Beautifulsoup 库 -- 04 -- 修改文档树

最新推荐文章于 2023-10-05 20:54:53 发布

S_numb

最新推荐文章于 2023-10-05 20:54:53 发布

阅读量465

点赞数

分类专栏： Python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/S_numb/article/details/120218188

版权

Python 专栏收录该内容

22 篇文章 0 订阅

订阅专栏

文章目录

1. 修改文档树

1. 修改文档树

Beautiful Soup 的强项是文档树的搜索，但同时也可以方便的修改文档树。

1.1 修改 tag 的名称和属性

修改属性：

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b

tag.name = "blockquote"
tag['class'] = 'verybold'
tag['id'] = 1
print(tag)

输出：

<blockquote class="verybold" id="1">Extremely bold</blockquote>

删除属性：

del tag['class']
del tag['id']
print(tag)

输出：

<blockquote>Extremely bold</blockquote>

1.2 修改 string

给 tag 的 string 属性赋值，就相当于用当前的内容替代了原来的内容：

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)

tag = soup.a
tag.string = "New link text."
print(tag)

输出：

<a href="http://example.com/">New link text.</a>

如果当前的 tag 包含了其它 tag，那么给它的 string 属性赋值会覆盖掉原有的所有内容包括子 tag。

1.3 append()

Tag.append() 方法向 tag 中添加内容，类似于 Python 的列表的 append() 方法：

soup = BeautifulSoup("<a>Foo</a>")
soup.a.append("Bar")
print(soup)
print(soup.a.contents)

输出：

<html><head></head><body><a>FooBar</a></body></html>
['Foo','Bar']

1.4 NavigableString() 和 .new_tag()

如果想添加一段文本内容到文档中也没问题，可以调用 Python 的 append() 方法或调用 NavigableString 的构造方法：

soup = BeautifulSoup("<b></b>")
tag = soup.b
tag.append("Hello")
new_string = NavigableString(" there")
tag.append(new_string)
print(tag)

输出：

<b>Hello there.</b>

创建一段注释：

from bs4 import BeautifulSoup, Comment, NavigableString

soup = BeautifulSoup("<b></b>")

tag = soup.b
tag.append("Hello")

new_string = NavigableString(" there")
tag.append(new_string)

new_comment = soup.new_string("Nice to see you.", Comment)
tag.append(new_comment)

print(tag)

输出：

<b>Hello there<!--Nice to see you.--></b>

Beautiful Soup 4.2.1 中新增的方法：
- 创建一个 tag 最好的方法是调用工厂方法 BeautifulSoup.new_tag()：

from bs4 import BeautifulSoup

soup = BeautifulSoup("<b></b>")

tag_original = soup.b
tag_new = soup.new_tag("a", href="http://www.example.com")
tag_original.append(tag_new)
print(tag_original)

输出：

<b><a href="http://www.example.com"></a></b>

1.5 insert()

Tag.insert() 方法与 Tag.append() 方法类似；
区别是不会把新元素添加到父节点 contents 属性的最后，而是把元素插入到指定的位置；
与 Python 列表中的 insert() 方法的用法相同；

from bs4 import BeautifulSoup

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
tag = soup.a
print(tag)

tag.insert(1, "but did not endorse")
print(tag)

print(tag.contents)

输出：

<a href="http://example.com/">I linked to <i>example.com</i></a>
<a href="http://example.com/">I linked to but did not endorse<i>example.com</i></a>
['I linked to ', 'but did not endorse', <i>example.com</i>]

1.6 insert_before() 和 insert_after()

insert_before() 方法在当前 tag 或文本节点前插入内容；

from bs4 import BeautifulSoup

soup = BeautifulSoup("<b>stop</b>")
tag = soup.new_tag("a")
tag.string = "Alice"
soup.b.string.insert_before(tag)

print(soup.b)

输出：

<b><a>Alice</a>stop</b>

insert_after() 方法在当前 tag 或文本节点后插入内容；

1.7 clear()

Tag.clear() 方法移除当前 tag 的内容：

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
tag = soup.a

tag.clear()

print(tag)

输出：

<a href="http://example.com/"></a>

1.8 extract()

PageElement.extract() 方法将当前 tag 移除文档树，并作为方法结果返回：

from bs4 import BeautifulSoup


markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)

tag_a = soup.a
tag_i = soup.i.extract()

print(tag_a)
print(tag_i)

输出：

<a href="http://example.com/">I linked to </a>
<i>example.com</i>

这个方法实际上产生了 2 个文档树：
- 一个是用来解析原始文档的 BeautifulSoup 对象；
- 另一个是被移除并且返回的 tag；
被移除并返回的 tag 可以继续调用 extract 方法；

1.9 decompose()

Tag.decompose() 方法将当前节点移除文档树并完全销毁；

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

soup.i.decompose()

print(a_tag)

输出：

<a href="http://example.com/">I linked to</a>

1.10 replace_with()

replace_with() 方法移除文档树中的某段内容，并用新 tag 或文本节点替代它；

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

new_tag = soup.new_tag("b")
new_tag.string = "example.net"
a_tag.i.replace_with(new_tag)

print(a_tag)

输出：

<a href="http://example.com/">I linked to <b>example.net</b></a>

replace_with() 方法返回被替代的 tag 或文本节点，可以用来浏览或添加到文档树其它地方。

1.11 wrap()

wrap() 方法可以对指定的 tag 元素进行包装，并返回包装后的结果。

soup = BeautifulSoup("<p>I wish I was bold.</p>")

print(soup.p.string.wrap(soup.new_tag("b")))
print(soup.p.wrap(soup.new_tag("div")))

输出：

<b>I wish I was bold.</b>
<div><p><b>I wish I was bold.</b></p></div>

1.12 unwrap()

unwrap() 方法与 wrap() 方法相反；
将移除 tag 内的所有 tag 标签，该方法常被用来进行标记的解包；

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

a_tag.i.unwrap()
print(a_tag)

输出：

<a href="http://example.com/">I linked to example.com</a>

S_numb

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录