beautifulsoup模块学习_beautifulsoup 递归删除-CSDN博客

本文链接：https://blog.csdn.net/Marvin_Wind/article/details/83066007

本文介绍了如何使用Python的BeautifulSoup库解析HTML文档，包括模块安装、基本使用方法、标签属性修改、子标签与子孙标签查询、标签内容操作及CSS选择器应用等。通过实例展示了标签的查找、替换、删除等操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

模块安装：pip3 install beautifulsoup4

from bs4 import BeautifulSoup

html_doc = """<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
<a href="dfghrt.cc">Chat on the internet</a>
<p>I have received a e-mail from an old friend yesterday. she asked how my summer holiday is.had I ever gone to
    somewhere for pleasure ! and also ,told me that she wanted to go around for fun ,but ,unfortunately,there is no
    time! <a href="dref.com">what a pity !</a>so ,if I would get to work, how will my life be? what type of jobs should
    I choose? uh, maybe I think a lot! how time flies! I have reached school for almost half a month.came to read
    articles day by day.</p>
<hr>
<div class="story"><p id="we11">Homely &comely appearance! want to make a beating plan! buy some herbs, face-mask and so on . surfing the
    internet ,I chance meet a stranger ,he has a beatiful net-mane,it give me a good sence,so I made him into my
    friends-list,and then i found ,that he has doctor degree. and he is of great knowlege! uh ,and a confident man! but
    unluckly ,he divorced .maybe in our country ,divorcement is normal .but I feel unsafety. I must get to listen "4+1"
    oral english now ,continue it later!</p></div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc,features = 'html.parser')

1、name，标签名称

tag = soup.find(name='a')
print(tag,tag.name)

输出：<a href="dfghrt.cc">Chat on the internet</a> a

2、attr，标签属性

print(tag.attrs)
tag.attrs = {'k1':'123'}   #所有属性重设
tag.attrs['k2'] = '456'     #新增属性
print(soup)

部分输出：

<body>
<a k1="123" k2="456">Chat on the internet</a>

3，children，所有子标签

body = soup.find('body')
l1 = body.children            #迭代器
print(list(l1))

子类中包含换行

4，descendants，所有子孙标签

l2 = body.descendants

5、clear,清空标签内容并保留标签名

tag = soup.find('body')
tag.clear()
print(soup)

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Title</title>
</head>
<body></body>
</html>

6、decompose，递归的删除所有的标签，将选中标签一并删除

body = soup.find('body')
body.decompose()
print(soup)

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Title</title>
</head>

</html>

7、extract，递归删除所有标签，并获取删除标签（删除效果同上，功能相当于剪切）

body = soup.find('body')
v= body.extract()
print(v)

8、decode，转换为字符串（含当前标签）；decode_contents(不含当前标签)

body = soup.find('body')
val = body.decode()
print(type(body),type(val))

9、encode，转换为字节（含标签），encode_contents(不含标签)

body = soup.find('body')
val = body.encode()
print(type(body), type(val))

10、find，获取匹配的第一个标签

tag = soup.find(name='p', attrs={'id': 'we11'}, recursive=True)

get获取标签属性

tag = soup.find('a')
val = tag.get('href')

11、find_all，获取所有匹配标签并返回列表

tags=soup.find_all(name='a',limit=1)               限制范围

tags = soup.find_all(name=['a','div'])             查找列表内所有标签类型

import re
rep = re.compile('^H')
tags = soup.find_all(text=rep,limit=1)       #节点文本

12、has_attr，检查标签是否具有该属性

tag = soup.find('a')
val = tag.has_attr('href')

13、get_text、获取标签内部文本内容

tag = soup.find('a')
val = tag.get_text()

14、index，查询标签索引位置

tag = soup.find('body')
val = tag.index(tag.find('p'))

15、is_empty_element，是否是空标签或是自闭合标签(hr br、input、img、meta、link 、frame等)

tag = soup.find('hr')
val = tag.is_empty_element

16、关联标签

# tag.next
# tag.next_element
# tag.next_elements
# tag.next_sibling
# tag.next_siblings

# tag.previous
# tag.previous_element
# tag.previous_elements
# tag.previous_sibling
# tag.previous_siblings

# tag.parent
# tag.parents

17、查找某标签的关联标签

# tag.find_next(...)
# tag.find_all_next(...)
# tag.find_next_sibling(...)
# tag.find_next_siblings(...)

# tag.find_previous(...)
# tag.find_all_previous(...)
# tag.find_previous_sibling(...)
# tag.find_previous_siblings(...)

# tag.find_parent(...)
# tag.find_parents(...)

18、select,select_one、css选择器

tag =soup.select("div p")

select按照css选择器方式进行匹配，返回列表；select_one只匹配一个对象

19、wrap，用指定标签把当前标签包裹起来

from bs4.element import Tag
tag1 = Tag(name='div',attrs={'color':'red'})
tag1.string = 'newtag'
tag2 = soup.find('a')
val = tag2.wrap(tag1)
print(soup)

20、unwrap，去掉当前标签保留其包裹的标签

tag = soup.find('div')
val = tag.unwrap()
print(soup)

val是所去掉的标签

21、标签内容

tag = soup.find('p')
print(tag.string)
tag.string='newcontents'
print(soup)

# tag = soup.find('body')
# val = tag.stripped_strings
# print(next(val))

22、append--当前标签内部追加一个标签

insert在当前标签内部指定位置插入一个标签

insert_after,insert_before在当前标签前后插入标签

replace_with替换指定标签

23、CSS选择器