模块安装:pip3 install beautifulsoup4
from bs4 import BeautifulSoup
html_doc = """<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<a href="dfghrt.cc">Chat on the internet</a>
<p>I have received a e-mail from an old friend yesterday. she asked how my summer holiday is.had I ever gone to
somewhere for pleasure ! and also ,told me that she wanted to go around for fun ,but ,unfortunately,there is no
time! <a href="dref.com">what a pity !</a>so ,if I would get to work, how will my life be? what type of jobs should
I choose? uh, maybe I think a lot! how time flies! I have reached school for almost half a month.came to read
articles day by day.</p>
<hr>
<div class="story"><p id="we11">Homely &comely appearance! want to make a beating plan! buy some herbs, face-mask and so on . surfing the
internet ,I chance meet a stranger ,he has a beatiful net-mane,it give me a good sence,so I made him into my
friends-list,and then i found ,that he has doctor degree. and he is of great knowlege! uh ,and a confident man! but
unluckly ,he divorced .maybe in our country ,divorcement is normal .but I feel unsafety. I must get to listen "4+1"
oral english now ,continue it later!</p></div>
</body>
</html>
"""
soup = BeautifulSoup(html_doc,features = 'html.parser')
1、name,标签名称
tag = soup.find(name='a')
print(tag,tag.name)
输出:<a href="dfghrt.cc">Chat on the internet</a> a
2、attr,标签属性
print(tag.attrs)
tag.attrs = {'k1':'123'} #所有属性重设
tag.attrs['k2'] = '456' #新增属性
print(soup)
部分输出:
<body>
<a k1="123" k2="456">Chat on the internet</a>
3,children,所有子标签
body = soup.find('body')
l1 = body.children #迭代器
print(list(l1))
子类中包含换行
4,descendants,所有子孙标签
l2 = body.descendants
5、clear,清空标签内容并保留标签名
tag = soup.find('body')
tag.clear()
print(soup)
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Title</title>
</head>
<body></body>
</html>
6、decompose,递归的删除所有的标签,将选中标签一并删除
body = soup.find('body')
body.decompose()
print(soup)
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Title</title>
</head></html>
7、extract,递归删除所有标签,并获取删除标签(删除效果同上,功能相当于剪切)
body = soup.find('body')
v= body.extract()
print(v)
8、decode,转换为字符串(含当前标签);decode_contents(不含当前标签)
body = soup.find('body')
val = body.decode()
print(type(body),type(val))
<class 'bs4.element.Tag'> <class 'str'>
9、encode,转换为字节(含标签),encode_contents(不含标签)
body = soup.find('body')
val = body.encode()
print(type(body), type(val))
<class 'bs4.element.Tag'> <class 'bytes'>
10、find,获取匹配的第一个标签
tag = soup.find(name='p', attrs={'id': 'we11'}, recursive=True)
get获取标签属性
tag = soup.find('a')
val = tag.get('href')
11、find_all,获取所有匹配标签并返回列表
tags=soup.find_all(name='a',limit=1) 限制范围
tags = soup.find_all(name=['a','div']) 查找列表内所有标签类型
import re
rep = re.compile('^H')
tags = soup.find_all(text=rep,limit=1) #节点文本
12、has_attr,检查标签是否具有该属性
tag = soup.find('a')
val = tag.has_attr('href')
13、get_text、获取标签内部文本内容
tag = soup.find('a')
val = tag.get_text()
14、index,查询标签索引位置
tag = soup.find('body')
val = tag.index(tag.find('p'))
15、is_empty_element,是否是空标签或是自闭合标签(hr br、input、img、meta、link 、frame等)
tag = soup.find('hr')
val = tag.is_empty_element
16、关联标签
# tag.next # tag.next_element # tag.next_elements # tag.next_sibling # tag.next_siblings # tag.previous # tag.previous_element # tag.previous_elements # tag.previous_sibling # tag.previous_siblings # tag.parent # tag.parents
17、查找某标签的关联标签
# tag.find_next(...) # tag.find_all_next(...) # tag.find_next_sibling(...) # tag.find_next_siblings(...) # tag.find_previous(...) # tag.find_all_previous(...) # tag.find_previous_sibling(...) # tag.find_previous_siblings(...) # tag.find_parent(...) # tag.find_parents(...)
18、select,select_one、css选择器
tag =soup.select("div p")
select按照css选择器方式进行匹配,返回列表;select_one只匹配一个对象
19、wrap,用指定标签把当前标签包裹起来
from bs4.element import Tag
tag1 = Tag(name='div',attrs={'color':'red'})
tag1.string = 'newtag'
tag2 = soup.find('a')
val = tag2.wrap(tag1)
print(soup)
20、unwrap,去掉当前标签保留其包裹的标签
tag = soup.find('div')
val = tag.unwrap()
print(soup)
val是所去掉的标签
21、标签内容
tag = soup.find('p')
print(tag.string)
tag.string='newcontents'
print(soup)
# tag = soup.find('body')
# val = tag.stripped_strings
# print(next(val))
22、append--当前标签内部追加一个标签
insert在当前标签内部指定位置插入一个标签
insert_after,insert_before在当前标签前后插入标签
replace_with替换指定标签
23、CSS选择器
soup.select('ui li')