BeatuifulSoup4

最新推荐文章于 2023-03-26 11:05:42 发布

云袖er

最新推荐文章于 2023-03-26 11:05:42 发布

阅读量237

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/qq_39249347/article/details/104235739

版权

爬虫专栏收录该内容

8 篇文章 0 订阅

订阅专栏

基本使用

from bs4 import BeautifulSoup

html = """
<div>test</div>
"""
# 第二个参数指定解释器：
# 默认html.parser，容错性差
# lxml速度快，需要安装c语言库，容错能力强，常使用
bs = BeautifulSoup(html, 'lxml')
print(bs.prettify())

提取元素

from bs4 import BeautifulSoup

html = """
<tr>
    <td>1</td>
    <td>2</td>
</tr>
<tr class='even'>
    <td>1</td>
    <td>2</td>
</tr>
<a class='test'id='test' href="www.baidu.com">2</a>
<a href="www.baidu.com"></a>
"""
soup = BeautifulSoup(html, 'lxml')
# 1.获取所有tr标签
trs = soup.find_all('tr')
for tr in trs:
    print(tr)
# 2.获取第二个tr标签
# limit最多获取多少个元素
tr = soup.find_all('tr', limit=2)[1]
# 3.获取所有class等于even的标签
trs = soup.find_all('tr', class_='even')
trs = soup.find_all('tr', attrs={'class': 'even'})
print(trs)
# 4.将所有id等于test，class也等于test的a标签提取出来
aList = soup.find_all('a', id='test', class_='test')
# 或者
aList = soup.find_all('a', attrs={'id': 'test', 'class': 'test'})
print(aList)
# 5.获取所有a标签的href属性
aList = soup.find_all('a')
for a in aList:
    # 通过下标操作的方式
    href = a['href']
    print(href)
    # 通过attrs属性的方式
    href = a.attrs['href']
    print(href)
# 6.获取纯文本信息
trs = soup.find_all('tr')
for tr in trs:
    print(tr)
    print(tr.string)
#存在多行文本string无法进行获取
# 7.tr标签下所有文本信息
trs = soup.find_all('tr')
for tr in trs:
    print(list(tr.stripped_strings))
#find()与find_all()
	find返回匹配的第一个标签，find_all返回匹配的所有标签，以列表的形式。

select

from bs4 import BeautifulSoup

html = ''
soup = BeautifulSoup(html, 'lxml')
# 1.通过标签名查找
p = soup.select('p')
# 2.通过类名查找
p = soup.select('.className')
# 3.通过id查找
p = soup.select('#idName')
# 4.通过组合查找
p = soup.select('.box p')
p = soup.select('.box>p')
# 5.通过属性值查找
p = soup.select('a[name="a"]')
#6.再根据类名或者id进行查找的时候，如果还要根据标签名进行过滤
p = soup.select('div.line')

四个常用对象

Tag:BeautifulSoup中所有标签都是Tag类型，并且BeautifulSoup的对象其实本质上也是一个Tag类型，所以其实一些方法比如find，find_all()并不是BeautifulSoup的，而是Tag
NavigableString：继承python的str，用起来跟python中的str是一样的
Comment：就是继承自NavigableString
BeautifulSoup：继承自Tag。用来生成BeautifulSoup树的。

遍历

返回某个标签下直接子元素，其中也包括字符串。

contents：返回一个列表
children:返回一个迭代器

云袖er

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
BeatuifulSoup4

基本使用from bs4 import BeautifulSouphtml = """<div>test</div>"""# 第二个参数指定解释器：# 默认html.parser，容错性差# lxml速度快，需要安装c语言库，容错能力强，常使用bs = BeautifulSoup(html, 'lxml')print(bs.prettify())提...
复制链接

扫一扫