html 遍历文档树,Python爬虫-bs4遍历文档树-bs4搜索文档树-css选择器

最新推荐文章于 2024-03-28 21:33:54 发布

weixin_39782832

最新推荐文章于 2024-03-28 21:33:54 发布

阅读量65

点赞数

文章标签： html 遍历文档树

from bs4 import BeautifulSoup

import re

# 要解析的文档内容

html_doc = """

The Dormouse's story

hhhh

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

"""

soup = BeautifulSoup(html_doc,'lxml')

# 过滤器，find_all 查找所有匹配的标签

# 按照名字匹配可以传一个名字或一个列表

# print(soup.find_all('a'))

# print(soup.find_all(['a','p']))

# 找id为link1 的a标签

# print(soup.find_all('a',attrs={'id':'link1'}))

# print(soup.find_all('a',attrs={'class':'sister'}))

# print(soup.find_all(name='a',id='link1'))

# 注意如果要按照条件为class来查找，需要使用class_ 因为class是关键字

# 多个类名加空格即可

# 只能找到类名完全匹配的如:

# print(soup.find_all(name='a',class_='sister brother'))

# 只要类名带有sister就能找到

# print(soup.find_all(name='a',class_='sister'))

# 如果属性带有特殊符号可以把条件装在attrs中

# print(soup.find_all(name='a',attrs={'data-a':'sister'}))

# 指定文本

# print(soup.find_all(name='a',text='Elsie'))

# 过滤器

# 标签名称中带有a字母的标签

# print(soup.find_all(name="a"))

# res = re.compile('b')

# 正则匹配

# print(soup.find_all(name=res))

# 数组

# print(soup.find_all(name=['body','a']))

# True表示所有标签

# print(soup.find_all(True))

# 所有具备id属性的标签

# print(soup.find_all(id=True))

# 方法匹配(写个函数来过滤)

# 必须只能有一个参数，参数表示要过滤的标签

def MyFilter(tag):

return tag.name == "a" and tag.text != "Elsie" and tag.has_attr("id")

print(soup.find_all(MyFilter,limit=1))

# 使用方式和find_all 相同

print(soup.find('a'))

# 总结：过滤可以是数组，可以是一个 re，可以是一个函数，可以是True

weixin_39782832

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。