python爬虫beautifulsoup_Python爬虫beautifulsoup4常用的解析方法总结

最新推荐文章于 2024-05-08 16:31:25 发布

weixin_39642998

最新推荐文章于 2024-05-08 16:31:25 发布

阅读量140

点赞数

文章标签： python爬虫beautifulsoup

摘要

如何用beautifulsoup4解析各种情况的网页

beautifulsoup4的使用

关于beautifulsoup4，官网已经讲的很详细了，我这里就把一些常用的解析方法做个总结，方便查阅。

装载html文档

使用beautifulsoup的第一步是把html文档装载到beautifulsoup中，使其形成一个beautifulsoup对象。

import requests

from bs4 import BeautifulSoup

url = "http://new.qq.com/omn/20180705/20180705A0920X.html"

r = requests.get(url)

htmls = r.text

#print(htmls)

soup = BeautifulSoup(htmls, 'html.parser')

初始化BeautifulSoup类时，需要加入两个参数，第一个参数即是我们爬到html源码，第二个参数是html解析器，常用的有三个解析器，分别是”html.parser”,”lxml”,”html5lib”，官网推荐用lxml，因为效率高，当然需要pip install lxml一下。

当然这三种解析方式在某些情况解析得到的对象内容是不同的，比如对于标签不完整这一情况(p标签只有一半)：

soup = BeautifulSoup("

", "html.parser")

# 只有起始标签的会自动补全，只有结束标签的灰自动忽略

# 结果为：

soup = BeautifulSoup("

", "lxml")

#结果为：

soup = BeautifulSoup("

", "html5lib")

# html5lib则出现一般的标签都会自动补全

# 结果为：

使用

在使用中，我尽量按照我使用的频率介绍，毕竟为了查阅~

按照标签名称、id、class等信息获取某个标签

html = '

The Dormouses story

'

soup = BeautifulSoup(html, 'lxml')

#根据class的名称获取p标签内的所有内容

soup.find(class_="title")

#或者

soup.find("p",class_="title" id = "p1")

#获取class为title的p标签的文本内容"The Dormouse's story"

soup.find(class_="title").get_text()

#获取文本内容时可以指定不同标签之间的分隔符，也可以选择是否去掉前后的空白。

soup = BeautifulSoup('

The Dormouses story

The Dormouses story

', "html5lib")

soup.find(class_="title").get_text("|", strip=True)

#结果为：The Dormouses story|The Dormouses story

#获取class为title的p标签的id

soup.find(class_="title").get("id")

#对class名称正则：

soup.find_all(class_=re.compile("tit"))

#recursive参数，recursive=False时，只find当前标签的第一级子标签的数据

soup = BeautifulSoup('

abc','lxml')

soup.html.find_all("title", recursive=False)

按照标签名称、id、class等信息获取多个标签

soup = BeautifulSoup('

The like story

The Dormouses story

', "html5lib")

#获取所有class为title的标签

for i in soup.find_all(class_="title"):

print(i.get_text())

#获取特定数量的class为title的标签

for i in soup.find_all(class_="title",limit = 2):

print(i.get_text())

按照标签的其他属性获取某个标签

html = '蜗牛宋'

soup = BeautifulSoup(html, 'lxml')

# 获取"蜗牛宋",此时，该标签里既没有class也没有id，需要根据其属性来定义获取规则

author = soup.find('a',{"alog-action":"qb-ask-uname"}).get_text()

#或

author = soup.find(attrs={"alog-action": "qb-ask-uname"})

找前头和后头的标签

soup.find_all_previous("p")

soup.find_previous("p")

soup.find_all_next("p")

soup.find_next("p")

找父标签

soup.find_parents("div")

soup.find_parent("div")

css选择器

soup.select("title") #标签名

soup.select("html head title") #多级标签名

soup.select("p > a") #p内的所有a标签

soup.select("p > #link1") #P标签内，按id查标签

soup.select("#link1 ~ .sister") #查找相同class的兄弟节点

soup.select("#link1 + .sister")

soup.select(".sister") #按class名称查

soup.select("#sister") #按id名称查

soup.select('a[href="http://example.com/elsie" rel="external nofollow" ]') # 按标签的属性查

soup.select('a[href$="tillie"]')

soup.select_one(".sister")

注意几个可能出现的错误，可以用try捕获来防止爬虫进程

UnicodeEncodeError: ‘charmap' codec can't encode character u'\xfoo' in position bar (或其它类型的 UnicodeEncodeError

需要转码

AttributeError: ‘NoneType' object has no attribute ‘foo'

没这个属性

就介绍这么多，应该可以覆盖大部分网页结构了吧~！

总结

以上就是这篇文章的全部内容了，希望本文的内容对大家的学习或者工作具有一定的参考学习价值，谢谢大家对脚本之家的支持。如果你想了解更多相关内容请查看下面相关链接

weixin_39642998

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python爬虫beautifulsoup_Python爬虫beautifulsoup4常用的解析方法总结

摘要如何用beautifulsoup4解析各种情况的网页beautifulsoup4的使用关于beautifulsoup4，官网已经讲的很详细了，我这里就把一些常用的解析方法做个总结，方便查阅。装载html文档使用beautifulsoup的第一步是把html文档装载到beautifulsoup中，使其形成一个beautifulsoup对象。import requestsfrom bs4 impo...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。