1.Beautiful Soup简介
Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出一个完整的应用程序。
Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时,Beautiful Soup就不能自动识别编码方式了。然后,你仅仅需要说明一下原始编码方式就可以了。
Beautiful Soup已成为和lxml、html6lib一样出色的python解释器,为用户灵活地提供不同的解析策略或强劲的速度。
2.BeautifulSoup中CSS选择器的基本使用
2.1 选取一段html代码
我这里从百度首页复制了一些html代码作为例子使用,请将以下代码保存到同级目录下,文件命名为test.html:
<html>
<head><title>practice BeautifulSoup</title></head>
<body class="baidu" style="hello">
<div id="wrapper" class="wrapper_new">
<div id="s-top-left" class="s-top-left s-isindex-wrap">
<a href="http://news.baidu.com" class="mnav1">新闻</a>
<a href="https://www.hao123.com" class="mnav2">hao123</a>
<a href="http://map.baidu.com" class="mnav3">地图</a>
<a href="https://live.baidu.com/" class="mnav4">直播</a>
<a href="https://haokan.baidu.com/?sfrom=baidu-top" class="mnav1">视频</a>
<a href="http://tieba.baidu.com" class="mnav2">贴吧</a>
<a href="http://xueshu.baidu.com" class="mnav3">学术</a>
</div>
<ul class="s-hotsearch-content" id="hotsearch-content-wrapper">
<li class="hotsearch-item odd" data-index="0">
<span class="title-content-title">#苏炳添有望圆梦奥运奖牌#</span>
</li>
<li class="hotsearch-item even" data-index="3">
<span class="title-content-title">小学生为要偶像签名被骗19100元</span>
</li>
<li class="hotsearch-item odd" data-index="1">
<span class="title-content-title">40秒回顾英仙座流星雨划过天际</span>
</li>
<li class="hotsearch-item odd" data-index="2">
<span class="title-content-title">奥运接力银牌得主被停赛</span>
</li>
</ul>
</div>
</body></html>
2.2 导入html文本,实例化对象
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'), 'html.parser')
读取test.html文件内容,指定解析器为html.parser,使用BeautifulSoup把html文本实例化为一个bs4.BeautifulSoup对象,接下来的一系列操作皆使用该对象的select方法提取信息。
3.基本使用
3.1直接选择标签
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'), 'html.parser')
items = soup.select('title')
for item in items:
print(item.name)
print(item.string)
# 结果:
# title
# practice BeautifulSoup
以提取title标签为例,直接把标签名称作为参数,可以直接从文本中提取出title标签,select方法返回对象是一个bs4.element.ResultSet数组,遍历数组元素,每个元素是一个bs4.element.Tag对象,使用该对象的name属性可以得到标签名称,使用string方法可以得到标签文本信息。
3.2根据id选择标签
CSS以id选择标签,直接在id前面加一个#号,即可选择该标签,以选取id等于s-top-left的标签为例:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'), 'html.parser')
items = soup.select('#s-top-left')
print(items)
# 结果
# [<div class="s-top-left s-isindex-wrap" id="s-top-left">
# <a class="mnav1" href="http://news.baidu.com">新闻</a>
# <a class="mnav2" href="https://www.hao123.com">hao123</a>
# <a class="mnav3" href="http://map.baidu.com">地图</a>
# <a class="mnav4" href="https://live.baidu.com/">直播</a>
# <a class="mnav1" href="https://haokan.baidu.com/?sfrom=baidu-top">视频</a>
# <a class="mnav2" href="http://tieba.baidu.com">贴吧</a>
# <a class="mnav3" href="http://xueshu.baidu.com">学术</a>
# </div>]
如果要选择id为s-top-left的div标签,可把div加在#前面,代码如下,结果与上述结果相同
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'), 'html.parser')
items = soup.select('div#s-top-left')
print(items)
3.3 根据属性选择标签以及获取标签文本值和属性值
以属性值选择标签,直接在属性值前面加个.作为select的参数即可选中所有符合条件的标签,这里以选择属性值为mnav1的a标签为例:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'), 'html.parser')
items = soup.select('a.mnav1')
for item in items:
print(item) # 每一个a标签
print(item.string) # 标签文本信息
print(item.attrs) # 标签所有的属性
print(item.get('class')) # 获取属性值
print()
# 结果:
# <a class="mnav1" href="http://news.baidu.com">新闻</a>
# 新闻
# {'href': 'http://news.baidu.com', 'class': ['mnav1']}
# ['mnav1']
#
# <a class="mnav1" href="https://haokan.baidu.com/?sfrom=baidu-top">视频</a>
# 视频
# {'href': 'https://haokan.baidu.com/?sfrom=baidu-top', 'class': ['mnav1']}
# ['mnav1']
3.4 递进式选择标签
3.4.1 具有直接父子关系的标签使用 ‘>’
例如:选择id为wrapper下的子一代为div子二代为a的标签,注意表达式中相邻标签必须为父子关系,即id为wrapper的标签的儿子节点为div,孙子节点为a标签
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'), 'html.parser')
items = soup.select('#wrapper > div > a')
for item in items:
print(item)
# 结果:
# <a class="mnav1" href="http://news.baidu.com">新闻</a>
# <a class="mnav2" href="https://www.hao123.com">hao123</a>
# <a class="mnav3" href="http://map.baidu.com">地图</a>
# <a class="mnav4" href="https://live.baidu.com/">直播</a>
# <a class="mnav1" href="https://haokan.baidu.com/?sfrom=baidu-top">视频</a>
# <a class="mnav2" href="http://tieba.baidu.com">贴吧</a>
# <a class="mnav3" href="http://xueshu.baidu.com">学术</a>
3.4.2 不具有直接父子关系的标签使用空格表示
例如: 选择body标签下的li标签的span标签,其中body和li并不是直接父子关系,但是li是body的子孙节点,所以用空格表示即可:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'), 'html.parser')
items = soup.select('body li span')
for item in items:
print(item)
# 结果
# <span class="title-content-title">#苏炳添有望圆梦奥运奖牌#</span>
# <span class="title-content-title">小学生为要偶像签名被骗19100元</span>
# <span class="title-content-title">40秒回顾英仙座流星雨划过天际</span>
# <span class="title-content-title">奥运接力银牌得主被停赛</span>
3.5选择具有href属性的标签
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'), 'html.parser')
items = soup.select('[href]')
for item in items:
print(item)
# 结果:
# <a class="mnav1" href="http://news.baidu.com">新闻</a>
# <a class="mnav2" href="https://www.hao123.com">hao123</a>
# <a class="mnav3" href="http://map.baidu.com">地图</a>
# <a class="mnav4" href="https://live.baidu.com/">直播</a>
# <a class="mnav1" href="https://haokan.baidu.com/?sfrom=baidu-top">视频</a>
# <a class="mnav2" href="http://tieba.baidu.com">贴吧</a>
# <a class="mnav3" href="http://xueshu.baidu.com">学术</a>
3.6同时选取多个标签
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'), 'html.parser')
items = soup.select('div#s-top-left, ul#hotsearch-content-wrapper')
for item in items:
print(item)
# 结果:
# <div class="s-top-left s-isindex-wrap" id="s-top-left">
# <a class="mnav1" href="http://news.baidu.com">新闻</a>
# <a class="mnav2" href="https://www.hao123.com">hao123</a>
# <a class="mnav3" href="http://map.baidu.com">地图</a>
# <a class="mnav4" href="https://live.baidu.com/">直播</a>
# <a class="mnav1" href="https://haokan.baidu.com/?sfrom=baidu-top">视频</a>
# <a class="mnav2" href="http://tieba.baidu.com">贴吧</a>
# <a class="mnav3" href="http://xueshu.baidu.com">学术</a>
# </div>
# <ul class="s-hotsearch-content" id="hotsearch-content-wrapper">
# <li class="hotsearch-item odd" data-index="0">
# <span class="title-content-title">#苏炳添有望圆梦奥运奖牌#</span>
# </li>
# <li class="hotsearch-item even" data-index="3">
# <span class="title-content-title">小学生为要偶像签名被骗19100元</span>
# </li>
# <li class="hotsearch-item odd" data-index="1">
# <span class="title-content-title">40秒回顾英仙座流星雨划过天际</span>
# </li>
# <li class="hotsearch-item odd" data-index="2">
# <span class="title-content-title">奥运接力银牌得主被停赛</span>
# </li>
# </ul>
3.7 选择具有href属性的a标签
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'), 'html.parser')
items = soup.select('a[href]')
for item in items:
print(item)
# 结果:
# <a class="mnav1" href="http://news.baidu.com">新闻</a>
# <a class="mnav2" href="https://www.hao123.com">hao123</a>
# <a class="mnav3" href="http://map.baidu.com">地图</a>
# <a class="mnav4" href="https://live.baidu.com/">直播</a>
# <a class="mnav1" href="https://haokan.baidu.com/?sfrom=baidu-top">视频</a>
# <a class="mnav2" href="http://tieba.baidu.com">贴吧</a>
# <a class="mnav3" href="http://xueshu.baidu.com">学术</a>
3.8根据具体的属性值选择标签
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'), 'html.parser')
items = soup.select('[href="https://haokan.baidu.com/?sfrom=baidu-top"]')
for item in items:
print(item)
# 结果:
# <a class="mnav1" href="https://haokan.baidu.com/?sfrom=baidu-top">视频</a>
3.9选择href属性值以https开头的a标签
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'), 'html.parser')
items = soup.select('a[href^="https"]')
for item in items:
print(item)
# 结果:
# <a class="mnav2" href="https://www.hao123.com">hao123</a>
# <a class="mnav4" href="https://live.baidu.com/">直播</a>
# <a class="mnav1" href="https://haokan.baidu.com/?sfrom=baidu-top">视频</a>
3.10选择以hao123.com结尾的a标签
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'), 'html.parser')
items = soup.select('a[href$="hao123.com"]')
for item in items:
print(item)
# 结果:
# <a class="mnav2" href="https://www.hao123.com">hao123</a>
3.11选择href属性包含‘www’的a标签
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'), 'html.parser')
items = soup.select('a[href*="www"]')
for item in items:
print(item)
# 结果:
# <a class="mnav2" href="https://www.hao123.com">hao123</a>
3.12 选择具有class属性的a标签
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'), 'html.parser')
items = soup.select('a[class]')
for item in items:
print(item)
# 结果:
# <a class="mnav1" href="http://news.baidu.com">新闻</a>
# <a class="mnav2" href="https://www.hao123.com">hao123</a>
# <a class="mnav3" href="http://map.baidu.com">地图</a>
# <a class="mnav4" href="https://live.baidu.com/">直播</a>
# <a class="mnav1" href="https://haokan.baidu.com/?sfrom=baidu-top">视频</a>
# <a class="mnav2" href="http://tieba.baidu.com">贴吧</a>
# <a class="mnav3" href="http://xueshu.baidu.com">学术</a>
4.最后
如有错误 ,敬请指正!