什么是beautiful soup?
和上一篇博客谈的Xpath一样,是python的一个HTML或XML的解析库,可以用它来方便地从网页中提取数据
CSS 选择器:BeautifulSoup4
和 lxml 一样,Beautiful Soup 也是python的一个HTML/XML的解析器,用它可以方便的从网页中提取数据。
lxml 只会局部遍历,而Beautiful Soup 是基于HTML DOM的,会载入整个文档,解析整个DOM树,因此时间和内存开销都
会大很多,所以性能要低于lxml。 BeautifulSoup 用来解析 HTML 比较简单,API非常人性化,支持CSS选择器、Python标
准库中的HTML解析器,也支持 lxml 的 XML解析器。 Beautiful Soup 3 目前已经停止开发,推荐现在的项目使用Beautiful
Soup 4。使用 pip 安装即可:pip install beautifulsoup4 官方文档:http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0
各种提取方式的对比
抓取工具 | 速度 | 使用难度 | 安装难度 |
---|---|---|---|
正则 | 最快 | 困难 | 无(内置) |
BeautifulSoup | 慢 | 最简单 | 简单 |
lxml | 快 | 简单 | 一般 |
Beautiful Soup支持的解析器
解析器 | 使用方法 | 优势 |
---|---|---|
Python标准库 | BeautifulSoup(markup,‘html.parser’) | Python的内置标准库、执行速度适中、文档容错能力强 |
lxml HTML解析器 | BeautifulSoup(markup,‘lxml’) | 速度快、文档容错能力强 |
lxml XML解析器 | BeautifulSoup(markup,‘xml’) | 速度快、唯一支持XML的解析器 |
html5lib | BeautifulSoup(markup,‘html5lib’) | 最好的容错性、以浏览器的方式解析文档,生成HTML5的格式文档 |
bs4在解析时依赖解析器
- Python标准库 BeautifulSoup(markup,‘html.parser’) Python内置标准库,执行速度适中,容错能力强
- lxml HTML 解析器 BeautifulSoup(markup,‘lxml’) 速度快、文档容错能力强
BeautifulSoup4的使用
基本使用
from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml')
# 用来格式化代码,自动补全不完整的HTML代码
print(soup.prettify())
# 获取title标签
print(soup.title)
# title标签的名字
print(soup.title.name)
# title标签的内容
print(soup.title.string)
标签选择器
(1)选择元素
from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
# 声明soup对象
soup = BeautifulSoup(html, 'lxml')
# 获取title标签
print(soup.title)
# title标签类型
print(type(soup.title))
# 获取head标签
print(soup.head)
# 获取p标签,只输出第一个匹配结果
print(soup.p)
(2)获取名称
from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
# 声明soup对象
soup = BeautifulSoup(html, 'lxml')
# 获取名称
print(soup.title.name)
(3)获取属性
from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="demo"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
# 声明soup对象
soup = BeautifulSoup(html, 'lxml')
# 获取属性1
print(soup.p.attrs['name'])
# 获取属性2
print(soup.p['name'])
(4)获取内容
from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="demo"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
# 声明soup对象
soup = BeautifulSoup(html, 'lxml')
# 获取内容:获取p标签的内容,第一个匹配到的
print(soup.p.string)
(5)嵌套选择
from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="demo"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
# 声明soup对象
soup = BeautifulSoup(html, 'lxml')
# 嵌套选择:获取head标签内的title标签的内容
print(soup.head.title.string)
(6)子节点和子孙节点
from bs4 import BeautifulSoup
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
# 获得p标签的所有子节点,以列表的方式输出
print(soup.p.contents)
# children 也可以返回子节点,与contents不同的是它相当于一个迭代器
print(soup.p.children)
for i, child in enumerate(soup.p.children):
print(i, child)
# 获取子孙节点descendants
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
print(i, child)
(7)父节点和祖先节点
from bs4 import BeautifulSoup
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
# 获取a标签(第一个a标签)的所有父节点
# print(soup.a.parent)
# 获取a标签所有祖先节点
print(list(enumerate(soup.a.parents)))
(8)兄弟节点
from bs4 import BeautifulSoup
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
# 获取a标签(第一个a标签)的所有后面的兄弟节点
print(list(enumerate(soup.a.next_siblings)))
# 获取a标签(第一个a标签)的所有前面的兄弟节点
print(list(enumerate(soup.a.previous_siblings)))
标准选择器
1、find_all(name, attrs, recursive, text, **kwargs)
可根据标签名、属性、内容查找文档,返回所有元素
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# 查找所有name为ul的标签
print(soup.find_all('ul'))
# 输出第一个ul标签的类型
print(type(soup.find_all('ul')[0]))
# 从ul中循环遍历取出所有的li标签
for ul in soup.find_all('ul'):
print(ul.find_all('li'))
attrs属性
from bs4 import BeautifulSoup
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
# attrs传递的是字典形式
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))
print(soup.find_all(id="list-1"))
# 因为class是Python中的一个关键字,所以不能直接用它来传数据
print(soup.find_all(class_="element"))
text属性
做元素查找不方便。
from bs4 import BeautifulSoup
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
# 根据文本内容选择
print(soup.find_all(text='Foo'))
2、find(name, attrs, recursive, text, **kwargs)
find返回单个元素,即匹配到的第一个结果。
from bs4 import BeautifulSoup
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))
其他一些类似的方法
(1)、find_parents() VS find_patent()
find_parents() ——返回所有祖先节点
find_patent() ——返回直接父节点
(2)、find_next_siblings() VS find_next_sibling
find_next_siblings() ——返回后面所有兄弟节点
find_next_sibling() ——返回后面第一个兄弟节点
(3)、find_previous_siblings() VS find_previous_sibling()
find_previous_siblings() ——返回前面所有兄弟节点
find_previous_sibling() ——返回前面第一个兄弟节点
(4)、find_all_next() VS find_next()
find_all_next() ——返回节点后所有符合条件的节点
find_next() ——返回第一个符合条件的节点
(5)、find_all_previous() VS find_previous()
find_all_previous() —— 返回节点前所有符合条件的节点
find_previous() —— 返回节点前第一个符合条件的节点
CSS选择器
通过select()直接传入CSS选择器即可完成选择。
from bs4 import BeautifulSoup
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
# 选择class属性,用空格分隔
print(soup.select('.panel .panel-heading'))
# 选择ul标签中的li标签
print(soup.select('ul li'))
# 选择 ID 中的class为element的标签
print(soup.select('#list-2 .element'))
# 输出第一个ul标签的类型
print(type(soup.select('ul')[0]))
(1)获取属性
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
print(ul['id'])
print(ul.attrs['id'])
(2)获取内容
from bs4 import BeautifulSoup
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
# get_text获取内容
print(li.get_text())
总结
1、推荐使用lxml解析库,必要时使用html.parser
2、标签选择筛选功能弱但是速度快
3、建议使用find()、find_all()查询匹配单个结果或者多个结果
4、如果对CSS选择器熟悉的建议使用select()方法
5、记住常用的获取属性和文本值得方法