1 BeautifulSoup4简介
BeautifulSoup4和 lxml 一样,Beautiful Soup 也是一个HTML/XML的解析器,主要的功能也是如何解析和提取 HTML/XML 数据。
lxml 只会局部遍历,而Beautiful Soup 是基于HTML DOM的,会载入整个文档,解析整个DOM树,因此时间和内存开销都会大很多,所以性能要低于lxml。
BeautifulSoup 用来解析 HTML 比较简单,API非常人性化,
支持CSS选择器(http://www.w3school.com.cn/cssref/css_selectors.asp)、Python标准库中的HTML解析器,也支持 lxml 的 XML解析器。
下面直接上图直观的比较一下:
抓取工具 | 速度 | 使用难度 | 安装难度 |
正则 | 最快 | 困难 | 无(内置) |
BeautifulSoup | 慢 | 最简单 | 简单 |
lxml | 快 | 简单 | 一般 |
安装过程就不赘述了,命令在下面,直接使用就好:
Python3中Linux安装命令:sudo pip3 installbeautifulsoup4
Python2使用 pip 安装命令:sudo pip install beautifulsoup4
官方文档:http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0
2. 四大对象种类
Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
· Tag,通俗点讲就是 HTML 中的一个个标签
· NavigableString,获取标签内部的文字
· BeautifulSoup,对象表示的是一个文档的内容
· Comment,是一个特殊类型的 NavigableString 对象,其输出的内容不包括注释符号
因为NavigableString实际应用比较多,这里只详细介绍一下NavigableString
#导入BeautifulSoup4
from bs4 import BeautifulSoup
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ #创建 Beautiful Soup 对象 soup = BeautifulSoup(html,"lxml")#不写"lxml"好多时候会报错 print(soup.p.string)# The Dormouse's story print(type(soup.p.string))# <class 'bs4.element.NavigableString'>
运行结果:
3. 遍历文档树
3.1. 直接子节点:.contents .children 属性
.contents 和 .children 属性仅包含tag的直接子节点
3.1.1 .content
tag 的 .content 属性可以将tag的子节点以列表的方式输出
print(soup.head.contents) #[<title>The Dormouse's story</title>]
输出方式为列表,我们可以用列表索引来获取它的某一个元素
print(soup.head.contents[0])#<title>The Dormouse's story</title>
#创建 Beautiful Soup 对象 sp = BeautifulSoup(html,"lxml") for a in sp.find_all(name="a"): content = a.contents print(content)
运行效果
3.1.2 .children
它返回的不是一个 list,不过我们可以通过遍历获取所有子节点。
我们打印输出.children 看一下,可以发现它是一个list 生成器对象
print(soup.head.children)#<listiterator object at 0x7f71457f5710> for child in soup.body.children: print(child)
结果:
#创建 Beautiful Soup 对象 sp = BeautifulSoup(html,"lxml") for item in sp.find_all(name="p"): #遍历取出所以的p标签,并且取出每个标题的子标签 for ch in item.children: print(ch)
运行效果:
3.2. 所有子孙节点:.descendants 属性
.contents 和 .children 属性仅包含tag的直接子节点,.descendants属性可以对所有tag的子孙节点进行递归循环,和 children类似,我们也需要遍历获取其中的内容。
for child in soup.descendants: print(child)
运行结果:
<html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body></html> <head><title>The Dormouse's story</title></head> <title>The Dormouse's story</title> The Dormouse's story <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <b>The Dormouse's story</b> The Dormouse's story <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> Elsie , <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> Lacie and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> Tillie ; and they lived at the bottom of a well. <p class="story">...</p> ...
案例2
#创建 Beautiful Soup 对象 sp = BeautifulSoup(html,"lxml") list_p = sp.find_all(name="p") for p in list_p: print(p.name,":的所以后代==============") for child in p.descendants: print(child)
3.3. 节点内容:.string 属性
如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点。如果一个tag仅有一个子节点,那么这个tag也可以使用 .string 方法,输出结果与当前唯一子节点的 .string 结果相同。
通俗点说就是:如果一个标签里面没有标签了,那么 .string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了,那么 .string 也会返回最里面的内容。例如:
print(soup.head.string)#The Dormouse's story print(soup.title.string)#The Dormouse's story
4. 搜索文档树--find_all
语法:.find_all(name, attrs, recursive, text, **kwargs)
4.1 name 参数
name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉
A.传字符串
最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的<b>标签:
print(soup.find_all('b'))# [<b>The Dormouse's story</b>] print(soup.find_all('a'))#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
B.传正则表达式
如果传入正则表达式作为参数,BeautifulSoup会通过正则表达式的match() 来匹配内容.下面例子中找出所有以b开头的标签,这表示<body>和<b>标签都应该被找到
soup = BeautifulSoup(html,"lxml") import re #匹配所以b开头的标题它们是:body标签和b标签 for tag in soup.find_all(re.compile("^b")): print(tag.name)# body# b
C.传列表
如果传入列表参数,BeautifulSoup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和<b>标签:
print(soup.find_all(["a", "b"]))# [<b>The Dormouse's story</b>,# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
4.2 keyword 参数
查找id为link3的标签
print(soup.find_all(id='link2'))# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
4.3 text 参数
通过 text 参数可以搜搜文档中的字符串内容,与 name 参数的可选值一样, text 参数接受 字符串 , 正则表达式 , 列表
import re print(soup.find_all(text="Elsie")) # [] print(soup.find_all(text=["Tillie", "Elsie", "Lacie"])) # ['Lacie', 'Tillie'] # 查找包含’Dormouse’内涵的 print(soup.find_all(text=re.compile("Dormouse"))) #["The Dormouse's story", "The Dormouse's story"]
4.4 href
得到所有的连接
from bs4 import BeautifulSoup
import re html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ #创建 Beautiful Soup 对象 soup = BeautifulSoup(html,"lxml") links = soup.find_all(href=re.compile(r'http://example.com/')) #得到所以的链接 for link in links: print(link["href"])