参考博文:https://cuiqingcai.com/1319.html
1.BS介绍
Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据。Beautiful Soup已成为和lxml、html6lib一样出色的python解释器,为用户灵活地提供不同的解析策略或强劲的速度。Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出一个完整的应用程序。Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时,Beautiful Soup就不能自动识别编码方式了。然后,你仅仅需要说明一下原始编码方式就可以了
安装方式:
pip install beautifulsoup4 或者 conda install beautifulsoup4(前提是安装了anaconda)
Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment .
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
BeautifulSoup标准化html,如下图所示:
2.BS四大对象
BS的所有对象可以归纳为4种:
1. Tag
2. NavigableString
3. BeautifulSoup
4. Comment
2.1 Tag
Tag 通俗点讲就是 HTML 中的一个个标签,例如
<title>The Dormouse's story</title>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a
上面的title和a加上里面的内容就是一个tag
print (soup.title)
输出:<title>The Dormouse's story</title>
print (soup.title.name)
输出:title
print (soup.title.string)
输出:The Dormouse's story
print (soup.title.parent.name)
输出:head
print (soup.p)
输出:<p class="title"><b>The Dormouse's story</b></p>
print (soup.p.get('class'))
print (soup.p['class'])
输出:['title'](列表)
print (soup.find_all('a'))
输出:[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
print (soup.find(id="link3"))
输出:<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
获得标签的属性,为字典类型如下:
print (soup.p.attrs)
{'class': ['title']
2.2 NavigableString
上面介绍了获取标签的内容,但我们很多时候需要获取标签内部的文字信息,我们可用.string,和get_text()等
下述两种方式结果相同
print (soup.a.string)
print (soup.a.get_text())
Elsie
Elsie
我们看看他们的输出类型是什么
print (type(soup.a.string))
输出:<class 'bs4.element.NavigableString'>
2.3 BeautifulSoup
BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象,是一个特殊的 Tag,我们可以分别获取它的类型,名称,以及属性print (type(soup.name))
print (soup.name)
<class 'str'>
[document]
2.4 Comment
Comment 对象是一个特殊类型的 NavigableString 对象,其实输出的内容仍然不包括注释符号,但是如果不好好处理它,可能会对我们的文本处理造成意想不到的麻烦。
print (soup.a)
print (soup.a.string)
print (type(soup.a.string))
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
Elsie (和原文不一样)
<class 'bs4.element.Comment'>
3.遍历文本
3.1 contents
Tag的contents可以将tag的内容以列表的形式展出
print (soup.head.contents )
[<title>The Dormouse's story</title>]
print (soup.head.contents[0])
<title>The Dormouse's story</title>
3.2 children
children它返回的不是一个 list,不过我们可以通过遍历获取所有子节点。我们打印输出 .children 看一下,可以发现它是一个 list 生成器对象print (soup.head.children)
<list_iterator object at 0x000000000D293588>
for child in soup.body.children:
print (child)
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
3.3 所有孙子节点
for child in soup.descendants:
print (child)
输出:
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story
3.4节点内容
print (soup.head.string)
print (soup.title.string)
The Dormouse's story
The Dormouse's story
print (soup.html.string)
None
3.5 获取多个内容strings
strings不需要遍历获取就可以获得多个节点的内容
for string in soup.strings:
print(repr(string))
'\n'
"The Dormouse's story"
'\n'
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'
'\n'
3.6 剔除空白或空格.stripped_strings
for string in soup.stripped_strings:
print(repr(string))
"The Dormouse's story"
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
','
'Lacie'
'and'
'Tillie'
';\nand they lived at the bottom of a well.'
'...'
3.7 父节点
content = soup.head.title.string
print (content.parent.name)
title
3.8 全部父节点parents
通过元素的 .parents 属性可以递归得到元素的所有父辈节点for parent in content.parents:
print (parent.name)
title
head
html
[document]
3.9 兄弟节点(.next_sibling 和 .previous_sibling)
兄弟节点可以理解为和本节点处在同一级的节点,.next_sibling 属性获取了该节点的下一个兄弟节点,.previous_sibling 则与之相反,如果节点不存在,则返回 None
注意:实际文档中的tag的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白,因为空白或者换行也可以被视作一个节点,所以得到的结果可能是空白或者换行
print (soup.p.next_sibling)
# 实际该处为空白
print (soup.p.prev_sibling)
#None 没有前一个兄弟节点,返回 None
print (soup.p.next_sibling.next_sibling)
None
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
3.10全部兄弟节点(.next_siblings 和 .previous_siblings)
通过 .next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出for sibling in soup.a.next_siblings:
print(repr(sibling))
',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'
4 搜索文档树
4.1 find_all( name , attrs , recursive , text , **kwargs )
一、name参数
name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉
1 传字符串
用find_all中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,示例
soup.find_all('a')
Out[11]:
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
2 传正则表达式
如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.示例如下
import re
for tag in soup.find_all(re.compile("^b")):
print(tag.name)
输出:
body
b
3 传列表
如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和<b>标签
: soup.find_all(["a", "b"])
Out[15]:
[<b>The Dormouse's story</b>,
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
4 传True
True 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点
for tag in soup.find_all(True):
print(tag.name)
html
head
title
body
p
b
p
a
a
a
p
5 传方法
如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数 ,如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False,下面方法校验了当前元素,如果包含 class 属性却不包含 id 属性,那么将返回 True:
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
has_class_but_no_id
Out[19]: <function __main__.has_class_but_no_id>
soup.find_all(has_class_but_no_id)
Out[20]:
[<p class="title"><b>The Dormouse's story</b></p>,
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>,
<p class="story">...</p>]
二.text参数
通过 text 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, text 参数接受 字符串 , 正则表达式 , 列表, True
soup.find_all('a')
Out[22]:
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find_all(text="Elsie")
Out[23]: []
soup.find_all(text=["Tillie", "Elsie", "Lacie"])
Out[24]: ['Lacie', 'Tillie']
import re
soup.find_all(text=re.compile("Dormouse"))
Out[25]: ["The Dormouse's story", "The Dormouse's story"]
三.limit参数
find_all() 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量.效果与SQL中的limit关键字类似,当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果.文档树中有3个tag符合搜索条件,但结果只返回了2个,因为我们限制了返回数量
soup.find_all("a", limit=2)
Out[27]:
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
四.recursive 参数
soup.html.find_all("title")
Out[29]: [<title>The Dormouse's story</title>]
soup.html.find_all("title", recursive=False)
Out[30]: []
五.keywords参数
soup.find_all(id='link2')
Out[31]: [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
soup.find_all(href=re.compile("elsie"))
Out[32]: [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
soup.find_all(href=re.compile("elsie"), id='link1')
Out[33]: [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
4.2 find( name , attrs , recursive , text , **kwargs )
soup.find('a')
Out[37]: <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
soup.find_all('a')
Out[38]:
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]