一、解析库
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
lxml HTML解析库 | Beautifulsoup(markup, “lxml”) | 速度快、文档容错能力强 | 需要安装C语言库 |
lxml XML解析库 | Beautifulsoup(markup, “xml”) | 速度快、唯一支持XML的解析器 | 需要安装C语言库 |
html5lib | Beautifulsoup(markup, “html5lib”) | 最好的容错性 浏览器解析方式 生成HTML5格式文档 | 速度慢 不依赖外部扩展 |
二、解析方法
安利:
1、使用lxml解析器
2、使用find_all()
(一)soup的构造
soup = Beautiful(被解析对象,解析器)对被解析对象(如字符串html代码)进行解析(若不完整可进行补全),返回一个BeautifulSoup对象,该对象涵盖标签成员及一些方法调用。
soup = BeautifulSoup(html, "lxml")
(二)对象成员
被解析文本:
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>'''
1、标签对象soup.tag
print(soup.p) # soup.tag标签对象
# <p class="title"><b>The Dormouse's story</b></p>
2、标签名
print(soup.p.name) # soup.tag.name标签名对象<str>
# p
3、标签属性
(1)标签全部属性字典
print(soup.a.attrs)
# {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
for key, val in soup.a.attrs.items():
print(key, val)
# href http://example.com/elsie
# class ['sister']
# id link1
(2)使用属性名访问属性值,注意存在多值属性时,返回属性值列表
print(soup.a["class"])
# ['sister']
print(soup.a.get("class"))
# ['sister']
soup.a["class"].append("yan")
print(soup.a["class"])
# ['sister', 'yan']
(3)属性变更
soup.a["extra_attr"] = "extra"
print(soup.a.attrs)
# {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1', 'extra_attr': 'extra'}
del soup.a["extra_attr"]
print(soup.a.attrs)
# {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
4、标签文本
(1)访问标签内文本soup.tag.string
soup = BeautifulSoup('''<p class="title"><b>The Dormouse's story</b></p>''', 'lxml')
print(soup.p.string)
# The Dormouse's story
(2)标签内文本替换soup.tag.string.replace_with(str)注意,文本不可修改,只可替换
ssoup.p.string.replace_with("hahahahahah")
print(soup.p.string)
# hahahahahahah
(三)标签访问
1、向下访问直接子标签
(1)soup.tag.contens返回tag直接子标签组成的列表
soup = BeautifulSoup(html, "lxml")
i = 0
for tag in soup.body.contents:
i += 1
print(i, ": ", tag)
# 1 :
#
# 2 : <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# 3 :
#
# 4 : <p class="story">...</p>
(2)soup.tag.children返回tag所有直接子标签的迭代对象,无法直接使用下标访问
for i, tag in enumerate(soup.body.children):
print(i, ": ", tag)
# 0 :
#
# 1 : <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# 2 :
#
# 3 : <p class="story">...</p>
2、soup.tag.descendants向下访问所有子标签,包括直系子标签及非直系子标签
i = 1
for tag in soup.body.descendants:
print(i, ": ", tag)
i += 1
# 1 :
#
# 2 : <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# 3 : Once upon a time there were three little sisters; and their names were
#
# 4 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 5 : Elsie
# 6 : ,
#
# 7 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 8 : Lacie
# 9 : and
#
# 10 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
# 11 : Tillie
# 12 : ;
# and they lived at the bottom of a well.
# 13 :
#
# 14 : <p class="story">...</p>
# 15 : ...
3、访问标签的所有文本内容
(1)soup.tag.strings返回tag内所有文本内容,类型为generator
i = 1
for str in soup.p.strings:
print(i, ": ", str)
i += 1
# 1 : Once upon a time there were three little sisters; and their names were
#
# 2 : Elsie
# 3 : ,
#
# 4 : Lacie
# 5 : and
#
# 6 : Tillie
# 7 : ;
# and they lived at the bottom of a well.
(2)soup.tag.stripped_strings返回tag标签内所有文本内容的无空白符格式
for str in soup.p.stripped_strings:
print(i, ": ", str)
i += 1
# 1 : Once upon a time there were three little sisters; and their names were
# 2 : Elsie
# 3 : ,
# 4 : Lacie
# 5 : and
# 6 : Tillie
# 7 : ;
4、soup.tag.parent返回tag的直接父标签,顶级标签如返回None
print(soup.p.parent)
# <body>
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p></body>
print(soup.parent)
# None
5、soup.tag.parents返回tag的所有直系、非直系父标签迭代对象,无法直接下标访问
soup = BeautifulSoup(html, "lxml")
for i, tag in enumerate(soup.p.parents):
print(i, ": ", tag)
# 0 : <body>
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p></body>
# 1 : <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p></body></html>
# 2 : <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p></body></html>
6、soup.tag.previous_sibling和soup.tag.next_sibling访问直接左右兄弟标签,无兄弟返回None,注意:该返回值常常是个空白符,如换行
print(soup.p.previous_sibling)
# None
print(soup.p.next_sibling)
# <p class="story">...</p>
7、soup.tag.previous_siblings和soup.tag.next_siblings返回所有左右兄弟标签迭代对象
for i, tag in enumerate(soup.p.next_siblings):
print(i, ": ", tag)
# 0 :
#
# 1 : <p class="story">...</p>
for i, tag in enumerate(soup.p.previous_siblings):
print(i, ": ", tag)
# 0 :
8、soup.tag.previous_elements和soup.tag.next_elements返回标签的所有左上、右上标签迭代对象
(四)标签检索
1、向下检索
soup.tag.find(name, attrs, recursive, string, limit, **kwargs)返回首个指定标签
soup.tag.find_all(name, attrs, recursive, text, limit, **kwargs)返回所有指定标签的列表
(1)name指定目标标签名
支持参数:正则表达式对象,字符串列表,True, 函数
(2)attrs指定目标标签属性
格式:
find_all(id="…") 注意特定格式不支持改格式,如data-*及find_all参数如name, attrs, class格式可以使用class_
find_all(attrs={“id”:"…"})
支持参数:字符串,正则表达式, 函数, True
(3)string指定目标标签文本内容
支持参数:字符串,字符串列表, 正则表达式,函数,True
(4)limit限制返回的最大标签数,参数为非负整数
(5)recursive=True时仅检索直系字标签
(6)简略写法:soup(name, attrs, recursive, string, limit, **kwargs)
(7)关于函数
def has_class_and_id(tag):
return tag.has_attr("class") and tag.has_attr("id")
for tag in soup.find_all(has_class_and_id):# 未指定参数,则传入标签对象
print(tag)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
def no_lacie(href):
return href and not re.search("lacie", href)
for tag in soup.find_all(href=no_lacie):# 指定href属性,则传入属性值
print(tag)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
2、向上检索
soup.tag.find_parents(name, attrs, recursive, string, limit, **kwargs)
soup.tag.find_parent(name, attrs, recursive, string, limit, **kwargs)
3、左右检索(标签)
soup.find_previous_siblings()
soup.find_previous_sibling()
soup.find_next_siblings()
soup.find_next_sibling()
4、左右检索(所有元素)
soup.find_next()
soup.find_all_next()
soup.find_previous()
soup.find_all_previous()
(五)标签选择器select(字符串类)
1、按照标签名选择
print(soup.select("title"))
# [<title>The Dormouse's story</title>]
(1)soup.select(“name1 name2”)检索标签名name1下的所有标签名为name2的标签(空格符嵌套)
for tag in soup.select("body a"):
print(i, ": ", tag)
i += 1
# 1 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 2 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 3 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
(2)soup.select(“name1 > name2”)检索标签名name1下所有名为name2的直接子标签(>符嵌套)
print(soup.select("body > a"))# body标签下无a直系子标签
# []
2、按属性值检索(.符指定为class属性值)
(1)按class属性值检索
soup.selct(.val)检索所有class=val的标签(空格符嵌套)
for tag in soup.select(".story .sister"):# 检索class="story"的标签下所有class="sister"的子标签
print(i, ": ", tag)
i += 1
# 1 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 2 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 3 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
(2)按id属性值检索(#符指定为id属性值)
soup.select("#val")检索所有id=val的标签(空格符嵌套)
print(soup.select("#link1"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
3、多目标并列检索(,逗号符号并列条件)
注:无空格分隔并列表示要求同时满足两个条件
for tag in soup.select("title,.sister"):# 检索所有标签名为title及class="sister"的标签
print(i, ": ", tag)
i += 1
# 1 : <title>The Dormouse's story</title>
# 2 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 3 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 4 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
4、指定标签名、属性值检索soup.select(name[attr="…"])
for tag in soup.select('a[class="sister"]'):
print(i, ": ", tag)
i += 1
# 1 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 2 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 3 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>