用正则表达式来分析提取网页节点信息还是有些不方便,可以看到,匹配一个稍微复杂点的网页内容时所需要的正则表达式一般都是又长又难懂,且如果正则表达式中的某一处写错了,就会导致匹配失败,这十分不方便。
使用XPath
XPath是一门在XML文档中查找信息的路径语言,它最初用来搜寻XML文档的,但是它同样适用于HTML 文档的搜索。
XPath 的选择功能十分强大,它提供了非常简洁明了的路径选择表达式。另外,它还提供了超过100 个内建函数,用于字符串、数值、时间的匹配以及节点、序列的处理等。几乎所有我们想要定位的节点,都可以用XPath 来选择。
XPath常用规则
下面列举了XPath常用规则:
表达式 | 描述 |
---|---|
nodename | 读取此节点的所有子节点 |
/ | 从当前节点选取直接子节点 |
// | 从当前节点选取子孙节点 |
. | 选取当前节点 |
… | 选取当前节点的父节点 |
@ | 选取节点属性 |
例子引入
from lxml import etree
text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''
# etree.HTML()将text这一串字符串构造成HTML文本,作为XPath的解析对象
html = etree.HTML(text)
# result是bytes类型
result = etree.tostring(html)
# 调用decode()将result转成str类型
print(result.decode('utf-8'))
# /----------******** 输出:*********--------\
# 可见,上述text给出的字符串并不完整,并不能算作是一个规范的HTML页面代码,如最后一个li没有闭合等,但根据输出结果可知,etree.HTML()对其有修正效果
# <html><body><div>
# <ul>
# <li class="item-0"><a href="link1.html">first item</a></li>
# <li class="item-1"><a href="link2.html">second item</a></li>
# <li class="item-inactive"><a href="link3.html">third item</a></li>
# <li class="item-1"><a href="link4.html">fourth item</a></li>
# <li class="item-0"><a href="link5.html">fifth item</a>
# </li></ul>
# </div>
# </body></html>
获取所有节点
将上述代码中的text字符串存入当前目录的test.html文件中,可以从文件中读入html内容。
from lxml import etree
# 使用HTMLParser解析器解析文件test.html中的信息
html = etree.parse('./test.html', etree.HTMLParser())
# HTML文本中的所有节点都会被获取,返回的result实际上是一个列表
result = html.xpath('//*')
print(result)
# /----------******** 输出:*********--------\
# 由于没有获取属性之类的节点,所以获取到的节点都是Element类型
# [<Element html at 0x17148a0>, <Element body at 0x1714878>, <Element div at 0x1714850>, <Element ul at 0x1714418>, <Element li at 0x1714148>, <Element a at 0x1714198>, <Element li at 0x1714170>, <Element a at 0x1710ee0>, <Element li at 0x1710eb8>, <Element a at 0x1710d28>, <Element li at 0x1710d00>, <Element a at 0x17109e0>, <Element li at 0x17109b8>, <Element a at 0x1710990>]
子节点
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
# 获取所有li节点下的所有直接a子节点
result = html.xpath('//li/a')
# 下面这一句实际上与上面的匹配结果是一致的,选取ul节点下所有的a节点(不限直接子节点)
result2 = html.xpath('//ul//a')
print(result)
print(result2)
# /----------******** 输出:*********--------\
# [<Element a at 0x784878>, <Element a at 0x784850>, <Element a at 0x784418>, <Element a at 0x784148>, <Element a at 0x784198>]
# [<Element a at 0x784878>, <Element a at 0x784850>, <Element a at 0x784418>, <Element a at 0x784148>, <Element a at 0x784198>]
父节点
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
# 获取属性href值为link4.html的a节点的父节点(使用..来获取父节点)的属性class的值
result = html.xpath('//a[@href="link4.html"]/../@class')
# 除了上面这种方式外,还可以使用parent::来获取父节点
result2 = html.xpath('//a[@href="link4.html"]/parent::*/@class')
print(result)
print(result2)
# /----------******** 输出:*********--------\
# ['item-1']
# ['item-1']
属性匹配
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
# 匹配class属性值为item-0的节点
result = html.xpath('//li[@class="item-0"]')
print(result)
# /----------******** 输出:*********--------\
# [<Element li at 0x21c4850>, <Element li at 0x21c4418>]
文本获取
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
# 获取li节点内的文本
result = html.xpath('//li[@class="item-0"]/text()')
print(result)
# /----------******** 输出:*********--------\
# ['n']
可见,输出结果只有一个换行符,这是为什么?
经过HTMLParser修正后,xpath实际上定位的是如下节点:
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</li>
可见,这两个li节点内的文本其实就只有一个修正后产生的换行符,其他有意义的文本内容都在子节点a标签内。
要想获取li内的所有内容,其实只需要将text()前的/改为//即可,即当前节点和子孙节点的文本都算在内:
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]//text()')
print(result)
# /----------******** 输出:*********--------\
# ['first item', 'fifth item', '\n']
属性获取
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
# 获取所有li节点的直接a子节点的href属性的值
result = html.xpath('//li/a/@href')
print(result)
# /----------******** 输出:*********--------\
# ['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
含有多个值的属性的匹配
from lxml import etree
text = '''
<li class="li li-first"><a href="link.html'>first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[@class="li"]/a/text()')
print(result)
# /----------******** 输出:*********--------\
# []
可见,对于含有多个值的属性的匹配,仅仅使用之前的匹配单值属性的方法并不奏效了。
这样一来,就可以将代码改为如下:
from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
# 通过contains()方法找出class属性值中包含有'li'的<li>节点
result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)
# /----------******** 输出:*********--------\
# ['first item']
根据多条属性来获取节点
from lxml import etree
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
# 用and连接多条属性匹配
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)
# /----------******** 输出:*********--------\
# ['first item']
上述所演示的and实际上是XPath中的运算符之一,另外,还有其他很多运算符:
根据出现的次序筛选同样符合条件的节点集
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[1]/a/text()')
print(result)
result = html.xpath('//li[last()]/a/text()')
print(result)
result = html.xpath('//li[position()<3]/a/text()')
print(result)
result = html.xpath('//li[last()-2]/a/text()')
print(result)
# /----------******** 输出:*********--------\
# ['first item']
# ['fifth item']
# ['first item', 'second item']
# ['third item']
XPath中其他用于定位次序的函数可参见w3cschool关于XPath的教程。
节点轴选择
from lxml import etree
text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html"><span>first item</span></a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''
html = etree.HTML(text)
# ancestor轴可以获取所有祖先结点,*表示匹配所有,所以这里的result是一个列表
result = html.xpath('//li[1]/ancestor::*')
print(result)
# 相比上面的匹配结果加了div节点名限定
result = html.xpath('//li[1]/ancestor::div')
print(result)
# attribute轴可以获取所有属性值,表示获取该节点的所有属性值
result = html.xpath('//li[1]/attribute::*')
print(result)
# child轴可以获取所有直接子节点
result = html.xpath('//li[1]/child::a[@href="link1.html"]')
print(result)
# desendant轴可以获取所有子孙节点
result = html.xpath('//li[1]/descendant::span')
print(result)
# following轴可以获取当前节点之后的所有节点(不限同级),下面加了[2]列表索引值限定
result = html.xpath('//li[1]/following::*[2]')
print(result)
# following-sibling轴可以获取当前节点之后的所有同级节点
result = html.xpath('//li[1]/following-sibling::*')
print(result)
# /----------******** 输出:*********--------\
# [<Element html at 0x7968c8>, <Element body at 0x796878>, <Element div at 0x796850>, <Element ul at 0x796828>]
# [<Element div at 0x796850>]
# ['item-0']
# [<Element a at 0x796878>]
# [<Element span at 0x796828>]
# [<Element a at 0x796850>]
# [<Element li at 0x796828>, <Element li at 0x796878>, <Element li at 0x7968a0>, <Element li at 0x7963f0>]
使用Beautiful Soup
BeautifulSoup在解析时实际上依赖解析器,它除了支持Python标准库中的HTML解析器外,还支持一些第三方解析器(比如lxml )。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FkcoCKMO-1640341311407)(https://note.youdao.com/yws/public/resource/f8e5908c0721dddb3fd530cf29cb1b8e/xmlnote/WEBRESOURCE46439f2d873fb01620ce092ec6eb3df2/42465 “”)]
其中,lxml解析器有解析HTML和XML的能力,且速度快,容错能力强,所以更倾向于使用它。
基本用法
from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dromouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Eleie--></a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http:///example.com/title" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
# 【Beautiful对象初始化】将html字符串传给Beautiful(),同xpath中一样,它也具有自动修正html格式的功能
soup = BeautifulSoup(html, 'lxml')
# prettify()把要解析的字符串以标准缩进格式输出
print(soup.prettify())
print(soup.title.string)
# /----------******** 输出:*********--------\
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title" name="dromouse">
# <b>
# The Dromouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# <!--Eleie-->
# </a>
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http:///example.com/title" id="link3">
# Tillie
# </a>
# ;
# and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>
# The Dormouse's story
节点选择器
选择元素
通过BeautifulSoup从html中选择需要的节点。
from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dromouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Eleie--></a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http:///example.com/title" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
# 经过选择器选择后,得到的处理结果都是bs4.element.Tag类型
print(type(soup.title))
print(soup.title.string) # 节点title内的文本内容
print(soup.head)
# 只能获取到第一个p节点
print(soup.p)
# /----------******** 输出:*********--------\
# <title>The Dormouse's story</title>
# <class 'bs4.element.Tag'>
# The Dormouse's story
# <head><title>The Dormouse's story</title></head>
# <p class="title" name="dromouse"><b>The Dromouse's story</b></p>
提取相关信息
(1). 获取节点名称
节点的name属性中记载着节点的名称。
print(soup.title.name)
# /----------******** 输出:*********--------\
# title
(2). 获取属性值
使用节点的attrs属性可以获取属性值。
print(soup.p.attrs)
print(soup.p.attrs['name'])
# /----------******** 输出:*********--------\
# {'name': 'dromouse', 'class': ['title']}
# dromouse
其实还有一种比较简便的写法——使用中括号[ ]直接取属性值。
print(soup.p['name'])
print(soup.p['class'])
# /----------******** 输出:*********--------\
# dromouse
# ['title']
可见,由于class属性可以有多个值,所以返回它的属性值时返回的是一个列表。
(3). 获取节点内内容
节点的string属性中存储了节点的内容。
print(soup.p.string)
# /----------******** 输出:*********--------\
# The Dormouse's story
嵌套选择
节点的选择可以一层一层的嵌套选择,这种嵌套式的选择方式非常类似html格式的嵌套节点模式,非常形象方便。
from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dromouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Eleie--></a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http:///example.com/title" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml')
# 调用head节点来选取其内部的节点,嵌套选择
print(soup.head.title)
# 可见,和soup.head一样,soup.head.title的类型依然是'bs4.element.Tag,也就是说它也还可以进行更深层的嵌套选择
print(type(soup.head.title))
print(soup.head.title.string)
# /----------******** 输出:*********--------\
# <title>The Dormouse's story</title>
# <class 'bs4.element.Tag'>
# The Dormouse's story
关联选择
在做选择的时候,有时候不能做到一步就选到想要的节点元素,需要先选中某一个节点元素,然后以它为基准再选择它的子节点、父节点、兄弟节点等,这种选择方式就是关联选择。
(1). 子节点和子孙节点
调用contents属性可以获取它的直接子节点。
from bs4 import BeautifulSoup
html = '''
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Eleie--></a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http:///example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)
# /----------******** 输出:*********--------\
# ['Once upon a time there were three little sisters; and their names were\n', <a class="sister" href="http://example.com/elsie" id="link1"><!--Eleie--></a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' and\n', <a class="sister" href="http:///example.com/tillie" id="link3">Tillie</a>, ';\nand they lived at the bottom of a well.\n']
contens将节点内的所有内容都作为一整个列表输出,使用children属性可以将成功匹配的子节点内容进行拆分输出:
from bs4 import BeautifulSoup
html = '''
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Eleie--></a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http:///example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml')
# children属性的返回类型是生成器类型
for i, child in enumerate(soup.p.children):
print(i, child)
# /----------******** 输出:*********--------\
# 0 Once upon a time there were three little sisters; and their names were
#
# 1 <a class="sister" href="http://example.com/elsie" id="link1"><!--Eleie--></a>
# 2
#
# 3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 4 and
#
# 5 <a class="sister" href="http:///example.com/tillie" id="link3">Tillie</a>
# 6 ;
# and they lived at the bottom of a well.
上面讨论的都是获取节点的子节点,使用descendants属性可以获取节点的所有子孙节点,descendants会递归查询所有子节点,得到所有的子孙节点。
from bs4 import BeautifulSoup
html = '''
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Eleie--></a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http:///example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml')
# descendants属性返回类型为生成器类型
for i, child in enumerate(soup.p.descendants):
print(i, child)
# /----------******** 输出:*********--------\
# 1 <a class="sister" href="http://example.com/elsie" id="link1"><!--Eleie--></a>
# 2 Eleie
# 3
#
# 4 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 5 Lacie
# 6 and
#
# 7 <a class="sister" href="http:///example.com/tillie" id="link3">Tillie</a>
# 8 Tillie
# 9 and they lived at the bottom of a well.
(2). 父节点和祖先节点
如果要获取某个节点元素的父节点,可以调用parent属性。
from bs4 import BeautifulSoup
html = '''
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml')
# 获取a节点的父节点—p节点
print(soup.a.parent)
# /----------******** 输出:*********--------\
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# <span>Elsie</span>
# </a>
# </p>
parent只能得到该节点的直接父节点,要想得到所有的祖先结点,可以调用parents属性。
from bs4 import BeautifulSoup
html = '''
<html>
<body>
<p class="story">
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
</p>
'''
soup = BeautifulSoup(html, 'lxml')
# 可见soup.a.parents是个生成器类型,且返回的是一个列表
print(type(soup.a.parents))
print(list(enumerate(soup.a.parents)))
# /----------******** 输出:*********--------\
# <class 'generator'>
# [(0, <p class="story">
# <a class="sister" href="http://example.com/elsie" id="link1">
# <span>Elsie</span>
# </a>
# </p>), (1, <body>
# <p class="story">
# <a class="sister" href="http://example.com/elsie" id="link1">
# <span>Elsie</span>
# </a>
# </p>
# </body>), (2, <html>
# <body>
# <p class="story">
# <a class="sister" href="http://example.com/elsie" id="link1">
# <span>Elsie</span>
# </a>
# </p>
# </body></html>), (3, <html>
# <body>
# <p class="story">
# <a class="sister" href="http://example.com/elsie" id="link1">
# <span>Elsie</span>
# </a>
# </p>
# </body></html>)]
(3). 兄弟节点
from bs4 import BeautifulSoup
html = '''
<html>
<body>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
Hello
<a href="http://example.com/lacie" class="sister" id="link2">lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
'''
soup = BeautifulSoup(html, 'lxml')
# next_sibling和previous_sibling分别获取节点的下一个和上一个兄弟元素
print('Next Sibling', soup.a.next_sibling)
print('Prev Sibling', soup.a.previous_sibling)
# next_siblings和previous_siblings分别获取前面和后面兄弟节点的生成器
print('Next Siblings', list(enumerate(soup.a.next_siblings)))
print('Prev Siblings', list(enumerate(soup.a.previous_siblings)))
# /----------******** 输出:*********--------\
# Next Sibling
# Hello
#
# Prev Sibling Once upon a time there were three little sisters; and their names were
#
# Next Siblings [(0, '\n Hello\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">lacie</a>), (2, '\n and\n'), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n and they lived at the bottom of a well.\n')]
# Prev Siblings [(0, 'Once upon a time there were three little sisters; and their names were\n')]
(4). 提取节点信息
from bs4 import BeautifulSoup
html = '''
<html>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Bob</a>
<a href="http://example.com/lacie" class="sister" id="link2">lacie</a>
</p>
'''
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling:')
print(type(soup.a.next_sibling))
print(soup.a.next_sibling)
print(soup.a.next_sibling.string)
print("Parent:")
print(type(soup.a.parents))
# 使用list()将generator类型转换为列表类型
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])
# /----------******** 输出:*********--------\
# Next Sibling:
# <class 'bs4.element.tag'>
# <a class="sister" href="http://example.com/lacie" id="link2">lacie</a>
# lacie
# Parent:
# <class 'generator'>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Bob</a>
# <a class="sister" href="http://example.com/lacie" id="link2">lacie</a>
# </p>
# ['story']
如果返回结果是单个节点,那么可以直接调用string、attrs等属性获得其文本和属性;如果返回结果是多个节点的生成器,则可以转为列表后取出某个元素,然后再调用string 、a ttrs 等属性获取其对应节点的文本和属性。
方法选择器
前面所讲的选择方法都是基于属性来选择的,这种方法非常快,但是如果进行比较复杂的选择,它就比较烦琐,不够灵活了。不过BeautifulSoup还为我们提供了一些查询方法,比如find_all()和find()等,调用它们,传入相应的参数,就可以灵活查询了。
find_all()
find_all()的声明如下:
find_all(name , attrs , recursive , text , **kwargs)
(1). name
给find_all()传入节点名称,根据节点的名称来进行查询。
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-samll" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# 和上述的soup.ul不同,find_all()会输出所有符合匹配条件的节点,而不仅仅只是第一个
print(soup.find_all(name='ul'))
# find_all()返回列表中的每个元素类型都是bs4.element.Tag
print(type(soup.find_all(name='ul')[0]))
# /----------******** 输出:*********--------\
# [<ul class="list" id="list-1">
# # <li class="element">Foo</li>
# # <li class="element">Bar</li>
# # <li class="element">Jay</li>
# # </ul>, <ul class="list list-samll" id="list-2">
# # <li class="element">Foo</li>
# # <li class="element">Bar</li>
# # </ul>]
# # <class 'bs4.element.Tag'>
由于find_all()返回列表中的每个元素类型都是bs4.element.Tag,所以实际上还是可以进行嵌套查询。即
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all(name='ul'):
print(ul.find_all(name='li'))
# /----------******** 输出:*********--------\
# [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
# [<li class="element">Foo</li>, <li class="element">Bar</li>]
(2). attrs
除了根据节点名称进行查询还可以通过属性值进行查询。
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-samll" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# find_all()的attrs参数是一个字典类型,find_all()返回的始终是一个列表
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))
# /----------******** 输出:*********--------\
# [<ul class="list" id="list-1" name="elements">
# <li class="element">Foo</li>
# <li class="element">Bar</li>
# <li class="element">Jay</li>
# </ul>]
# [<ul class="list" id="list-1" name="elements">
# <li class="element">Foo</li>
# <li class="element">Bar</li>
# <li class="element">Jay</li>
# </ul>]
实际上,对于一些常见的属性如class和id,我们并不需要包含在attrs参数中,可以直接和name=""的形式一样使用在find_all()中。如
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-samll" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
# 注意:这里是class_而不是class,这是为了避免与python中的关键字class混淆
print(soup.find_all(class_='element'))
# /----------******** 输出:*********--------\
# [<ul class="list" id="list-1" name="elements">
# <li class="element">Foo</li>
# <li class="element">Bar</li>
# <li class="element">Jay</li>
# </ul>]
# [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
(3). text
text 参数用来匹配html内的文本节点,传入的形式可以是字符串,可以是正则表达式对象。
import re
html = '''
<div class="panel">
<div class="panel-body">
<a>Hello, this is a link</a>
<a>Hello, this is a link, too</a>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# find_all(text=)返回所有符合匹配的文本节点组成的列表
print(soup.find_all(text=re.compile('link')))
# /----------******** 输出:*********--------\
# ['Hello, this is a link', 'Hello, this is a link, too']
find()
find()与find_all()的区别是find()返回的是第一个符合匹配的单个元素,而后者返回的是所有成功匹配的元素组成的列表。
import re
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class"element">Bar</li>
<li class="element">Jay</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# 由输出结果可得,find()返回的不再是列表,而只是单一元素,且类型皆为bs4.element.Tag
print(soup.find(name='ul'))
print(type(soup.find(name='ul')))
print(soup.find(class_='list'))
# /----------******** 输出:*********--------\
# <ul class="list" id="list-1">
# <li class="element">Foo</li>
# <li class="">Bar</li>
# <li class="element">Jay</li>
# </ul>
# <class 'bs4.element.Tag'>
# <ul class="list" id="list-1">
# <li class="element">Foo</li>
# <li class="">Bar</li>
# <li class="element">Jay</li>
# </ul>
find()方法的使用和find_all()类似。
其他用于查询的方法
除了find_all()和find()之外,bs4还提供了许多用于查询节点的方法,这些方法与find_all()和find()的用法基本一致,只不过查询的范围有些许不同,这里简单介绍下:
- find_parents()和find_parent(): 前者返回所有祖先节点,后者返回直接父节点。
- find_next_siblings()和find_next_ sibling(): 前者返回后面所有的兄弟节点,后者返回后面第一个兄弟节点。
- find_previous_siblings()和find_previous_sibling():前者返回前面所有的兄弟节点,后者返回前面第一个兄弟节点。
- find_all_next()和find_next():前者返回节点后所有符合条件的节点,后者返回第一个符合条件的节点。
- find_all_previous()和find_previous():前者返回节点后所有符合条件的节点,后者返回第一个符合条件的节点。
CSS选择器
Beautiful Soup 还提供了另外一种选择器,那就是css 选择器。
使用css 选择器时,只需要调用select()方法,传人相应的css选择器即可。
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class"element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# 可见,向select()中传入的表达式都是标准的css表达式
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))
# /----------******** 输出:*********--------\
# [<div class="panel-heading">
# <h4>Hello</h4>
# </div>]
# [<li class="element">Foo</li>, <li class="">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
# [<li class="element">Foo</li>, <li class="element">Bar</li>]
# <class 'bs4.element.Tag'>
嵌套选择
select()方法同样支持嵌套选择。
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class"element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# 由于soup.select()返回的还是bs4.element.Tag类型所以依然可以采用嵌套选择
for ul in soup.select('ul'):
print(ul.select('li'))
# /----------******** 输出:*********--------\
# [<li class="element">Foo</li>, <li class="">Bar</li>, <li class="element">Jay</li>]
# [<li class="element">Foo</li>, <li class="element">Bar</li>]
获取属性
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class"element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
print(ul['id'])
print(ul.attrs['id'])
# /----------******** 输出:*********--------\
# list-1
# list-1
# list-2
# list-2
获取文本
获取文本内容,可以使用前面提到过的string属性,还可以使用get_text()方法。
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class"element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# 可见,get_text()和使用string属性效果一模一样
for li in soup.select('li'):
print('Get Text:', li.get_text())
print('String:', li.string)
# /----------******** 输出:*********--------\
# Get Text: Foo
# String: Foo
# Get Text: Bar
# String: Bar
# Get Text: Jay
# String: Jay
# Get Text: Foo
# String: Foo
# Get Text: Bar
# String: Bar
使用pyquery
pyquery解析库对css选择器的功能做了强化,且其中的选择函数用法类似于jquery,但由于jQuery淘汰在即,如果不必要的话,使用前面两种解析库就足够了。
初始化
创建一个pyquery对象的方式有很多种,如通过HTML文本字符串进行初始化或通过网站的url进行初始化等。
(1). 通过HTML文本字符串进行初始化
html = '''
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from pyquery import PyQuery as pq
# 传入HTML文本字符串进行初始化
doc = pq(html)
# 从pyquery对象中取得节点时不需要使用什么函数,直接用括号选择,很方便
print(doc('li'))
# /----------******** 输出:*********--------\
# <li class="item-0">first item</li>
# <li class="item-1"><a href="link2.html">second item</a></li>
# <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
# <li class="item-1 active"><a href="link4.html">fourth item</a></li>
# <li class="item-0"><a href="link5.html">fifth item</a></li>
(2). 通过网站url初始化
from pyquery import PyQuery as pq
doc = pq(url='https://cuiqingcai.com')
print(doc('li'))
# /----------******** 输出:*********--------\
# 此页面html内容较多,故略...
(3). 通过本地文件内容进行初始化
from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
print(doc('li'))
基本CSS选择器
在pyquery对象中使用css选择器。
html = '''
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list li'))
print(type(doc('#container .list li')))
# /----------******** 输出:*********--------\
# <li class="item-0">first item</li>
# <li class="item-1"><a href="link2.html">second item</a></li>
# <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
# <li class="item-1 active"><a href="link4.html">fourth item</a></li>
# <li class="item-0"><a href="link5.html">fifth item</a></li>
#
# <class 'pyquery.pyquery.PyQuery'>
查找节点
下面介绍一些用于节点查询的函数,这些函数和jQuery 中函数的用法完全相同。
子节点和子孙节点
find()拿到该节点的所有子孙节点。
html = '''
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
# 可见,经过pyquery处理的单个节点或节点集都是pyquery.pyquery.PyQuery类型
items = doc('.list')
print(type(items))
print(items)
# find()获取该节点的所有子孙节点
lis = items.find('li')
print(type(lis))
print(lis)
# /----------******** 输出:*********--------\
# <class 'pyquery.pyquery.PyQuery'>
# <ul class="list">
# <li class="item-0">first item</li>
# <li class="item-1"><a href="link2.html">second item</a></li>
# <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
# <li class="item-1 active"><a href="link4.html">fourth item</a></li>
# <li class="item-0"><a href="link5.html">fifth item</a></li>
# </ul>
#
# <class 'pyquery.pyquery.PyQuery'>
# <li class="item-0">first item</li>
# <li class="item-1"><a href="link2.html">second item</a></li>
# <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
# <li class="item-1 active"><a href="link4.html">fourth item</a></li>
# <li class="item-0"><a href="link5.html">fifth item</a></li>
find()的查找范围其实当前节点的所有子孙节点。children()获取该节点的所有直接子节点。
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
# 向children()中传入css选择器限制范围
lis = items.children('.active')
print(type(lis))
print(lis)
# /----------******** 输出:*********--------\
# <class 'pyquery.pyquery.PyQuery'>
# <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
# <li class="item-1 active"><a href="link4.html">fourth item</a></li>
父节点
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
# 使用parent()得到当前节点的父节点
container = items.parent()
print(type(container))
print(container)
# /----------******** 输出:*********--------\
# <class 'pyquery.pyquery.PyQuery'>
# <div id="container">
# <ul class="list">
# <li class="item-0">first item</li>
# <li class="item-1"><a href="link2.html">second item</a></li>
# <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
# <li class="item-1 active"><a href="link4.html">fourth item</a></li>
# <li class="item-0"><a href="link5.html">fifth item</a></li>
# </ul>
# </div>
parent()只能取得当前节点的直接父节点,要想获得所有祖先结点,还需要所有parents()
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
# 使用parent()得到当前节点的父节点
parents = items.parents()
print(type(parents))
print(parents)
# /----------******** 输出:*********--------\
# <class 'pyquery.pyquery.PyQuery'>
# <div class="wrap">
# <div id="container">
# <ul class="list">
# <li class="item-0">first item</li>
# <li class="item-1"><a href="link2.html">second item</a></li>
# <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
# <li class="item-1 active"><a href="link4.html">fourth item</a></li>
# <li class="item-0"><a href="link5.html">fifth item</a></li>
# </ul>
# </div>
# </div><div id="container">
# <ul class="list">
# <li class="item-0">first item</li>
# <li class="item-1"><a href="link2.html">second item</a></li>
# <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
# <li class="item-1 active"><a href="link4.html">fourth item</a></li>
# <li class="item-0"><a href="link5.html">fifth item</a></li>
# </ul>
# </div>
兄弟节点
from pyquery import PyQuery as pq
doc = pq(html)
# .item-0.active 两个类名写在一起是说这个节点的class应有这两个值
li = doc('.list .item-0.active')
print(li.siblings())
# /----------******** 输出:*********--------\
# <li class="item-1"><a href="link2.html">second item</a></li>
# <li class="item-0">first item</li>
# <li class="item-1 active"><a href="link4.html">fourth item</a></li>
# <li class="item-0"><a href="link5.html">fifth item</a></li>
遍历
使用items()方法得到节点列表的生成器。
from pyquery import PyQuery as pq
doc = pq(html)
# 使用items()方法得到li节点列表的生成器以便之后的遍历
lis = doc('li').items()
print(type(lis)) # <class 'generator'>
for li in lis:
print(li, type(li)) # 每个节点都是 <class 'pyquery.pyquery.PyQuery'>
# /----------******** 输出:*********--------\
# <class 'generator'>
# <li class="item-0">first item</li>
# <class 'pyquery.pyquery.PyQuery'>
# <li class="item-1"><a href="link2.html">second item</a></li>
# <class 'pyquery.pyquery.PyQuery'>
# <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
# <class 'pyquery.pyquery.PyQuery'>
# <li class="item-1 active"><a href="link4.html">fourth item</a></li>
# <class 'pyquery.pyquery.PyQuery'>
# <li class="item-0"><a href="link5.html">fifth item</a></li>
# <class 'pyquery.pyquery.PyQuery'>
取出节点相关信息
其实获取信息一般无非就是获取节点的内容和属性值。
获取属性
可以使用attr()方法和attr属性来获取节点的属性值。
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('a')
print(a, type(a))
# 使用attr()方法来获取属性值
print(a.attr('href'))
# 也可以使用attr属性来获取属性值
print(a.attr.href)
# 上述的attr()方法和attr属性都只得到了节点列表中的第一个节点的属性
# 要想取得所有节点的属性,还是需要遍历:
print('\n')
for item in a.items():
print(item.attr('href'))
# /----------******** 输出:*********--------\
# <a href="link2.html">second item</a><a href="link3.html"><span class="bold">third item</span></a><a href="link4.html">fourth item</a><a href="link5.html">fifth item</a> <class 'pyquery.pyquery.PyQuery'>
# link2.html
# link2.html
#
# link2.html
# link3.html
# link4.html
# link5.html
获取文本
text()和html()分别用于获取节点内的纯文本信息和包含html标签的信息。
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a)
# 使用text()获取节点内的纯文本信息
print(a.text())
# 如果想要得到类似innerHTML的html信息,则需要使用html()
print(a.html())
# /----------******** 输出:*********--------\
# <a href="link3.html"><span class="bold">third item</span></a>
# third item
# <span class="bold">third item</span>
当使用一个节点集而不是单个节点去调用text()或html()方法时,会发生什么:
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('li')
# html()返回节点集中第一个节点的html文本信息
print(li.html())
# text()将节点集中所有节点内的纯文本信息组装成一个字符串输出
print(li.text())
print(type(li.text()))
# /----------******** 输出:*********--------\
# <a href="link2.html">second item</a>
# second item third item fourth item fifth item
# <class 'str'>
节点操作
pyquery 提供了一系列方法来对节点进行动态修改,比如为某个节点添加一个class,移除某个节点等。
add_class()和remove_class()
add_class()和remove_class分别用于给节点添加和移除class。
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.remove_class('active')
print(li)
li.add_class('active')
print(li)
# /----------******** 输出:*********--------\
# <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
# <li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
# <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
add_class()和remove_class()分别有一个相同效果的方法addClass()和removeClass()。
attr(),text(),html()
attr()除了可以取出节点的属性外,还可以更改节点的属性值。text()和html()也是一样,除了可以拿到节点内的内容还可以更改节点内的内容。这取决于传给它们的参数。
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
# 使用attr()更改name属性值
li.attr('name', 'link')
print(li)
# 使用text()改变li节点内的纯文本信息
li.text('changed item')
print(li)
# 使用html()改变li节点内的html信息
li.html('<span>changed item</span>')
print(li)
# /----------******** 输出:*********--------\
# <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
# <li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li>
# <li class="item-0 active" name="link">changed item</li>
# <li class="item-0 active" name="link"><span>changed item</span></li>
remove()
可以从某个节点中移除某个子节点,这在提取信息时某些情况下很有用。
html = '''
<div class="wrap">
Hello, World
<p>This is a paragraph.</p>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
wrap = doc('.wrap')
print(wrap.text())
# 使用remove移除p节点,使得只输出.wrap节点内的纯文本
wrap.find('p').remove()
print(wrap.text())
# /----------******** 输出:*********--------\
# Hello, World This is a paragraph.
# Hello, World
伪类选择器
也就是css里的伪类选择器。
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
# 第一个li节点
li = doc('li:first-child')
print(li)
# 最后一个li节点
li = doc('li:last-child')
print(li)
# 第二个li节点
li = doc('li:nth-child(2)')
print(li)
# 第三个li节点及之后的节点
li = doc('li:gt(2)')
print(li)
# 位置索引数为偶数的li节点
li = doc('li:nth-child(2n)')
print(li)
# 包含second文本信息的li节点
li = doc('li:contains(second)')
print(li)
# /----------******** 输出:*********--------\
# <li class="item-1"><a href="link2.html">second item</a></li>
#
# <li class="item-0"><a href="link5.html">fifth item</a></li>
#
# <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
#
# <li class="item-0"><a href="link5.html">fifth item</a></li>
#
# <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
# <li class="item-0"><a href="link5.html">fifth item</a></li>
#
# <li class="item-1"><a href="link2.html">second item</a></li>