XPath解析库
一、简介
1.简介
XPath,全称XML Path Language,即XML路径语言,它是一门在XML文档中查找信息的语言,最初用于搜索XML文档,但同样适用于HTML文档的搜索,选择功能强大
2.安装
在python中很多库都提供XPath的功能,但是最流行的还是lxml这个库,效率最高
pip install lxml
3.官方文档
https://www.w3.org/TR/xpath/
4.中文文档
https://www.w3school.com.cn/xpath/index.asp
二、基础
1.XPath常用规则
表达式 | 描述 |
---|---|
nodename | 选取此节点的所有子节点 |
/ | 从当前节点选取直接子节点 |
// | 从当前节点选取子孙节点 |
. | 选取当前节点 |
… | 选取当前节点的父节点 |
@ | 选取属性 |
2.构造XPath解析对象
(1).声明HTML文本
首先导入lxml库的etree模块,然后声明一段HTML文本,调用HTML类进行初始化,这样就构成一个XPath解析对象,etree模块可以自动修正文本,调用tostring()方法可以输出修正过的HTML文本,结果是Bytes类型,下面例子中li标签被补全并且添加了body、html节点
from lxml import etree
text = '''
<div>
<ul>
<li class="one"><a href="link1.html">first</a></li>
<li class="two"><a href="link2.html">second</a></li>
<li class="three"><a href="link3.html">third</a></li>
<li class="two"><a href="link4.html">fourth</a></li>
<li class="one"><a href="link5.html">fifth</a>
</ul>
</div>
'''
html = etree.HTML(text) # 构成XPath解析对象
result = etree.tostring(html)
print(result)
print(result.decode('utf-8'))
**********************************************************************
b'<html><body><div>\n<ul>\n<li class="one"><a href="link1.html">first</a></li>\n<li class="two"><a href="link2.html">second</a></li>\n<li class="three"><a href="link3.html">third</a></li>\n<li class="two"><a href="link4.html">fourth</a></li>\n<li class="one"><a href="link5.html">fifth</a>\n</li></ul>\n</div>\n</body></html>'
<html><body><div>
<ul>
<li class="one"><a href="link1.html">first</a></li>
<li class="two"><a href="link2.html">second</a></li>
<li class="three"><a href="link3.html">third</a></li>
<li class="two"><a href="link4.html">fourth</a></li>
<li class="one"><a href="link5.html">fifth</a>
</li></ul>
</div>
</body></html>
(2).读取文本文件
可以直接读取文本文件进行解析,但是会多一个DOCTYPE声明,并不影响解析
from lxml import etree
html = etree.parse(r'./test.html', etree.HTMLParser()) # 构成XPath解析对象
result = etree.tostring(html)
print(result.decode())
**********************************************************************
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>
<ul>
<li class="one"><a href="link1.html">first</a></li>
<li class="two"><a href="link2.html">second</a></li>
<li class="three"><a href="link3.html">third</a></li>
<li class="two"><a href="link4.html">fourth</a></li>
<li class="one"><a href="link5.html">fifth</a>
</li></ul>
</div></body></html>
test.html文件
此文件以下许多实例都会用到
<!--test.html-->
<div>
<ul>
<li class="one"><a href="link1.html">first</a></li>
<li class="two"><a href="link2.html">second</a></li>
<li class="three"><a href="link3.html">third</a></li>
<li class="two"><a href="link4.html">fourth</a></li>
<li class="one"><a href="link5.html">fifth</a>
</ul>
</div>
3.xpath()方法
构造完XPath解析对象后,使用xpath()方法配合XPath常用规则即可做到信息抽取
4.XPath中的运算符
运算符 | 描述 | 实例 |
---|---|---|
or | 或 | a=1 or a=2 |
and | 和 | a=1 and a=2 |
mod | 除法的余数 | a mod b |
| | 两个节点的集 | //a | //img |
+ | 加法 | 1 + 2 |
- | 减法 | 1 - 2 |
* | 乘法 | 1 * 2 |
div | 除法 | 1 div 2 |
= | 等于 | a=1 |
!= | 不等于 | a!=1 |
< | 小于 | a<1 |
<= | 小于等于 | a<=1 |
> | 大于 | a>1 |
>= | 大于等于 | a>=1 |
三、匹配
1.所有节点//
//x
,匹配所有x节点。//*
,匹配所有节点
from lxml import etree
html = etree.parse(r'./test.html', etree.HTMLParser()) # 构成XPath解析对象
result1 = html.xpath('//*') # XPath解析对象使用xpath()方法匹配所有节点
print(result1) # 结果是列表形式,每个元素都是一个Element对象
result2 = html.xpath('//li') # XPath解析对象使用xpath()方法匹配所有li节点
print(result2) # 结果是列表形式,每个元素都是一个Element对象
**********************************************************************
[<Element html at 0x16517842388>, <Element body at 0x16517842488>, <Element div at 0x165178424c8>, <Element ul at 0x16517842508>, <Element li at 0x16517842548>, <Element a at 0x165178425c8>, <Element li at 0x16517842608>, <Element a at 0x16517842648>, <Element li at 0x16517842688>, <Element a at 0x16517842588>, <Element li at 0x165178426c8>, <Element a at 0x16517842708>, <Element li at 0x16517842748>, <Element a at 0x16517842788>]
[<Element li at 0x21c75ae2408>, <Element li at 0x21c75ae2448>, <Element li at 0x21c75ae2488>, <Element li at 0x21c75ae24c8>, <Element li at 0x21c75ae2508>]
2.子节点/
x/y
,y是x的直接子节点才能匹配出y,否则匹配结果为空
由于ul标签下没有直接的a子节点,只有li节点,所以匹配结果为空。/前必须有节点约束,不然则匹配为空
from lxml import etree
html = etree.parse(r'./test.html', etree.HTMLParser()) # 构成XPath解析对象
result1 = html.xpath('//li/a')
print(result1)
result2 = html.xpath('//ul/a')
print(result2)
result3 = html.xpath('/a')
print(result3)
**********************************************************************
[<Element a at 0x1c0637d2448>, <Element a at 0x1c0637d2488>, <Element a at 0x1c0637d24c8>, <Element a at 0x1c0637d2508>, <Element a at 0x1c0637d2548>]
[]
[]
3.属性匹配@
x[@y="z"]
,匹配x节点的y属性值为z的节点
from lxml import etree
html = etree.parse(r'./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="one"]')
print(result)
**********************************************************************
[<Element li at 0x1f6e5192488>, <Element li at 0x1f6e51924c8>]
4.嵌套查询.
./
,匹配元素内部的数据,注意列表不能使用XPath方法
from lxml import etree
html = etree.parse(r'./test.html', etree.HTMLParser())
result1 = html.xpath('//li[@class="one"]')
print(result1)
result2 = result1[0].xpath('./a')
print(result2)
**********************************************************************
[<Element li at 0x295d9d50e88>, <Element li at 0x295d9d50ec8>]
[<Element a at 0x295d9d50f08>]
4.父节点…
x/..
,匹配x节点的的父节点
from lxml import etree
html = etree.parse(r'./test.html', etree.HTMLParser())
result1 = html.xpath('//a[@href="link4.html"]/..')
print(result1)
result2 = html.xpath('//a/..')
print(result2)
**********************************************************************
[<Element li at 0x1c0b20623c8>]
[<Element li at 0x1b1494f2548>, <Element li at 0x1b1494f2588>, <Element li at 0x1b1494f25c8>, <Element li at 0x1b1494f2608>, <Element li at 0x1b1494f2648>]
5.文本获取
x/text()
,此方法获取x节点中的文本
1匹配到的结果是被修正的li节点内部的换行符,2匹配到的是换行符和a节点的内容,3匹配的是a节点的内容
from lxml import etree
html = etree.parse(r'./test.html', etree.HTMLParser())
result1 = html.xpath('//li[@class="one"]/text()')
print(result1)
result2 = html.xpath('//li[@class="one"]//text()')
print(result2)
result3 = html.xpath('//li[@class="one"]/a/text()')
print(result3)
**********************************************************************
['\r\n']
['first', 'fifth', '\r\n']
['first', 'fifth']
string(//x)
,此方法获取x节点下的所有文本
from lxml import etree
html = etree.parse(r'./test.html', etree.HTMLParser())
result = html.xpath('string(//ul)')
print(result)
**********************************************************************
first
second
third
fourth
fifth
6.属性获取
@x
,x为想要获取的属性
from lxml import etree
html = etree.parse(r'./test.html', etree.HTMLParser())
result1 = html.xpath('//li/a/@href')
print(result1)
**********************************************************************
['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
7.属性多值匹配
x[contains(y, z)]
,匹配x节点的y属性值包含z的节点
某个属性如果有多个值时,用@匹配会失败,需要用contains()方法,第一个参数传入属性名称,第二个参数传入属性值,只要此属性包含所传入的属性值,就可以完成匹配
from lxml import etree
text = '''
<li class="one two three"><a href="link1.html">first</a></li>
'''
html = etree.HTML(text)
result1 = html.xpath('//li[@class="one"]')
print(result1)
result2 = html.xpath('//li[contains(class, one)]')
print(result2)
**********************************************************************
[]
[<Element li at 0x26ccf312308>]
8.多属性匹配
/x[@y="z" and @m="n"]
,匹配x节点的y属性值为z、m属性值为n的节点
同时匹配多个属性,用运算符and连接
from lxml import etree
text = '''
<li class="one" name="number"><a href="link1.html">first</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[@class="one" and @name="number"]')
print(result)
**********************************************************************
[<Element li at 0x1f65c162448>]
9.按序选择
/x[y]
,匹配第y个x节点,序号以1开头而不是0
/x[last()]
,匹配最后一个x节点
/x[last()-1]
,匹配倒数第二个x节点
/x[position()<2]
,匹配位置小于2的x节点
from lxml import etree
html = etree.parse(r'./test.html', etree.HTMLParser())
result1 = html.xpath('//li[1]/a/text()')
print(result1)
result2 = html.xpath('//li[last()]/a/text()')
print(result2)
result3 = html.xpath('//li[last()-1]/a/text()')
print(result3)
result4 = html.xpath('//li[position()<3]/a/text()')
print(result4)
**********************************************************************
['first']
['fifth']
['fourth']
['first', 'second']
10.节点轴选择
参考https://www.w3school.com.cn/xpath/xpath_axes.asp