XPath解析库

XPath解析库

一、简介

1.简介

XPath,全称XML Path Language,即XML路径语言,它是一门在XML文档中查找信息的语言,最初用于搜索XML文档,但同样适用于HTML文档的搜索,选择功能强大

2.安装

在python中很多库都提供XPath的功能,但是最流行的还是lxml这个库,效率最高

pip install lxml

3.官方文档

https://www.w3.org/TR/xpath/

4.中文文档

https://www.w3school.com.cn/xpath/index.asp

二、基础

1.XPath常用规则

表达式描述
nodename选取此节点的所有子节点
/从当前节点选取直接子节点
//从当前节点选取子孙节点
.选取当前节点
选取当前节点的父节点
@选取属性

2.构造XPath解析对象

(1).声明HTML文本

首先导入lxml库的etree模块,然后声明一段HTML文本,调用HTML类进行初始化,这样就构成一个XPath解析对象,etree模块可以自动修正文本,调用tostring()方法可以输出修正过的HTML文本,结果是Bytes类型,下面例子中li标签被补全并且添加了body、html节点

from lxml import etree

text = '''
<div>
<ul>
<li class="one"><a href="link1.html">first</a></li>
<li class="two"><a href="link2.html">second</a></li>
<li class="three"><a href="link3.html">third</a></li>
<li class="two"><a href="link4.html">fourth</a></li>
<li class="one"><a href="link5.html">fifth</a>
</ul>
</div>
'''
html = etree.HTML(text)     # 构成XPath解析对象
result = etree.tostring(html)
print(result)
print(result.decode('utf-8'))
**********************************************************************
b'<html><body><div>\n<ul>\n<li class="one"><a href="link1.html">first</a></li>\n<li class="two"><a href="link2.html">second</a></li>\n<li class="three"><a href="link3.html">third</a></li>\n<li class="two"><a href="link4.html">fourth</a></li>\n<li class="one"><a href="link5.html">fifth</a>\n</li></ul>\n</div>\n</body></html>'
<html><body><div>
<ul>
<li class="one"><a href="link1.html">first</a></li>
<li class="two"><a href="link2.html">second</a></li>
<li class="three"><a href="link3.html">third</a></li>
<li class="two"><a href="link4.html">fourth</a></li>
<li class="one"><a href="link5.html">fifth</a>
</li></ul>
</div>
</body></html>
(2).读取文本文件

可以直接读取文本文件进行解析,但是会多一个DOCTYPE声明,并不影响解析

from lxml import etree

html = etree.parse(r'./test.html', etree.HTMLParser())     # 构成XPath解析对象
result = etree.tostring(html)
print(result.decode())
**********************************************************************
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>&#13;
<ul>&#13;
<li class="one"><a href="link1.html">first</a></li>&#13;
<li class="two"><a href="link2.html">second</a></li>&#13;
<li class="three"><a href="link3.html">third</a></li>&#13;
<li class="two"><a href="link4.html">fourth</a></li>&#13;
<li class="one"><a href="link5.html">fifth</a>&#13;
</li></ul>&#13;
</div></body></html>
test.html文件

此文件以下许多实例都会用到

<!--test.html-->
<div>
<ul>
<li class="one"><a href="link1.html">first</a></li>
<li class="two"><a href="link2.html">second</a></li>
<li class="three"><a href="link3.html">third</a></li>
<li class="two"><a href="link4.html">fourth</a></li>
<li class="one"><a href="link5.html">fifth</a>
</ul>
</div>

3.xpath()方法

构造完XPath解析对象后,使用xpath()方法配合XPath常用规则即可做到信息抽取

4.XPath中的运算符

运算符描述实例
ora=1 or a=2
anda=1 and a=2
mod除法的余数a mod b
|两个节点的集//a | //img
+加法1 + 2
-减法1 - 2
*乘法1 * 2
div除法1 div 2
=等于a=1
!=不等于a!=1
<小于a<1
<=小于等于a<=1
>大于a>1
>=大于等于a>=1

三、匹配

1.所有节点//

//x,匹配所有x节点。//*,匹配所有节点

from lxml import etree

html = etree.parse(r'./test.html', etree.HTMLParser())     # 构成XPath解析对象
result1 = html.xpath('//*')		# XPath解析对象使用xpath()方法匹配所有节点
print(result1)	# 结果是列表形式,每个元素都是一个Element对象
result2 = html.xpath('//li')		# XPath解析对象使用xpath()方法匹配所有li节点
print(result2)	# 结果是列表形式,每个元素都是一个Element对象
**********************************************************************
[<Element html at 0x16517842388>, <Element body at 0x16517842488>, <Element div at 0x165178424c8>, <Element ul at 0x16517842508>, <Element li at 0x16517842548>, <Element a at 0x165178425c8>, <Element li at 0x16517842608>, <Element a at 0x16517842648>, <Element li at 0x16517842688>, <Element a at 0x16517842588>, <Element li at 0x165178426c8>, <Element a at 0x16517842708>, <Element li at 0x16517842748>, <Element a at 0x16517842788>]
[<Element li at 0x21c75ae2408>, <Element li at 0x21c75ae2448>, <Element li at 0x21c75ae2488>, <Element li at 0x21c75ae24c8>, <Element li at 0x21c75ae2508>]

2.子节点/

x/y,y是x的直接子节点才能匹配出y,否则匹配结果为空

由于ul标签下没有直接的a子节点,只有li节点,所以匹配结果为空。/前必须有节点约束,不然则匹配为空

from lxml import etree

html = etree.parse(r'./test.html', etree.HTMLParser())     # 构成XPath解析对象
result1 = html.xpath('//li/a')
print(result1)
result2 = html.xpath('//ul/a')
print(result2)
result3 = html.xpath('/a')
print(result3)
**********************************************************************
[<Element a at 0x1c0637d2448>, <Element a at 0x1c0637d2488>, <Element a at 0x1c0637d24c8>, <Element a at 0x1c0637d2508>, <Element a at 0x1c0637d2548>]
[]
[]

3.属性匹配@

x[@y="z"],匹配x节点的y属性值为z的节点

from lxml import etree

html = etree.parse(r'./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="one"]')
print(result)
**********************************************************************
[<Element li at 0x1f6e5192488>, <Element li at 0x1f6e51924c8>]

4.嵌套查询.

./,匹配元素内部的数据,注意列表不能使用XPath方法

from lxml import etree

html = etree.parse(r'./test.html', etree.HTMLParser())
result1 = html.xpath('//li[@class="one"]')
print(result1)
result2 = result1[0].xpath('./a')
print(result2)
**********************************************************************
[<Element li at 0x295d9d50e88>, <Element li at 0x295d9d50ec8>]
[<Element a at 0x295d9d50f08>]

4.父节点…

x/..,匹配x节点的的父节点

from lxml import etree

html = etree.parse(r'./test.html', etree.HTMLParser())
result1 = html.xpath('//a[@href="link4.html"]/..')
print(result1)
result2 = html.xpath('//a/..')
print(result2)
**********************************************************************
[<Element li at 0x1c0b20623c8>]
[<Element li at 0x1b1494f2548>, <Element li at 0x1b1494f2588>, <Element li at 0x1b1494f25c8>, <Element li at 0x1b1494f2608>, <Element li at 0x1b1494f2648>]

5.文本获取

x/text(),此方法获取x节点中的文本

1匹配到的结果是被修正的li节点内部的换行符,2匹配到的是换行符和a节点的内容,3匹配的是a节点的内容

from lxml import etree

html = etree.parse(r'./test.html', etree.HTMLParser())
result1 = html.xpath('//li[@class="one"]/text()')
print(result1)
result2 = html.xpath('//li[@class="one"]//text()')
print(result2)
result3 = html.xpath('//li[@class="one"]/a/text()')
print(result3)
**********************************************************************
['\r\n']
['first', 'fifth', '\r\n']
['first', 'fifth']

string(//x),此方法获取x节点下的所有文本

from lxml import etree

html = etree.parse(r'./test.html', etree.HTMLParser())
result = html.xpath('string(//ul)')
print(result)
**********************************************************************
first
second
third
fourth
fifth

6.属性获取

@x,x为想要获取的属性

from lxml import etree

html = etree.parse(r'./test.html', etree.HTMLParser())
result1 = html.xpath('//li/a/@href')
print(result1)
**********************************************************************
['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

7.属性多值匹配

x[contains(y, z)],匹配x节点的y属性值包含z的节点

某个属性如果有多个值时,用@匹配会失败,需要用contains()方法,第一个参数传入属性名称,第二个参数传入属性值,只要此属性包含所传入的属性值,就可以完成匹配

from lxml import etree

text = '''
<li class="one two three"><a href="link1.html">first</a></li>
'''
html = etree.HTML(text)
result1 = html.xpath('//li[@class="one"]')
print(result1)
result2 = html.xpath('//li[contains(class, one)]')
print(result2)
**********************************************************************
[]
[<Element li at 0x26ccf312308>]

8.多属性匹配

/x[@y="z" and @m="n"],匹配x节点的y属性值为z、m属性值为n的节点

同时匹配多个属性,用运算符and连接

from lxml import etree

text = '''
<li class="one" name="number"><a href="link1.html">first</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[@class="one" and @name="number"]')
print(result)
**********************************************************************
[<Element li at 0x1f65c162448>]

9.按序选择

/x[y],匹配第y个x节点,序号以1开头而不是0

/x[last()],匹配最后一个x节点

/x[last()-1],匹配倒数第二个x节点

/x[position()<2],匹配位置小于2的x节点

from lxml import etree

html = etree.parse(r'./test.html', etree.HTMLParser())
result1 = html.xpath('//li[1]/a/text()')
print(result1)
result2 = html.xpath('//li[last()]/a/text()')
print(result2)
result3 = html.xpath('//li[last()-1]/a/text()')
print(result3)
result4 = html.xpath('//li[position()<3]/a/text()')
print(result4)
**********************************************************************
['first']
['fifth']
['fourth']
['first', 'second']

10.节点轴选择

参考https://www.w3school.com.cn/xpath/xpath_axes.asp

  • 1
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值