Xpath解析库

最新推荐文章于 2023-06-20 13:37:12 发布

雪小妮

最新推荐文章于 2023-06-20 13:37:12 发布

阅读量187

点赞数

分类专栏： # Python基础爬虫文章标签： python

本文链接：https://blog.csdn.net/qq_35249586/article/details/117451299

版权

Python基础爬虫专栏收录该内容

17 篇文章 0 订阅

订阅专栏

一、XPath 常用规则
在这里插入图片描述
二、介绍
安装lxml 库，导入库，from lxml import etree
2.1 构造XPath 解析对象

index.html的内容
'''<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>'''
html = etree.parse('index.html',etree.HTMLParser())#构造
print(html)
result = etree.tostring(html)#调用tostring （）方法可以修正html（有些html中会缺失，比如缺失</html>）

2.2 获取节点
获取所有节点：用//开头的XPa th 规则来选取所有符合要求的节点。*代表：匹配所有节点。
获取子节点：通过／或／／即可查找元素的子节点或子孙节点
父节点：…获取；过parent ：：来获取父节点
属性匹配：@
文本获取：text（）（1.先选取某一节点在获取文本；2.直接用//）
属性获取：@
属性多值匹配：某些节点的某个属性可能有多个值，用contains （）函数
在这里插入图片描述
多属性匹配：用and

html = etree.parse('index.html',etree.HTMLParser())
result1 = html.xpath('//*')#*代表：匹配所有节点。
print(result1)
result2 = html.xpath('//p')#匹配所有p节点。
print(result2)
result2 = html.xpath('//p/b')#匹配p节点后的b节点，子节点。
print(result2)
result3 = html.xpath('//a[@href="http://example.com/elsie"]/../@class')
print(result3)
result4 = html.xpath('//a[@id="link1"]/../@class')
print(result4)
result5 = html.xpath('//a[@id="link1"]/parent::*/@class')
print(result5)
result6 = html.xpath('//a[@id="link1"]')
print(result6)

result7 = html.xpath('//p[@class="story"]/a/text()')
print(result7)

result8 = html.xpath('//p[@class="story"]//text()')
print(result8)
result9 = html.xpath('//p/a/@href')
print(result9)
result9 = html.xpath('//p/a/@class')
print(result9)