一、XPath 常用规则
二、介绍
安装lxml 库,导入库,from lxml import etree
2.1 构造XPath 解析对象
index.html的内容
'''<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>'''
html = etree.parse('index.html',etree.HTMLParser())#构造
print(html)
result = etree.tostring(html)#调用tostring ()方法可以修正html(有些html中会缺失,比如缺失</html>)
2.2 获取节点
获取所有节点:用//开头的XPa th 规则来选取所有符合要求的节点。*代表:匹配所有节点。
获取子节点:通过/或//即可查找元素的子节点或子孙节点
父节点:…获取;过parent ::来获取父节点
属性匹配:@
文本获取:text()(1.先选取某一节点在获取文本;2.直接用//)
属性获取:@
属性多值匹配:某些节点的某个属性可能有多个值,用contains ()函数
多属性匹配:用and
html = etree.parse('index.html',etree.HTMLParser())
result1 = html.xpath('//*')#*代表:匹配所有节点。
print(result1)
result2 = html.xpath('//p')#匹配所有p节点。
print(result2)
result2 = html.xpath('//p/b')#匹配p节点后的b节点,子节点。
print(result2)
result3 = html.xpath('//a[@href="http://example.com/elsie"]/../@class')
print(result3)
result4 = html.xpath('//a[@id="link1"]/../@class')
print(result4)
result5 = html.xpath('//a[@id="link1"]/parent::*/@class')
print(result5)
result6 = html.xpath('//a[@id="link1"]')
print(result6)
result7 = html.xpath('//p[@class="story"]/a/text()')
print(result7)
result8 = html.xpath('//p[@class="story"]//text()')
print(result8)
result9 = html.xpath('//p/a/@href')
print(result9)
result9 = html.xpath('//p/a/@class')
print(result9)