XPath简介
XPath全称XML Path Language即XML路径语言,用于在XML文档查找信息,但是也同样适合于在HTML文当中搜索信息。在进行爬虫的过程中,可以使用XPath来做相应的信息抽取。他提供了简洁明了的的路径选择表达式以及100多个内建函数用于字符串 、数值和时间的匹配以及节点、序列的处理等,几乎所有我们想要定位的节点都可以通过XPath来选择。
XPath规则:
表达式 描述
Nodename 选取此节点的所有子节点
/ 选取此节点的所有子节点
// 从当前节点选取直接子节点
. 选取当前节点
.. 选取当前节点的父节点
@ 选取属性
XPath用法如下:
通过XPath对网页进行解析简单示例
text =''' <div> <ul> <li class="item-0"><a herf="link1.html">first item</a></li> <li class="item-1"><a herf="link2.html">second item</a></li> <li class="item-inactive"><a herf="link3.html">third item</a></li> <li class="item-1"><a herf="link4.html">fourth item</a></li> <li class="item-0"><a herf="link5.html">fifth item</a></li> </ul> </div> ''' from lxml import etree #构造XPath解析对象(并且自动修正HTML文本 html = etree.HTML(text) #若text为HTML文本文件上段代码需要改写为 [html = etree.parse(‘./test.html’,etree.HTMLParser())] #输出修正后的HTML代码 result = etree.tostring(html) #使用decode方法将bytes其转换为str类型 print(result.decode('utf-8')) 结果: <html><body><div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </body></html>
测试文档
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Title</title> </head> <body> <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </body> </html>
from lxml import etree html = etree.parse('./test.html',etree.HTMLParser()) #获取HTML所有节点(*匹配所有节点),返回元素为一个列表 result = html.xpath('//*') print(result) 结果: [<Element html at 0x2cb6748>, <Element head at 0x2cb6848>, <Element meta at 0x2cb6888>, <Element title at 0x2cb68c8>, <Element body at 0x2cb6908>, <Element div at 0x2cb6988>, <Element ul at 0x2cb69c8>, <Element li at 0x2cb6a08>, <Element a at 0x2cb6a48>, <Element li at 0x2cb6948>, <Element a at 0x2cb6a88>, <Element li at 0x2cb6ac8>, <Element a at 0x2cb6b08>, <Element li at 0x2cb6b48>, <Element a at 0x2cb6b88>, <Element li at 0x2cb6bc8>, <Element a at 0xXPath选取所有节点
选取指定节点
from lxml import etree html = etree.parse('./test.html',etree.HTMLParser()) #选取所有li节点,使用//li result = html.xpath('//li') print(result) 结果: [<Element li at 0x2c96848>, <Element li at 0x2c96888>, <Element li at 0x2c968c8>, <Element li at 0x2c96908>, <Element li at 0x2c96948>]
选取子节点
from lxml import etree html = etree.parse('./test.html',etree.HTMLParser()) #获取所有li节点的所有直接a子节点 result = html.xpath('//li/a') print(result) 结果 [<Element a at 0x2cc6848>, <Element a at 0x2cc6888>, <Element a at 0x2cc68c8>, <Element a at 0x2cc6908>, <Element a at 0x2cc6948>] from lxml import etree html = etree.parse('./test.html',etree.HTMLParser()) #获取ul节点下的所有直接子节点a result = html.xpath('//ul//a') print(result) 结果 [<Element a at 0x2ca6848>, <Element a at 0x2ca6888>, <Element a at 0x2ca68c8>, <Element a at 0x2ca6908>, <Element a at 0x2ca6948>]
选取父节点
from lxml import etree html = etree.parse('./test.html',etree.HTMLParser()) #获取href属性为link4.html的a节点,然后获取其父节点的class属性 result = html.xpath('//a[@href="link4.html"]/../@class') print(result) 结果 ['item-1'] from lxml import etree html = etree.parse('./test.html',etree.HTMLParser()) #获取href属性为link4.html的a节点,然后获取其父节点的class属性 result = html.xpath('//a[@href="link4.html"]/parent::*/@class') print(result) 结果: ['item-1']
属性匹配获取节点
from lxml import etree html = etree.parse('./test.html',etree.HTMLParser()) #获取class属性为item-0的节点 result = html.xpath('//li[@class="item-0"]') print(result) 结果: [<Element li at 0x2c8f848>, <Element li at 0x2c8f888>]
节点文本获取
from lxml import etree html = etree.parse('./test.html',etree.HTMLParser()) #获取属性class为item-0的li节点的文本,逐层获取 result = html.xpath('//li[@class="item-0"]/a/text()') print(result) 结果 ['first item', 'fifth item'] from lxml import etree html = etree.parse('./test.html',etree.HTMLParser()) #获取class属性为item-0的节点(//获取文本内容) 即选取所有子孙节点的文本 result = html.xpath('//li[@class="item-0"]//text()') print(result) 结果 ['first item', 'fifth item'] 属性获取(@属性名) from lxml import etree html = etree.parse('./test.html',etree.HTMLParser()) #获取所有li节点下所有a节点的href属性 result = html.xpath('//li/a/@href') print(result) 结果: ['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
属性多值获取(一个节点含有多个属性)
from lxml import etree text = '''<li class=li li-first"><a href="link.html">first item</a></li>''' html = etree.HTML(text) result = html.xpath('//li[contains(@class,"li")]/a/text()') print(result) 结果 ['first item']
多属性匹配(多属性确定一个节点)
from lxml import etree text = '''<li class=li li-first" name="item"><a href="link.html">first item</a></li>''' html = etree.HTML(text) result = html.xpath('//li[contains(@class,"li") and @name="item"]/a/text()') print(result) 结果 ['first item']
按序选择(在选择的过程中某些属性可能匹配多个节点,但只想取其中的某个,例如第二个节点或者最后一个节点)
from lxml import etree text = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' html = etree.HTML(text) #获取第一个li节点 result = html.xpath('//li[1]/a/text()') print(result) #获取最后一个li节点 result = html.xpath('//li[last()]/a/text()') print(result) #获取位置小于3的li节点,使用position()函数 result = html.xpath('//li[position()<3]/a/text()') print(result) #获取倒数第三个li节点,使用last()函数 result = html.xpath('//li[last()-2]/a/text()') print(result) 结果 ['first item'] ['fifth item'] ['first item', 'second item'] ['third item']