xpath

最新推荐文章于 2024-01-12 16:49:17 发布

weixin_30828379

最新推荐文章于 2024-01-12 16:49:17 发布

阅读量226

点赞数

原文链接：http://www.cnblogs.com/peng-zhao/p/10706287.html

版权

xpath语法

XPath 使用路径表达式来选取 XML 文档中的节点或节点集。节点是通过沿着路径 (path) 或者步 (steps) 来选取的。

xpath选取节点 xpath提供了六种选取节点的表达式可以混合使用

1、nodename（节点名字例：div a book）：表示选取此节点的所有子节点；

2、/ ：表示从根节点选取；

3、// ：从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置；

4、. ：选取当前节点；

5、.. ：选取当前节点的父节点；

6、@ ：选取属性。

例：

from lxml import etree

doc = """
<?xml version="1.0" encoding="ISO-8859-1"?>
<html>
    <body>
        <bookstore id="test" class="ttt">
        <book id= "1" class = "2">
          <title lang="eng">Harry Potter</title>
          <price>29.99</price>
        </book>
        <book id = "2222222222222">11111111111111111111
          <title lang="abc">Learning XML</title>
          <price>39.95</price>
        </book>
        </bookstore>
    <a></a>
    </body>
</html>
"""

html = etree.HTML(doc)
print(html.xpath("body"))  
# result: [<Element body at 0x24ea98109c8>]

#  / 只从从跟标签下查找    // 从全文中查找所有匹配的标签
print(html.xpath("/bookstore"))    # 从根标签开始找所有匹配的（根标签下只有body和head标签）
# result: []

print(html.xpath("//bookstore"))   # 全文中找所有匹配的
# result: [<Element bookstore at 0x241ba69f948>]

print(html.xpath("//bookstore[@class='ttt']//book"))   # 全文中找所有匹配的，这种也是从全文中取查找book标签
# result: [<Element book at 0x13ddc771948>, <Element book at 0x13ddc771988>]

print(html.xpath("//book"))
# result: [<Element book at 0x13ddc771948>, <Element book at 0x13ddc771988>]

print(html.xpath("//*"))  # *为通配符
# result: [<Element html at 0x13ddc771848>, <Element body at 0x13ddc771948>, <Element bookstore at 0x13ddc771988>, <Element book at 0x13ddc7719c8>, <Element title at 0x13ddc771a08>, <Element price at 0x13ddc771a88>, <Element book at 0x13ddc771ac8>, <Element title at 0x13ddc771b08>, <Element price at 0x13ddc771b48>, <Element a at 0x13ddc771a48>]

View Code

为了方便更加精确的查询 xpath中还提供了一个谓语的概念，即限制条件，一般放在中括弧中

# 指定要获取的索引
print(html.xpath("//bookstore/book[1]/title/text()"))  # 获取第一个
# result: ['Harry Potter']
print(html.xpath("//bookstore/book[last()-1]/title/text()"))  # last() 最后一个     last()-1 倒数第二个
# result: ['Harry Potter']
print(html.xpath("//bookstore/book[position()>1]/title/text()"))  # 索引大于1的
# result: ['Learning XML']

# 用属性来作限制
# 只要存在lang属性即可
print(html.xpath("//*[@lang]"))
# result: [<Element title at 0x26867410948>, <Element title at 0x26867410908>]

# 只要 有属性即可  @表示属性 *表示通配符
print(html.xpath("//*[@*]"))
#result: [<Element bookstore at 0x224c1d41a08>, <Element book at 0x224c1d41a48>, <Element title at 0x224c1d419c8>, <Element book at 0x224c1d41a88>, <Element title at 0x224c1d41ac8>]

View Code

当存在多个匹配条件时可以用 "|" 来表示可供选择

# 多个匹配条件
print(html.xpath("//title|//price"))
# result： [<Element title at 0x1d6f4ec0b48>, <Element price at 0x1d6f4ec0a48>, <Element title at 0x1d6f4ec0bc8>, <Element price at 0x1d6f4ec0a88>]

View Code

xpath中还提供了轴，它可以用于定义相当于当前节点的节点集

1、ancestor：选取当前节点的所有先辈（父、祖父等）。

2、ancestor-or-self：选取当前节点的所有先辈（父、祖父等）以及当前节点本身。

3、attribute：选取当前节点的所有属性。

4、child：选取当前节点的所有子元素。

5、descendant：选取当前节点的所有后代元素（子、孙等）。

6、descendant-or-self：选取当前节点的所有后代元素（子、孙等）以及当前节点本身。

7、following：选取文档中当前节点的结束标签之后的所有节点。

8、namespace：选取当前节点的所有命名空间节点。

9、parent：选取当前节点的父节点。

10、preceding：选取文档中当前节点的开始标签之前的所有节点。

11、preceding-sibling：选取当前节点之前的所有同级节点。

12、self：选取当前节点。

例：

from lxml import etree

doc = """
<?xml version="1.0" encoding="ISO-8859-1"?>
<html>
    <body>
        <bookstore id="test" class="ttt">

        <book id= "1" class = "2">
          <title lang="eng">Harry Potter</title>
          <price>29.99</price>
        </book>

        <book id = "2222222222222">11111111111111111111
          <title lang="abc">Learning XML</title>
          <price>39.95</price>
        </book>

        </bookstore>
    <a></a>
    </body>
</html>
"""

html = etree.HTML(doc)

# 轴标签
print(html.xpath("//bookstore/ancestor::*"))  # 所有父标签
# result: [<Element html at 0x1fbeac80848>, <Element body at 0x1fbeac80948>]

print(html.xpath("//bookstore/ancestor::body"))  # 所有叫body的先辈（父标签）
# result: [<Element body at 0x203f46f0988>]

print(html.xpath("//bookstore/ancestor-or-self::*"))  # 所有叫父标签（包括自己）
# result: [<Element html at 0x203f46f0848>, <Element body at 0x203f46f0948>, <Element bookstore at 0x203f46f09c8>]

View Code

附：

tag = html.xpath('//ul[@class="gl-warp clearfix"]/li/div/div[@class="p-img"]/a/img/@src')
# 与下面的标签时可以捕获到相同的内容
tags = html.xpath('//ul[@class="gl-warp clearfix"]/child::*')
tag = tags[0]
imgpath = tag.xpath('./div/div[@class="p-img"]/a/img/@src')