lxml与xpath_04

Spark_zzz

已于 2022-11-13 23:53:35 修改

阅读量100

点赞数

文章标签： microsoft javascript 前端

于 2022-11-10 23:41:21 首次发布

本文链接：https://blog.csdn.net/m0_60255954/article/details/127798520

版权

按序选择节点

xpath有内置函数,position()表示当前位置,last()表示最后的位置

from lxml import etree
parser=etree.HTMLParser()
text="""
<div>
    <a href="https://geekori.com">geekori.com</a>
    <a href="https://www.jd.com">京东商城</a>
    <a href="https://www.taobao.com">淘宝</a>
    <a href="https://www.google.com">谷歌</a>
    <a href="https://www.microsoft.com">微软</a>
</div>
    """
html=etree.HTML(text)
# 选择第1个<a>节点
a1=html.xpath('//a[1]/text()')
# 选择第2个<a>节点
a2=html.xpath('//a[2]/text()')
print(a1,a2)
# 选择最后一个<a>节点
lasta=html.xpath('//a[last()]/text()')
print(lasta)
# 选择索引大于3的<a>节点
aList=html.xpath('//a[position()>3]/text()')
print(aList)
# 选择第2个<a>节点和倒数第2个<a>节点
aList=html.xpath('//a[position()=2 or position()=last()-1]/text()')
print(aList)

节点轴选择

xpath提供很多节点轴选择方法,包活获取祖先节点,兄弟节点,子孙节点等...

定义HTML

from lxml import etree
parser=etree.HTMLParser()
text="""
<html>
<head>
    <meta charset="UTF-8">
    <title>XPath演示</title>
</head>
<body class="item">
<div>
    <ul class="item">
        <li class="item1"><a href="https://geekori.com">geekori.com</a></li>
        <li class="item2"><a href="https://www.jd.com">京东商城</a>
                            <value url="https://geekori.com"/>
                            <value url="https://google.com"/>
        </li>
        <li class="item3"><a href="https://www.taobao.com">淘宝</a>
                        <a href="https://www.tmall.com">天猫</a></li>
        <li class="item4" value="1234"><a href="https://www.google.com">谷歌</a></li>
        <li class="item5"><a href="https://www.microsoft.com">微软</a></li>
    </ul>
</div>
</body>
</html>
"""
html=etree.HTML(text)

使用ancestor轴

# 使用ancestor轴,用于获取所有的祖先节点.后面必须跟两个冒号(::),然后是节点选择器
# 这里的*表示匹配所有的节点
result=html.xpath('//li[1]/ancestor::*')
# 输出结果:html body div ul
for value in result:
    print(value.tag,end=' ')
print()

# 使用ancestor轴匹配所有class属性值为item的祖先节点
result=html.xpath('//li[1]/ancestor::*[@class="item"]')
# 输出结果:body ul
for value in result:
    print(value.tag,end=' ')
print()

使用attributer轴

# 使用attributer轴获取第4个<li>节点的所有属性值
result=html.xpath('//li[4]/attribute::*')
# 输出结果:['item','1234']
print(result)

使用child轴

# 使用child轴获取第3个<li>节点的所有子节点
result=html.xpath('//li[3]/child::*')
# 输出结果:https://www.taobao.com 淘宝 https://www.tmail.com/ 天猫
for value in result:
    print(value.get('href'),value.text,end=' ')
print()

使用descendant轴

# 使用descendant轴获取第2个<li>节点的所有名为value的子孙节点
result=html.xpath('//li[2]/descendant::value')
# 输出结果:https://geekori.com https://www.google.com
for value in result:
    print(value.get('url'),end=' ')
print()

使用follow轴

# 使用following轴获取第1个<li>节点后的所有子节点(包括子孙节点)
result=html.xpath('//li[1]/following::*')
# 输出结果:li a value value li a value a li a li a
for value in result:
    print(value.tag,end=' ')
print()

使用follow-sibling轴

# 使用following-sibling轴获取第1个<li>节点后所有同级的节点
result=html.xpath('//li[1]/following-sibling::*')
# 输出结果:li li li li
for value in result:
    print(value.tag,end=' ')
print()