爬虫手册01 XPath的使用

XPath的使用

目标: 列举常用的XPath选择器,方便以后查阅。

下面代码用到的test.html

<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>

1. 所有节点

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//*')
print(result)

运行结果:

[<Element html at 0x224bd79a648>, <Element body at 0x224bd9293c8>, <Element div at 0x224bd929408>, <Element ul at 0x224bd929488>, <Element li at 0x224bd929688>, <Element a at 0x224bd9296c8>, <Element li at 0x224bd929748>, <Element a at 0x224bd929788>, <Element li at 0x224bd9297c8>, <Element a at 0x224bd929588>, <Element li at 0x224bd929808>, <Element a at 0x224bd929848>, <Element li at 0x224bd929888>, <Element a at 0x224bd9298c8>]

2. 子节点

2.1 子节点

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a')
print(result)

运行结果:

[<Element a at 0x2a16da09288>, <Element a at 0x2a16da092c8>, <Element a at 0x2a16da09348>, <Element a at 0x2a16da09548>, <Element a at 0x2a16da09448>]

2.2 子孙节点

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//ul//a')
print(result)

运行结果:和上面是一样的,一个是找爸爸的儿子,一个是找爷爷的孙子

[<Element a at 0x223c6616448>, <Element a at 0x223c6616488>, <Element a at 0x223c6616508>, <Element a at 0x223c6616708>, <Element a at 0x223c6616608>]

但是如果这样找,就找不到了

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//ul/a')
print(result)

运行结果:因为ul的儿子节点没有a标签

[]

3. 父节点

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/../@class')
print(result)

运行结果:

['item-1']

也可以这么写,效果一样。

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/parent::*/@class')
print(result)

4. 属性匹配

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]')
print(result)

运行结果:

[<Element li at 0x19868d59408>, <Element li at 0x19868d59488>]

5. 文本获取

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="it
                    em-0"]/a/text()')
print(result)

运行结果:

['first item', 'fifth item']

6. 属性获取

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)

运行结果:

['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

7. 属性多值匹配

from lxml import etree

text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)

运行结果:

['first item']

8. 多属性匹配

from lxml import etree

text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)

运行结果:

['first item']

9. 按序选择

from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/a/text()')
print(result)
result = html.xpath('//li[last()]/a/text()')
print(result)
result = html.xpath('//li[position()<3]/a/text()')
print(result)
result = html.xpath('//li[last()-2]/a/text()')
print(result)

运行结果:

['first item']
['fifth item']
['first item', 'second item']
['third item']

10. 节点轴选择

from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html"><span>first item</span></a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/ancestor::*')
print(result)
result = html.xpath('//li[1]/ancestor::div')
print(result)
result = html.xpath('//li[1]/attribute::*')
print(result)
result = html.xpath('//li[1]/child::a[@href="link1.html"]')
print(result)
result = html.xpath('//li[1]/descendant::span')
print(result)
result = html.xpath('//li[1]/following::*[2]')
print(result)
result = html.xpath('//li[1]/following-sibling::*')
print(result)

运行结果:

[<Element html at 0x1ead9b366c8>, <Element body at 0x1ead9cc63c8>, <Element div at 0x1ead9cc6408>, <Element ul at 0x1ead9cc6448>]
[<Element div at 0x1ead9cc6408>]
['item-0']
[<Element a at 0x1ead9cc63c8>]
[<Element span at 0x1ead9cc6408>]
[<Element a at 0x1ead9cc6448>]
[<Element li at 0x1ead9cc66c8>, <Element li at 0x1ead9cc65c8>, <Element li at 0x1ead9cc6308>, <Element li at 0x1ead9cc64c8>]
  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值