python爬虫入门:lxml库进行XPath抽取

python爬虫入门:lxml库进行XPath抽取

lxml起步

常用规则

表达式描述
nodename选取此节点的所有子节点
/从当前节点选取直接子节点
//从当前节点选取子孙节点
.选取当前节点
选取当前节点的父节点
@选取属性

常见用法

  • 所有节点
  • 子节点
  • 父节点
  • 属性匹配 li[@class=“xxx”]
  • 文本获取 /text()
  • 属性获取 @href
  • 属性多值获取 li[contains(@class,“xxx”)]
  • 多属性匹配 li[contains(@class,“li_test”) and @tag=“tag”]
  • 按序选择 li[1]、li[last()]
  • 节点轴选择 li[1]/ancestor::a、li[1]/attribute::a

简单实例

from lxml import etree
text = '''
<div>
<ul class="ul_test">
<li class="first_li"><a href="www.baidu.com">one</a>
<li class="li_test"><a href="www.csdn.net" tag="tag">two</a></li>
<li class="li li_test"><a href="www.163.com">three</a></li>
<li class="li li_test" tag="tag"><a href="www.163.com">four</a></li>
</div>
'''
html = etree.HTML(text)
# 或者html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8')+'\n')
print(html.xpath('//li/a[@tag="tag"]/text()'))
print(html.xpath('//li/a[@tag="tag"]/../@class'))
print(html.xpath('//li[@class="li_test"]/a/text()'))
print(html.xpath('//li[contains(@class,"li_test")]/a/text()'))
print(html.xpath('//li[contains(@class,"li_test") and @tag="tag"]/a/text()'))
print(html.xpath('//li[1]/a/text()'), 
	html.xpath('//li[last()]/a/text()'),
	html.xpath('//li[position()<3]/a/text()'),
	html.xpath('//li[last()-1]/a/text()'))
print('*'*20)
print(html.xpath('//li[1]/ancestor::*'),
	html.xpath('//li[1]/ancestor::ul'),
	html.xpath('//li[1]/attribute::*'),
	html.xpath('//li[1]/child::*'),
	html.xpath('//li[1]/descendant::*'),
	html.xpath('//li[1]/following::*'),
	html.xpath('//li[1]/following-sibling::*'), sep="\n")

"""

运行结果为

<html><body><div>
<ul class="ul_test">
<li class="first_li"><a href="www.baidu.com">one</a>
</li><li class="li_test"><a href="www.csdn.net" tag="tag">two</a></li>
<li class="li li_test"><a href="www.163.com">three</a></li>
<li class="li li_test" tag="tag"><a href="www.163.com">four</a></li>
</ul></div>
</body></html>

['two']
['li_test']
['two']
['two', 'three', 'four']
['four']
['one'] ['four'] ['one', 'two'] ['three']
********************
[<Element html at 0x1d4f8dba688>, <Element body at 0x1d4f8dba6c8>, <Element div at 0x1d4f8dba5c8>, <Element ul at 0x1d4f8dba748>]
[<Element ul at 0x1d4f8dba748>]
['first_li']
[<Element a at 0x1d4f8dba608>]
[<Element a at 0x1d4f8dba608>]
[<Element li at 0x1d4f8dba808>, <Element a at 0x1d4f8dba848>, <Element li at 0x1d4f8dba888>, <Element a at 0x1d4f8dba8c8>, <Element li at 0x1d4f8dba908>, <Element a at 0x1d4f8dba988>]
[<Element li at 0x1d4f8dba808>, <Element li at 0x1d4f8dba888>, <Element li at 0x1d4f8dba908>]
"""
  • 3
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值