XPath应用实例

这篇博客介绍了XPath在Python中用于修复残缺源码、选取节点、属性过滤、获取文本和属性等操作的实例。通过lxml库,展示了如何利用XPath表达式从HTML文档中提取特定信息,例如选取所有li节点、获取指定类名的li节点内的a标签文本、获取a标签的href属性等。此外,还涉及到了按序选择、轴选择以及属性多值匹配等高级用法。
摘要由CSDN通过智能技术生成

XPath实例


XPath可以修复残缺源码
# coding=utf-8
from lxml improt etree

text = '''
<div>
	<ul>
		<li class="item-0"><a href="link1.html"><span>first item</span></a></li>
		<li class="item-1"><a href="link2.html">second item</a></li>
		<li class="item-inactive"><a href="link3.html">third item</a></li>
		<li class="item-1"><a href="link4.html">first item</a></li>
		<li class="item-0"><a href="link5.html">first item</a>
	</ul>
</div>
'''
# XPath对网页进行解析的过程
html = etree.HTML(text)
# result = html.xpath()
# 修正网页源码残缺
result = etree.tostring(html)
# 转换为utf-8字符型
print(result.decode('utf-8'))

选取节点
# coding=utf-8
from lxml improt etree

text = '''
<div>
	<ul>
		<li class="item-0"><a href="link1.html"><span>first item</span></a></li>
		<li class="item-1"><a href="link2.html">second item</a></li>
		<li class="item-inactive"><a href="link3.html">third item</a></li>
		<li class="item-1"><a href="link4.html">first item</a></li>
		<li class="item-0"><a href="link5.html">first item</a>
	</ul>
</div>
'''

# 获取所有子节点
html = etree.HTML(text)
result = html.xpath('//')

# 获取子节点下的所有li
# result = html.xpath('//li')

# li节点下的a
# result = html.xpath('//li/a')

# 获取a标签href属性为link5.html的父级节点的class属性
# result = html.xpath('//a[@href="link5.html"]/../@class')

# 循环换行打印
for i in result:
    print(i)
使用@过滤选择节点
# coding=utf-8
from lxml improt etree

text = '''
<div>
	<ul>
		<li class="item-0"><a href="link1.html"><span>first item</span></a></li>
		<li class="item-1"><a href="link2.html">second item</a></li>
		<li class="item-inactive"><a href="link3.html">third item</a></li>
		<li class="item-1"><a href="link4.html">first item</a></li>
		<li class="item-0"><a href="link5.html">first item</a>
	</ul>
</div>
'''

html = etree.HTML(text)
# @过滤
result = html.xpath('//li[@class="item-0"]')

# 循环换行打印
for i in result:
    print(i)
获取文本
# coding=utf-8
from lxml improt etree

text = '''
<div>
	<ul>
		<li class="item-0"><a href="link1.html"><span>first item</span></a></li>
		<li class="item-1"><a href="link2.html">second item</a></li>
		<li class="item-inactive"><a href="link3.html">third item</a></li>
		<li class="item-1"><a href="link4.html">first item</a></li>
		<li class="item-0"><a href="link5.html">first item</a>
	</ul>
</div>
'''

html = etree.HTML(text)
# text()获取标签内文本
result = html.xpath('//li[@class="item-0"]/a/text()')

# 循环换行打印
for i in result:
    print(i)
属性获取@
# coding=utf-8
from lxml improt etree

text = '''
<div>
	<ul>
		<li class="item-0"><a href="link1.html"><span>first item</span></a></li>
		<li class="item-1"><a href="link2.html">second item</a></li>
		<li class="item-inactive"><a href="link3.html">third item</a></li>
		<li class="item-1"><a href="link4.html">first item</a></li>
		<li class="item-0"><a href="link5.html">first item</a>
	</ul>
</div>
'''

html = etree.HTML(text)
# 属性获取
result = html.xpath('//li/a/@href')

# 循环换行打印
for i in result:
    print(i)
属性多值匹配
# coding=utf-8
from lxml improt etree

text = '''
<div>
	<ul>
		<li class="item-0 li li-first" name="item"><a href="link1.html">first item></a></li>
		<li class="item-1"><a href="link2.html">second item</a></li>
		<li class="item-inactive"><a href="link3.html">third item</a></li>
		<li class="item-1"><a href="link4.html">first item</a></li>
		<li class="item-0"><a href="link5.html">first item</a>
	</ul>
</div>
'''

html = etree.HTML(text)
# 属性多值匹配 contains
# result = html.xpath('//li[contains(@class,"li")]/a/text()')
# 多属性匹配  and
result = html.xpath('//li[contains(@class,"li") and @name="item"]/a/text')

# 循环换行打印
for i in result:
    print(i)
按序选择
# coding=utf-8
from lxml improt etree

text = '''
<div>
	<ul>
		<li class="item-0 li li-first" name="item"><a href="link1.html">first item></a></li>
		<li class="item-1"><a href="link2.html">second item</a></li>
		<li class="item-inactive"><a href="link3.html">third item</a></li>
		<li class="item-1"><a href="link4.html">first item</a></li>
		<li class="item-0"><a href="link5.html">first item</a>
	</ul>
</div>
'''

html = etree.HTML(text)
# 按序选择
# result = html.xpath('//li[1]/a/text')

# 最后一个标签的文本
# result = html.xpath('//li[last()]/a/text')

# 前两个文本
# result = html.xpath('//li[position()<3]/a/text')

# 倒数第三个文本
# result = html.xpath('//li[last()-2]/a/text')

# 循环换行打印
for i in result:
    print(i)
轴选择
# coding=utf-8
from lxml improt etree

text = '''
<div>
	<ul>
		<li class="item-0 li li-first" name="item"><a href="link1.html">first item></a></li>
		<li class="item-1"><a href="link2.html">second item</a></li>
		<li class="item-inactive"><a href="link3.html">third item</a></li>
		<li class="item-1"><a href="link4.html">first item</a></li>
		<li class="item-0"><a href="link5.html">first item</a>
	</ul>
</div>
'''

html = etree.HTML(text)
# 获取所以祖先结点
result = html.xpath('li[1]/ancestor::*')

# 循环换行打印
for i in result:
    print(i)

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值