Python爬虫学习（二）：xpath解析html

~程序员小白~

已于 2024-04-07 10:28:18 修改

阅读量744

点赞数 24

文章标签： python 爬虫学习

于 2024-04-07 10:13:33 首次发布

本文链接：https://blog.csdn.net/qq_31957463/article/details/137453985

版权

--xpath教程地址可参照：XPath 教程
被解析的html示例：

<html>
<body>
<div>
    <ul>
        <li class="item-0"><a href="link1.html">first item</a></li>
        <li class="item-1"><a href="link2.html">second item</a></li>
        <li class="item-inactive"><a href="link3.html">third item</a></li>
        <li class="item-1"><a href="link4.html">fourth item</a></li>
        <li class="item-0"><a href="link5.html">fifth item</a>
        </li>
    </ul>
</div>
</body>
</html>

'''
        nodename   选取此节点的所有子节点
        /  从当前节点选取直接子节点
        // 从当前节点选取子孙节点
        .  选取当前节点
        .. 选取当前节点的父节点
        @  选取属性
'''
def xpathParse():
    # 解析html
    result1 = etree.parse('test.html', etree.HTMLParser())
    result2 = etree.tostring(result1, method='html')
    print(result2.decode('utf-8'))

1、获取当前节点

 # 输出当前节点
    res = result1.xpath('.')
    print('当前节点为:', res)

2、获取所有节点：

 # 获取所有节点 * 代表所有节点
    result3 = result1.xpath('//*')
    print('result3输出结果为:', result3)

3、 获取所有li节点（从当前节点<Element html at 0x14fd2e09c40>直接选取子孙节点li）

result4 = result1.xpath('//li')
print('result4输出结果为:', result4)

4、获取所有li节点下的a节点（从当前节点<Element html at 0x14fd2e09c40>直接选取子孙节点li,然后选取li的子节点a，如果存在返回值，不存在返回空）

 result5 = result1.xpath('//li/a')
 print('result5输出结果为:', result5)

5、获取当前节点的父节点的class

result6 = result1.xpath('//a[@href="link4.html"]/../@class')
result7 = result1.xpath('//a[@href="link4.html"]/parent::*/@class')
print('result6输出结果为:', result6)
print('result7输出结果为:', result7)

6、通过@进行属性过滤 过滤出li标签下class属性为item-0的元素

result8 = result1.xpath('//li[@class="item-0"]')
print('result8输出结果为:', result8)

7、获取标签下的文本信息’

result9 = result1.xpath('//a[@href="link4.html"]/text()')
print('result9输出结果为:', result9)

8、获取节点属性 使用@

result10 = result1.xpath('//li/a/@href')
print('result10输出结果为:', result10)

9、属性值匹配 通过contains方法（是否包含）

result11 = result1.xpath('//li[contains(@class,"item")]/a/text()')
print('result11输出结果为:', result11)

10、按序选择、获取节点属性 使用@

 result12 = result1.xpath('//li[1]')
 result13 = result1.xpath('//li[position()<3]')
 result14 = result1.xpath('//li[last()]')
 print('result12输出结果为:', result12)
 print('result13输出结果为:', result13)
 print('result14输出结果为:', result14)

11、节点轴选择

'''
        child：选择当前节点的直接子节点。
        parent：选择当前节点的父节点。
        descendant：选择当前节点的所有后代节点（子节点，子节点的子节点，等等）。
        ancestor：选择当前节点的所有先辈节点（父节点，父节点的父节点，等等）。
        following：选择文档中当前节点后面的所有节点。
        preceding：选择文档中当前节点前面的所有节点。
        following-sibling：选择当前节点的后续同级节点。
        preceding-sibling：选择当前节点的前置同级节点。
        self：选择当前节点。
        descendant-or-self：选择当前节点及其所有后代节点。
        ancestor-or-self：选择当前节点及其所有先辈节点。
    '''
    # 获取li节点的所有先辈节点
    result15 = result1.xpath('//li[1]/ancestor::*')
    print('result15输出结果为:', result15)

执行结果：

当前节点为: [<Element html at 0x1e84c28a380>]
result3输出结果为: [<Element html at 0x1e84c28a380>, <Element body at 0x1e84c5d7040>, <Element div at 0x1e84c5d70c0>, <Element ul at 0x1e84c5d7100>, <Element li at 0x1e84c5d7140>, <Element a at 0x1e84c5d71c0>, <Element li at 0x1e84c5d7200>, <Element a at 0x1e84c5d7240>, <Element li at 0x1e84c5d7280>, <Element a at 0x1e84c5d7180>, <Element li at 0x1e84c5d72c0>, <Element a at 0x1e84c5d7300>, <Element li at 0x1e84c5d7340>, <Element a at 0x1e84c5d7380>]
result4输出结果为: [<Element li at 0x1e84c5d7140>, <Element li at 0x1e84c5d7200>, <Element li at 0x1e84c5d7280>, <Element li at 0x1e84c5d72c0>, <Element li at 0x1e84c5d7340>]
result5输出结果为: [<Element a at 0x1e84c5d71c0>, <Element a at 0x1e84c5d7240>, <Element a at 0x1e84c5d7180>, <Element a at 0x1e84c5d7300>, <Element a at 0x1e84c5d7380>]
result6输出结果为: ['item-1']
result7输出结果为: ['item-1']
result8输出结果为: [<Element li at 0x1e84c5d7140>, <Element li at 0x1e84c5d7340>]
result9输出结果为: ['fourth item']
result10输出结果为: ['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
result11输出结果为: ['first item', 'second item', 'third item', 'fourth item', 'fifth item']
result12输出结果为: [<Element li at 0x1e84c5d7140>]
result13输出结果为: [<Element li at 0x1e84c5d7140>, <Element li at 0x1e84c5d7200>]
result14输出结果为: [<Element li at 0x1e84c5d7340>]
result15输出结果为: [<Element html at 0x1e84c28a380>]