【Python 爬虫】XPath的简单使用

最新推荐文章于 2024-02-01 11:42:23 发布

猪猪传奇

最新推荐文章于 2024-02-01 11:42:23 发布

阅读量767

点赞数 1

分类专栏： Python 学习

本文链接：https://blog.csdn.net/qq_42127861/article/details/108444485

版权

Python 学习专栏收录该内容

20 篇文章 2 订阅

订阅专栏

一、XPath(XML Path Language) 是一门在XML文档中查找信息的语言，可用来在XML文档中对元素和属性进行遍历，需要安装lxml库

最常用的路径表达式

在这里插入图片描述

常用路径表达式以及表达式的结果

在这里插入图片描述

谓语用来查找某个特定的节点或者包含某个指定的值的节点，被嵌在方括号中

在这里插入图片描述

选取未知节点

在这里插入图片描述

选取若干路径，通过在路径表达式中使用“|”运算符，您可以选取若干个路径

在这里插入图片描述

XPath的运算符

在这里插入图片描述

二、对于xpath的简单理解

浅析~DOM结构中的元素节点、属性节点、文本节点

上篇博客是我当初对元素节点、属性节点、文本节点的浅要理解。就我个人来说，我习惯把xpath理解成如上博客中的节点，来形成一棵DOM树。/ 表示选取当前层级上的子节点，这样如果把属性和文本都看成节点，那么使用 / 来获取文本和属性就显得很容易理解了。/node 表示选取node节点，/div/@id 选取div的子节点中的id属性节点，/div/text() 选取div的子节点中的文本节点，/div/p 选取div层级下的p标签节点

如图：
在这里插入图片描述

三、xpath的简单语法

注：感觉xpath的匹配策略像是贪心算法，只要有匹配的，就继续向下匹配。xpath的谓语部分就是判断，如果成立返回true，就选取当前节点。具体的判断条件应该是没有限制的，什么都可以，只要能有true和false，就可以选中使用谓语的当前节点。

from lxml import etree


'''
统一说明：
    div[]这种类型中，[]代表筛选条件，当条件成立的时候，就选中当前div，所谓的这个条件可以随便写，只要能够成立，就没问题
'''
# str类型数据
s = '''<div id="head_wrapper">这是第一个div<div id="son_div">这是子一div<div>这是孙一div</div></div><div id="son_div2" class="suner">这是子二div<price>45</price></div></div>
        <div id="s-isindex-wrap">这时第二个div
        </div>
        <div id="nologin" >这时第三个div</div>'''

def print_(item,name):
    print('*' * 25, '%s' % name, '#' * 25,'\n')
    if type(item) != type([]):
        print(etree.tostring(item,encoding='utf-8').decode('utf-8'))
    else:
        for i in item:
            print(etree.tostring(i,encoding='utf-8').decode('utf-8'))
            print('-' * 50)
    print('\n')

def print_l(item,name):
    print('*' * 25, '%s' % name, '#' * 25,'\n')
    print(item)
    print('\n')

if __name__ == '__main__':
    html = etree.HTML(s)
    # 转换成了Element对象，使用了默认的praser
    print(html,type(html))
    # 将Element对象里面的数据用str形式显示出来，自动添加了<html><body></body></html>

    extract_html = etree.tostring(html,encoding='utf-8').decode('utf-8')
    print_l(extract_html,'extract_html')


    # 真正的操作，还是在Element对象中进行
    # 获取/html/body/div的id属性，获取属性要额外加一个/，返回一个列表，列表项为head_wrapper
    # 返回一个列表，不可回溯，不可再次匹配
    result1 = html.xpath('/html/body/div/@id')
    print_l(result1, "/html/body/div的id属性")

    # 返回一个id属性的列表，不可回溯，不可再次匹配
    result2 = html.xpath('//div/@id')  # //不论位置，找到所有
    print_l(result2,'所有div的id属性列表')

    # 返回一个文本内容的列表，不可回溯，不可再次匹配
    result3 = html.xpath('/html/body/div/text()')
    print_l(result3,'/html/body/div/text()')

    # results是一个Element对象的数组，可以再次匹配，可以回溯，遵从匹配原则
    # 即使results中的列表项，是整个文档树中的某一片段，但是他并没有从文档树上摘取下来，按照树的遍历规则，他还可以找到父节点
    # 相当于返回的Element对象是一个节点指针，仍然在树上，而不是单纯的单一节点
    results = html.xpath('//div')
    print_l(len(results),'results的返回长度')
    for index,res in enumerate(results):
        print_(res,'第%d个res'%index)
        print(type(res))  # res 是一个Element对象
        tmp = res.xpath('..')  # 找到res的父节点
        tmp1 = res.xpath('/*')  # 从根节点开始匹配所有节点，应该是只返回一个html，即使当前节点并不是根节点
        print_(tmp,'第%d个res的父节点'%index)

    # xpath索引是从1开始的
    # 这个匹配是按照层级的来算的，获取的虽然是所有的div，但是这些div是分层次的
    # last()是获取所有层级div中，当前层级的最后一个，也就是有几个层级，就会返回几个最后一个div
    div = html.xpath('//div[last()]')
    print_(div,'last() div')

    div2 = html.xpath('//div[position()=2]')
    print_(div2,'position()=2 的div')

    div3 = html.xpath('//div[position()<=2]')
    print_(div3,'position()<=2 的div')

    result4 = html.xpath('//div[@id="son_div"]')
    print_(result4,'带谓语选择条件的div')

    result5 = html.xpath('//div[@id="son_div2"][@class="suner"]')
    print_(result5,'形式一：带多个条件筛选的div')

    # 选取某一节点，对这一节点有两种可选择的条件，and or
    result5 = html.xpath('//div[@id="son_div2" and @class="suner"]')
    print_(result5, '形式二：带多个条件筛选的div')

    result6 = html.xpath('//div[contains(@id,"son")]')
    print_(result6,'模糊查询，只要带有son的都会被找到')

    # 选取两种类型的节点，那个节点都可以，都有的话，就都被选择
    result7 = html.xpath('//div[@id="s-isindex-wrap"] | //div[@id="nologin"]')
    # 注意此时@id的使用
    result7 = html.xpath('//div[./@id="s-isindex-wrap"] | //div[/@id="nologin"]')
    print_(result7,'或者形式的查询，两个条件二选一')

    result8 = html.xpath('//div[price/text()>40]')
    result8 = html.xpath('//div[./price/text()>40]')
    result8 = html.xpath('//div[price>40]')
    print_(result8,'使用子元素节点的text作为筛选条件的')

    result9 = html.xpath('//div[price mod 5 =0]')
    print_(result9,'mod 作为筛选条件的使用')

    result10 = html.xpath('//div[1 = 1]')
    print_(result10,'使用恒等式作为谓语')
    
    result11 = html.xpath('//div[price]')
    print_(result11,'使用是否包含节点作为谓语')

运行结果：

<Element html at 0x1cd41ed9548> <class 'lxml.etree._Element'>
************************* extract_html ######################### 
<html><body><div id="head_wrapper">这是第一个div<div id="son_div">这是子一div<div>这是孙一div</div></div><div id="son_div2" class="suner">这是子二div<price>45</price></div></div>
        <div id="s-isindex-wrap">这时第二个div
        </div>
        <div id="nologin">这时第三个div</div></body></html>
************************* /html/body/div的id属性 ######################### 
['head_wrapper', 's-isindex-wrap', 'nologin']
************************* 所有div的id属性列表 ######################### 
['head_wrapper', 'son_div', 'son_div2', 's-isindex-wrap', 'nologin']
************************* /html/body/div/text() ######################### 
['这是第一个div', '这时第二个div\n        ', '这时第三个div']
************************* results的返回长度 ######################### 
6
************************* 第0个res ######################### 
<div id="head_wrapper">这是第一个div<div id="son_div">这是子一div<div>这是孙一div</div></div><div id="son_div2" class="suner">这是子二div<price>45</price></div></div>
        
<class 'lxml.etree._Element'>
************************* 第0个res的父节点 ######################### 
<body><div id="head_wrapper">这是第一个div<div id="son_div">这是子一div<div>这是孙一div</div></div><div id="son_div2" class="suner">这是子二div<price>45</price></div></div>
        <div id="s-isindex-wrap">这时第二个div
        </div>
        <div id="nologin">这时第三个div</div></body>
--------------------------------------------------
************************* 第1个res ######################### 
<div id="son_div">这是子一div<div>这是孙一div</div></div>
<class 'lxml.etree._Element'>
************************* 第1个res的父节点 ######################### 
<div id="head_wrapper">这是第一个div<div id="son_div">这是子一div<div>这是孙一div</div></div><div id="son_div2" class="suner">这是子二div<price>45</price></div></div>
        
--------------------------------------------------
************************* 第2个res ######################### 
<div>这是孙一div</div>
<class 'lxml.etree._Element'>
************************* 第2个res的父节点 ######################### 
<div id="son_div">这是子一div<div>这是孙一div</div></div>
--------------------------------------------------
************************* 第3个res ######################### 
<div id="son_div2" class="suner">这是子二div<price>45</price></div>
<class 'lxml.etree._Element'>
************************* 第3个res的父节点 ######################### 
<div id="head_wrapper">这是第一个div<div id="son_div">这是子一div<div>这是孙一div</div></div><div id="son_div2" class="suner">这是子二div<price>45</price></div></div>
        
--------------------------------------------------
************************* 第4个res ######################### 
<div id="s-isindex-wrap">这时第二个div
        </div>
        
<class 'lxml.etree._Element'>
************************* 第4个res的父节点 ######################### 
<body><div id="head_wrapper">这是第一个div<div id="son_div">这是子一div<div>这是孙一div</div></div><div id="son_div2" class="suner">这是子二div<price>45</price></div></div>
        <div id="s-isindex-wrap">这时第二个div
        </div>
        <div id="nologin">这时第三个div</div></body>
--------------------------------------------------
************************* 第5个res ######################### 
<div id="nologin">这时第三个div</div>
<class 'lxml.etree._Element'>
************************* 第5个res的父节点 ######################### 
<body><div id="head_wrapper">这是第一个div<div id="son_div">这是子一div<div>这是孙一div</div></div><div id="son_div2" class="suner">这是子二div<price>45</price></div></div>
        <div id="s-isindex-wrap">这时第二个div
        </div>
        <div id="nologin">这时第三个div</div></body>
--------------------------------------------------
************************* last() div ######################### 
<div>这是孙一div</div>
--------------------------------------------------
<div id="son_div2" class="suner">这是子二div<price>45</price></div>
--------------------------------------------------
<div id="nologin">这时第三个div</div>
--------------------------------------------------
************************* position()=2 的div ######################### 
<div id="son_div2" class="suner">这是子二div<price>45</price></div>
--------------------------------------------------
<div id="s-isindex-wrap">这时第二个div
        </div>
        
--------------------------------------------------
************************* position()<=2 的div ######################### 
<div id="head_wrapper">这是第一个div<div id="son_div">这是子一div<div>这是孙一div</div></div><div id="son_div2" class="suner">这是子二div<price>45</price></div></div>
        
--------------------------------------------------
<div id="son_div">这是子一div<div>这是孙一div</div></div>
--------------------------------------------------
<div>这是孙一div</div>
--------------------------------------------------
<div id="son_div2" class="suner">这是子二div<price>45</price></div>
--------------------------------------------------
<div id="s-isindex-wrap">这时第二个div
        </div>
        
--------------------------------------------------
************************* 带谓语选择条件的div ######################### 
<div id="son_div">这是子一div<div>这是孙一div</div></div>
--------------------------------------------------
************************* 形式一：带多个条件筛选的div ######################### 
<div id="son_div2" class="suner">这是子二div<price>45</price></div>
--------------------------------------------------
************************* 形式二：带多个条件筛选的div ######################### 
<div id="son_div2" class="suner">这是子二div<price>45</price></div>
--------------------------------------------------
************************* 模糊查询，只要带有son的都会被找到 ######################### 
<div id="son_div">这是子一div<div>这是孙一div</div></div>
--------------------------------------------------
<div id="son_div2" class="suner">这是子二div<price>45</price></div>
--------------------------------------------------
************************* 或者形式的查询，两个条件二选一 ######################### 
<div id="s-isindex-wrap">这时第二个div
        </div>
        
--------------------------------------------------
************************* 使用子元素节点的text作为筛选条件的 ######################### 
<div id="son_div2" class="suner">这是子二div<price>45</price></div>
--------------------------------------------------
************************* mod 作为筛选条件的使用 ######################### 
<div id="son_div2" class="suner">这是子二div<price>45</price></div>
--------------------------------------------------
************************* 使用恒等式作为谓语 ######################### 
<div id="head_wrapper">这是第一个div<div id="son_div">这是子一div<div>这是孙一div</div></div><div id="son_div2" class="suner">这是子二div<price>45</price></div></div>
        
--------------------------------------------------
<div id="son_div">这是子一div<div>这是孙一div</div></div>
--------------------------------------------------
<div>这是孙一div</div>
--------------------------------------------------
<div id="son_div2" class="suner">这是子二div<price>45</price></div>
--------------------------------------------------
<div id="s-isindex-wrap">这时第二个div
        </div>
        
--------------------------------------------------
<div id="nologin">这时第三个div</div>
--------------------------------------------------
************************* 使用是否包含节点作为谓语 ######################### 
<div id="son_div2" class="suner">这是子二div<price>45</price></div>
--------------------------------------------------

四、xpath案例

爬取段子网的段子，并存储起来。

import requests
from lxml import etree
import time
import random

url = 'https://duanziwang.com/page/%d/'
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3766.400 QQBrowser/10.6.4163.400"
}

if __name__ == '__main__':
    f = open('./joke.csv',mode='a',encoding='utf-8')
    for i in range(1,11):
        response = requests.get(url=url%i,headers=headers)
        response.encoding = 'utf-8'
        html = etree.HTML(response.text)
        # print(etree.tostring(html,encoding='utf-8').decode('utf-8'))
        node_list = html.xpath('//article[@class="post"]')
        for item in node_list:
            title = item.xpath('./div[@class="post-head"]/h1/a/text()')[0]
            datetime = item.xpath('./div/div/time[1]/text()')[0]
            hotRate = item.xpath('.//time[2]/text()')[0]
            like = item.xpath('.//span/text()')[0]
            content = item.xpath('./div[@class="post-content"]//code/text()')[0]
            # print(title,'\t',content)
            # strip()无参数代表两边去掉空格，strip('\n')代表两边去掉换行
            f.write('%s\t时间：%s\t热度：%s\t点赞%s\n%s\n'%(title.strip(),datetime.strip(),hotRate.strip(),like.strip(),content.strip()))
        time.sleep(random.randint(1,3))
    f.close()

运行结果：

在这里插入图片描述

猪猪传奇

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
【Python 爬虫】XPath的简单使用

一、XPath(XML Path Language) 是一门在XML文档中查找信息的语言，可用来在XML文档中对元素和属性进行遍历，需要安装lxml库最常用的路径表达式常用路径表达式以及表达式的结果谓语用来查找某个特定的节点或者包含某个指定的值的节点，被嵌在方括号中选取未知节点选取若干路径，通过在路径表达式中使用“|”运算符，您可以选取若干个路径XPath的运算符二、对于xpath的简单理解浅析~DOM结构中的元素节点、属性节点、文本节点上篇博客是我当
复制链接

扫一扫

专栏目录