爬虫之xpath语法和xml模块

最新推荐文章于 2024-05-01 00:06:10 发布

yc_1994

最新推荐文章于 2024-05-01 00:06:10 发布

阅读量521

点赞数

分类专栏： python 文章标签： xpath xml 爬虫

本文链接：https://blog.csdn.net/weixin_44269952/article/details/106974045

版权

python 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

xpath语法和xml模块

xpath简介
Lxml库
- 基本使用：
- 使用lxml解析html代码
lxml结合xpath
- lxml+xpath爬取豆瓣电影实例

xpath简介

什么是xpath？

xpath(xml path language)是一门在xml和html文档查找信息的语言，可用来在xml和html文档中对元素和属性进行遍历。

xpat开发工具

谷歌游览器插件Xpath Helper。
Firefox插件 Try Xpath

xpath语法

选取节点：
xpath选取路径表达式来选取xml文档的节点和节点集。
表格1

表达式	描述	示例	结果
nodename	选取此节点的所有子节点	bookstore	选取bookstore下所有子节点
/	如果在最前面就表示从根节点选取。否则选取某节点下的某个节点	/bookstore	选取bookstore下所有子节点
@	选取某个节点属性	//book[@price]	选取所有拥有price属性的book节点。

谓语：
谓语用来查找某个特定的节点或者包含某个指定的值的节点，被嵌在方括号里。
在下面的表格中，我们列出了带有谓语的一些路径表达式，以及表达式的结果，注意在xpath下标从1开始。

                      表格2

路径表达式	描述
/bookstore/book[1]	选取bookstore下的第一个子元素
/bookstore/book[last()]	选取bookstore下的最后一个子元素
/bookstore/book[position()< 3]	选取bookstore下的前面两个子元素
//book[@price]	选取拥有price属性的book节点
//book[@price=10]	选取所有price属性等于10的book节点

通配符：
表示通配符

                                  表格3

通配符	描述	示例	结果
*	匹配任意节点	/bookstore/*	选取bookstore下的所有子元素
@*	匹配节点中的任意属性	//book[@* ]	选取所有带有属性的book元素

选取多个路径：
通过使用"|"运算符，可以选取若干个路径。
实例如下：

//bookstore/book | //book/title
选取所有book元素以及book元素下所有的title元素

运算符：

                                      表格4

运算符	描述	示例	返回值
div	除法	8div4	2
=	等于	price=11	如果price是9.8则返回ture，否则返回false
！=	不等于	price！=11	2
<	小于	price<11	如果price是9.8则返回ture，否则返回false
>	大于	price>11	如果price是12则返回ture，否则返回false
<=	小于等于	price<=11	如果price是1则返回ture，如果price是8则返回false
>=	大于等于	price>=11	如果price是12则返回ture，如果price是8则返回false
or	或	price>11 or price<5	如果price是12则返回ture，如果price是6则返回false
and	与	price>11 and price<15	如果price是12则返回ture，如果price是8则返回false

xpath注意事项

谓语的下标从1开始
//和/的区别。
//选取的是所有子孙节点；/选取的是直接字节点
contains：有时候某个属性包含多个值，那么可以使用"contains"函数。实例代码如下:

//div[contains(@class,'job_detail')]

Lxml库

lxml是一个html/xml的解析器，主要的功能是如何解析和提取HTML/xml数据。
lxml安装
可使用pip安装：pip install lxml
如果出现安装错误,则需要将目标路径写上

pip install --target=d:\python\python37\lib\site-packages lxml

基本使用：

可以利用lxml来解析html代码，并且在解析html代码时如果不规范，会自动补全。示例代码如下：

#使用lxml的etree库
#encoding utf-8
from lxml import etree

text="""
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>百度招聘</title>
    <meta name="description" content="百度官方招聘平台-诚挚邀请来自社会，校园，实习生，海外的各界精英了解百度，加入百度。百度，招最好的人，给最大的空间，看最后的结果，让优秀人才脱颖而出。">
    <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    <meta http-equiv="Cache-Control" content="no-store" />
    <meta http-equiv="Pragma" content="no-cache" />
    <meta http-equiv="Expires" content="0" />
    <meta name="baidu-site-verification" content="OP8Pm08nAE" />
    <link rel="shortcut icon" href="images/favicon-ignore.ico" type="image/x-icon" />
    <link rel="icon" href="images/favicon-ignore.ico" type="image/x-icon" />

    <link rel="stylesheet" href="src/_common/css/app.min.css"/>

    <!--[if lt IE 9]>
        <script src="vendor/es5-shim/es5-shim.min.js"></script>
        <script src="vendor/es5-shim/es5-sham.min.js"></script>
        <script src="vendor/html5shiv/dist/html5shiv.js"></script>
    <![endif]-->

</head>
<body>
    <div class="wrap">
        <div class="header"></div>
        <div class="body"></div>
    </div>
    <div class="footer"></div>
    <script src="lib/requirejs/require.js" data-main="src/entry"></script>
    <!-- Baidu Share -->
    <script>
        window._bd_share_config={"common":{"bdSnsKey":{},"bdText":"","bdMini":"2","bdMiniList":[],"bdPic":"","bdStyle":"1","bdSize":"16"},"share":{}};with(document)0[(getElementsByTagName('head')[0]||body).appendChild(createElement('script')).src='scripts/share.min.js?v=89860593.js?cdnversion='+~(-new Date()/36e5)];
    </script>
    <!-- Baidu Tongji -->
    <script type="text/javascript">
    var _bdhmProtocol = (("https:" == document.location.protocol) ? " https://" : " http://");
    document.write(unescape("%3Cscript src='" + _bdhmProtocol + "hm.baidu.com/h.js%3F50e85ccdd6c1e538eb1290bc92327926' type='text/javascript'%3E%3C/script%3E"));
    </script>

</body>
</html>

"""
def parse_text():
    html = etree.HTML(text)
    print(etree.tostring(html, encoding='utf-8').decode('utf-8'))
def parse_file():
    html=etree.parse("renrer.html")#直接使用parse会报错，说明html不规范需要自己创建解析器
    print(etree.tostring(html, encoding='utf-8').decode('utf-8'))
def parse_lagou_file():
    parse=etree.HTMLParser(encoding='utf-8')
    html=etree.parse("renrer.html",parser=parse)
    print(etree.tostring(html,encoding='utf-8').decode('utf-8'))
if __name__=='__main__':
    parse_lagou_file()

使用lxml解析html代码

解析html字符串：使用’lxml.etree.HTML’
解析html文件：使用’lxmle.tree.parse’进行解析，这个
函数默认使用的是’xml’解析器。所以碰到一些不规范的’HTML’的代码时会出现错误
这个时候就需要自己创建解析器。

lxml结合xpath

使用’xpath’语法，应该使用’element.xpath’方法。来执行xpath的选择。
代码如下：

trs=html.xpath("//tr[position()>1]")

获取某个标签属性

a=html.xpath("//a/@href")
#获取a便签的href属性对应的值

获取文本，是通过’xpath’中的’text()'函数。示例代码如下：

text=html.xpath(".//tr[4]/text()")

在某个标签下，在执行xpath函数，获取这个标签的子孙元素
那么应该在//前加点，代表是在当前元素下获取

lxml+xpath爬取豆瓣电影实例

import requests                                                                                                                              
from  lxml import etree                                                                                                                      
#第一步爬取豆瓣网页数据                                                                                                                                 
url="https://movie.douban.com/cinema/nowplaying/shanghai/"                                                                                   
header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"} 
                                                                                                                                             
                                                                                                                                             
response=requests.get(url,headers=header)                                                                                                    
reqt=response.text                                                                                                                           
#print(response.text)                                                                                                         #第二步处理数据                                                        
                                                                 
html=etree.HTML(reqt)                                            
ul=html.xpath("//ul[@class='lists']")[0]                         
lis=ul.xpath("./li")                                             
movies=[]                                                        
for li in lis:                                                   
    title=li.xpath("@data-title")[0]                             
    score=li.xpath("@data-score")[0]                             
    release=li.xpath("@data-release")[0]                         
    duration=li.xpath("@data-duration")[0]                       
    href=li.xpath(".//li[@class='poster']/a/@href")[0]           
    movie={                                                      
        'title':title,                                           
        'score':score,                                           
        'release':release,                                       
        'duration':duration,                                     
        'href':href                                              
    }                                                            
    movies.append(movie)                                         
                                                                 
print(movies)                                                    
#print(etree.tostring(ul,encoding='utf-8').decode("utf-8"))

yc_1994

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬虫之xpath语法和xml模块

xpath语法和xml模块xpath简介什么是xpath？xpat开发工具xpath语法xpath简介什么是xpath？xpath(xml path language)是一门在xml和html文档查找信息的语言，可用来在xml和html文档中对元素和属性进行遍历。xpat开发工具谷歌游览器插件Xpath Helper。Firefox插件 Try Xpathxpath语法...
复制链接

扫一扫