Python爬虫入门（二）数据提取（lxml）

最新推荐文章于 2022-06-17 11:44:12 发布

Story–teller

最新推荐文章于 2022-06-17 11:44:12 发布

阅读量764

点赞数

文章标签： python 爬虫入门

本文链接：https://blog.csdn.net/qq_42019407/article/details/103072642

版权

XPath语法和lXml模块

什么是XPath？

Xpath是一门在xml和html文档中查找信息的语言，可用来在xml和html文档中对元素和属性进行遍历

Xpath开发工具

chrome插件xpath helper
Firefox插件try xpath

Xpath语法：

选取节点：

XPath 使用路径表达式在 XML 文档中选取节点。节点是通过沿着路径或者 step 来选取的。

谓语

谓语用来查找某个特定的节点或者包含某个指定的值的节点。

谓语被嵌在方括号中。

实例

在下面的表格中，我们列出了带有谓语的一些路径表达式，以及表达式的结果：

选取未知节点：

XPath 通配符可用来选取未知的 XML 元素。

一般使用方式

使用//获取整个页面当中的元素，然后写标签名，然后再填写谓词进行提取

需要主要的知识点

1./和//的区别：/代表只获取直接子节点。//获取子孙结点。一般//

2.contains：有时候某个属性包含多个值，可以使用contains

例如

//div[contains(@class,’jab-detail’)]

3.谓词中的下标是从1开始的，不是从0开始的。

Lxml库

Lxml是一个html/xml的解析器，主要功能是如何解析和提取html/xml数据

Lxml和正则一样，也是用C实现的，是一款高性能的python html/xml解析器，我们可以利用之前学习的xpath语法，来快速的定位特定的元素以及结点信息。

需要安装C语言库，可以使用pip安装：pip install lxml

基本使用：

我们可以利用他解析html代码，如果html代码不规范，他会自动补全。

#encoding=utf-8

from lxml import etree
text=""""
 <div class="r_city_tit">推荐城市：</div>
    <ul class="r_city_con">
                 <li class="r_search_item"><a href="https://www.lagou.com/beijing/">北京找工作</a></li>
                 <li class="r_search_item"><a href="https://www.lagou.com/beijing/">北京招聘</a></li>
                 <li class="r_search_item"><a href="https://www.lagou.com/shanghai/">上海找工作</a></li>
                 <li class="r_search_item"><a href="https://www.lagou.com/shanghai/">上海招聘</a></li>
                 <li class="r_search_item"><a href="https://www.lagou.com/hangzhou/">杭州找工作</a></li>
                 <li class="r_search_item"><a href="https://www.lagou.com/hangzhou/">杭州招聘</a></li>
                 <li class="r_search_item"><a href="https://www.lagou.com/guangzhou/">广州找工作</a></li>
                 <li class="r_search_item"><a href="https://www.lagou.com/guangzhou/">广州招聘</a></li>
                 <li class="r_search_item"><a href="https://www.lagou.com/shenzhen/">深圳找工作</a></li>
                 <li class="r_search_item"><a href="https://www.lagou.com/shenzhen/">深圳招聘</a></li>
                 <li class="r_search_item"><a href="https://www.lagou.com/chengdu/">成都找工作</a></li>
                 <li class="r_search_item"><a href="https://www.lagou.com/chengdu/">成都招聘</a></li>
           </ul>
</div>
"""

htmlElem=etree.HTML(text)
print(etree.tostring(htmlElem,encoding='utf-8').decode('utf-8'))

解析html文件，默认使用的是xml解析器，所有如果遇到不规范的html代码的时候会解析错误，这时候要自己解析HTML解析器。

除了直接从字符串进行解析，lxml还支持从文件中读取内容，新建一个hello.html文件

然后利用etree.parse()方法进行读取

htmlElme=etree.parse("hello.html)

当出现以下情况，网页里面标签不匹配

可以使用HTMLParser解析HTML代码

parser=etree.HTMLParser(encoding='utf-8')
htmlElme=etree.parse("text.html",parser=parser)

实例代码

from lxml import etree

parser=etree.HTMLParser(encoding='utf-8')
html=etree.parse("tenxun.html",parser=parser)

#取出所有的div标签
# divs=html.xpath("//div")
# for div in divs:
#     print(etree.tostring(div,encoding='utf-8').decode('utf-8'))

#取出第二个div标签
# div=html.xpath("//div[2]")[0]
# print(etree.tostring(div,encoding='utf-8').decode('utf-8'))

#获取到class=header-wrap的div标签
# divs=html.xpath("//div[@class='header-wrap']")
# for div in divs:
#     print(etree.tostring(div,encoding='utf-8').decode('utf-8'))

#获取所有的a标签的href属性
# ass = html.xpath("//a/@href")
# for a in ass:
#     print("https://www.luogu.org/problem/list"+a)

#获取纯文本
trs=html.xpath("//div[@class='row-wrap']")
infos=[]
for tr in trs:
    #href返回的是列表，不能直接写出fullhref='https://www.luogu.org/problem/list'+href
    href=tr.xpath(".//div[@class='title']/a/@href")[0]
    fullhref='https://www.luogu.org/problem/list'+href
    num=tr.xpath(".//div[@class='part left-part']/span/text()")[0]
    title=tr.xpath(".//div[@class='title']/a/text()")[0]
    info={
        '题号':num,
        '名字':title
    }
    infos.append(info)
print(infos)

tenxun.html

<div data-v-6c294b5c="" class="border"><div data-v-6c294b5c="" class="header-wrap"><div data-v-65fb3fca=""

最低0.47元/天解锁文章

Story–teller

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
0
评论
Python爬虫入门（二）数据提取（lxml）

XPath语法和lXml模块什么是XPath？Xpath是一门在xml和html文档中查找信息的语言，可用来在xml和html文档中对元素和属性进行遍历Xpath开发工具chrome插件xpath helper Firefox插件try xpathXpath语法：选取节点：XPath 使用路径表达式在 XML 文档中选取节点。节点是通过沿着路径或者 step 来选取的。...
复制链接

扫一扫