python爬虫-lxml的使用

最新推荐文章于 2024-03-28 07:36:43 发布

冯子玉

最新推荐文章于 2024-03-28 07:36:43 发布

阅读量621

点赞数

分类专栏：爬虫文章标签：爬虫 lxml html解析 python

本文链接：https://blog.csdn.net/qq_35488769/article/details/72855494

版权

爬虫专栏收录该内容

10 篇文章 0 订阅

订阅专栏

之前在使用java编写爬虫解析html的时候习惯用jsoup,用python写爬虫的时候习惯用beautifulsoup

两个都属于用起来比较简单但是效率相对于其他的库来说比较低下的库,现在学习python下的lxml的使用

这里将lxml的语法和beautifulsoup做一个对比

1.加载html内容

beautifulsoup实现

>>> data = open("f:\\test5.html","rb").read()
>>> html = data.decode("utf-8","ignore")
>>> from bs4 import BeautifulSoup
>>> bs = BeautifulSoup(html,"lxml")

lxml实现(从文件中读取html的部分省略)

>>> from lxml import etree
>>> html = etree.HTML(html)

2.内容的解析

要使用lxml,需要首先学习Xpath的语法

在为w3cschool中找到的资料如下

XPath 使用路径表达式来选取 XML 文档中的节点或节点集。节点是通过沿着路径 (path) 或者步 (steps) 来选取的。

我的理解是,xpath将html或者xml文档当做一个文件来对待,每个节点代表不同的路径

"/"表示文档根节点(即文档本身)而/html表示最外层的文档节点

如果要定位一个标签,与定位一个文件类似,可以使用相对路径或者绝对路径

绝对路径即给出从/html的外层标签到我们需要定位的标签所经过的所有路径(与文件绝对路径一样的)

比如我们要找一个文档中的所有a标签,用绝对路径定位就是

u"/html/body/a"

例如:

>>> hrefs = html.xpath(u"/html/body/link")
>>> hrefs

[<Element link at 0x3f451c0>, <Element link at 0x4c2bf08>, <Element link at 0x4c2be68>, <Element link at 0x4c2bee0>, <Element link at 0x4c2beb8>, <Element link at 0x4c2be90>, <Element link at 0x4c2bd50>]

可以看出,使用绝对路径,找到的是body下的所有link,但是不会找子节点的子节点

相对路径

u//a

>>> hrefs = html.xpath(u"//a")

>>> hrefs

[<Element a at 0x4c39648>, <Element a at 0x4c2bb20>, <Element a at 0x4c2beb8>, <Element a at 0x4c2be90>, <Element a at 0x4c2bee0>, <Element a at 0x4c2bf08>, <Element a at 0x4c2be68>, <Element a at 0x4c2bf80>, <Element a at 0x4c2bf30>, <Element a at 0x4c2bfa8>, <Element a at 0x4c2bfd0>, <Element a at 0x4c32030>, <Element a at 0x4c32058>, <Element a at 0x4c32080>, <Element a at 0x4c320a8>, <Element a at 0x4c320d0>, <Element a at 0x4c320f8>, <Element a at 0x4c32120>, <Element a at 0x4c32148>, <Element a at 0x4c32170>, <Element a at 0x4c32198>, <Element a at 0x4c321c0>, <Element a at 0x4c321e8>, <Element a at 0x4c32210>, <Element a at 0x4c32238>, <Element a at 0x4c32260>, <Element a at 0x4c32288>, <Element a at 0x4c322b0>, <Element a at 0x4c322d8>, <Element a at 0x4c32300>, <Element a at 0x4c32350>, <Element a at 0x4c32378>, <Element a at 0x4c323a0>, <Element a at 0x4c323c8>, <Element a at 0x4c323f0>, <Element a at 0x4c32418>, <Element a at 0x4c32440>, <Element a at 0x4c32468>, <Element a at 0x4c32490>, <Element a at 0x4c324b8>, <Element a at 0x4c324e0>, <Element a at 0x4c32508>, <Element a at 0x4c32530>, <Element a at 0x4c32558>, <Element a at 0x4c32580>, <Element a at 0x4c325a8>, <Element a at 0x4c325d0>, <Element a at 0x4c325f8>, <Element a at 0x4c32620>, <Element a at 0x4c32648>, <Element a at 0x4c32670>, <Element a at 0x4c32698>, <Element a at 0x4c326c0>, <Element a at 0x4c326e8>, <Element a at 0x4c32710>, <Element a at 0x4c32738>, <Element a at 0x4c32760>, <Element a at 0x4c32788>

可以看到,相对路径找的是所有的a节点

如何定位我们需要的唯一元素呢?

与beautifulsoup类似,用标签的属性来定义,不过xpath有他特有的语法

利用classname来定位元素

使用beautifulsoup:href = bs.find_all("a",class_="classname")

使用lxml:href = html.xpath(u"//a[@class='classname']")

可以看出xpath定位特定的标签语法是u"路径(相对路径或者绝对路径[@属性='属性名'])"

如果没有属性呢?

可以使用标签之间的文字来定位(bs中标签的text属性)

u"路径(相对路径或者绝对路径[@text='content'])"

在xpath中*可以代表任意元素(类似于正则表达式)

例如u"/html/body/*/*/p"表示的是body节点的子节点的子节点中的p节点

冯子玉

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫-lxml的使用

之前在使用java编写爬虫解析html的时候习惯用jsoup,用python写爬虫的时候习惯用beautifulsoup两个都属于用起来比较简单但是效率相对于其他的库来说比较低下的库,现在学习python下的lxml的使用这里将lxml的语法和beautifulsoup做一个对比1.加载html内容beautifulsoup实现>>> data = open("f:\\test
复制链接

扫一扫