from lxml import etree 遇到的提取问题

最新推荐文章于 2024-07-18 20:16:37 发布

木下瞳

最新推荐文章于 2024-07-18 20:16:37 发布

阅读量8k

点赞数 3

分类专栏：爬虫爬取请求、提取信息的方法文章标签： python html 开发语言

本文链接：https://blog.csdn.net/zjkpy_5/article/details/81041815

版权

爬虫爬取请求、提取信息的方法专栏收录该内容

5 篇文章 1 订阅

订阅专栏

了解更多关注微信公众号“木下学Python”吧~

爬取结构

url_info.xpath('string(.)').strip()：

url_info.xpath('//*[@id="info"]/text()[2]'):

url_info.xpath('//h1[@class="title"]/text()')：

当 url_info.xpath('') 路径没问题的时候：

“*”

包含选取

“.” 当前节点

.xpath(.//div[last()]) ,last() 函数

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes

爬取结构

res = requests.get(url,headers = headers)

selector =  etree.HTML(res.text) #获得格式化的 HTML 源码

url_infos = selector.xpath('xpath路径') #抓大标签

for url_info in url_infos:

    url_info.xpath('。。')[0]

    。。。。。。

抓大标签以后，url_info.xpath('。。')[0] 中的路径不能在此外，只能 < 其路径，不能 =

把不合法的 HTML 解析为统一格式 selector = etree.HTML(res.text)
from lxml.html import fromstring,tostring
broken_html = '...'
tree = fromstring(broken_html) #解析 HTNL
fixed_html = tostring(tree,pretty_print=True)

url_info.xpath('string(.)').strip()：

Xpath string()提取多个子节点中的文本 - 雷子-LL - 博客园

抓目前选中标签中的所有文本信息

url_info.xpath('//*[@id="info"]/text()[2]'):

出现类似text（）【2】这种情况改用正则好一点

url_info.xpath('//h1[@class="title"]/text()')：

当第一个标签是带有属性值时，开头是'//',不是'/'

url_info.xpath('//h1[@class="title"]/text()')对

url_info.xpath('/h1[@class="title"]/text()')错

当 url_info.xpath('') 路径没问题的时候：

例如有三个标签他们是同一个路径里面，分别是：

1）.xpath('//div[@class="meta"]/span[1]/text()')[0]

2）.xpath('//div[@class="meta"]/span[2]/text()')[0]

3）.xpath('//div[@class="meta"]/span[1]/text()')[0]

1)和2）都没问题，但到了3）就匹配不出来，这时候查看网页源代码搜索，发现3）匹配的元素不在里面，1），2）都可以搜索到，说明3）号元素采用了异步加载技术，但要匹配的内容在源代码的<script>标签中，用正则匹配

“*”

可以使用 * 来选择指定层级的所有元素

这个节点下的所有节点

包含选取

‘//a[contains(@href,'baidu')]’ 提取连接中包含了 baidu 的链接
'//a[starts-with(@href,'http://www.')]' 提取链接中开头为 http://www. 的链接
‘//a[not(contains(@href,'abc))]’ ti提取连接中不包含 abc 的链接

“.” 当前节点

表示当前节点，当 xpath路径正确，而结果有问题时，可以这样写

.xpath('./h3/a/@title') 提取结果正确
.xpath('/h3/a/@title') 提取结果错误

.xpath(.//div[last()]) ,last() 函数

选择当前节点的最后一个，如果想选择倒数第二个就是 last() - 1，以此类推

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes

https://blog.csdn.net/weixin_42081389/article/details/103891908

resp = requests.get(url,headers=headers)
resp_text = resp.text
html = etree.HTML(resp_text.encode('utf-8'))

木下瞳

关注

3
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
from lxml import etree 遇到的提取问题

了解更多关注微信公众号“木下学Python”吧~目录爬取结构url_info.xpath('string(.)').strip()：url_info.xpath('//*[@id="info"]/text()[2]'):url_info.xpath('//h1[@class="title"]/text()')：当 url_info.xpath('') 路径没问题的时候...
复制链接

扫一扫

专栏目录