etree.html 中文乱码,[三脚猫指路]请求+etree+中文乱码解决方案,requestsetree,出现,的,方式...

最新推荐文章于 2024-02-17 22:38:06 发布

weixin_39956353

最新推荐文章于 2024-02-17 22:38:06 发布

阅读量892

点赞数

今天记录个编码问题的解决方法(好像时不时这个编码问题就会跳出来烦一下)。

import requests

from lxml import etree

req = requests.get("https://www.cn.com/index.html") #某网页，有中文

if req.encoding == 'ISO-8859-1':

encodings = requests.utils.get_encodings_from_content(req.text) #这方式其实还能往下琢磨，本篇就不说了。其实自己层层print下去，外加看文档是能很好体会的。

if encodings:

encoding = encodings[0]

else:

encoding = req.apparent_encoding

encode_content = req.content.decode(encoding, 'replace') #如果设置为replace，则会用?取代非法字符；说白了这个req.content就是bytes的形式！！！

selector = etree.HTML(encode_content) #一定要用decode过的

content = selector.xpath('//div[@class="publish"]/div/table')[0] #content其实是个类

content_bytes = etree.tostring(content,encoding='utf-8') #没错这个变量也是bytes

content_str = content_bytes.decode(encoding,'replace') #于是照抄有！！！的那行的方法

最后三行可以说是最蒙圈的。

首先content是个类，也就算了。为什么tosting这种词返回的是bytes??? 就因为这个想当然的点，耗费了一个小时(毕竟本人debug手段不行，大多靠猜想，然后靠print映证)。

然后如果直接提取text 是没有那么复杂的，直接

#selector的赋值就是上面代码片里的。

headers= selector.xpath('//div[@class="publish"]/div/table//th/text()')

print(headers)

weixin_39956353

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
etree.html 中文乱码,[三脚猫指路]请求+etree+中文乱码解决方案,requestsetree,出现,的,方式...

今天记录个编码问题的解决方法(好像时不时这个编码问题就会跳出来烦一下)。import requestsfrom lxml import etreereq = requests.get("https://www.cn.com/index.html") #某网页，有中文if req.encoding == 'ISO-8859-1':encodings = requests.utils.get_enc...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。