【三脚猫指路】requests+etree+中文出现乱码的解决方式

最新推荐文章于 2024-01-17 10:15:54 发布

取啥都被占用

最新推荐文章于 2024-01-17 10:15:54 发布

阅读量706

点赞数 1

分类专栏： Python 文章标签： etree

本文链接：https://blog.csdn.net/u011410413/article/details/106004885

版权

Python 专栏收录该内容

59 篇文章 0 订阅

订阅专栏

今天记录个编码问题的解决方法（好像时不时这个编码问题就会跳出来烦一下）。

import requests
from lxml import etree

req = requests.get("https://www.cn.com/index.html")  #某网页，有中文


if req.encoding == 'ISO-8859-1':
    encodings = requests.utils.get_encodings_from_content(req.text) #这方式其实还能往下琢磨，本篇就不说了。其实自己层层print下去，外加看文档是能很好体会的。
    if encodings:
        encoding = encodings[0]
    else:
        encoding = req.apparent_encoding
    encode_content = req.content.decode(encoding, 'replace') #如果设置为replace，则会用?取代非法字符； 说白了这个req.content就是bytes的形式！！！


selector = etree.HTML(encode_content) #一定要用decode过的
content = selector.xpath('//div[@class="publish"]/div/table')[0]  #content其实是个类
content_bytes = etree.tostring(content,encoding='utf-8')  #没错这个变量也是bytes
content_str = content_bytes.decode(encoding,'replace')   #于是照抄有！！！的那行的方法

最后三行可以说是最蒙圈的。

首先content是个类，也就算了。为什么tosting这种词返回的是bytes??? 就因为这个想当然的点，耗费了一个小时（毕竟本人debug手段不行，大多靠猜想，然后靠print映证）。

然后如果直接提取text 是没有那么复杂的，直接

#selector的赋值就是上面代码片里的。
headers= selector.xpath('//div[@class="publish"]/div/table//th/text()')
print(headers)

取啥都被占用

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
【三脚猫指路】requests+etree+中文出现乱码的解决方式

今天记录个编码问题的解决方法（好像时不时这个编码问题就会跳出来烦一下）。import requestsfrom lxml import etreereq = requests.get("https://www.cn.com/index.html") #某网页，有中文if req.encoding == 'ISO-8859-1': encodings = requests.utils.get_encodings_from_content(req.text) #这方式其实还能往下琢磨
复制链接

扫一扫