爬虫爬取的网页乱码 response.encoding = "utf-8" 来解决

最新推荐文章于 2025-03-29 15:26:33 发布

基础决定反应速度

最新推荐文章于 2025-03-29 15:26:33 发布

阅读量2.9w

点赞数 11

本文链接：https://blog.csdn.net/abcdasdff/article/details/82053282

版权

使用requests爬数据的时候，发现打印或者保存到文件中的中文显示为Unicode码(其实我也不知道是什么码,总之乱码)。

爬取某网 response= requests.get(“http://www.xxxxx.com/“)

我们都知道response有 text 和 content 这两个property, 它们都是指响应内容，但是又有区别。我们从doc中可以看到：

text的doc内容为：

Content of the response, in unicode.
If Response.encoding is None, encoding will be guessed using chardet.
The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set r.encoding appropriately before accessing this property.

而content的doc内容为：

Content of the response, in bytes.

其中text是unicode码,content是字节码.

我们看一下网页的 headers，网页源码标签下标签中charset指定的字符编码，例如：

因此，当我们用text属性获取了html内容出现unicode码时，可以通过设置字符编码response.encoding 来匹配指定的编码，这样就不会乱码了。

import requests

response = requests.get(“http://www.xxxxx.com/“)
response.encoding = “utf-8” # 手动指定字符编码为utf-8
print(response.text)