requests爬取中文网站的字符编码问题

最新推荐文章于 2024-07-12 16:58:11 发布

薛定谔的DFA

最新推荐文章于 2024-07-12 16:58:11 发布

阅读量4.5k

点赞数 1

分类专栏：奇怪的小问题文章标签： python 字符编码

本文链接：https://blog.csdn.net/qq_30103413/article/details/78768925

版权

使用requests爬取中文网站时，响应内容中的中文显示为Unicode码。通过理解response.content和response.text的区别，可以发现内容的编码由HTTP响应头决定。通过设置正确的字符编码，如网页源码中的charset，可以解决这个问题。此外，Python打开文本文件默认编码可能与实际需要的不同，需指定正确编码避免乱码。

摘要由CSDN通过智能技术生成

这两天在一些门户网站使用requests爬数据的时候，发现打印或者保存到文件中的中文显示为Unicode码，看着十分不爽快，于是就必须网上找了一下相关问题。其实，弄明白了解决也很简单了
比如，爬取凤凰网

response= requests.get("http://www.ifeng.com/")

我们都知道response有text和content这两个property,它们都是指响应内容，但是又有区别。我们从doc中可以看到：

text的doc内容为：

Content of the response, in unicode. If Response.encoding is None, encoding will be guessed using ``chardet``. The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set ``r.encoding`` appropriately before accessing this property.

而content的doc内容为：