python学习之 requests爬虫导致的中文乱码

最新推荐文章于 2024-01-25 11:56:25 发布

sentimental_dog

最新推荐文章于 2024-01-25 11:56:25 发布

阅读量1w

点赞数 3

分类专栏：机器学习

本文链接：https://blog.csdn.net/sentimental_dog/article/details/52661974

版权

机器学习专栏收录该内容

32 篇文章 0 订阅

订阅专栏

首先是官方文档

Compliance

Requests is intended to be compliant with all relevant specifications and RFCs where that compliance will not cause difficulties for users. This attention to the specification can lead to some behaviour that may seem unusual to those not familiar with the relevant specification.

Encodings

When you receive a response, Requests makes a guess at the encoding to use for decoding the response when you access the Response.text attribute. Requests will first check for an encoding in the HTTP header, and if none is present, will use chardet to attempt to guess the encoding.

The only time Requests will not do this is if no explicit charset is present in the HTTP headersand the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the rawResponse.content.

官方文档的意思就是，如果requests没有发现http headers中的charset，就会使用默认的IOS-8859-1(也就是我们常说的latin-1，但是我们一般的网页使用的charset其实是utf-8)这会导致什么结果呢？

requests获取html文档的时候，获取的实际上是一串字节流（类似python中的str），由于python默认的编码是unicode，所以requests就会尝试把这串字节流decode成unicode编码(关于encode和decode，请参见http://blog.csdn.net/sentimental_dog/article/details/52658725)

而由于目标url的headers没有提供charset，那么这串字节流就会用latin-1 转换为 unicode 编码的方式转换成了我们见到的unicode对象。但是由于网页的编码方式实际上是utf-8，，所以我们实际上需要的是从utf-8转换成unicode编码。此时这一串字节流就会被错误地解释成unicode编码

那么我们该如何还原它呢，我们需要先把这个对象encode（latin-1) （恢复原来的字节流），然后再decode（'utf-8'）(把字节流用utf-8的编码格式解码)，这样就获得了utf-8 字节流转换成的unicode对象

我们如何发现这种情况呢？

其实很简单，我们只要知道reponse的encoding方式是否错误就可以了

<span style="font-size:18px;">url = 'http://weather.sina.com.cn/xiamen'
content = requests.get(url)
print content.encoding #ISO-8859-1
#这就说明了编码方式的确是latin-1</span>

当然，如果我们知道网页的编码方式是utf-8，我们可以在调用response.text()之前使用response.encoding='utf-8'，这样就不需要像上文一样先使用encoding（'latin-1'）还原之后再decoding了

requests的原始内容在 response.content 里，是bytes形式自己想怎么处理就怎么处理。

另外，细心的同学可能发现很多时候print response.content和print response.text的结果其实是一样的，这是为什么呢？

根据上文官方文档中的信息，response.content其实提供的是原本的字节流，由于网页是用utf-8编码的，这时候实际上我们就得到了一个 utf-8编码的字符串，这时候使用print response.content当然是没有问题的

而response.text所做的，其实是对response.content调用了decode方法，获得了一个unicode对象,如果编码方式是正确的，那么其实我们所做的只是把utf-8编码的字节流转换成了一个unicode对象，当然print的时候是相同的。但是，如果response.encoding不正确的话，这个unicode对象当然就不是我们想要的了，所以会出现错误

关于unicode和utf-8对象的区别：参见：http://blog.csdn.net/sentimental_dog/article/details/52662259

本文参考依云大大在sf的回答https://segmentfault.com/q/1010000000341014