这两天在一些门户网站使用requests爬数据的时候,发现打印或者保存到文件中的中文显示为Unicode码,看着十分不爽快,于是就必须网上找了一下相关问题。其实,弄明白了解决也很简单了
比如,爬取凤凰网
response= requests.get("http://www.ifeng.com/")
我们都知道response有text
和content
这两个property,它们都是指响应内容,但是又有区别。我们从doc中可以看到:
text
的doc内容为:
Content of the response, in unicode.
If Response.encoding is None, encoding will be guessed using ``chardet``.
The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set ``r.encoding`` appropriately before accessing this property.
而content
的doc内容为: