the difference between ‘content’ and ‘text’:
content
Content of the response, in bytes.
text
Content of the response, in unicode.
If Response.encoding is None, encoding will be guessed using chardet.The encoding of the response content is determined based solely on HTTP headers.
乱码
问题
requests请求时获取其text,其中中文为乱码,如:全国移动10M-3å
分析
requests会从服务器返回的响应头的 Content-Type 去获取字符集编码,如果content-type有charset字段那么requests才能正确识别编码,否则就使用默认的 ISO-8859-1. 一般那些不规范的页面往往有这样的问题. 打印r.encoding,会得到其编码.
# requests.utils源码
def get_encoding_from_headers(headers):
"""Returns encodings from given HTTP Header Dict.
:param headers: dictionary to extract encoding from.
:rtype: str
"""
content_type = headers.get('content-type')
if not content_type:
return None
content_type, params = cgi.parse_header(content_type)
if 'charset' in params:
return params['charset'].strip("'\"")
if 'text' in content_type:
return 'ISO-8859-1'
解决
方法一:若知道其编码,则指定它的编码即可。如:
r = requests.post(url=url)
r.encoding = 'utf8'
方法二:获取其原始bytes,再decode你想要的编码
r = requests.post(url=url)
content = r.content
print(content.decode())