python3的requests类抓取中文页面出现乱码

最新推荐文章于 2024-05-06 13:42:45 发布

沧_海_笑

最新推荐文章于 2024-05-06 13:42:45 发布

阅读量5.9k

点赞数

分类专栏： python

python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

搜索了一下相关的说明，发现 requests 只会简单地从服务器返回的响应头的 Content-Type 去获取编码，如果有 Charset 才能正确识别编码，否则就使用默认的 ISO-8859-1，这样一来某些不规范的服务器返回就必然乱码了。

例如：通过浏览器content-type选项中只有text/html，而没有charset，下面两张图是不同两个网站对比情况：

解决方法：

方法1：requests 内部的 utils 提供了一个从返回 body 获取页面编码的函数，get_encodings_from_content，这样如果服务器返回的头不含 Charset，再通过 get_encodings_from_content 就可以知道页面的正确编码了。

实例如下：response.text.encode('ISO-8859-1').decode(requests.utils.get_encodings_from_content(r.text)[0])

方法2：写一个 patch 将 requests.models.Response.content 打上补丁，而这明显比较麻烦。

Requests 使用的是 urllib3，因此继承了它的所有特性。Requests 支持 HTTP 连接保持和连接池，支持使用 cookie 保持会话，支持文件上传，支持自动确定响应内容的编码，支持国际化的 URL 和 POST 数据自动编码。现代、国际化、人性化。

最近在使用Requests的过程中发现一个问题，就是抓去某些中文网页的时候，出现乱码，打印encoding是ISO-8859-1。为什么会这样呢？通过查看源码，我发现默认的编码识别比较简单，直接从响应头文件的Content-Type里获取，如果存在charset，则可以正确识别，如果不存在charset但是存在text就认为是ISO-8859-1，见utils.py。

def get_encoding_from_headers(headers):
    """Returns encodings from given HTTP Header Dict.
    :param headers: dictionary to extract encoding from.
    """
    content_type = headers.get('content-type')
    if not content_type:
        return None
    content_type, params = cgi.parse_header(content_type)
    if 'charset' in params:
        return params['charset'].strip("'\"")
    if 'text' in content_type:
        return 'ISO-8859-1'

其实Requests提供了从内容获取编码，只是在默认中没有使用，见utils.py：

def get_encodings_from_content(content):
    """Returns encodings from given content string.
    :param content: bytestring to extract encodings from.
    """
    charset_re = re.compile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
    pragma_re = re.compile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
    xml_re = re.compile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]')
    return (charset_re.findall(content) +
            pragma_re.findall(content) +
            xml_re.findall(content))

还提供了使用chardet的编码检测，见models.py:

@property
def apparent_encoding(self):
    """The apparent encoding, provided by the lovely Charade library
    (Thanks, Ian!)."""
    return chardet.detect(self.content)['encoding']

如何修复这个问题呢？先来看一下示例：

>>> r = requests.get('http://cn.python-requests.org/en/latest/')
>>> r.headers['content-type']
'text/html'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
>>> requests.utils.get_encodings_from_content(r.content)
['utf-8']
>>> r = requests.get('http://reader.360duzhe.com/2013_24/index.html')
>>> r.headers['content-type']
'text/html'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'gb2312'
>>> requests.utils.get_encodings_from_content(r.content)
['gb2312']

通过了解，可以这么用一个monkey patch解决这个问题：

import requests
def monkey_patch():
    prop = requests.models.Response.content
    def content(self):
        _content = prop.fget(self)
        if self.encoding == 'ISO-8859-1':
            encodings = requests.utils.get_encodings_from_content(_content)
            if encodings:
                self.encoding = encodings[0]
            else:
                self.encoding = self.apparent_encoding
            _content = _content.decode(self.encoding, 'replace').encode('utf8', 'replace')
            self._content = _content
        return _content
    requests.models.Response.content = property(content)
monkey_patch()

沧_海_笑

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python3的requests类抓取中文页面出现乱码

搜索了一下相关的说明，发现 requests 只会简单地从服务器返回的响应头的 Content-Type 去获取编码，如果有 Charset 才能正确识别编码，否则就使用默认的 ISO-8859-1，这样一来某些不规范的服务器返回就必然乱码了。例如：通过浏览器content-type选项中只有text/html，而没有charset，下面两张图是不同两个网站对比情况：解决方法：
复制链接

扫一扫