requests_html编码,Python+Requests编码识别Bug

最新推荐文章于 2024-05-01 19:51:16 发布

weixin_39831567

最新推荐文章于 2024-05-01 19:51:16 发布

阅读量243

点赞数

文章标签： requests_html编码

Requests 是使用 Apache2 Licensed 许可证的 HTTP 库。用 Python 编写，更友好，更易用。

Requests 使用的是 urllib3，因此继承了它的所有特性。Requests 支持 HTTP 连接保持和连接池，支持使用 cookie 保持会话，支持文件上传，支持自动确定响应内容的编码，支持国际化的 URL 和 POST 数据自动编码。现代、国际化、人性化。

最近在使用Requests的过程中发现一个问题，就是抓去某些中文网页的时候，出现乱码，打印encoding是ISO-8859-1。为什么会这样呢？通过查看源码，我发现默认的编码识别比较简单，直接从响应头文件的Content-Type里获取，如果存在charset，则可以正确识别，如果不存在charset但是存在text就认为是ISO-8859-1，见utils.py。

def get_encoding_from_headers(headers):

"""Returns encodings from given HTTP Header Dict.

:param headers: dictionary to extract encoding from.

"""

content_type = headers.get(‘content-type‘)

if not content_type:

return None

content_type, params = cgi.parse_header(content_type)

if ‘charset‘ in params:

return params[‘charset‘].strip("‘\"")

if ‘text‘ in content_type:

return ‘ISO-8859-1‘

其实Requests提供了从内容获取编码，只是在默认中没有使用，见utils.py：

def get_encodings_from_content(content):

"""Returns encodings from given content string.

:param content: bytestring to extract encodings from.

"""

charset_re = re.compile(r‘]‘, flags=re.I)

pragma_re = re.compile(r‘]‘, flags=re.I)

xml_re = re.compile(r‘^]‘)

return (charset_re.findall(content) +

pragma_re.findall(content) +

xml_re.findall(content))

还提供了使用chardet的编码检测，见models.py:

@property

def apparent_encoding(self):

"""The apparent encoding, provided by the lovely Charade library

(Thanks, Ian!)."""

return chardet.detect(self.content)[‘encoding‘]

如何修复这个问题呢？先来看一下示例：

>>> r = requests.get(‘http://cn.python-requests.org/en/latest/‘)

>>> r.headers[‘content-type‘]

‘text/html‘

>>> r.encoding

‘ISO-8859-1‘

>>> r.apparent_encoding

‘utf-8‘

>>> requests.utils.get_encodings_from_content(r.content)

[‘utf-8‘]

>>> r = requests.get(‘http://reader.360duzhe.com/2013_24/index.html‘)

>>> r.headers[‘content-type‘]

‘text/html‘

>>> r.encoding

‘ISO-8859-1‘

>>> r.apparent_encoding

‘gb2312‘

>>> requests.utils.get_encodings_from_content(r.content)

[‘gb2312‘]

通过了解，可以这么用一个monkey patch解决这个问题：

import requests

def monkey_patch():

prop = requests.models.Response.content

def content(self):

_content = prop.fget(self)

if self.encoding == ‘ISO-8859-1‘:

encodings = requests.utils.get_encodings_from_content(_content)

if encodings:

self.encoding = encodings[0]

else:

self.encoding = self.apparent_encoding

_content = _content.decode(self.encoding, ‘replace‘).encode(‘utf8‘, ‘replace‘)

self._content = _content

return _content

requests.models.Response.content = property(content)

monkey_patch()

相关文章:

原文：http://my.oschina.net/u/1188877/blog/493207

weixin_39831567

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。