Question4：编码乱码问题--如何获取文本的charset字符集

最新推荐文章于 2021-12-08 14:26:54 发布

laoyouzhazi

最新推荐文章于 2021-12-08 14:26:54 发布

阅读量296

点赞数

分类专栏： Python Notes 文章标签： python 乱码 unicode 正则表达式字符串

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/qq_21264377/article/details/105834991

版权

Python 同时被 3 个专栏收录

159 篇文章 1 订阅

订阅专栏

Notes

98 篇文章 1 订阅

订阅专栏

Question

9 篇文章 0 订阅

订阅专栏

前面提到Question3：获取指定网址的HTML文本，里面有个问题：如何识别文本的charset字符编码？网上搜索资料一番，找到一些相关的blog博文：

csdn weixin_33924220：Python 编码转换与中文处理

csm201314：python 爬取编码（charset）为gbk的网页

csdn 八戒爱飘柔 Python 使用requests时的编码问题

其他就不一一列举了。

一开始，笔者也像第一二例中一样手动指定charset字符编码集。（其他瞎搞就不必了：））也像第三例，查各种官方资料无果。后来找到一个module模块依赖包：chardet。

下面来试试“懒人”的办法吧！

"""
Run chardet on a bunch of documents and see that we get the correct encodings.
:author: Dan Blanchard
:author: Ian Cordasco
"""
'''
    省略部分代码
'''
def test_detect_all_and_detect_one_should_agree(txt, enc, rnd):
        try:
            data = txt.encode(enc)
        except UnicodeEncodeError:
            assume(False)
        try:
            result = chardet.detect(data)
            results = chardet.detect_all(data)
            assert result['encoding'] == results[0]['encoding']
        except Exception:
            raise Exception('%s != %s' % (result, results))

改变一下方便测试：

'''
    省略部分代码
    假设data为byte字节流或raw数据
'''
def test(data, default_encoding='utf-8'):
        try:
            text = data.decode(default_encoding)
        except:
            try:
                result = chardet.detect(data)
                results = chardet.detect_all(data)
                if result['encoding'] == results[0]['encoding']:
                    print 'test success'
                else:
                    print 'test failure'
            except Exception:
                raise Exception('%s != %s' % (result, results))

单从上面的例子来看，detect()和detect_all()是有差距的，detect()肯定存在“失误”的情景，如中英文“混合物”（mixture）。所以在复杂的环境下，基于尽可能准确的考量推荐使用detect_all()或者组合。笔者尝试过在实际环境中应用chardet.detect()确实出现少量的失误。这是基本的使用方法。详尽的请看

example-using-the-detect-function

'''
Example: Detecting encoding incrementally
'''
import urllib
from chardet.universaldetector import UniversalDetector

usock = urllib.urlopen('http://yahoo.co.jp/')
detector = UniversalDetector()
for line in usock.readlines():
    detector.feed(line)
    if detector.done: break
detector.close()
usock.close()
print detector.result
'''
{'encoding': 'EUC-JP', 'confidence': 0.99}
'''

PyPI chardet支持的encodings：

PyPI chardet--Supported encodings

Universal Encoding Detector currently supports over two dozen character encodings.

Big5, GB2312/GB18030, EUC-TW, HZ-GB-2312, and ISO-2022-CN (Traditional and Simplified Chinese)
EUC-JP, SHIFT_JIS, and ISO-2022-JP (Japanese)
EUC-KR and ISO-2022-KR (Korean)
KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, and windows-1251 (Russian)
ISO-8859-2 and windows-1250 (Hungarian)
ISO-8859-5 and windows-1251 (Bulgarian)
ISO-8859-1 and windows-1252 (Western European languages)
ISO-8859-7 and windows-1253 (Greek)
ISO-8859-8 and windows-1255 (Visual and Logical Hebrew)
TIS-620 (Thai)
UTF-32 BE, LE, 3412-ordered, or 2143-ordered (with a BOM)
UTF-16 BE or LE (with a BOM)
UTF-8 (with or without a BOM)
ASCII

汉字字符集：

GB2312、GBK、GB18030 这几种字符集的主要区别是什么？

GB 2312 标准共收录 6763 个汉字，其中一级汉字 3755 个，二级汉字 3008 个；同时收录了包括拉丁字母、希腊字母、日文平假名及片假名字母、俄语西里尔字母在内的 682 个字符。

GB 2312 对任意一个图形字符都采用两个字节表示

GBK 即汉字内码扩展规范，K 为汉语拼音 Kuo Zhan（扩展）中“扩”字的声母。英文全称 Chinese Internal Code Specification。

GBK 共收入 21886 个汉字和图形符号

GBK 采用双字节表示

GB 18030，全称：国家标准 GB 18030-2005《信息技术中文编码字符集》，是中华人民共和国现时最新的内码字集，是 GB 18030-2000《信息技术信息交换用汉字编码字符集基本集的扩充》的修订版。
GB 18030 与 GB 2312-1980 和 GBK 兼容，共收录汉字70244个。

GB 18030 编码是一二四字节变长编码。

除了PyPI的chardet就没有其他办法了？所谓“条条大路通罗马”。对于不同的情景，还是有对应的解决方案。除了本地文件，网页HTML的测试目前还可行。测试环境：Windows10，Python3.7.*。啰嗦一大堆，其他不多说开搞咯。

众所周知，Windows和linux类操作系统命令行下有个比较有名的debug调试“潜规则”：非错不显示。也就是说，执行命令错误了才显示信息，成功不显示。真是“懒人”们的福音！所以，这个方法可以借鉴以下。

'''
author: mrn6
'''

'''
@param: raw_data(byte)
'''
def test_charset(raw_data, charset='utf-8'):
    text=raw_data.decode(charset)
    return text

def test_decode(raw_data, charsets=['utf-8', 'gb2312', 'gbk', 'big5', 'gb18030']):
    text=''
    for charset in charsets:
        try:
            text=test_charset(raw_data, charset)
            if text is not None and len(text)>0:
                break;
        except:
            continue
    return text

此方法，在多次实践中，屡试不爽。当然缺点也明显，就是文本体积较大时，耗时会大幅线性增加。最差的时间是：n = len(charsets) * t，也即charsets的数量乘以每次decode解码的时间t。但是就目前来说，适用范围广、代码量少且准确率高。

下面来看另一个情景--标准的HTML文件字节流。

python print raw_data — python print(raw_data)

将标准的HTML文件字节流raw_data直接打印出来，发现其中的英文字符完全是原样显示！（惊喜不惊喜：）原理不多说，多试几次print()就知道）现在接下来的步骤，真的是简单到不得了。抽取步骤代码略略略。相对上面的方法，这就快乐很多：步骤简单，速度快，代码量少，准确率高；缺点：情景单一，只能应用于标准的HTML文件。