始终正确地检测编码是不可能的.
(来自chardet FAQ ?
However, some encodings are optimized
for specific languages, and languages
are not random. Some character
sequences pop up all the time, while
other sequences make no sense. A
person fluent in English who opens a
newspaper and finds “txzqJv 2!dasd0a
QqdKjvz” will instantly recognize that
that isn’t English (even though it is
composed entirely of English letters).
By studying lots of “typical” text, a
computer algorithm can simulate this
kind of fluency and make an educated
guess about a text’s language.
有一个chardet库使用该研究来尝试检测编码. chardet是Mozilla中自动检测代码的一个端口.
您也可以使用UnicodeDammit.它将尝试以下方法:
>在文档本身中发现的编码:例如,在XML声明中或(对于HTML文档)的http-equiv META标记.如果Beautiful Soup在文档中找到这种编码,它会从头开始再次解析文档并尝试新编码.唯一的例外是如果您明确指定了编码,并且该编码实际上有效:那么它将忽略它在文档中找到的任何编码.
>通过查看文件的前几个字节来嗅探编码.如果在此阶段检测到编码,则它将是UTF- *编码,EBCDIC或ASCII之一.
>如果安装了chardet库,则会对其进行嗅探.
> UTF-8
> Windows-1252