您在评论中提到,您只需要检测UTF-8.如果你知道替代方案只包含单字节编码,那么就有一种解决方案可以正常工作.
如果你知道它是UTF-8或像latin-1这样的单字节编码,那么先尝试在UTF-8中打开它,然后再在其他编码中打开它.如果文件仅包含ASCII字符,则最终将以UTF-8打开,即使它是用作其他编码.如果它包含任何非ASCII字符,则几乎总能正确检测两者之间的正确字符集.
try:
# or codecs.open on Python <= 2.5
# or io.open on Python > 2.5 and <= 2.7
filedata = open(filename, encoding='UTF-8').read()
except:
filedata = open(filename, encoding='other-single-byte-encoding').read()
您最好的选择是使用chardet package from PyPI,直接或通过BeautifulSoup的UnicodeDamnit:
chardet 1.0.1
Universal encoding detector
Detects:
ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
EUC-JP, SHIFT_JIS, ISO-2022-JP (Japanese)
EUC-KR, ISO-2022-KR (Korean)
KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
ISO-8859-2, windows-1250 (Hungarian)
ISO-8859-5, windows-1251 (Bulgarian)
windows-1252 (English)
ISO-8859-7, windows-1253 (Greek)
ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
TIS-620 (Thai)
Requires Python 2.1 or later
但是,有些文件在多种编码中有效,因此chardet不是灵丹妙药.