python逐步检测文件编码格式,使用Python检测文本文件编码的代码中的陷阱？

最新推荐文章于 2021-09-02 10:00:11 发布

伯特兰·罗卜

最新推荐文章于 2021-09-02 10:00:11 发布

阅读量148

点赞数

文章标签： python逐步检测文件编码格式

本文档探讨了一种简单的Python文本编码检测方法，通过比较常见字符的数量来猜测文件的编码。虽然这种方法可能适用于大多数英文文本，但对包含特殊字符的语言可能不准确。作者提到，此方法可能会在某些情况下失败，特别是在处理非ASCII字符时。建议使用更专业的库如`chardet`，但考虑到依赖性和环境限制，作者选择避免使用外部库。总结来说，该方法适合快速识别主要使用标准ASCII字符的文本文件。

摘要由CSDN通过智能技术生成

I know more about bicycle repair, chainsaw use and trench safety than I do Python or text encoding; with that in mind...

Python text encoding seems to be a perennial issue (my own question: Searching text files' contents with various encodings with Python?, and others I've read: 1, 2. I've taken a crack at writing some code to guess the encoding below.

In limited testing this code seems to work for my purposes* without me having to know an excess about the first three bytes of text encoding and the situations where those data aren't informative.

*My purposes are:

Have a dependency-free snippet I can use with a moderate-high degree of success,

Scan a local workstation for text based log files of any encoding and identify them as a file I am interested in based on their contents (which requires the file to be opened with the proper encoding)

for the challenge of getting this to work.

Question: What are the pitfalls with using a what I assume to be a klutzy method of comparing and counting characters like I do below? Any input is greatly appreciated.

def guess_encoding_debug(file_path):

"""

DEBUG - returns many 2 value tuples

Will return list of all possible text encodings with a count of the number of chars

read that are common characters, which might be a symptom of success.

SEE warnings in sister function

"""

import codecs

import string

from operator import itemgetter

READ_LEN = 1000

ENCODINGS = ['ascii','cp1252','mac_roman','utf_8','utf_16','utf_16_le',\

'utf_16_be','utf_32','utf_32_le','utf_32_be']

#chars in the regular ascii printable set are BY FAR the most common

#in most files written in English, so their presence suggests the file

#was decoded correctly.

nonsuspect_chars = string.printable

#to be a list of 2 value tuples

results = []

for e in ENCODINGS:

#some encodings will cause an exception with an incompatible file,

#they are invalid encoding, so use try to exclude them from results[]

try:

with codecs.open(file_path, 'r', e) as f:

#sample from the beginning of the file

data = f.read(READ_LEN)

nonsuspect_sum = 0

#count the number of printable ascii chars in the

#READ_LEN sized sample of the file

for n in nonsuspect_chars:

nonsuspect_sum += data.count(n)

#if there are more chars than READ_LEN

#the encoding is wrong and bloating the data

if nonsuspect_sum <= READ_LEN:

results.append([e, nonsuspect_sum])

except:

pass

#sort results descending based on nonsuspect_sum portion of

#tuple (itemgetter index 1).

results = sorted(results, key=itemgetter(1), reverse=True)

return results

def guess_encoding(file_path):

"""

Stupid, simple, slow, brute and yet slightly accurate text file encoding guessing.

Will return one likely text encoding, though there may be others just as likely.

WARNING: DO NOT use if your file uses any significant number of characters

outside the standard ASCII printable characters!

WARNING: DO NOT use for critical applications, this code will fail you.

"""

results = guess_encoding_debug(file_path)

#return the encoding string (second 0 index) from the first

#result in descending list of encodings (first 0 index)

return results[0][0]

I am assuming it would be slow compared to chardet, which I am not particularly familiar with. Also less accurate. They way it is designed, any roman character based language that uses accents, umlauts, etc. will not work, at least not well. It will be hard to know when it fails. However, most text in English, including most programming code, would largely be written with string.printable on which this code depends.

External libraries may be an option in the future, but for now I want to avoid them because:

This script will be run on multiple company computers on and off the network with various versions of python, so the fewer complications the better. When I say 'company' I mean small non-profit of social scientists.

I am in charge of collecting the logs from GPS data processing, but I am not the systems administrator - she is not a python programmer and the less time I take of hers the better.

The installation of Python that is generally available at my company is installed with a GIS software package, and is generally better when left alone.

My requirements aren't too strict, I just want to identify the files I am interested in and use other methods to copy them to an archive. I am not reading the full contents to memory to manipulate, appending or to rewriting the contents.

It seems like a high-level programming language should have some way of accomplishing this on its own. While "seems like" is a shaky foundation for any endeavor, I wanted to try and see if I could get it to work.

解决方案

Probably the simplest way to find out how well your code works is to take the test suites for the other existing libraries, and use those as a base to create your own comprehensive test suite. They you will know if your code works for all of those cases, and you can also test for all of the cases you care about.