python 文件打开r,Python打开（“x”，“r”）功能，我如何知道或控制文件应该具有哪些编码？...-CSDN博客

该博客讨论了在Python2.7中如何确定文本文件的编码方式。当使用open函数以'r'模式读取文件时，需要预先知道文件的编码。文章指出，无法自动检测文件编码，必须事先知晓，例如可以使用UTF-8或ASCII编码。如果不确定，可以尝试使用chardet库来猜测编码。内容还涉及Mercurial版本控制系统中处理文件名编码的情况。

摘要由CSDN通过智能技术生成

If a python script uses the open("filename", "r") function to open, and subsequently read, the contents of a text file, how can I tell which encoding this file is supposed to have?

Note that since I'm executing this script from my own program, if there is any way to control this through environment variables, then that is good enough for me.

This is Python 2.7 by the way.

The code in question comes from Mercurial, it can be given a list of files to, say, add to the repository, through a file on disk, instead of passing them on the command line.

So basically, instead of this:

hg add A B C

I can write out A, B and C to a file, with newlines between each, and then execute the following:

hg add listfile:input.txt

The code that ends up reading this file is this:

files = open(name, 'r').read().split(delimiter)

Hence my question. The answer I was given on IRC when I asked which encoding I should use was this:

it is the same encoding than the one you use on command line when passing a file argument

I take this to mean that it is the same encoding I "use" when I execute Mercurial (hg). Since I have no idea which encoding that is, I just give everything to the .NET Process object, I ask here.

解决方案

You can't. Reading a file is independent of its encoding; you'll need to know the encoding in advance in order to properly interpret the bytes you read in.

For example, if you know the file is encoded in UTF-8:

with open('filename', 'rb') as f:

contents = f.read().decode('utf-8-sig') # -sig deals with BOM, if present

Or if you know the file is ASCII only:

with open('filename', 'r') as f:

contents = f.read() # results in a str object

If you really don't know the encoding of the file, then there's obviously no guarantee that you can read it properly; however, you can guess at the encoding using a tool like chardet.

UPDATE:

I think I understand your question now. I thought you had a file you needed to write code for, but it seems you have code you need to write a file for ;-)

The code in question probably only deals properly with plain ASCII (it's possible the strings are converted later, but unlikely I think). So you'll want to make a text file that contains only ASCII (codepoint < 128) characters, and make sure it is saved in an ASCII encoding (i.e. not UTF-16 or anything like that). This is a little unfortunate considering that Mercurial deals with filenames, which can contain Unicode characters.