本文翻译自:How can I detect the encoding/codepage of a text file
In our application, we receive text files ( .txt
, .csv
, etc.) from diverse sources. 在我们的应用程序中,我们从各种来源接收文本文件( .txt
, .csv
等)。 When reading, these files sometimes contain garbage, because the files where created in a different/unknown codepage. 读取时,这些文件有时包含垃圾,因为这些文件是在不同/未知的代码页中创建的。
Is there a way to (automatically) detect the codepage of a text file? 有没有办法(自动)检测文本文件的代码页?
The detectEncodingFromByteOrderMarks
, on the StreamReader
constructor, works for UTF8
and other unicode marked files, but I'm looking for a way to detect code pages, like ibm850
, windows1252
. 该detectEncodingFromByteOrderMarks
,对StreamReader
构造,适用于UTF8
等统一标记的文件,但是我正在寻找一种方法来检测代码页,像ibm850
, windows1252
。
Thanks for your answers, this is what I've done. 感谢您的回答,这就是我所做的。
The files we receive are from end-users, they do not have a clue about codepages. 我们收到的文件来自最终用户,他们不了解代码页。 The receivers are also end-users, by now this is what they know about codepages: Codepages exist, and are annoying. 接收者也是最终用户,到目前为止,这是他们对代码页的了解:代码页存在并且令人讨厌。
Solution: 解:
- Open the received file in Notepad, look at a garbled piece of text. 在记事本中打开接收到的文件,查看乱码的文本。 If somebody is called François or something, with your human intelligence you can guess this. 如果有人叫弗朗索瓦(François)之类的东西,凭着您的智慧,您就可以猜