目录
如何检测文本文件的编码
免责声明:这里描述的提示并不是做这一切的灵丹妙药。那里有很多代码页和编码,这是无法处理的。但是,如果您在Windows环境中工作,这可能会有所帮助。
在处理文本或csv文件时,我通常会遇到以三种不同方式编码的文件:
- ANSI或更准确地说:Windows-1252
- 带字节顺序标记(BOM)的UTF8
- 不带BOM的UTF8
为了读取文件,我曾经使用过这样的StreamReader方法:
public string ReadFile(string path)
{
using (var sr = new System.IO.StreamReader(path: path,
encoding: Encoding.Default,
detectEncodingFromByteOrderMarks: true))
{
return sr.ReadToEnd();
}
}
如果BOM可用,StreamReader将使用它来获得正确的编码。在任何其他情况下,将使用默认编码(此处为Windows-1252)。这在前两种情况下效果很好,当使用没有BOM的UTF8时会失败。
问题是:
如何在没有BOM的情况下区分ANSI和UTF8
一种方法是这样的:
- 读取具有UTF8编码的文件并捕获DecoderFallbackException。
- 如果引发此类异常,请使用ANSI编码再次读取文件,这可能是正确的选择。
但是,这意味着必须再次读取文件或流,并且不应该再次执行操作!
最后,我想出了另一个解决方案。
FlexiStreamReader
原版StreamReader在根据BOM选择正确的编码方面做得很好,因此我们只需要处理缺少BOM并且StreamReader使用给定编码的情况。
因此,我们需要一个新的StreamReader编码,这是EncodingProvider:
这个类派生自System.Text.Encoding并且因为它只用于从字节流中获取字符,所以我们只需要实现方法GetCharCount、GetChars和GetMaxCharCount。
在这些方法中,处理了DecoderFallbackException,这里实际的编码从UTF8切换到Default。
因此,当StreamReader使用EncodingProvider从流中读取时,它以UTF8开头,一旦发生异常,就会切换到Default(Windows-1252)。流不受影响,流内的位置不需要改变,因此这也适用于仅向前的流。
因为EncodingProvider是出于一个非常特殊的目的,而且有些方法甚至没有实现,所以它不应该是一个公共类。取而代之的是,我选择创建FlexiStreamReader并使其EncodingProvider成为私有类:
using System;
using System.Text;
using System.IO;
namespace MyClassLibrary
{
/// <summary>
/// StreamReader that is to some extend capable to detect the encoding of a stream.
/// </summary>
public class FlexiStreamReader : StreamReader
{
/// <summary>
/// Initializes a new instance of <see cref="FlexiStreamReader"/>
/// class for the specified stream. The character encoding will be
/// detected by the byte order mark. In addition the reader is capable
/// to distinguish between UTF8 and Defaut encoding.
/// </summary>
/// <param name="stream"></param>
public FlexiStreamReader(Stream stream) :
base(stream, new EncodingProvider(), detectEncodingFromByteOrderMarks: true)
{
}
/// <summary>
/// Initializes a new instance of <see cref="FlexiStreamReader"/>
/// class for the specified file name. The character encoding will be
/// detected by the byte order mark. In addition the reader is
/// capable to distinguish between UTF8 and Defaut encoding.
/// </summary>
/// <param name="path"></param>
public FlexiStreamReader(string path) :
base(path, new EncodingProvider(), detectEncodingFromByteOrderMarks: true)
{
}
/// <inheritdoc />
public override Encoding CurrentEncoding
{
get
{
var enc = base.CurrentEncoding as EncodingProvider;
if (enc != null)
{
return enc.InternalEncoding;
}
return base.CurrentEncoding;
}
}
/// <summary>
/// Internal Encoding:
/// Starts with UTF-8 and switches to Default
/// in case of a DecoderFallbackException.
/// </summary>
private class EncodingProvider : System.Text.Encoding
{
private Encoding m_internalEncoding;
private bool m_useAnsi = false;
public EncodingProvider()
{
m_internalEncoding = new UTF8Encoding
(encoderShouldEmitUTF8Identifier: true, throwOnInvalidBytes: true);
}
public override int GetByteCount(char[] chars, int index, int count)
{
// no need for an implementation:
throw new NotImplementedException();
}
public override int GetBytes(char[] chars,
int charIndex, int charCount, byte[] bytes, int byteIndex)
{
// no need for an implementation:
throw new NotImplementedException();
}
public override int GetCharCount(byte[] bytes, int index, int count)
{
try
{
return m_internalEncoding.GetCharCount(bytes, index, count);
}
catch (DecoderFallbackException)
{
if (m_useAnsi)
{
throw;
}
m_useAnsi = true;
m_internalEncoding = System.Text.Encoding.Default;
return m_internalEncoding.GetCharCount(bytes, index, count);
}
}
public override int GetChars(byte[] bytes, int byteIndex,
int byteCount, char[] chars, int charIndex)
{
try
{
return m_internalEncoding.GetChars
(bytes, byteIndex, byteCount, chars, charIndex);
}
catch (DecoderFallbackException)
{
if (m_useAnsi)
{
throw;
}
m_useAnsi = true;
m_internalEncoding = System.Text.Encoding.Default;
return m_internalEncoding.GetChars
(bytes, byteIndex, byteCount, chars, charIndex);
}
}
public override int GetMaxByteCount(int charCount)
{
try
{
return m_internalEncoding.GetMaxByteCount(charCount);
}
catch (DecoderFallbackException)
{
if (m_useAnsi)
{
throw;
}
m_useAnsi = true;
m_internalEncoding = System.Text.Encoding.Default;
return m_internalEncoding.GetMaxByteCount(charCount);
}
}
public override int GetMaxCharCount(int byteCount)
{
try
{
return m_internalEncoding.GetMaxCharCount(byteCount);
}
catch (DecoderFallbackException)
{
if (m_useAnsi)
{
throw;
}
m_useAnsi = true;
m_internalEncoding = System.Text.Encoding.Default;
return m_internalEncoding.GetMaxCharCount(byteCount);
}
}
/// <summary>Returns the actual used Encoding.
/// Has to be used after reading!</summary>
public Encoding InternalEncoding
{
get { return m_internalEncoding; }
}
}
}
}
使用代码
将类FlexiStreamReader复制到您的项目中,调整命名空间并使用它:
public string ReadFile(string path)
{
using (var sr = new FlexiStreamReader(path))
{
return sr.ReadToEnd();
}
}
如果需要知道编码,可以在看完流后再看:
public string ReadFile(string path)
{
using (var sr = new FlexiStreamReader(path))
{
var result = sr.ReadToEnd();
Debug.WriteLine(sr.CurrentEncoding.EncodingName);
return result;
}
}
单元测试
如果您对此有任何疑问......
下面是一些可以添加到测试项目中的FlexiStreamReader测试:
using Microsoft.VisualStudio.TestTools.UnitTesting;
using MyClassLibrary;
using System;
using System.IO;
using System.Text;
namespace MyClassLibrary.Tests
{
[TestClass()]
public class FlexiStreamReaderTests
{
[TestMethod]
public void ReadUTF8()
{
var text = "abcdäöü";
var result = string.Empty;
using (var s = GetStream(new UTF8Encoding(false), text))
using (var r = new FlexiStreamReader(s))
{
result = r.ReadToEnd();
Assert.AreEqual(text, result);
Assert.AreEqual(Encoding.UTF8.EncodingName,
r.CurrentEncoding.EncodingName);
}
}
[TestMethod]
public void ReadUTF8_long()
{
var text = GetTestString(10000);
var result = string.Empty;
using (var s = GetStream(new UTF8Encoding(false), text))
using (var r = new FlexiStreamReader(s))
{
result = r.ReadToEnd();
Assert.AreEqual(text, result);
Assert.AreEqual(Encoding.UTF8.EncodingName,
r.CurrentEncoding.EncodingName);
}
}
[TestMethod]
public void ReadUTF8_BOM()
{
var text = "abcdäöü";
var result = string.Empty;
using (var s = GetStream(new UTF8Encoding(true), text))
using (var r = new FlexiStreamReader(s))
{
result = r.ReadToEnd();
Assert.AreEqual(text, result);
Assert.AreEqual(Encoding.UTF8, r.CurrentEncoding);
}
}
[TestMethod]
public void ReadAnsi()
{
var text = "abcdäöü";
var result = string.Empty;
using (var s = GetStream(Encoding.Default, text))
using (var r = new FlexiStreamReader(s))
{
result = r.ReadToEnd();
Assert.AreEqual(text, result);
Assert.AreEqual(Encoding.Default, r.CurrentEncoding);
}
}
[TestMethod]
public void ReadUnicode()
{
var text = "abcdäöü";
var result = string.Empty;
using (var s = GetStream(Encoding.Unicode, text))
using (var r = new FlexiStreamReader(s))
{
result = r.ReadToEnd();
Assert.AreEqual(text, result);
Assert.AreEqual(Encoding.Unicode, r.CurrentEncoding);
}
}
[TestMethod]
public void ReadBigEndianUnicode()
{
var text = "abcdäöü";
var result = string.Empty;
using (var s = GetStream(Encoding.BigEndianUnicode, text))
using (var r = new FlexiStreamReader(s))
{
result = r.ReadToEnd();
Assert.AreEqual(text, result);
Assert.AreEqual(Encoding.BigEndianUnicode, r.CurrentEncoding);
}
}
private static Stream GetStream(Encoding enc, String text)
{
var ms = new MemoryStream();
var sw = new StreamWriter(ms, enc);
sw.Write(text);
sw.Flush();
ms.Position = 0;
return ms;
}
private static string GetTestString(int length)
{
// get a long string with funky character at end of it:
var l2 = Math.Min(length / 10, 10);
var l1 = length - l2;
var p1 = new string('a', l1);
var p2 = new string('ö', l2);
return string.Concat(p1, p2);
}
}
}
https://www.codeproject.com/Tips/5359193/ANSI-or-UTF8-without-BOM-Read-Text-with-the-Right