不带BOM的ANSI或UTF8:使用正确的编码读取文本

目录

如何检测文本文件的编码

如何在没有BOM的情况下区分ANSI和UTF8

FlexiStreamReader

使用代码

单元测试


如何检测文本文件的编码

免责声明:这里描述的提示并不是做这一切的灵丹妙药。那里有很多代码页和编码,这是无法处理的。但是,如果您在Windows环境中工作,这可能会有所帮助。

在处理文本或csv文件时,我通常会遇到以三种不同方式编码的文件:

  1. ANSI或更准确地说:Windows-1252
  2. 带字节顺序标记(BOM)的UTF8
  3. 不带BOM的UTF8

为了读取文件,我曾经使用过这样的StreamReader方法:

public string ReadFile(string path)
 {
     using (var sr = new System.IO.StreamReader(path: path,
         encoding: Encoding.Default,
         detectEncodingFromByteOrderMarks: true))
     {
         return sr.ReadToEnd();
     }
 }

如果BOM可用,StreamReader将使用它来获得正确的编码。在任何其他情况下,将使用默认编码(此处为Windows-1252)。这在前两种情况下效果很好,当使用没有BOMUTF8时会失败。

问题是:

如何在没有BOM的情况下区分ANSIUTF8

一种方法是这样的:

  1. 读取具有UTF8编码的文件并捕获DecoderFallbackException
  2. 如果引发此类异常,请使用ANSI编码再次读取文件,这可能是正确的选择。

但是,这意味着必须再次读取文件或流,并且不应该再次执行操作!

最后,我想出了另一个解决方案。

FlexiStreamReader

原版StreamReader在根据BOM选择正确的编码方面做得很好,因此我们只需要处理缺少BOM并且StreamReader使用给定编码的情况。

因此,我们需要一个新的StreamReader编码,这是EncodingProvider

这个类派生自System.Text.Encoding并且因为它只用于从字节流中获取字符,所以我们只需要实现方法GetCharCountGetCharsGetMaxCharCount

在这些方法中,处理了DecoderFallbackException,这里实际的编码从UTF8切换到Default

因此,当StreamReader使用EncodingProvider从流中读取时,它以UTF8开头,一旦发生异常,就会切换到DefaultWindows-1252)。流不受影响,流内的位置不需要改变,因此这也适用于仅向前的流。 

因为EncodingProvider是出于一个非常特殊的目的,而且有些方法甚至没有实现,所以它不应该是一个公共类。取而代之的是,我选择创建FlexiStreamReader并使其EncodingProvider成为私有类:

using System;
using System.Text;
using System.IO;
 
namespace MyClassLibrary
{
    /// <summary>
    /// StreamReader that is to some extend capable to detect the encoding of a stream.
    /// </summary>
    public class FlexiStreamReader : StreamReader
    {
        /// <summary>
        /// Initializes a new instance of <see cref="FlexiStreamReader"/> 
        /// class for the specified stream. The character encoding will be 
        /// detected by the byte order mark. In addition the reader is capable 
        /// to distinguish between UTF8 and Defaut encoding.
        /// </summary>
        /// <param name="stream"></param>
        public FlexiStreamReader(Stream stream) : 
         base(stream, new EncodingProvider(), detectEncodingFromByteOrderMarks: true)
        {
        }
 
        /// <summary>
        /// Initializes a new instance of <see cref="FlexiStreamReader"/> 
        /// class for the specified file name. The character encoding will be 
        /// detected by the byte order mark. In addition the reader is 
        /// capable to distinguish between UTF8 and Defaut encoding.
        /// </summary>
        /// <param name="path"></param>
        public FlexiStreamReader(string path) : 
        base(path, new EncodingProvider(), detectEncodingFromByteOrderMarks: true)
        {
        } 
 
        /// <inheritdoc />
        public override Encoding CurrentEncoding
        {
            get
            {
                var enc = base.CurrentEncoding as EncodingProvider;
                if (enc != null)
                {
                    return enc.InternalEncoding;
                }
                return base.CurrentEncoding;
            }
        } 
 
        /// <summary>
        /// Internal Encoding:
        /// Starts with UTF-8 and switches to Default 
        /// in case of a DecoderFallbackException.
        /// </summary>
        private class EncodingProvider : System.Text.Encoding
        {
            private Encoding m_internalEncoding;
            private bool m_useAnsi = false;
 
            public EncodingProvider()
            {
                m_internalEncoding = new UTF8Encoding
                (encoderShouldEmitUTF8Identifier: true, throwOnInvalidBytes: true);
            }
 
            public override int GetByteCount(char[] chars, int index, int count)
            {
                // no need for an implementation:
                throw new NotImplementedException();
            }
 
            public override int GetBytes(char[] chars, 
                   int charIndex, int charCount, byte[] bytes, int byteIndex)
            {
                // no need for an implementation:
                throw new NotImplementedException();
            }
 
            public override int GetCharCount(byte[] bytes, int index, int count)
            {
                try
                {
                    return m_internalEncoding.GetCharCount(bytes, index, count);
                }
                catch (DecoderFallbackException)
                {
                    if (m_useAnsi)
                    {
                        throw;
                    }
                    m_useAnsi = true;
                    m_internalEncoding = System.Text.Encoding.Default;
                    return m_internalEncoding.GetCharCount(bytes, index, count);
                }
            }
 
            public override int GetChars(byte[] bytes, int byteIndex, 
                            int byteCount, char[] chars, int charIndex)
            {
                try
                {
                    return m_internalEncoding.GetChars
                           (bytes, byteIndex, byteCount, chars, charIndex);
                }
                catch (DecoderFallbackException)
                {
                    if (m_useAnsi)
                    {
                        throw;
                    }
                    m_useAnsi = true;
                    m_internalEncoding = System.Text.Encoding.Default;
                    return m_internalEncoding.GetChars
                           (bytes, byteIndex, byteCount, chars, charIndex);
                }
            }
 
            public override int GetMaxByteCount(int charCount)
            {
                try
                {
                    return m_internalEncoding.GetMaxByteCount(charCount);
                }
                catch (DecoderFallbackException)
                {
                    if (m_useAnsi)
                    {
                        throw;
                    }
                    m_useAnsi = true;
                    m_internalEncoding = System.Text.Encoding.Default;
                    return m_internalEncoding.GetMaxByteCount(charCount);
                }
            }
 
            public override int GetMaxCharCount(int byteCount)
            {
                try
                {
                    return m_internalEncoding.GetMaxCharCount(byteCount);
                }
                catch (DecoderFallbackException)
                {
                    if (m_useAnsi)
                    {
                        throw;
                    }
                    m_useAnsi = true;
                    m_internalEncoding = System.Text.Encoding.Default;
                    return m_internalEncoding.GetMaxCharCount(byteCount);
                }
            }
 
            /// <summary>Returns the actual used Encoding. 
            /// Has to be used after reading!</summary>
            public Encoding InternalEncoding
            {
                get { return m_internalEncoding; }
            }
        } 
    } 
}

使用代码

将类FlexiStreamReader复制到您的项目中,调整命名空间并使用它:

public string ReadFile(string path)
{
    using (var sr = new FlexiStreamReader(path))
    {
        return sr.ReadToEnd();
    }
}  

如果需要知道编码,可以在看完流后再看:

public string ReadFile(string path)
{
    using (var sr = new FlexiStreamReader(path))
    {
        var result = sr.ReadToEnd();
        Debug.WriteLine(sr.CurrentEncoding.EncodingName);
        return result;
    }
}

单元测试

如果您对此有任何疑问......

下面是一些可以添加到测试项目中的FlexiStreamReader测试:

using Microsoft.VisualStudio.TestTools.UnitTesting;
using MyClassLibrary;
using System;
using System.IO;
using System.Text; 
 
namespace MyClassLibrary.Tests
{
    [TestClass()]
    public class FlexiStreamReaderTests
    { 
        [TestMethod]
        public void ReadUTF8()
        {
            var text = "abcdäöü";
            var result = string.Empty;
            using (var s = GetStream(new UTF8Encoding(false), text))
            using (var r = new FlexiStreamReader(s))
            {
                result = r.ReadToEnd();
                Assert.AreEqual(text, result);
                Assert.AreEqual(Encoding.UTF8.EncodingName, 
                                r.CurrentEncoding.EncodingName);
            }
        }
 
        [TestMethod]
        public void ReadUTF8_long()
        {
            var text = GetTestString(10000);
            var result = string.Empty;
            using (var s = GetStream(new UTF8Encoding(false), text))
            using (var r = new FlexiStreamReader(s))
            {
                result = r.ReadToEnd();
                Assert.AreEqual(text, result);
                Assert.AreEqual(Encoding.UTF8.EncodingName, 
                                r.CurrentEncoding.EncodingName);
            }
        }
 
        [TestMethod]
        public void ReadUTF8_BOM()
        {
            var text = "abcdäöü";
            var result = string.Empty;
            using (var s = GetStream(new UTF8Encoding(true), text))
            using (var r = new FlexiStreamReader(s))
            {
                result = r.ReadToEnd();
                Assert.AreEqual(text, result);
                Assert.AreEqual(Encoding.UTF8, r.CurrentEncoding);
            }
        } 
 
        [TestMethod]
        public void ReadAnsi()
        {
            var text = "abcdäöü";
            var result = string.Empty;
            using (var s = GetStream(Encoding.Default, text))
            using (var r = new FlexiStreamReader(s))
            {
                result = r.ReadToEnd();
                Assert.AreEqual(text, result);
                Assert.AreEqual(Encoding.Default, r.CurrentEncoding);
            }
        }
 
        [TestMethod]
        public void ReadUnicode()
        {
            var text = "abcdäöü";
            var result = string.Empty;
            using (var s = GetStream(Encoding.Unicode, text))
            using (var r = new FlexiStreamReader(s))
            {
                result = r.ReadToEnd();
                Assert.AreEqual(text, result);
                Assert.AreEqual(Encoding.Unicode, r.CurrentEncoding);
            }
        }
 
        [TestMethod]
        public void ReadBigEndianUnicode()
        {
            var text = "abcdäöü";
            var result = string.Empty;
            using (var s = GetStream(Encoding.BigEndianUnicode, text))
            using (var r = new FlexiStreamReader(s))
            {
                result = r.ReadToEnd();
                Assert.AreEqual(text, result);
                Assert.AreEqual(Encoding.BigEndianUnicode, r.CurrentEncoding);
            }
        } 
 
        private static Stream GetStream(Encoding enc, String text)
        {
            var ms = new MemoryStream();
            var sw = new StreamWriter(ms, enc);
            sw.Write(text);
            sw.Flush();
            ms.Position = 0;
            return ms;
        }
 
        private static string GetTestString(int length)
        {
            // get a long string with funky character at end of it:
            var l2 = Math.Min(length / 10, 10);
            var l1 = length - l2;
 
            var p1 = new string('a', l1);
            var p2 = new string('ö', l2);
            return string.Concat(p1, p2);
        }
    }
}

https://www.codeproject.com/Tips/5359193/ANSI-or-UTF8-without-BOM-Read-Text-with-the-Right

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值