检测字节流是否是UTF8编码

最新推荐文章于 2024-01-31 16:05:23 发布

土戈

最新推荐文章于 2024-01-31 16:05:23 发布

阅读量473

点赞数

原文链接：https://www.cnblogs.com/powertoolsteam/archive/2010/09/20/1831638.html

版权

UTF8的编码规则总结起来如下：

ASCII码（U+0000 - U+007F），不编码

其余编码规则为

•第一个Byte二进制以形式为n个1紧跟个0 (n >= 2), 0后面的位数用来存储真正的字符编码，n的个数说明了这个多Byte字节组字节数（包括第一个Byte）
•接下来会有n个以10开头的Byte，后6个bit存储真正的字符编码。
因此对整个编码byte流进行分析可以得出是否是UTF8编码的判断。

根据这个规则，我给出的C#代码如下：

public static bool IsTextUTF8(ref byte[] inputStream)

{

int encodingBytesCount = 0;

bool allTextsAreASCIIChars = true;

for (int i = 0; i < inputStream.Length; i++)

{

byte current = inputStream[i];

if ((current & 0x80) == 0x80)

{

allTextsAreASCIIChars = false;

}

// First byte

if (encodingBytesCount == 0)

{

if ((current & 0x80) == 0)

{

// ASCII chars, from 0x00-0x7F

continue;

}

if ((current & 0xC0) == 0xC0)

{

encodingBytesCount = 1;

current <<= 2;

// More than two bytes used to encoding a unicode char.

// Calculate the real length.

while ((current & 0x80) == 0x80)

{

current <<= 1;

encodingBytesCount++;

}

}

else

{

// Invalid bits structure for UTF8 encoding rule.

return false;

}

}

else

{

// Following bytes, must start with 10.

if ((current & 0xC0) == 0x80)

{

encodingBytesCount--;

}

else

{

// Invalid bits structure for UTF8 encoding rule.

return false;

}

}

}

if (encodingBytesCount != 0)

{

// Invalid bits structure for UTF8 encoding rule.

// Wrong following bytes count.

return false;

}

// Although UTF8 supports encoding for ASCII chars, we regard as a input stream, whose contents are all ASCII as default encoding.

return !allTextsAreASCIIChars;

}

另：

如果是判断一个文件是否使用了UTF8编码，不一定非用这种方法，因为通常以UTF8格式保存的文件最初两个字符是BOM头，标示该文件使用了UTF8编码。

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。