UTF8的编码规则总结起来如下:
ASCII码(U+0000 - U+007F),不编码
其余编码规则为
•第一个Byte二进制以形式为n个1紧跟个0 (n >= 2), 0后面的位数用来存储真正的字符编码,n的个数说明了这个多Byte字节组字节数(包括第一个Byte)
•接下来会有n个以10开头的Byte,后6个bit存储真正的字符编码。
因此对整个编码byte流进行分析可以得出是否是UTF8编码的判断。
根据这个规则,我给出的C#代码如下:
public static bool IsTextUTF8(ref byte[] inputStream)
{
int encodingBytesCount = 0;
bool allTextsAreASCIIChars = true;
for (int i = 0; i < inputStream.Length; i++)
{
byte current = inputStream[i];
if ((current & 0x80) == 0x80)
{
allTextsAreASCIIChars = false;
}
// First byte
if (encodingBytesCount == 0)
{
if ((current & 0x80) == 0)
{
// ASCII chars, from 0x00-0x7F
continue;
}
if ((current & 0xC0) == 0xC0)
{
encodingBytesCount = 1;
current <<= 2;
// More than two bytes used to encoding a unicode char.
// Calculate the real length.
while ((current & 0x80) == 0x80)
{
current <<= 1;
encodingBytesCount++;
}
}
else
{
// Invalid bits structure for UTF8 encoding rule.
return false;
}
}
else
{
// Following bytes, must start with 10.
if ((current & 0xC0) == 0x80)
{
encodingBytesCount--;
}
else
{
// Invalid bits structure for UTF8 encoding rule.
return false;
}
}
}
if (encodingBytesCount != 0)
{
// Invalid bits structure for UTF8 encoding rule.
// Wrong following bytes count.
return false;
}
// Although UTF8 supports encoding for ASCII chars, we regard as a input stream, whose contents are all ASCII as default encoding.
return !allTextsAreASCIIChars;
}
另:
如果是判断一个文件是否使用了UTF8编码,不一定非用这种方法,因为通常以UTF8格式保存的文件最初两个字符是BOM头,标示该文件使用了UTF8编码。