mozilla有一个模块,叫universalchatdet,是用来判断是什么编码的 他的主要流程是这样的:
HandleData(batch_of_text)用了3中方法: 1) Coding scheme method, 2) Character Distribution, 3) 2-Char Sequence Distribution
{
if (batch_of_text contains BOM)
report UCS2;
if ((inputState is PureAscii) || (inputState is EscAscii))
if (batch_of_text contains 8-bits-byte)
inputState = HighByte;
else if ((inputState is PureAscii ) && (batch_of_text contains Esc_Sequence) )
inputState = EscAscii;
if (inputState is HighByte)
{
Remove Ascii character that is not neighboring to 8-bits byte
For each prober in multibyte_probers
Prober.HandleData(batch_of_text);
For each prober in singlebyte_probers
Prober.HandleData(batch_of_text);
}
else if (inputState is EscAscii)
{
For each prober in (ISO2022_XX or HZ)
Prober.HandleData(batch_of_text);
}
}