universalchardet学习

mozilla有一个模块,叫universalchatdet,是用来判断是什么编码的 他的主要流程是这样的:

HandleData(batch_of_text) 
{
if (batch_of_text contains BOM)
report UCS2;
if ((inputState is PureAscii) || (inputState is EscAscii))
if (batch_of_text contains 8-bits-byte)
inputState = HighByte;
else if ((inputState is PureAscii ) && (batch_of_text contains Esc_Sequence) )
inputState = EscAscii;

if (inputState is HighByte)
{
Remove Ascii character that is not neighboring to 8-bits byte
For each prober in multibyte_probers
Prober.HandleData(batch_of_text);
For each prober in singlebyte_probers
Prober.HandleData(batch_of_text);
}
else if (inputState is EscAscii)
{
For each prober in (ISO2022_XX or HZ)
Prober.HandleData(batch_of_text);
}
}

用了3中方法: 1) Coding scheme method, 2) Character Distribution, 3) 2-Char Sequence Distribution  
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值