python chardet模块_python编码检测原理以及chardet模块应用

最新推荐文章于 2024-07-04 14:52:15 发布

weixin_39867594

最新推荐文章于 2024-07-04 14:52:15 发布

阅读量422

点赞数 1

文章标签： python chardet模块

有时候需要先检测一个文件的编码，然后将其转化为另一种编码。这时候就会用到chardet(chardet是python的一个第三方库，是非常优秀的编码识别模块)

chardet有两种检测文件编码的方法：

一、>>> import chardet

>>> f = open('songs.txt','r')

>>> result = chardet.detect(f.read())

>>> result

{'confidence': 0.99, 'encoding': 'utf-8'}

二、chardet comes with a command-line script which reports on the encodings of one or more files:% chardetect.py somefile someotherfile

somefile: windows-1252 with confidence 0.5

someotherfile: ascii with confidence 1.0def description_of(file, name='stdin'):

"""Return a string describing the probable encoding of a file."""

u = UniversalDetector()

for line in file:

u.feed(line)

u.close()

result = u.result

if result['encoding']:

return '%s: %s with confidence %s' % (name,

result['encoding'],

result['confidence'])

else:

return '%s: no result' % name

猜测：第一种检测编码的方法可能类似于vim，从小的编码集合(比如说ascii)开始解析数据，计算解码错误率，错误率超过阈值，则换用更大的字符集合，直到得到一个可以容忍的解码结果。因此速率会慢，文件比较大的话此方法不是很合适。不知道是不是这样子的？

问题：二的原理是什么，难道是每行每行地解析数据，然后看有没有解码错误？

用二的方法检测时，有时候检测小文本花费时间比大文本长很多(其中大文本只是简单的重复小文本中的内容)

weixin_39867594

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python chardet模块_python编码检测原理以及chardet模块应用

有时候需要先检测一个文件的编码，然后将其转化为另一种编码。这时候就会用到chardet(chardet是python的一个第三方库，是非常优秀的编码识别模块)chardet有两种检测文件编码的方法：一、>>> import chardet>>> f = open('songs.txt','r')>>> result = chardet.dete...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。