python通用编码_Chardet 是一款通用的Python 2/3字符编码检测器

chardet 4.0.0版本发布,带来了性能提升和一些改进。单字节字符集探测器使用了嵌套字典,提高了速度;CharsetGroupProber类现在在组内探测器确定匹配时能正确短路,提高效率。新增`chardet.detect_all`函数,返回输入可能的编码及置信度。同时,已弃用Python 2.6、3.4和3.5的支持。
摘要由CSDN通过智能技术生成

⚠️ This will be the last release of chardet to support Python 2.7. chardet 5.0 will only support 3.6+

⚠️

Major Changes

This release is multiple years in the making, and provides some quality of life improvements to chardet. The primary user-facing changes are:

Single-byte charset probers now use nested dictionaries under the hood, so they are usually a little faster than before. (See #121 for details)

The CharsetGroupProber class now properly short-circuits when one of the probers in the group is considered a definite match. This lead to a substantial speedup.

There is now a chardet.detect_all function that returns a list of possible encodings for the input with associated confidences.

We have dropped support for Python 2.6, 3.4, and 3.5 as they are all past end-of-life.

The changes in this release have also laid the groundwork for retraining the models to make them more accurate, and to support some more encodings/languages (see #99 for progress). This is our main focus for chardet 5.0 (beyond dropping Python 2 support).

Benchmarks

Running on a MacBook Pro (15-inch, 2018) with 2.2GHz 6-core i7 processor and 32GB RAM

old version (chardet 3.0.4)

Benchmarking chardet 3.0.4 on CPython 3.7.5 (default, Sep 8 2020, 12:19:42)

[Clang 11.0.3 (clang-1103.0.32.62)]

--------------------------------------------------------------------------------

Calls per second for each encoding:

ascii: 25559.439366240098

big5: 7.187002209518091

cp932: 4.71090956645177

cp949: 2.937256786994428

euc-jp: 4.870580412090848

euc-kr: 6.6910755971933416

euc-tw: 87.71098043480079

gb2312: 6.614302607154443

ibm855: 27.595893549680685

ibm866: 29.93483661732791

iso-2022-jp: 3379.5052775763434

iso-2022-kr: 26181.67290886392

iso-8859-1: 120.63424740403983

iso-8859-5: 32.65106262196898

iso-8859-7: 62.480089080556084

koi8-r: 13.72481001727257

maccyrillic: 33.018537255804496

shift_jis: 4.996013583677438

tis-620: 14.323112928341818

utf-16: 166771.53081510935

utf-32: 198782.18009478672

utf-8: 13.966236809766901

utf-8-sig: 193732.28637413395

windows-1251: 23.038910006925768

windows-1252: 99.48409117053738

windows-1255: 6.336261495718825

Total time: 357.05358052253723s (10.054513372323958 calls per second)

new version (chardet 4.0.0)

Benchmarking chardet 4.0.0 on CPython 3.7.5 (default, Sep 8 2020, 12:19:42)

[Clang 11.0.3 (clang-1103.0.32.62)]

--------------------------------------------------------------------------------

.......................................................................................................................................................................................................................................................................................................................................................................

Calls per second for each encoding:

ascii: 38176.31067961165

big5: 12.86915132656389

cp932: 4.656400877065864

cp949: 7.282976434315926

euc-jp: 4.329381447610525

euc-kr: 8.16386823884839

euc-tw: 90.230745070368

gb2312: 14.248865889128146

ibm855: 33.30225548069821

ibm866: 44.181691968506

iso-2022-jp: 3024.2295767539117

iso-2022-kr: 25055.57945041816

iso-8859-1: 59.25262902122995

iso-8859-5: 39.7069713674529

iso-8859-7: 61.008422013862194

koi8-r: 41.21560517643845

maccyrillic: 31.402474369805002

shift_jis: 4.9091652743515155

tis-620: 14.408875278821073

utf-16: 177349.00634249471

utf-32: 186413.51111111112

utf-8: 108.62174360115105

utf-8-sig: 181965.46637744035

windows-1251: 43.16933400329809

windows-1252: 211.27653358317968

windows-1255: 16.15113643694104

Total time: 268.0230791568756s (13.394368915143872 calls per second)

Thank you to @aaaxx, @edumco, @hrnciar, @hroncok, @jdufresne, @mdamien, @saintamh , @xeor for submitting pull requests, to all of our users for being patient with how long this release has taken.

Full changelog

Convert single-byte charset probers to use nested dicts for language models (#121) @dan-blanchard

Add API option to get all the encodings confidence (#111) @mdamien

Make sure pyc files are not in tarballs (d7c7343) @dan-blanchard

Include license file in the generated wheel package (#141) @jdufresne

Drop support for Python 2.6 (#143) @jdufresne

Remove unused coverage configuration (#142) @jdufresne

Doc the chardet package suitable for production (#144) @jdufresne

Pass python_requires argument to setuptools (#150) @jdufresne

Update pypi.python.org URL to pypi.org (#155) @jdufresne

Support pytest 4, don't apply marks directly to parameters (PR #174, Issue #173) @hroncok

Test Python 3.7 and 3.8 and document support (#175) @jdufresne

Drop support for end-of-life Python 3.4 (#181) @jdufresne

Workaround for distutils bug in python 2.7 (#165) @xeor

Remove deprecated license_file from setup.cfg (#182) @jdufresne

Remove deprecated 'sudo: false' from Travis configuraiton (#200) @jdufresne

Add testing for Python 3.9 (#201) @jdufresne

Adds explicit os and distro definitions (#140) @edumco

Remove shebang from nonexecutable script (#192) @hrnciar

Remove use of deprecated 'setup.py test' (#187) @jdufresne

Remove unnecessary numeric placeholders from format strings (#176) @jdufresne

Update links (#152) @aaaxx

Remove shebang and executable bit from chardet/cli/chardetect.py (#171) @jdufresne

Handle weird logging edge case in universaldetector.py (056a2a4) @dan-blanchard

Switch from Travis to GitHub Actions (#204) @dan-blanchard

Properly set CharsetGroupProber.state to FOUND_IT (PR #203, Issue #202) @dan-blanchard

Add language to detect_all output (1e208b7) @dan-blanchard

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值