python 中chardet用法

最新推荐文章于 2024-04-24 10:00:00 发布

水星灭绝

最新推荐文章于 2024-04-24 10:00:00 发布

阅读量1.5k

点赞数 1

分类专栏： python

本文链接：https://blog.csdn.net/wulong710/article/details/109047099

版权

python 专栏收录该内容

86 篇文章 3 订阅

订阅专栏

# coding=utf-8

import os
import chardet

root = os.getcwd()
path = os.path.sep.join((root, "lst.txt"))
with open(path, "rb") as f:
    data = f.read()
    en = chardet.detect(data)
    print(data.decode(en["encoding"]))

判断大文件编码格式，参考：https://www.cnblogs.com/Neeo/articles/11528011.html：

大文件编码判断

上面的例子，是一下子读完，然后进行判断，但这不适合大文件。

因此，这里我们选择对读取的数据进行分块迭代，每次迭代出的数据喂给detector，当喂给detector数据达到一定程度足以进行高准确性判断时，detector.done返回True。此时我们就可以获取该文件的编码格式。

import requests
from chardet.universaldetector import UniversalDetector

url = 'https://chardet.readthedocs.io/en/latest/index.html'

response = requests.get(url=url, stream=True)

detector = UniversalDetector()
for line in response.iter_lines():
    detector.feed(line)
    if detector.done:
        break
detector.close()
print(detector.result)  # {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

水星灭绝

关注关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
python 中chardet用法

# coding=utf-8import osimport chardetroot = os.getcwd()path = os.path.sep.join((root, "lst.txt"))with open(path, "rb") as f: data = f.read() en = chardet.detect(data) print(data.decode(en["encoding"]))
复制链接

扫一扫