Scrapy第十三篇：编码检测cchardet

文子阳

已于 2022-08-18 16:02:48 修改

阅读量1.2k

点赞数

分类专栏： scrapy

于 2022-08-18 10:44:17 首次发布

本文链接：https://blog.csdn.net/wenxingchen/article/details/126400968

版权

scrapy cchardet python

scrapy 专栏收录该内容

19 篇文章 4 订阅

订阅专栏

cchardet是chardet的升级版，功能和chardet完全一样（requests依赖包采用的就是chardet），用来检测一个字节数组的编码。由于是用C和C++实现的，所以它的速度非常快，非常适合在爬虫中用来判断网页的编码。
切记，不要相信requests返回的encoding，自己判断一下更放心。

1.看下爬虫必备依赖包：requests是怎么解码的。

if __name__ == '__main__':
    # 获取百度新闻首页
    response: Response = requests.get('http://news.baidu.com/')

    # 直接通过text计算属性就能直接拿到字符串，但是这种方式并不保险。
    html: str = response.text

2.依赖包

pip install cchardet

3.手动解码更放心：

import cchardet
import requests
from requests import Response

if __name__ == '__main__':
    # 获取百度新闻首页
    response: Response = requests.get('http://news.baidu.com/')
    """
    自动解码：不推荐
    """
    # 直接通过text计算属性就能直接拿到字符串，但是这种方式并不保险。
    html: str = response.text

    """
    手动解码：推荐
    """
    # 获取字节数组
    html_bytes = response.content
    # 判断编码
    detect = cchardet.detect(html_bytes)
    # 解码
    html_ = html_bytes.decode(detect['encoding'])

4.判断bytes文件编码

# -*- coding: utf-8 -*-
import cchardet as chardet

with open(r"util.py", "rb") as f:
    msg = f.read()
    result = chardet.detect(msg)
    print(result)