爬虫decode utf-8报错解决办法

最新推荐文章于 2024-04-30 19:24:04 发布

拢龙木嘛

最新推荐文章于 2024-04-30 19:24:04 发布

阅读量4.2k

点赞数 5

分类专栏：爬虫文章标签： python html

本文链接：https://blog.csdn.net/weixin_37896489/article/details/106976889

版权

爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xa1 in position 160: invalid start byte

问题描述

在爬虫一个小说网站时，在使用urllib获取request的response的时候，要进行解码,相关语句如下：
html = reponse.read().decode(“utf-8”)，
该语句写“utf-8”，在执行时会报如下错误：
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xa1 in position 160: invalid start byte

解决问题

将语句：html = reponse.read().decode(“utf-8”)
改为：html = reponse.read().decode(“GBK”)
即：把"utf-8"变为在国标码"GBK"

问题代码如下

def askURL(url):
    head = {
        "User-Agent": "Mozillacko) Chrome6"}
    request = urllib.request.Request(url, headers=head)
    html = ""
    try:
        reponse = urllib.request.urlopen(request)
        //就是下面这句报错！！！ 
        //错误原因：decode()里面要放对应的解码格式
        html = reponse.read().decode("utf-8")
        //对我这次爬虫小说的网页来说，下面这句才是对的：
        //html = reponse.read().decode("GBK")
        print("html", html)
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)

问题分析

出现这种问题绝大部分情况是因为文件不是 UTF8 编码的（例如，可能是 GBK 编码的）。解决方法是改为对应的解码方式。

参考链接：

https://www.cnblogs.com/zhangshitong/p/11281312.html
https://blog.csdn.net/qq_37701443/article/details/84964684
https://blog.csdn.net/wang7807564/article/details/78164855/

附上一个网上写的比较好的爬取小说的教程：

爬虫：小说

拢龙木嘛

关注

5
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
爬虫decode utf-8报错解决办法

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xa1 in position 160: invalid start byte问题描述在爬虫一个小说网站时，在使用urllib获取request的response的时候，要进行解码,相关语句如下：html = reponse.read().decode(“utf-8”)，该语句写“utf-8”，在执行时会报如下错误：UnicodeDecodeError: ‘utf-8’ codec can’t
复制链接

扫一扫