python3 request 爬虫 httplib.IncompleteRead() 问题的简单解决方法

最新推荐文章于 2023-12-16 13:36:28 发布

NeverSettle101

最新推荐文章于 2023-12-16 13:36:28 发布

阅读量5.8k

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/qq_21265915/article/details/79324742

版权

爬虫专栏收录该内容

11 篇文章 0 订阅

订阅专栏

起因

在一个循环爬取得爬虫中，随机出现一个 httplib.IncompleteRead() 错误。

分析

查询了许多资料之后了解到，这个是由于 chunked 编码不完整导致，那么如何解决这个问题？由于这时候其实数据我们已经拿到了，但是 http_client 认为没有结束，所以有这么一个错误。具体分析过程可以看看这篇博文很详细。博文传送门

处理

这里由于项目中充斥着 Request，不方便换，而且文件不大，所以我通过最简单的处理方式来处理这个问题（处理方式来源），这种方式可能会带来数据丢失（在大文件的情况下）
自己抓取到这个错误
通过错误.partial.decode(‘utf-8’) 来获取到文件内容
实例代码：

def __init__(self, e, uri, format, uriparts):
        self.e = e
        self.uri = uri
        self.format = format
        self.uriparts = uriparts
        try:
            data = self.e.fp.read()
        except http_client.IncompleteRead as e:
            data = e.partial
        if self.e.headers.get('Content-Encoding') == 'gzip':
            buf = BytesIO(data)
            f = gzip.GzipFile(fileobj=buf)
            data = f.read()
        if len(data) == 0:
            data = {}
        else:
            data = data.decode('utf-8')
            if self.format == 'json':
                try:
                    data = json.loads(data)
                except ValueError:
                    pass
        self.response_data = data
        super(FanfouHTTPError, self).__init__(str(self))

本文对应代码

    def get_cartoon_json(self,categories):
        for category in categories:
            hasMore = True
            index = 1
            while hasMore:
                try:
                    link = self.__baseUrl + '/' + category['name']+'/'+str(index)
                    # 转换中文 url 编码
                    link = urllib.request.quote(link)
                    print(link)
                    # 把多余的转换 : ==> %3A ，还原
                    link = link.replace('%3A', ':')
                    # 打开链接
                    conn = req.urlopen(link)
                    # 以 utf-8 编码获取网页内容
                    content = conn.read().decode('utf-8')

                except http_client.IncompleteRead as e:
                    # 处理 chunked 读取错误，由于这里都是 json 所以就不再作 gzip 验证
                    content = e.partial
                    content = content.decode('utf-8')
                    if len(content) == 0:
                        content = '{}'

                print(content)
                jsonUtils = BaiDuCartoonUtils(self.__filePath)
                jsonUtils.write_json_to_jsonFile(content, self.__filePath + category['name'] + '/list/',
                                                 'data' + str(index) + '.json')

                jsonStr = json.loads(content)
                if jsonStr['data']['hasMore'] != 1:
                    hasMore = False
                index = index + 1
                # jsonUtils.write_cartoon_to_json('/api/query/'+self.__categoryName, content)

如有问题，欢迎指正

NeverSettle101

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python3 request 爬虫 httplib.IncompleteRead() 问题的简单解决方法

起因在一个循环爬取得爬虫中，随机出现一个 httplib.IncompleteRead() 错误。分析查询了许多资料之后了解到，这个是由于 chunked 编码不完整导致，那么如何解决这个问题？由于这时候其实数据我们已经拿到了，但是 http_client 认为没有结束，所以有这么一个错误。具体分析过程可以看看这篇博文很详细。博文传送门处理这里由于项目中充斥着 Requ...
复制链接

扫一扫