tarfile.ReadError: not a gzip file

最新推荐文章于 2024-04-12 07:00:00 发布

懒蛤蟆吃天鹅肉

最新推荐文章于 2024-04-12 07:00:00 发布

阅读量2k

点赞数

分类专栏： Python 文章标签： python 算法人工智能

本文链接：https://blog.csdn.net/a15779627836/article/details/126248048

版权

Python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

关于tarfile读取tgz文件报ReadError:not a gzip file的问题分析

在学习机器学习过程中，需要下载一个房屋数据，按照书上的流程操作下来发现出现以下错误：

tarfile.ReadError: not a gzip file

初始认为是由于tarfile文件调用错误，经过一番查找，发现并非如此。

分析：

使用工具下载的数据包大小为139kb，无论是使用urllib.request.urlretrieve还是使用requests.get方法。

手动下载的数据包大小为400kb。

点击下载地址进行下载。

错误原因猜测

通过上面的分析可知，使用工具下载的文件并不完整，虽然从表面看它是一个tgz文件，但是你可以把它理解成一个已经损坏的tgz文件，因此使用tarfile无法正确的进行打开，以至于报tarfile.ReadError: not a gzip file错误。

解决方案一

知道的错误原因后，我们可以手动下载对应的tgz文件并放在创建的文件夹下，然后通过程序解压。
当然这是一个鸡肋的方法，因为你可以直接下载csv文件，然而对于这样的一个问题，我还是过于偏执的下载了tgz文件并通过tarfile进行读取。

解决方案二

你也可以在工具下载时对该方法进行一些处理，确保下载的文件正确且完整。

源代码

# 下载数据
import os
import tarfile
import urllib.request
import requests

download_link = "https://github.com/ageron/handson-ml2/tree/master/"
housing_path = os.path.join("datasets", "housing")
housing_url = download_link + "datasets/housing/housing.tgz"


def download_source(url, output_path):
    response = requests.get(url, stream=False)
    with open(output_path, mode='wb') as f:
        f.write(response.content)
        

def fetch_housing_data(housing_url=housing_url, housing_path=housing_path):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    if not os.path.exists(tgz_path):
        download_source(housing_url, tgz_path) # 使用requests.get方法
#         urllib.request.urlretrieve(housing_url, tgz_path) # 使用urllib.request.urlretrieve方法

    housing_tgz = tarfile.open(tgz_path, 'r:gz')
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    
fetch_housing_data()