TensorFlow下解决关于数据集下载和读取导入问题

最新推荐文章于 2025-03-27 14:58:58 发布

置顶 H_Mike

最新推荐文章于 2025-03-27 14:58:58 发布

阅读量5.5k

点赞数 3

分类专栏： Python 文章标签： tensorflow

本文链接：https://blog.csdn.net/qq_45488482/article/details/105330976

版权

Python 专栏收录该内容

9 篇文章

订阅专栏

一、TensorFlow的数据集下载

1 官网下载

手写数字数据集官方链接

2 使用TensorFlow下载

import tensorflow as tf
(x_train, y_train), (x_test, y_test)=tf.keras.datasets.mnist.load_data()

如果已经下载过的话会导入之前的mnist.npz文件，未下载的话会自动到相应网站下载

在下载过程可能遇到网速的问题，由于下载的数据集是国外的，可能出现下载超时的问题

导致下载的数据集不完整，当重新执行代码时会出现报错问题

Compressed file ended before the end-of-stream marker was reached

解决这个问题就是找到下载数据的文件位置把数据集删除

数据集文件位置在C:\Users\mike\.keras\datasets 主要是在.\keras\datasets下多找找就找到了
mnist
用TensorFlow的数据集导入下载速度感人不推荐使用有代理可以尝试科学上网

二、数据集的导入

1 gzip格式的数据集

从官网下载的数据属于gzip格式的
在这里插入图片描述
从官网可以看到数据图像和标签的储存格式

(1)方法一

图像数据读取

在图像文件中可以看到前四行不是我们需要的数据，前四行数据占有16个字节

从第五行开始，即offset为0016时就是图像的数据

第五行开始数据保存的格式是无符号字节占用一个字节数据读取就要从这里开始

首先把下载到的gzip包解压放在一个文件夹dataset下

下面进入文件的读取操作

首先用open()以二进制方式打开文件并读取文件

binary_data=open('./dataset/train-images-idx3-ubyte.gz','rb').read()

可以看到二进制数据
在这里插入图片描述

struct用法链接

格式字符
在这里插入图片描述

对二进制数据进行解包

读前面四行的数据

import struct
offset = 0
fmt_header = '!iiii'
magic_num,image_num,row_num,column_num=struct.unpack_from(fmt_header,binary_data,offset)

读取图像的数据

import numpy as np
images=np.empty((image_num,row_num,column_num))#创建一个空的图像数据数组
print('存放图像数据的数组结构：',images.shape)
offset_image=struct.calcsize(fmt_header)#计算图像数据的起始偏移量
print('图像数据起始偏移量：%d'%offset_image)
fmt_image='!'+str(row_num*column_num)+'B'#格式字符  一个图像数据是28*28需要读28*28个格式字符
print('格式字符为：',fmt_image)
for i in range(image_num):
    images[i]=np.array(struct.unpack_from(fmt_image,binary_data,offset_image)).reshape(row_num,column_num)
    offset_image=offset_image+struct.calcsize(fmt_image)

显示下其中一张图片

import matplotlib.pyplot as plt
plt.imshow(images[0])

在这里插入图片描述

标签数据读取

步骤与读取图像数据类似

前两行数据占16个字节
标签数据从第三行开始读

import struct
import numpy as np
binary_data=open('./dataset/train-labels.idx1-ubyte','rb').read()
offset_head = 0
fmt_header = '!ii' #格式字符
magic_num,label_num=struct.unpack_from(fmt_header,binary_data,offset_head)# 魔数、标签数
print(magic_num,image_num)
labels=np.empty(label_num)#创建一个空的标签数据数组
print('存放标签数据的数组结构：',labels.shape)
offset_label=struct.calcsize(fmt_header)#计算标签数据的起始偏移量
print('标签数据起始偏移量：%d'%offset_label)
fmt_label='!B'#格式字符  
print('格式字符为：',fmt_label)
for i in range(label_num):
    labels[i]=np.array(struct.unpack_from(fmt_label,binary_data,offset_label))
    offset_label=offset_label+struct.calcsize(fmt_label)

查看其中一个标签数据

在这里插入图片描述

(2)方法二

使用np.frombuffer()

import numpy as np
import os
fnames=os.listdir(path='./dataset//')
path=[]
for fname in fnames:
    path.append(os.path.join('./dataset/',fname))
print(path)
y_train = np.frombuffer(open(path[3],'rb').read(), np.uint8, offset=8)
x_train = np.frombuffer(open(path[2],'rb').read(), np.uint8, offset=16).reshape(len(y_train), 28, 28)
y_test = np.frombuffer(open(path[1],'rb').read(), np.uint8, offset=8)
x_test = np.frombuffer(open(path[0],'rb').read(), np.uint8, offset=16).reshape(len(y_test), 28, 28)

或者使用np.fromfile()

import numpy as np
import os
fnames=os.listdir(path='./dataset//')
path=[]
for fname in fnames:
    path.append(os.path.join('./dataset/',fname))
print(path)
x_test=np.fromfile(path[3],np.uint8,offset=8)
y_train=np.fromfile(path[2],np.uint8,offset=16).reshape(len(x_test),28,28)
y_test=np.fromfile(path[1],np.uint8,offset=8)
x_test=np.fromfile(path[0],np.uint8,offset=16).reshape(len(y_test),28,28)

在这里插入图片描述

2 npz格式的数据集

使用TensorFlow下载的数据格式是npz文件

npz、npy ：numpy的二进制文件(即把numpy数组以二进制形式保存到磁盘的文件)

(x_train, y_train), (x_test, y_test)=tf.keras.datasets.mnist.load_data()

其源码为：

import numpy as np
from tensorflow.python.keras.utils.data_utils import get_file
from tensorflow.python.util.tf_export import keras_export
@keras_export('keras.datasets.mnist.load_data')
def load_data(path='mnist.npz'):
    origin_folder = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
    path = get_file(
        path,
        origin=origin_folder + 'mnist.npz',
        file_hash=
        '731c5ac602752760c8e48fbffcf8c3b850d9dc2a2aedcf2cc48468fc17b673d1')
    with np.load(path, allow_pickle=True) as f:
        x_train, y_train = f['x_train'], f['y_train']
        x_test, y_test = f['x_test'], f['y_test']
    return (x_train, y_train), (x_test, y_test)