how to download cifar10 and split it into training file and testing file in python

最新推荐文章于 2025-06-27 02:20:57 发布

原创最新推荐文章于 2025-06-27 02:20:57 发布 · 253 阅读

0 ·

CC 4.0 BY-SA版权

机器学习小算法专栏收录该内容

8 篇文章

订阅专栏

本文详细介绍了如何使用Python和Shell脚本下载CIFAR-10数据集，包括数据集的解压、训练数据与测试数据的分离以及数据预处理的方法。CIFAR-10数据集已预先分为训练集和测试集，通过脚本可以轻松获取。

how to use cifar10 in python

the first step:download the cifar10 using the shell scripts
- how to split the cifar10 into training data, testing data
- how to change the data more convient

the first step:download the cifar10 using the shell scripts

#!/usr/bin/env bash
if ! [ -d "cifar-10-batches-py" ]; then
        wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
        tar xvzf cifar-10-python.tar.gz
        rm -f cifar-10-python.tar.gz
fi

in the first line, it means that this is a bash shell script
the second line represents that if there is no cifar-10-batches-py in the folder,then it will automatically download the batch file at toronto.
the third line, it stands for unfold the cifar10 at the same folder.
the forth line, it means that delete the compressed file right now
the last line, the scripts is finished.

Suppose you have created a folder named data, then if you cd /data then you will see the contents below.
在这里插入图片描述
if you cd into the cifar-10-py you will see the file below data_batch_1 …5 is the training data,and the test_batch is the testing data you will see this contents in the folder named cifar10-batchs-py

how to split the cifar10 into training data, testing data

in the training process of a model, using the training data, in the testing process, it will use the testing data. So it is necessary to split the data into training data and testing data. But to our joy, the cifar10 have already split the data into training data and testing data, so what you need to do is to just take it out.

#because there are five files as the training data in the folder as you can see above,so the nbbatch=5
def load_cifar10_2(nbbatch=5):
    all_data = []#this is the traning data 
    all_labels = []#this is the trianing label
    test_data=[]#this is the testing data
    test_labels=[]#this is the testing label
    ########
    #this section is for getting the training data
    for i in range(nbbatch):
        data = open("./data/cifar-10-batches-py/data_batch_%s" % (i + 1), 'rb')
        #open files in a sequence, and the flag is 'rb' because this file is opened in a read-only and Binary mode.(all images should do like this)  
        dict = pickle.load(data, encoding='bytes')
        #the pickle.load return a dict in a bytes mode
        data = dict[b'data']
        labels = np.asarray(dict[b'labels']).reshape((-1,1))
        #it changes it to an array
        all_data.append(data)
        all_labels.append(labels)
    ########
    data=open("./data/cifar-10-batches-py/test_batch",'rb')
    dict=pickle.load(data,encoding='bytes')
    data=dict[b'data']
    labels = np.asarray(dict[b'labels']).reshape((-1,1))
    test_data.append(data)
    test_labels.append(labels)


    all_data = np.concatenate(all_data, axis=0)
    all_labels = np.concatenate(all_labels, axis=0)
	#cat the data and labels
    test_data=np.concatenate(test_data,axis=0)
    test_labels=np.concatenate(test_labels,axis=0)
    return (all_data, all_labels,test_data,test_labels)

how to change the data more convient


def cifar10_proper_array(data):
    all_red = data[:,:1024].reshape(-1, 32, 32)
    all_green = data[:,1024:2048].reshape(-1, 32, 32)
    all_blue = data[:,2048:].reshape(-1, 32, 32)
    return np.stack([all_red, all_green, all_blue], axis=1) / 255.0

the snippet above is for data normalization.

data, labels,test_data,test_label =load_cifar10_2()
labels = labels.reshape(-1)
test_label=test_label.reshape(-1)

data = cifar10_proper_array(data)
test_data=cifar10_proper_array(test_data)

the code above is the main function