制作自己的数据（cifar-10）-python

最新推荐文章于 2024-06-25 23:15:17 发布

hanlinger_

最新推荐文章于 2024-06-25 23:15:17 发布

阅读量1.4k

点赞数 1

分类专栏：调试代码 Python 機器學習

本文链接：https://blog.csdn.net/hanlinger_/article/details/100126641

版权

機器學習同时被 3 个专栏收录

14 篇文章 0 订阅

订阅专栏

Python

9 篇文章 0 订阅

订阅专栏

调试代码

1 篇文章 0 订阅

订阅专栏

CIFAR-10 介绍

　　该数据集共有60000张彩色图像，这些图像是32*32，分为10个类，每类6000张图。这里面有50000张用于训练，构成了5个训练批，每一批10000张图；另外10000用于测试，单独构成一批。测试批的数据里，取自10类中的每一类，每一类随机取1000张。抽剩下的就随机排列组成了训练批。注意一个训练批中的各类图像并不一定数量相同，总的来看训练批，每一类都有5000张图。

下面这幅图就是列举了10各类，每一类展示了随机的10张图片：

需要说明的是，这10类都是各自独立的，不会出现重叠（一张图对应一类）

数据的下载地址：（共有三个版本：python,matlab,binary version 适用于C语言）

=====================以python版本数据为例======================

关于python 版本的CIFAR10的数据格式，官网上已经介绍：

data – a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image.
labels – a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.

下载解压后是这样的：

该数据集文件包含data_batch1……data_batch5，和test_batch。他们都是由cPickle库产生的序列化后的对象（关于pickle,移步https://docs.python.org/3/library/pickle.html）。

以data_batch_1为例，其数据格为：

很明显，python版本存储成了一个dict，其中key包括:

data, 存放图像数据文件，是一个nx3072的数组；

labels, 存放图像对应的label，是一个nx1的数组；

batch_label, 说明信息；

filenames, 文件名列表。

{'data': array([[ 59,  43,  50, ..., 140,  84,  72],
       [154, 126, 105, ..., 139, 142, 144],
       [255, 253, 253, ...,  83,  83,  84],
       ..., 
       [ 71,  60,  74, ...,  68,  69,  68],
       [250, 254, 211, ..., 215, 255, 254],
       [ 62,  61,  60, ..., 130, 130, 131]], dtype=uint8), 
'labels': [6, 9, 9, 4, 1, 1, 2, 7, 8, 3, 4, 7, 7, 2, 9, 9, 9, 3, 2, 6, 4, 3, 6, 6, 2, 6, 3, 5, 4, 0, 0, 9, 1, 3, 4, 0, 3, 7, 3, 3, 5, 2, 2, 7, 1, 1, 1, 2, 2, 0, 9, 5, 7, 9, 2, 2, 5, 2, 4, 3, 1, 1, 8, 2, 1, 1, 4, 9, 7, 8, 5, 9, 6, 7, 3, 1, 9, 0, 3, 1, 3, 5, 4, 5, 7, 7,  ... , 9, 8, 9, 4, 4, 7, 1, 0, 4, 3, 6, 3, 9, 8, 3, 6, 8, 3, 6, 6, 2, 6, 7, 3, 0, 0, 0, 2, 5, 1, 2, 9, 2, 2, 1, 6, 3, 9, 1, 1, 5], 
'batch_label': 'training batch 1 of 5', 
'filenames': ['leptodactylus_pentadactylus_s_000004.png', 'camion_s_000148.png', 'tipper_truck_s_001250.png', ... , 'truck_s_000036.png', 'car_s_002296.png', 'estate_car_s_001433.png', 'cur_s_000170.png']}

这里给出python2和python3的例程，他可以打开这样的pkl文件，返回一个字典结构的数据：

Python2:

import cPickle
def unpickle(file):
  with open(file, 'rb') as fo:
    dict = cPickle.load(fo)
  return dict

Python3:

import pickle
def unpickle(file):
  with open(file, 'rb') as fo:
    dict = pickle.load(fo, encoding='bytes')
  return dict

数据集除了6个batch之外，还有一个文件batches.meta。它包含一个python字典对象，内容有：

一个包含10个元素的列表，每一个描述了labels array中每个数字对应类标的名字。比如：label_names[0] == "airplane", label_names[1] == "automobile"

制作数据集代码

关于向pytorch送数据的一些参考链接：

1
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
制作自己的数据（cifar-10）-python

CIFAR-10 介绍　　该数据集共有60000张彩色图像，这些图像是32*32，分为10个类，每类6000张图。这里面有50000张用于训练，构成了5个训练批，每一批10000张图；另外10000用于测试，单独构成一批。测试批的数据里，取自10类中的每一类，每一类随机取1000张。抽剩下的就随机排列组成了训练批。注意一个训练批中的各类图像并不一定数量相同，总的来看训练批，每一类都有5000张...
复制链接

扫一扫