Caffe100数据集使用

最新推荐文章于 2021-03-22 21:14:34 发布

汀桦坞

最新推荐文章于 2021-03-22 21:14:34 发布

阅读量385

点赞数

分类专栏： Paddle 文章标签： Paddle

本文链接：https://blog.csdn.net/wiborgite/article/details/79779426

版权

Paddle 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

背景信息

Caffe100数据集包含100小类，每小类包含600个图像，其中有500个训练图像和100个测试图像。100类被分组为20个大类。每个图像带有1个小类的“fine”标签和1个大类“coarse”标签。

该数据集作为机器学习的最常用的数据集之一，有必要详细了解下其结构，但由于下载后的文件是二进制的，因此在Python中需要先读取再使用，本文给出了简单的读取代码，并转化为PaddlePaddle中使用的reader。（代码的思路来自于PaddlePaddle官方的源码文件，此处仅是对自己的学习过程做个记录）

详细代码

#读取压缩文件，并其中包含指定名称的文件并返回一个字典
def get_dict(filename, sub_name):
    with tarfile.open(filename, mode='r') as f:
        names = (each_item.name for each_item in f
                if sub_name in each_item.name)
        for name in names:
            dictfile = cPickle.load(f.extractfile(name))
            return dictfile

#读取一个字典，并将其中data和fine_labels两个字典项组成迭代器，最终的返回值是一个由元组组成的enerator
def dict2generator(dictfile):
    data = dictfile['data']
    labels = dictfile.get('labels', dictfile.get('fine_labels', None))
    assert labels is not None
    for sample, label in itertools.izip(data, labels):
        yield (sample / 255.0).astype(numpy.float32), int(label)

# 定义PaddlePaddle中使用的reader
def reader(filename, sub_name):
    batch = get_dict(filename, sub_name)
    for item in dict2generator(batch):
        yield item

使用说明

filename = '/root/.cache/paddle/dataset/cifar/cifar-100-python.tar.gz'
sub_name = 'train'
train_reader=reader(filename, sub_name)

其它

通过get_dict()函数可以将压缩文件中的训练文件或测试文件读取为字典，这样很便于查看其中的数据结构：

查看字典的键值：

查看字典中data和label的大小：

汀桦坞

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录