（转）Cifar10 bin转lmdb格式

最新推荐文章于 2022-10-29 10:28:57 发布

zyb418

最新推荐文章于 2022-10-29 10:28:57 发布

阅读量170

点赞数

分类专栏：深度 python 文章标签： python cifar

深度同时被 2 个专栏收录

92 篇文章 0 订阅

订阅专栏

python

55 篇文章 6 订阅

订阅专栏

https://blog.csdn.net/xxiaozr/article/details/80426417

Cifar10 包含十类图片，共 60000 个 32*32 的colour images，每一类有 6000 个 images。其中，训练集包含 50000 张，测试集为 10000张。

它的 binary version 文件包含如下几个文件：

batch 1-5 是训练集，test_batch 是测试集

每一个 binary 文件都包含如下的内容：

其中，第一个 byte 是第一个 image 的 label，是 0-9 的数字，之后的 3072 byte 是这个 image 的 pixels 的值，这 3072 的 byte 中，前 1024 是 red channel，接下来 1024 是 green，然后是 blue，以 row-major order 保存

每一个 binary 包含 10000 个 imgage

import os 
import caffe 
import numpy as np 
import lmdb
 
#cifar10 bin文件路径
cifar_directory = os.path.abspath("/home/shuai/cifar10/cifar10_data/cifar-10-batches-bin")
#train_lmdb的路径，如果不存在，会创建文件夹
lmdb_directory = os.path.abspath("cifar10_train_lmdb")


def load_data(trian_batches):
	images = []
	labels= []
	for batch in trian_batches:
		file = os.path.join(cifar_directory,batch)
                #打开二进制文件
		file = open(file,'rb')
                #读取二进制文件
		bytestream = file.read(10000*(1+32*32*3))
                #np.frombuffer(buffer,dtype=) 将缓存区内容转化为一维的数组
		data = np.frombuffer(bytestream,dtype=np.uint8)
		data = data.reshape(10000,1+3*32*32)
                #将数组的label 和 image 分开成两个数组，hsplit 和 split(axis=1) 相等
		data = np.hsplit(data,[1])
		label = data[0]
		image = data[1]
		images.append(image)
		labels.append(label)
		print('completed'+''+batch)
        #将 array-list 转化为 array ,axis = 0
	images = np.concatenate(images)
	labels = np.concatenate(labels)
	print(labels.shape)
	length= labels.shape[0]
	images = images.reshape(length,3,32,32)
	return images,labels

X,Y = load_data(['data_batch_{}.bin'.format(i) for i in range(1,6)])
Xt,Yt = load_data(['test_batch.bin'])

#将 image 和 label 转化为 lmdb 格式；lmdb.open() 打开 lmdb 环境生成一个新的 lmdb 数据库文件，包括 data.mdb 和 lock.mdb ，如果有，则不会覆盖，map_size 定义大小
env = lmdb.open(lmdb_directory,map_size=50000*1000*5)
# env.begin() 建立事务，准备添加数据，将 array 转化为 datum 格式，然后序列化加到 lmdb 中
txn = env.begin(write=True)
count=0
for i in range(X.shape[0]):
		datum=caffe.io.array_to_datum(X[i],int(Y[i]))
		str_id='{:08}'.format(count)
		txn.put(str_id,datum.SerializeToString())
		count+=1
		if count%1000==0:
			print('already handled with {} pictures'.format(count))
                        #通过 commit() 函数提交更改，记得提交后要使用 env.begin 来更新 txn.
			txn.commit()
			txn=env.begin(write=True)
txn.commit()
#关闭环境
env.close()

env=lmdb.open('cifar10_test_lmdb',map_size=10000*1000*5)
txn=env.begin(write=True)
count=0
for i in range(Xt.shape[0]):
        datum=caffe.io.array_to_datum(Xt[i],int(Yt[i]))
        str_id='{:08}'.format(count)
        txn.put(str_id,datum.SerializeToString())
        count+=1
        if count%1000==0:
            print('already handled with {} pictures'.format(count))
            txn.commit()
            txn=env.begin(write=True)
txn.commit()
env.close()

zyb418

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
（转）Cifar10 bin转lmdb格式

https://blog.csdn.net/xxiaozr/article/details/80426417Cifar10 包含十类图片，共 60000 个 32*32 的colour images，每一类有 6000 个 images。其中，训练集包含 50000 张，测试集为 10000张。它的 binary version 文件包含如下几个文件：batch 1-5 是训练集...
复制链接

扫一扫

专栏目录