前言
对于深度学习而言,往往有数以十万记的数据,跑程序的时候经常会在加载数据集的时候出现Memory error,查了很多资料,感觉python的h5py包处理数据集非常方便,导入数据时,并不会占据内存空间。
实例
在利用Resnet做迁移学习对图片进行分类的时候,最初直接往内存加载数据,结果报错,内存溢出。后来查了很多方法,终于用h5py解决了。处理数据集的核心代码如下:
for times in range(539): #有5万多张图片, 每次存入100张
if i == 0:
h5f = h5py.File("data/train_data.h5", "w") #build File object
x = h5f.create_dataset("x_train", (100, 299, 299,3),
maxshape=(None, 299, 299, 3),
dtype =np.float32)# build x_train dataset
y = h5f.create_dataset("y_train", (100, 80),maxshape=(None, 80),
dtype =np.int32)# build y_train dataset
else:
h5f = h5py.File("data/train_data.h5", "a") # add mode
x = h5f["x_train"]
y = h5f["y_train"]
image = np.array(list(map(lambda x: ndimage.imread(x, mode='RGB'), image_path[times*100:(times+1)*100]))).astype(np.float32)
image = preprocess_input(image)
ytem = label[times*100:(times+1)*100] #读入图片
if times != 538: #图片总数不能被100整除
x.resize([times * 100 + 100, 299, 299,3])
y.resize([times * 100 + 100,80])
x[times * 100:times * 100 + 100] = image
y[times * 100:times * 100 + 100] = ytem #写入数据集
print('%d images are dealed with' %(times))
else:
x.resize([times * 100 + 79, 299, 299, 3])
y.resize([times * 100 + 79, 80])
x[times * 100:times * 100 + 79] = image
y[times * 100:times * 100 + 79] = ytem
print('%d images are dealed with' % (times))
h5f.close() #close file