利用h5py 构建深度学习数据集

最新推荐文章于 2024-08-09 07:07:53 发布

Balloontime

最新推荐文章于 2024-08-09 07:07:53 发布

阅读量6.8k

点赞数 1

分类专栏： Deeplearning 文章标签：数据集 Memory error 内存溢出深度学习

本文链接：https://blog.csdn.net/Balloontime/article/details/80414189

版权

Deeplearning 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

前言

对于深度学习而言，往往有数以十万记的数据，跑程序的时候经常会在加载数据集的时候出现Memory error，查了很多资料，感觉python的h5py包处理数据集非常方便，导入数据时，并不会占据内存空间。

实例

在利用Resnet做迁移学习对图片进行分类的时候，最初直接往内存加载数据，结果报错，内存溢出。后来查了很多方法，终于用h5py解决了。处理数据集的核心代码如下：

for times in range（539）： #有5万多张图片， 每次存入100张
    if i == 0：
        h5f = h5py.File("data/train_data.h5", "w") #build File object
        x = h5f.create_dataset("x_train", (100, 299, 299,3), 
        maxshape=(None, 299, 299, 3), 
        dtype =np.float32)# build x_train dataset
        y = h5f.create_dataset("y_train", (100, 80),maxshape=(None, 80), 
        dtype =np.int32)# build y_train dataset
    else:
    h5f = h5py.File("data/train_data.h5", "a") # add mode
    x = h5f["x_train"]
    y = h5f["y_train"]

    image = np.array(list(map(lambda x: ndimage.imread(x, mode='RGB'), image_path[times*100:(times+1)*100]))).astype(np.float32)

    image = preprocess_input(image)
    ytem = label[times*100:(times+1)*100] #读入图片
    if times != 538: #图片总数不能被100整除
         x.resize([times * 100 + 100, 299, 299,3])
         y.resize([times * 100 + 100,80])


         x[times * 100:times * 100 + 100] = image
         y[times * 100:times * 100 + 100] = ytem #写入数据集 

         print('%d images are dealed with' %(times))
     else:
         x.resize([times * 100 + 79, 299, 299, 3])
         y.resize([times * 100 + 79, 80])


         x[times * 100:times * 100 + 79] = image
         y[times * 100:times * 100 + 79] = ytem

         print('%d images are dealed with' % (times))

h5f.close() #close file