使用h5py导入自己的图片训练集与测试集

最新推荐文章于 2023-04-18 11:56:13 发布

aban_77

最新推荐文章于 2023-04-18 11:56:13 发布

阅读量6.5k

点赞数 7

分类专栏： TensorFlow 数据处理

本文链接：https://blog.csdn.net/weixin_43615222/article/details/84577293

版权

TensorFlow 同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

数据处理

2 篇文章 0 订阅

订阅专栏

参考文献：https://blog.csdn.net/chenkz123/article/details/79640658
在使用图片进行训练和测试的时候，可以用h5py生成数据集文件，便于导入训练

数据集整理

首先将对应的图片按照label，分在三个文件夹内，在本例中，创造一个剪刀石头布的三分类问题，首先看看采集的对应的数据集，如下图，将同一个标签的数据放入同一个文件夹内

将同一标签放入同一个文件夹内

获取对应类别文件列表

在转成h5py之前，我们要获取对应文件夹下的数据文件名列表，并打上标签
首先导入依赖库

import os
import numpy as np
from PIL import Image
import h5py
import scipy

然后写一个子函数，用于获取文件列表，并进行乱序，有利于划分训练集与测试集
参考：https://blog.csdn.net/chenkz123/article/details/79640658

def get_files(file_dir):
    stone = []
    label_stone = []
    cut = []
    label_cut = []
    cloth = []
    label_cloth = []

    
    for file in os.listdir(file_dir+'/stone'):
            stone.append(file_dir +'/stone'+'/'+ file) 
            label_stone.append(0)     #添加标签 这里用0 1 2 代表榔头 剪刀 布
    for file in os.listdir(file_dir+'/cut'):
            cut.append(file_dir +'/cut'+'/'+file)
            label_cut.append(1)
    for file in os.listdir(file_dir+'/cloth'):
            cloth.append(file_dir +'/cloth'+'/'+ file) 
            label_cloth.append(2)     

            
    #把所有数据集进行合并
    image_list = np.hstack((stone, cut, cloth))
    label_list = np.hstack((label_stone, label_cut,label_cloth))
 
    #利用shuffle打乱顺序
    temp = np.array([image_list, label_list])
    temp = temp.transpose()
    np.random.shuffle(temp)
 
    #从打乱的temp中再取出list（img和lab）
    image_list = list(temp[:, 0])
    label_list = list(temp[:, 1])
    label_list = [int(i) for i in label_list] 
    
    return  image_list,label_list
    #返回两个list 分别为图片文件名及其标签  顺序已被打乱

这样，我们就取得了两个列表，分别是图片文件名及其标签顺序已被打乱。
我们来测试一下，运行以下代码

train_dir = '.'
image_list,label_list = get_files(train_dir) 
print(len(image_list))
print(len(label_list))

输出：1124 1124
可以看到lable与数据长度相等，没有问题。

划分测试集与训练集

我们要从列表里面，抽取一部分作为测试集，在这里代码中打桩写死为124，可按需求修改

Train_image =  np.random.rand(len(image_list)-124, 240, 320, 3).astype('float32')
Train_label = np.random.rand(len(image_list)-124, 1).astype('float32')
 
Test_image =  np.random.rand(124, 240, 320, 3).astype('float32')
Test_label = np.random.rand(124, 1).astype('float32')
for i in range(len(image_list)-124):
    Train_image[i] = np.array(plt.imread(image_list[i]))
    Train_label[i] = np.array(label_list[i])
 
for i in range(len(image_list)-124, len(image_list)):
    Test_image[i+124-len(image_list)] = np.array(plt.imread(image_list[i]))
    Test_label[i+124-len(image_list)] = np.array(label_list[i])

写入为h5文件

为对应的数据进行命名并保存，并进行读写测试

f = h5py.File('data.h5', 'w')
f.create_dataset('X_train', data=Train_image)
f.create_dataset('y_train', data=Train_label)
f.create_dataset('X_test', data=Test_image)
f.create_dataset('y_test', data=Test_label)
f.close()
train_dataset = h5py.File('data.h5', 'r')
train_set_x_orig = np.array(train_dataset['X_train'][:]) # your train set features
train_set_y_orig = np.array(train_dataset['y_train'][:]) # your train set labels
test_set_x_orig = np.array(train_dataset['X_test'][:]) # your train set features
test_set_y_orig = np.array(train_dataset['y_test'][:]) # your train set labels
f.close()
#读写测试
print(train_set_x_orig.shape)
print(train_set_y_orig.shape)
 
print(train_set_x_orig.max())
print(train_set_x_orig.min())
 
print(test_set_x_orig.shape)
print(test_set_y_orig.shape)

输出：
(1000, 240, 320, 3)
(1000, 1)
255.0
0.0
(124, 240, 320, 3)
(124, 1)
训练集与测试集分别是1000与124，与之前写的一致，没有问题

总结

至此，我们就生成了一个h5py文件，不过发现20M左右的JPG转为h5后，文件大小激增到1G，看来这个方法并不适用于很大的数据

aban_77

关注

7
点赞
踩
59

收藏

觉得还不错? 一键收藏
8
评论
使用h5py导入自己的图片训练集与测试集

参考文献：https://blog.csdn.net/chenkz123/article/details/79640658在使用图片进行训练和测试的时候，可以用h5py生成数据集文件，便于导入训练数据集整理首先将对应的图片按照label，分在三个文件夹内，在本例中，创造一个剪刀石头布的三分类问题，首先看看采集的对应的数据集，如下图，将同一个标签的数据放入同一个文件夹内获取对应类别文件列表...
复制链接

扫一扫