图片转换成HDF5文件（加载、保存）

最新推荐文章于 2023-05-17 21:05:25 发布

witnessai1

最新推荐文章于 2023-05-17 21:05:25 发布

阅读量6.1k

点赞数 1

分类专栏： ——深度学习—— 文章标签： HDF5

本文链接：https://blog.csdn.net/witnessai1/article/details/78472321

版权

——深度学习—— 专栏收录该内容

20 篇文章 5 订阅

订阅专栏

Introduction

When we talk about deep learning, usually the first thing comes to mind is a huge amount of data or a large number of images (e.g. a couple of milions images in ImageNet). In such situation, it is not very smart and efficient to load every single image from the hard seperately and apply image preprocessing and then pass it to the network to train, validate, or test. Despite the required time to apply the preprocessing, it's way more time consuming to read multiple images from a harddrive than having them all in a single file and read them as a single bunch of data. Hopefully, there are different data models and libraries which come out in faviour of us, such as HDF5 and TFRecord. In this post we learn how to save a large number of images in a single HDF5 file and then load them from the file in batch-wise manner. It does not matter how big the data is and either it is larger than your memory size or not. HDF5 provides tools to manage, manipulate, view, compress and save the data. We will stick to the same topic but using TFRecord in our next post.

In this post, we load, resize and save all the images inside the train folder of the well-known Dogs vs. Cats data set. To follow the rest of this post you need to download the train part of the Dogs vs. Cats data set.

List images and their labels

First, we need to list all images and label them. We give each cat image a label = 0 and each dog image a label = 1. The following code list all images, give them proper labels, and then shuffle the data. We also divide the data set into three train (%60), validation (%20), and test parts (%20).

List images and label them

 
    from random import shuffle
 import glob
 shuffle_data = True # shuffle the addresses before saving
 hdf5_path = 'Cat vs Dog/dataset.hdf5' # address to where you want to save the hdf5 file
 cat_dog_train_path = 'Cat vs Dog/train/*.jpg'
 
 # read addresses and labels from the 'train' folder
 addrs = glob.glob(cat_dog_train_path)
 labels = [0 if 'cat' in addr else 1 for addr in addrs] # 0 = Cat, 1 = Dog
 
 # to shuffle data
 if shuffle_data:
 c = list(zip(addrs, labels))
 shuffle(c)
 addrs, labels = zip(*c)
 
 # Divide the hata into 60% train, 20% validation, and 20% test
 train_addrs = addrs[0:int(0.6*len(addrs))]
 train_labels = labels[0:int(0.6*len(labels))]
 
 val_addrs = addrs[int(0.6*len(addrs)):int(0.8*len(addrs))]
 val_labels = labels[int(0.6*len(addrs)):int(0.8*len(addrs))]
 
 test_addrs = addrs[int(0.8*len(addrs)):]
 test_labels = labels[int(0.8*len(labels)):]
 
  

Create a HDF5 file

There are two main libraries which let you work with HDF5 format, namely h5py and tables (PyTables). We are going to explain how to work with each of them in the following. The first step is to create a HDF5 file. To store images, we should define an array for each of train, validation and test sets with the shape of (number of data, image_height, image_width, image_depth) in Tensorfloworder or (number of data, image_height, image_width, image_depth) in Theano order. For labels we also need an array for each of train, validation and test sets with the shape of (number of data). Finally, we calculate the pixel-wise mean of the train set and save it in an array with the shape of (1, image_height, image_width, image_depth). Note that you always should determine the type of data (dtype) when you want to create an array for it.

tables:In tables we can use create_earray which create an empty array (number of data=0)and we can append data to it later. For labels, it is more convenient here to use create_array as it lets us to write the lables when we are creating the array. To set the dtype of an array, you can use tables dtype such as tables.UInt8Atom() for uint8. The first attribute of create_earray and create_array methods is the data group (we create the arrays in root group) which lets you to manage your data by creating different data groups. You can consider groups as somethings like folders in your HDF5 file.
h5py: in h5py we create an array using create_dataset. Note that we should determine the exact size of array when you are defining it. We can use the create_dataset for labels as well and immediately put the labels on it. You can set the dtype of an array directly using numpy dypes.

Creating a HDF5 file

tables
h5py

 
      import numpy as np
 import tables
 
 data_order = 'tf' # 'th' for Theano, 'tf' for Tensorflow
 img_dtype = tables.UInt8Atom() # dtype in which the images will be saved
 
 # check the order of data and chose proper data shape to save images
 if data_order == 'th':
 data_shape = (0, 3, 224, 224)
 elif data_order == 'tf':
 data_shape = (0, 224, 224, 3)
 
 # open a hdf5 file and create earrays
 hdf5_file = tables.open_file(hdf5_path, mode='w')
 
 train_storage = hdf5_file.create_earray(hdf5_file.root, 'train_img', img_dtype, shape=data_shape)
 val_storage = hdf5_file.create_earray(hdf5_file.root, 'val_img', img_dtype, shape=data_shape)
 test_storage = hdf5_file.create_earray(hdf5_file.root, 'test_img', img_dtype, shape=data_shape)
 
 mean_storage = hdf5_file.create_earray(hdf5_file.root, 'train_mean', img_dtype, shape=data_shape)
 
 # create the label arrays and copy the labels data in them
 hdf5_file.create_array(hdf5_file.root, 'train_labels', train_labels)
 hdf5_file.create_array(hdf5_file.root, 'val_labels', val_labels)
 hdf5_file.create_array(hdf5_file.root, 'test_labels', test_labels)
 
    

Now, it's time to read images one by one, apply preprocessing (only resize in our code) and then save it.

Load images and save them

tables
h5py

 
      # a numpy array to save the mean of the images
 mean = np.zeros(data_shape[1:], np.float32)
 
 # loop over train addresses
 for i in range(len(train_addrs)):
 # print how many images are saved every 1000 images
 if i % 1000 == 0 and i > 1:
 print 'Train data: {}/{}'.format(i, len(train_addrs))
 
 # read an image and resize to (224, 224)
 # cv2 load images as BGR, convert it to RGB
 addr = train_addrs[i]
 img = cv2.imread(addr)
 img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
 img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
 
 # add any image pre-processing here
 
 # if the data order is Theano, axis orders should change
 if data_order == 'th':
 img = np.rollaxis(img, 2)
 
 # save the image and calculate the mean so far
 train_storage.append(img[None])
 mean += img / float(len(train_labels))
 
 # loop over validation addresses
 for i in range(len(val_addrs)):
 # print how many images are saved every 1000 images
 if i % 1000 == 0 and i > 1:
 print 'Validation data: {}/{}'.format(i, len(val_addrs))
 
 # read an image and resize to (224, 224)
 # cv2 load images as BGR, convert it to RGB
 addr = val_addrs[i]
 img = cv2.imread(addr)
 img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
 img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
 
 # add any image pre-processing here
 
 # if the data order is Theano, axis orders should change
 if data_order == 'th':
 img = np.rollaxis(img, 2)
 
 # save the image
 val_storage.append(img[None])
 
 # loop over test addresses
 for i in range(len(test_addrs)):
 # print how many images are saved every 1000 images
 if i % 1000 == 0 and i > 1:
 print 'Test data: {}/{}'.format(i, len(test_addrs))
 
 # read an image and resize to (224, 224)
 # cv2 load images as BGR, convert it to RGB
 addr = test_addrs[i]
 img = cv2.imread(addr)
 img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
 img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
 
 # add any image pre-processing here
 
 # if the data order is Theano, axis orders should change
 if data_order == 'th':
 img = np.rollaxis(img, 2)
 
 # save the image
 test_storage.append(img[None])
 
 # save the mean and close the hdf5 file
 mean_storage.append(mean[None])
 hdf5_file.close()
 
    

Read the HDF5 file

It's time to check if the data is saved properly in the HDF5 file. To do so, we load the data in batchs of an arbitrary size and plot the first image of the first 5 batchs. We also check the label of each image. We define a variable, subtract_mean, which indicates if we want to subtract mean of the training set before showing the image. In tables we access each array calling its name after its data group (like this hdf5_file.group.arrayname). You can index it like a numpy array. However, in h5py we access an array using its name like a dictionary name (hdf5_file["arrayname""]). In either case, you have access to the shape of the array through .shape like a numpy array.

Open the HDF5 for read

tables
h5py

 
      import tables
 import numpy as np
 
 hdf5_path = 'Cat vs Dog/dataset.hdf5'
 subtract_mean = False
 
 # open the hdf5 file
 hdf5_file = tables.open_file(hdf5_path, mode='r')
 
 # subtract the training mean
 if subtract_mean:
 mm = hdf5_file.root.train_mean[0]
 mm = mm[np.newaxis, ...]
 
 # Total number of samples
 data_num = hdf5_file.root.train_img.shape[0]
 
    

Now we create a list of batches indeces and shuffle it. Now, we loop over batches and read all images in each batch at once.

Open the HDF5 for read

tables
h5py

 
      from random import shuffle
 from math import ceil
 import matplotlib.pyplot as plt
 
 # create list of batches to shuffle the data
 batches_list = list(range(int(ceil(float(data_num) / batch_size))))
 shuffle(batches_list)
 
 # loop over batches
 for n, i in enumerate(batches_list):
 i_s = i * batch_size # index of the first image in this batch
 i_e = min([(i + 1) * batch_size, data_num]) # index of the last image in this batch
 
 # read batch images and remove training mean
 images = hdf5_file.root.train_img[i_s:i_e]
 if subtract_mean:
 images -= mm
 
 # read labels and convert to one hot encoding
 labels = hdf5_file.root.train_labels[i_s:i_e]
 labels_one_hot = np.zeros((batch_size, nb_class))
 labels_one_hot[np.arange(batch_size), labels] = 1
 
 print n+1, '/', len(batches_list)
 
 print labels[0], labels_one_hot[0, :]
 plt.imshow(images[0])
 plt.show()
 
 if n == 5: # break after 5 batches
 break
 
 hdf5_file.close()