Introduction
When we talk about deep learning, usually the first thing comes to mind is a huge amount of data or a large number of images (e.g. a couple of milions images in ImageNet). In such situation, it is not very smart and efficient to load every single image from the hard seperately and apply image preprocessing and then pass it to the network to train, validate, or test. Despite the required time to apply the preprocessing, it's way more time consuming to read multiple images from a harddrive than having them all in a single file and read them as a single bunch of data. Hopefully, there are different data models and libraries which come out in faviour of us, such as HDF5 and TFRecord. In this post we learn how to save a large number of images in a single HDF5 file and then load them from the file in batch-wise manner. It does not matter how big the data is and either it is larger than your memory size or not. HDF5 provides tools to manage, manipulate, view, compress and save the data. We will stick to the same topic but using TFRecord in our next post.
In this post, we load, resize and save all the images inside the train folder of the well-known Dogs vs. Cats data set. To follow the rest of this post you need to download the train part of the Dogs vs. Cats data set.
List images and their labels
First, we need to list all images and label them. We give each cat image a label = 0 and each dog image a label = 1. The following code list all images, give them proper labels, and then shuffle the data. We also divide the data set into three train (%60), validation (%20), and test parts (%20).
- from random import shuffle
- import glob
- shuffle_data = True # shuffle the addresses before saving
- hdf5_path = 'Cat vs Dog/dataset.hdf5' # address to where you want to save the hdf5 file
- cat_dog_train_path = 'Cat vs Dog/train/*.jpg'
- # read addresses and labels from the 'train' folder
- addrs = glob.glob(cat_dog_train_path)
- labels = [0 if 'cat' in addr else 1 for addr in addrs] # 0 = Cat, 1 = Dog
- # to shuffle data
- if shuffle_data:
- c = list(zip(addrs, labels))
- shuffle(c)
- addrs, labels = zip(*c)
- # Divide the hata into 60% train, 20% validation, and 20% test
- train_addrs = addrs[0:int(0.6*len(addrs))]
- train_labels = labels[0:int(0.6*len(labels))]
- val_addrs = addrs[int(0.6*len(addrs)):int(0.8*len(addrs))]
- val_labels = labels[int(0.6*len(addrs)):int(0.8*len(addrs))]
- test_addrs = addrs[int(0.8*len(addrs)):]
- test_labels = labels[int(0.8*len(labels)):]
Create a HDF5 file
There are two main libraries which let you work with HDF5 format, namely h5py and tables (PyTables). We are going to explain how to work with each of them in the following. The first step is to create a HDF5 file. To store images, we should define an array for each of train, validation and test sets with the shape of (number of data, image_height, image_width, image_depth) in Tensorfloworder or (number of data, image_height, image_width, image_depth) in Theano order. For labels we also need an array for each of train, validation and test sets with the shape of (number of data). Finally, we calculate the pixel-wise mean of the train set and save it in an array with the shape of (1, image_height, image_width, image_depth). Note that you always should determine the type of data (dtype) when you want to create an array for it.
- tables:In tables we can use create_earray which create an empty array (number of data=0)and we can append data to it later. For labels, it is more convenient here to use create_array as it lets us to write the lables when we are creating the array. To set the dtype of an array, you can use tables dtype such as tables.UInt8Atom() for uint8. The first attribute of create_earray and create_array methods is the data group (we create the arrays in root group) which lets you to manage your data by creating different data groups. You can consider groups as somethings like folders in your HDF5 file.
- h5py: in h5py we create an array using create_dataset. Note that we should determine the exact size of array when you are defining it. We can use the create_dataset for labels as well and immediately put the labels on it. You can set the dtype of an array directly using numpy dypes.
- tables
- h5py
- import numpy as np
- import tables
- data_order = 'tf' # 'th' for Theano, 'tf' for Tensorflow
- img_dtype = tables.UInt8Atom() # dtype in which the images will be saved
- # check the order of data and chose proper data shape to save images
- if data_order == 'th':
- data_shape = (0, 3, 224, 224)
- elif data_order == 'tf':
- data_shape = (0, 224, 224, 3)
- # open a hdf5 file and create earrays
- hdf5_file = tables.open_file(hdf5_path, mode='w')
- train_storage = hdf5_file.create_earray(hdf5_file.root, 'train_img', img_dtype, shape=data_shape)
- val_storage = hdf5_file.create_earray(hdf5_file.root, 'val_img', img_dtype, shape=data_shape)
- test_storage = hdf5_file.create_earray(hdf5_file.root, 'test_img', img_dtype, shape=data_shape)
- mean_storage = hdf5_file.create_earray(hdf5_file.root, 'train_mean', img_dtype, shape=data_shape)
- # create the label arrays and copy the labels data in them
- hdf5_file.create_array(hdf5_file.root, 'train_labels', train_labels)
- hdf5_file.create_array(hdf5_file.root, 'val_labels', val_labels)
- hdf5_file.create_array(hdf5_file.root, 'test_labels', test_labels)
Now, it's time to read images one by one, apply preprocessing (only resize in our code) and then save it.
- tables
- h5py
- # a numpy array to save the mean of the images
- mean = np.zeros(data_shape[1:], np.float32)
- # loop over train addresses
- for i in range(len(train_addrs)):
- # print how many images are saved every 1000 images
- if i % 1000 == 0 and i > 1:
- print 'Train data: {}/{}'.format(i, len(train_addrs))
- # read an image and resize to (224, 224)
- # cv2 load images as BGR, convert it to RGB
- addr = train_addrs[i]
- img = cv2.imread(addr)
- img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
- img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
- # add any image pre-processing here
- # if the data order is Theano, axis orders should change
- if data_order == 'th':
- img = np.rollaxis(img, 2)
- # save the image and calculate the mean so far
- train_storage.append(img[None])
- mean += img / float(len(train_labels))
- # loop over validation addresses
- for i in range(len(val_addrs)):
- # print how many images are saved every 1000 images
- if i % 1000 == 0 and i > 1:
- print 'Validation data: {}/{}'.format(i, len(val_addrs))
- # read an image and resize to (224, 224)
- # cv2 load images as BGR, convert it to RGB
- addr = val_addrs[i]
- img = cv2.imread(addr)
- img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
- img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
- # add any image pre-processing here
- # if the data order is Theano, axis orders should change
- if data_order == 'th':
- img = np.rollaxis(img, 2)
- # save the image
- val_storage.append(img[None])
- # loop over test addresses
- for i in range(len(test_addrs)):
- # print how many images are saved every 1000 images
- if i % 1000 == 0 and i > 1:
- print 'Test data: {}/{}'.format(i, len(test_addrs))
- # read an image and resize to (224, 224)
- # cv2 load images as BGR, convert it to RGB
- addr = test_addrs[i]
- img = cv2.imread(addr)
- img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
- img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
- # add any image pre-processing here
- # if the data order is Theano, axis orders should change
- if data_order == 'th':
- img = np.rollaxis(img, 2)
- # save the image
- test_storage.append(img[None])
- # save the mean and close the hdf5 file
- mean_storage.append(mean[None])
- hdf5_file.close()
Read the HDF5 file
It's time to check if the data is saved properly in the HDF5 file. To do so, we load the data in batchs of an arbitrary size and plot the first image of the first 5 batchs. We also check the label of each image. We define a variable, subtract_mean, which indicates if we want to subtract mean of the training set before showing the image. In tables we access each array calling its name after its data group (like this hdf5_file.group.arrayname). You can index it like a numpy array. However, in h5py we access an array using its name like a dictionary name (hdf5_file["arrayname""]). In either case, you have access to the shape of the array through .shape like a numpy array.
- tables
- h5py
- import tables
- import numpy as np
- hdf5_path = 'Cat vs Dog/dataset.hdf5'
- subtract_mean = False
- # open the hdf5 file
- hdf5_file = tables.open_file(hdf5_path, mode='r')
- # subtract the training mean
- if subtract_mean:
- mm = hdf5_file.root.train_mean[0]
- mm = mm[np.newaxis, ...]
- # Total number of samples
- data_num = hdf5_file.root.train_img.shape[0]
Now we create a list of batches indeces and shuffle it. Now, we loop over batches and read all images in each batch at once.
- tables
- h5py
- from random import shuffle
- from math import ceil
- import matplotlib.pyplot as plt
- # create list of batches to shuffle the data
- batches_list = list(range(int(ceil(float(data_num) / batch_size))))
- shuffle(batches_list)
- # loop over batches
- for n, i in enumerate(batches_list):
- i_s = i * batch_size # index of the first image in this batch
- i_e = min([(i + 1) * batch_size, data_num]) # index of the last image in this batch
- # read batch images and remove training mean
- images = hdf5_file.root.train_img[i_s:i_e]
- if subtract_mean:
- images -= mm
- # read labels and convert to one hot encoding
- labels = hdf5_file.root.train_labels[i_s:i_e]
- labels_one_hot = np.zeros((batch_size, nb_class))
- labels_one_hot[np.arange(batch_size), labels] = 1
- print n+1, '/', len(batches_list)
- print labels[0], labels_one_hot[0, :]
- plt.imshow(images[0])
- plt.show()
- if n == 5: # break after 5 batches
- break
- hdf5_file.close()
You can download the codes of this post in our Github page.
source:http://machinelearninguru.com/deep_learning/data_preparation/hdf5/hdf5.html