图片转换成HDF5文件(加载、保存)

Introduction

When we talk about deep learning, usually the first thing comes to mind is a huge amount of data or a large number of images (e.g. a couple of milions images in ImageNet). In such situation, it is not very smart and efficient to load every single image from the hard seperately and apply image preprocessing and then pass it to the network to train, validate, or test. Despite the required time to apply the preprocessing, it's way more time consuming to read multiple images from a harddrive than having them all in a single file and read them as a single bunch of data. Hopefully, there are different data models and libraries which come out in faviour of us, such as HDF5 and TFRecord. In this post we learn how to save a large number of images in a single HDF5 file and then load them from the file in batch-wise manner. It does not matter how big the data is and either it is larger than your memory size or not. HDF5 provides tools to manage, manipulate, view, compress and save the data. We will stick to the same topic but using TFRecord in our next post.

In this post, we load, resize and save all the images inside the train folder of the well-known Dogs vs. Cats data set. To follow the rest of this post you need to download the train part of the Dogs vs. Cats data set.

List images and their labels

First, we need to list all images and label them. We give each cat image a label = 0 and each dog image a label = 1. The following code list all images, give them proper labels, and then shuffle the data. We also divide the data set into three train (%60), validation (%20), and test parts (%20).

List images and label them
  1. from random import shuffle
  2. import glob
  3. shuffle_data = True # shuffle the addresses before saving
  4. hdf5_path = 'Cat vs Dog/dataset.hdf5' # address to where you want to save the hdf5 file
  5. cat_dog_train_path = 'Cat vs Dog/train/*.jpg'
  6. # read addresses and labels from the 'train' folder
  7. addrs = glob.glob(cat_dog_train_path)
  8. labels = [0 if 'cat' in addr else 1 for addr in addrs] # 0 = Cat, 1 = Dog
  9. # to shuffle data
  10. if shuffle_data:
  11. c = list(zip(addrs, labels))
  12. shuffle(c)
  13. addrs, labels = zip(*c)
  14. # Divide the hata into 60% train, 20% validation, and 20% test
  15. train_addrs = addrs[0:int(0.6*len(addrs))]
  16. train_labels = labels[0:int(0.6*len(labels))]
  17. val_addrs = addrs[int(0.6*len(addrs)):int(0.8*len(addrs))]
  18. val_labels = labels[int(0.6*len(addrs)):int(0.8*len(addrs))]
  19. test_addrs = addrs[int(0.8*len(addrs)):]
  20. test_labels = labels[int(0.8*len(labels)):]

Create a HDF5 file

There are two main libraries which let you work with HDF5 format, namely h5py and tables (PyTables). We are going to explain how to work with each of them in the following. The first step is to create a HDF5 file. To store images, we should define an array for each of train, validation and test sets with the shape of (number of data, image_height, image_width, image_depth) in Tensorfloworder or (number of data, image_height, image_width, image_depth) in Theano order. For labels we also need an array for each of train, validation and test sets with the shape of (number of data). Finally, we calculate the pixel-wise mean of the train set and save it in an array with the shape of (1, image_height, image_width, image_depth). Note that you always should determine the type of data (dtype) when you want to create an array for it.

  • tables:In tables we can use create_earray which create an empty array (number of data=0)and we can append data to it later. For labels, it is more convenient here to use create_array as it lets us to write the lables when we are creating the array. To set the dtype of an array, you can use tables dtype such as tables.UInt8Atom() for uint8. The first attribute of create_earray and create_array methods is the data group (we create the arrays in root group) which lets you to manage your data by creating different data groups. You can consider groups as somethings like folders in your HDF5 file.
  • h5py: in h5py we create an array using create_dataset. Note that we should determine the exact size of array when you are defining it. We can use the create_dataset for labels as well and immediately put the labels on it. You can set the dtype of an array directly using numpy dypes.
Creating a HDF5 file
  • tables
  • h5py
  1. import numpy as np
  2. import tables
  3. data_order = 'tf' # 'th' for Theano, 'tf' for Tensorflow
  4. img_dtype = tables.UInt8Atom() # dtype in which the images will be saved
  5. # check the order of data and chose proper data shape to save images
  6. if data_order == 'th':
  7. data_shape = (0, 3, 224, 224)
  8. elif data_order == 'tf':
  9. data_shape = (0, 224, 224, 3)
  10. # open a hdf5 file and create earrays
  11. hdf5_file = tables.open_file(hdf5_path, mode='w')
  12. train_storage = hdf5_file.create_earray(hdf5_file.root, 'train_img', img_dtype, shape=data_shape)
  13. val_storage = hdf5_file.create_earray(hdf5_file.root, 'val_img', img_dtype, shape=data_shape)
  14. test_storage = hdf5_file.create_earray(hdf5_file.root, 'test_img', img_dtype, shape=data_shape)
  15. mean_storage = hdf5_file.create_earray(hdf5_file.root, 'train_mean', img_dtype, shape=data_shape)
  16. # create the label arrays and copy the labels data in them
  17. hdf5_file.create_array(hdf5_file.root, 'train_labels', train_labels)
  18. hdf5_file.create_array(hdf5_file.root, 'val_labels', val_labels)
  19. hdf5_file.create_array(hdf5_file.root, 'test_labels', test_labels)

Now, it's time to read images one by one, apply preprocessing (only resize in our code) and then save it.

Load images and save them
  • tables
  • h5py
  1. # a numpy array to save the mean of the images
  2. mean = np.zeros(data_shape[1:], np.float32)
  3. # loop over train addresses
  4. for i in range(len(train_addrs)):
  5. # print how many images are saved every 1000 images
  6. if i % 1000 == 0 and i > 1:
  7. print 'Train data: {}/{}'.format(i, len(train_addrs))
  8. # read an image and resize to (224, 224)
  9. # cv2 load images as BGR, convert it to RGB
  10. addr = train_addrs[i]
  11. img = cv2.imread(addr)
  12. img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
  13. img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
  14. # add any image pre-processing here
  15. # if the data order is Theano, axis orders should change
  16. if data_order == 'th':
  17. img = np.rollaxis(img, 2)
  18. # save the image and calculate the mean so far
  19. train_storage.append(img[None])
  20. mean += img / float(len(train_labels))
  21. # loop over validation addresses
  22. for i in range(len(val_addrs)):
  23. # print how many images are saved every 1000 images
  24. if i % 1000 == 0 and i > 1:
  25. print 'Validation data: {}/{}'.format(i, len(val_addrs))
  26. # read an image and resize to (224, 224)
  27. # cv2 load images as BGR, convert it to RGB
  28. addr = val_addrs[i]
  29. img = cv2.imread(addr)
  30. img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
  31. img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
  32. # add any image pre-processing here
  33. # if the data order is Theano, axis orders should change
  34. if data_order == 'th':
  35. img = np.rollaxis(img, 2)
  36. # save the image
  37. val_storage.append(img[None])
  38. # loop over test addresses
  39. for i in range(len(test_addrs)):
  40. # print how many images are saved every 1000 images
  41. if i % 1000 == 0 and i > 1:
  42. print 'Test data: {}/{}'.format(i, len(test_addrs))
  43. # read an image and resize to (224, 224)
  44. # cv2 load images as BGR, convert it to RGB
  45. addr = test_addrs[i]
  46. img = cv2.imread(addr)
  47. img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
  48. img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
  49. # add any image pre-processing here
  50. # if the data order is Theano, axis orders should change
  51. if data_order == 'th':
  52. img = np.rollaxis(img, 2)
  53. # save the image
  54. test_storage.append(img[None])
  55. # save the mean and close the hdf5 file
  56. mean_storage.append(mean[None])
  57. hdf5_file.close()

Read the HDF5 file

It's time to check if the data is saved properly in the HDF5 file. To do so, we load the data in batchs of an arbitrary size and plot the first image of the first 5 batchs. We also check the label of each image. We define a variable, subtract_mean, which indicates if we want to subtract mean of the training set before showing the image. In tables we access each array calling its name after its data group (like this hdf5_file.group.arrayname). You can index it like a numpy array. However, in h5py we access an array using its name like a dictionary name (hdf5_file["arrayname""]). In either case, you have access to the shape of the array through .shape like a numpy array.

Open the HDF5 for read
  • tables
  • h5py
  1. import tables
  2. import numpy as np
  3. hdf5_path = 'Cat vs Dog/dataset.hdf5'
  4. subtract_mean = False
  5. # open the hdf5 file
  6. hdf5_file = tables.open_file(hdf5_path, mode='r')
  7. # subtract the training mean
  8. if subtract_mean:
  9. mm = hdf5_file.root.train_mean[0]
  10. mm = mm[np.newaxis, ...]
  11. # Total number of samples
  12. data_num = hdf5_file.root.train_img.shape[0]

Now we create a list of batches indeces and shuffle it. Now, we loop over batches and read all images in each batch at once.

Open the HDF5 for read
  • tables
  • h5py
  1. from random import shuffle
  2. from math import ceil
  3. import matplotlib.pyplot as plt
  4. # create list of batches to shuffle the data
  5. batches_list = list(range(int(ceil(float(data_num) / batch_size))))
  6. shuffle(batches_list)
  7. # loop over batches
  8. for n, i in enumerate(batches_list):
  9. i_s = i * batch_size # index of the first image in this batch
  10. i_e = min([(i + 1) * batch_size, data_num]) # index of the last image in this batch
  11. # read batch images and remove training mean
  12. images = hdf5_file.root.train_img[i_s:i_e]
  13. if subtract_mean:
  14. images -= mm
  15. # read labels and convert to one hot encoding
  16. labels = hdf5_file.root.train_labels[i_s:i_e]
  17. labels_one_hot = np.zeros((batch_size, nb_class))
  18. labels_one_hot[np.arange(batch_size), labels] = 1
  19. print n+1, '/', len(batches_list)
  20. print labels[0], labels_one_hot[0, :]
  21. plt.imshow(images[0])
  22. plt.show()
  23. if n == 5: # break after 5 batches
  24. break
  25. hdf5_file.close()

You can download the codes of this post in our Github page.


source:http://machinelearninguru.com/deep_learning/data_preparation/hdf5/hdf5.html

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值