html显示hdf5文件,图片转换成HDF5文件（加载，保存）

最新推荐文章于 2023-06-06 22:07:03 发布

会计星球

最新推荐文章于 2023-06-06 22:07:03 发布

阅读量659

点赞数

文章标签： html显示hdf5文件

本文介绍了在深度学习中如何有效地管理和加载大量图像数据，特别是针对ImageNet规模的数据集。通过使用HDF5文件格式，可以提高数据读取速度并简化预处理流程。文章详细展示了如何将图像保存到HDF5文件，以及如何按批处理方式从文件中加载。此外，还讨论了如何在HDF5文件中组织数据，包括图像和标签，并给出了使用h5py和tables库的示例代码。最后，演示了如何从HDF5文件中读取和显示数据。

摘要由CSDN通过智能技术生成

翻译http://machinelearninguru.com/deep_learning/data_preparation/hdf5/hdf5.html

当我们谈论深度学习时，通常首先想到的是大量数据或大量图像(例如ImageNet中数百万幅图像)。在这种情况下，从硬盘单独加载每个图像并应用图像预处理，然后将其传递到网络进行训练，验证或测试，这并不是非常明智和高效。尽管应用预处理需要时间，但从硬盘读取多个图像要花费更多的时间，而不是将它们全部放在单个文件中，并将它们作为单个数据组读取。我们希望有不同的数据模型和库，如HDF5和TFRecord。希望我们有不同的数据模型和库，如HDF5和TFRecord。在这篇文章中，我们将学习如何将大量图像保存在单个HDF5文件中，然后以批处理方式从文件中加载它们。数据的大小并不重要，或者它大于或小于内存大小。HDF5提供了管理，操作，查看，压缩和保存数据的工具。我们将关注相同的主题，但在我们的使用TFRecord 下一篇文章。

在这篇文章中，我们加载，调整大小并将所有图像保存在着名的Dogs vs. Cats 数据集的train文件夹中。按照这篇文章的其余部分，你需要下载Dogs vs. Cats数据集的训练部分。

列出图像及其标签

首先，我们需要列出所有图像并标注它们。我们给每个猫图像一个标签= 0，每个狗图像一个标签= 1.下面的代码列出所有图像，给它们适当的标签，然后洗牌数据。我们还将数据集分为三列(％60)，验证(％20)和测试部分(％20)。

列出图像并标记它们

from random import shuffle

import glob

shuffle_data = True

# shuffle the addresses before saving

hdf5_path ='Cat vs Dog/dataset.hdf5' # address to where you want to save the hdf5 file

cat_dog_train_path ='Cat vs Dog/train/*.jpg'

#read addresses and labels from the 'train' folder

addrs = glob.glob(cat_dog_train_path)

labels =[0 if 'cat' in addr else 1 for addr in addrs] # 0 = Cat, 1 = Dog

#to shuffle data

if shuffle_data:

c =list(zip(addrs,labels))

shuffle(c)

addrs, labels =zip(*c)

#Divide the hata into 60% train, 20% validation, and 20% test

train_addrs = addrs[0:int(0.6*len(addrs))]

train_labels = labels[0:int(0.6*len(labels))]

val_addrs = addrs[int(0.6*len(addrs)):int(0.8*len(addrs))]

val_labels = labels[int(0.6*len(addrs)):int(0.8*len(addrs))]

test_addrs = addrs[int(0.8*len(addrs)):]

test_labels = labels[int(0.8*len(labels)):]

创建一个HDF5文件

有两个主要的库让你使用HDF5格式，即 h5py 和 tables(PyTables)。我们将在下面解释如何与他们一起工作。第一步是创建一个HDF5文件。为了存储图像，我们应该为每一个训练集,验证集和测试集定义一个数组并按照 Tensorflow older(number of data, image_height, image_width, image_depth)或按照Theano older (number of data, image_height, image_width, image_depth)。对于标签，我们还需要一个数组，用于每个训练，验证和测试集，大小为 (number of data)。最后，我们计算训练集组的像素平均值，并将其保存为(1，image_height，image_width，image_depth)大小的数组。请注意，当您想为其创建数组时，您总是应该确定数据的类型(dtype)。

tables：在tables中，我们可以使用 create_earray 创建一个空数组(数据数量= 0)，我们可以稍后将数据附加到它。对于标签，在这里使用create_array更方便，因为它可以让我们在创建数组时创建标签。要设置数组的dtype，可以为uint8使用表dtype，如tables.UInt8Atom()。

create_earray 和 create_array 方法的第一个属性是data group( we create the arrays in root group)，它允许您通过创建不同的data group来管理数据。您可以将组视为HDF5文件中的文件夹。

h5py：在h5py中，我们使用create_dataset创建一个数组。请注意，在定义数组时，我们应该确定数组的确切大小。我们也可以使用 create_dataset 作为标签，并立即将标签放在上面。您可以使用numpy dype直接设置数组的dtype。

使用tables案例

import numpy as np

import tables

data_order ='tf'

# 'th' for Theano,'tf' for Tensorflow

img_dtype = tables.UInt8Atom() # dtype in which the images will be saved

#check the order of data and chose proper data shape to save images

if data_order == 'th':

data_shape =(0,3,224,224)

elif data_order == 'tf':

data_shape =(0,224,224,3)

#open a hdf5 file and create earrays

hdf5_file = tables.open_file(hdf5_path,mode='w')

train_storage = hdf5_file.create_earray(hdf5_file.root,'train_img',

img_dtype, shape=data_shape)

val_storage = hdf5_file.create_earray(hdf5_file.root,'val_img',

img_dtype, shape=data_shape)

test_storage = hdf5_file.create_earray(hdf5_file.root,'test_img',

img_dtype, shape=data_shape)

mean_storage = hdf5_file.create_earray(hdf5_file.root,'train_mean',

img_dtype, shape=data_shape)

#create the label arrays and copy the labels data in them

hdf5_file.create_array(hdf5_file.root,'train_labels',train_labels)

hdf5_file.create_array(hdf5_file.root,'val_labels',val_labels)

hdf5_file.create_array(hdf5_file.root,'test_labels',test_labels)

现在，是时候逐一读取图像，应用预处理(只调整我们的代码)然后保存。

# a numpy array to save the mean of the images

mean = np.zeros(data_shape[1:],np.float32)

#loop over train addresses

for i in range(len(train_addrs)):

# print how many images are saved every 1000 images

if i % 1000 == 0 and i > 1:

print'Train data: {}/{}'.format(i,len(train_addrs))

#read an image and resize to (224, 224)

#cv2 load images as BGR, convert it to RGB

addr = train_addrs[i]

img = cv2.imread(addr)

img = cv2.resize(img,(224,224),interpolation=cv2.INTER_CUBIC)

img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# add any image pre-processing here

#if the data order is Theano, axis orders should change

if data_order == 'th':

img = np.rollaxis(img,2)

#save the image and calculate the mean so far

train_storage.append(img[None])

mean += img /float(len(train_labels))

#loop over validation addresses

for i in range(len(val_addrs)):

#print how many images are saved every 1000 images

if i % 1000 == 0 and i> 1:

print'Validation data: {}/{}'.format(i,len(val_addrs))

#read an image and resize to (224, 224)

#cv2 load images as BGR, convert it to RGB

addr = val_addrs[i]

img = cv2.imread(addr)

img = cv2.resize(img,(224,224),interpolation=cv2.INTER_CUBIC)

img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)

#add any image pre-processing here

# if the data order is Theano, axis orders should change

if data_order == 'th':

img = np.rollaxis(img,2)

#save the image

val_storage.append(img[None])

#loop over test addresses

for i in range(len(test_addrs)):

#print how many images are saved every 1000 images

if i % 1000 == 0 and i> 1:

print'Test data: {}/{}'.format(i,len(test_addrs))

#read an image and resize to (224, 224)

#cv2 load images as BGR, convert it to RGB

addr = test_addrs[i]

img = cv2.imread(addr)

img = cv2.resize(img,(224,224),interpolation=cv2.INTER_CUBIC)

img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)

#add any image pre-processing here

#if the data order is Theano, axis orders should change

if data_order == 'th':

img = np.rollaxis(img,2)

#save the image

test_storage.append(img[None])

#save the mean and close the hdf5 file

mean_storage.append(mean[None])

hdf5_file.close()

阅读HDF5文件

是时候检查数据是否正确保存在HDF5文件中。为此，我们以任意大小的批次加载数据并绘制前5批次的第一张图片。我们也检查每张图片的标签。我们定义了一个变量 subtract_mean，它指示我们是否想在显示图像之前减去训练集的平均值。在表中，我们访问每个阵列调用其名称其数据组之后(这样 hdf5_file。组。arrayName中)。您可以将它索引为一个numpy数组。然而，在 h5py中，我们使用它的名字像字典名称(hdf5_file [ “arrayname” “ ]]来访问数组)。无论哪种情况，您都可以访问数组的形状。形状像一个numpy阵列。

import tables

import numpy as np

hdf5_path ='Cat vs Dog/dataset.hdf5'

subtract_mean =False

#open the hdf5 file

hdf5_file = tables.open_file(hdf5_path,mode='r')

#subtract the training mean

if subtract_mean:

mm = hdf5_file.root.train_mean[0]

mm = mm[np.newaxis,...]

#Total number of samples

data_num = hdf5_file.root.train_img.shape[0]