Python处理HDF5文件：h5py库

最新推荐文章于 2025-01-08 23:26:51 发布

一只干巴巴的海绵

最新推荐文章于 2025-01-08 23:26:51 发布

阅读量2.4w

点赞数 20

本文链接：https://blog.csdn.net/Hanx09/article/details/107511115

版权

h5py官方文档

简介

HDF（Hierarchical Data Format）指一种为存储和处理大容量科学数据设计的文件格式及相应库文件。最早由美国国家超级计算应用中心 NCSA 研究开发，目前在非盈利组织HDF Group维护下继续发展。HDF支持多种商业及非商业的软件平台，包括MATLAB、Java、Python、R 和 Julia 等等，现在也提供了 Spark ，其版本包括了 HDF4 和 HDF5 。当前流行的版本是 HDF5。Python 中有一系列的工具可以操作和使用 HDF5 数据，其中最常用的是 h5py 和 PyTables。

HDF5文件是一种存储dataset 和 group 两类数据对象的容器，其操作类似 python 标准的文件操作；File 实例对象本身就是一个组，以 / 为名，是遍历文件的入口。

dataset：数据集，可类比为 Numpy 数组，每个数据集都有一个名字（name）、形状（shape）和类型（dtype），支持切片操作；
group：组，可以类比为字典，它是一种像文件夹一样的容器；group 中可以存放 dataset 或者其他的 group，键就是组成员的名称，值就是组成员对象本身(组或者数据集)。

h5py库基本操作

安装h5py

conda install h5py

创建 h5py 文件

import h5py
# 方式一
f = h5py.File("myh5py1.h5", "w")
# 方式二
with h5py.File('myh5py2.h5', 'w') as f:

参数说明：

第一个参数：文件名，可以是字节字符串或 unicode 字符串；
第二个参数：mode

mode	说明
r	只读，文件必须存在
r+	读 / 写，文件必须存在
w	创建文件，已经存在的文件会被覆盖掉
w- / x	创建文件，文件如果已经存在则出错
a	打开已经存在的文件进行读 / 写，如果不存在则创建一个新文件读 / 写（默认）

创建dataset

函数：h5py.File.create_dataset(self, name, shape=None, dtype=None, data=None, **kwds)

方式一：创建一个空数据集并赋值

创建空数据集时，只需指定数据集的 name 和 shape。dtype 默认为 np.float32，默认填充值为 0，亦可通过关键字参数 fillvalue 来改变。

f1 = h5py.File("myh5py1.h5", "w")
d = f1.create_dataset("dset1", (100,), 'i')
d[...] = np.arange(100)

方式二：直接创建一个数据集并赋值

创建非空数据集时，只需指定 name 和具体的数据 data。shape 和 dtype 都会从 data 中自动获取，当然也可以显示的指定存储类型来节省空间。（单精度浮点比双精度浮点要节省一半的空间）

f2 = h5py.File("myh5py2.h5", "w")
# 1.
f2["dset2"] = np.arange(100)
# 2.
arr=np.arange(100)
dset=f2.create_dataset('dset2',data=arr)

for key in f2.keys():
    print(f2[key].name)
    print(f2[key].value)

深度学习之10分钟入门h5py

创建group

import h5py
import numpy as np

f = h5py.File("myh5py.h5", "w")

# 创建组bar1,组bar2，数据集dset
g1 = f.create_group("bar1")
g2 = f.create_group("bar2")
d = f.create_dataset("dset", data=np.arange(10))

# 在bar1组里面创建一个组car1和一个数据集dset1。
c1 = g1.create_group("car1")
d1 = g1.create_dataset("dset1", data=np.arange(10))

# 在bar2组里面创建一个组car2和一个数据集dset2
c2 = g2.create_group("car2")
d2 = g2.create_dataset("dset2", data=np.arange(10))

# 根目录下的组和数据集
print(".............")
for key in f.keys():
    print(f[key].name)

# bar1这个组下面的组和数据集
print(".............")
for key in g1.keys():
    print(g1[key].name)

# bar2这个组下面的组和数据集
print(".............")
for key in g2.keys():
    print(g2[key].name)

# 查看car1组和car2组下面都有什么
print(".............")
print(c1.keys())
print(c2.keys())

批量读、写HDF5文件

#-*- coding: utf-8 -*-
import h5py
import numpy as np
# 批量写入数据
def save_h5(h5f,data,target):
    shape_list=list(data.shape)
    if not h5f.__contains__(target):
        shape_list[0]=None
        dataset = h5f.create_dataset(target, data=data,maxshape=tuple(shape_list), chunks=True)
        return
    else:
        dataset = h5f[target]
    len_old=dataset.shape[0]
    len_new=len_old+data.shape[0]
    shape_list[0]=len_new
    dataset.resize(tuple(shape_list))
    dataset[len_old:len_new] = data
# 批量读取
def getDataFromH5py(fileName,target,start,length):
    with h5py.File(fileName,'r') as h5f:
        if not h5f.__contains__(target):
            res=[]
        elif(start+length>=h5f[target].shape[0]):
            res=h5f[target].value[start:h5f[target].shape[0]]
        else:
            res=h5f[target].value[start:start+length]
    return res
# 调用
file_name='./data.h5'
with h5py.File(file_name,'w') as h5f:
	features=np.arange(100)
	save_h5(h5f,data=np.array(features),target='mnist_features')
	save_h5(h5f,data=np.array(features),target='mnist_features')
	save_h5(h5f,data=np.array(features),target='mnist_features')
	save_h5(h5f,data=np.array(features),target='mnist_features')
	save_h5(h5f,data=np.array(features),target='mnist_features')
for i in range(10):
    d=getDataFromH5py('./data.h5','mnist_features',i*5,5)#每批读取5个数据
    print(d)

输出：

[0 1 2 3 4]
[5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]
[25 26 27 28 29]
[30 31 32 33 34]
[35 36 37 38 39]
[40 41 42 43 44]
[45 46 47 48 49]

特殊地，如果批量写入的数据为图像时：

# 按批次将图像数据及相应的标签数据写入一个h5py文件
IMAGE_DIMS = (80, 60, 3)# 图片维度
LABEL_DIMS=3 # 标签维度
img_total=40442# 图片总数目
img_batch=100# 每次存取的图片数目
img_num=img_total//img_batch+1# 需要读取的batch数
img_res=img_total-img_batch*(img_num-1)# 最后一个batch读取的图片数目

for img_i in range(img_num): #img_i 第几个batch
    if img_i == 0：
        h5f = h5py.File("./dataset/train_data.h5", "w") #build File object
        x = h5f.create_dataset("x_train", (img_batch,IMAGE_DIMS[0], IMAGE_DIMS[1],IMAGE_DIMS[2]), 
        maxshape=(None,IMAGE_DIMS[0], IMAGE_DIMS[1],IMAGE_DIMS[2]), 
        dtype =np.float32)# build x_train dataset
        y = h5f.create_dataset("y_train", (img_batch, LABEL_DIMS),maxshape=(None, LABEL_DIMS), 
        dtype =np.int32)# build y_train dataset
    else:
    	h5f = h5py.File("./dataset/train_data.h5", "a") # add mode
    	x = h5f["x_train"]
    	y = h5f["y_train"]
	
	ytem = label[img_i*img_batch:(img_i+1)*img_batch]
	image=[]
	for i in range(img_i*img_batch,(img_i+1)*img_batch):
		if i>=img_total:
			break
    	img = cv2.imread(path+str(i)+'.jpg') 
    	img=cv2.resize(img, (IMAGE_DIMS[1], IMAGE_DIMS[0]))
    	img=img_to_array(img)
    	image.append(img)
    image=np.array(image, dtype="float") / 255.0
    
    if img_i != img_num-1:
         x.resize([img_i*img_batch + img_batch,IMAGE_DIMS[0], IMAGE_DIMS[1],IMAGE_DIMS[2]])
         y.resize([img_i*img_batch + img_batch,LABEL_DIMS])

         x[img_i*img_batch:img_i*img_batch + img_batch] = image
         y[img_i*img_batch:img_i*img_batch + img_batch] = ytem #写入数据集 

         print('{} images are dealed with'.format(img_i))
    else:
         x.resize([img_i*img_batch+img_res,IMAGE_DIMS[0], IMAGE_DIMS[1],IMAGE_DIMS[2]])
         y.resize([img_i*img_batch+img_res,LABEL_DIMS])

         x[img_i*img_batch:img_i*img_batch + img_res] = image
         y[img_i*img_batch:img_i*img_batch + img_res] = ytem

         print('{} images are dealed with'.format(img_i))

h5f.close() #close file