h5py vs npy/npz

最新推荐文章于 2025-01-02 21:18:30 发布

masonwang_513

最新推荐文章于 2025-01-02 21:18:30 发布

阅读量1.4k

点赞数 2

分类专栏： python

原文链接：http://tdeboissiere.github.io/h5py-vs-npz.html

版权

python 专栏收录该内容

12 篇文章

订阅专栏

本文探讨了在计算机视觉项目中，如何有效组织数据以提高机器学习模型的训练效率。作者对比了NumPy的np.savez和np.save方法，发现HDF5格式通过h5py模块不仅便于管理大量元数据，而且在数据加载速度上优于前者，甚至快于NumPy的.npy文件。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Data engineering for computer vision

Lately, I've been thinking hard about the best way to organize my data before feeding it to a machine learning classifier or regressor. I have a few guiding principles in mind:

Keep the number of data files to a minimum
Ease common operations (train test split, class selection)
Make sure loading the data is not a bottleneck

I first started with the numpy savez function which allows you to save the data in an ordered way. This would typically go as

import numpy as np

# Assume you have a train and test array and labels:
arr_train, label_train
arr_test, label_test

# Define a dict
d = {"arr_train": arr_train,
     "arr_test": arr_test,
     "label_train": label_train,
     "label_test": label_test}

np.savez("data.npz", **d)

Which works just fine.

However, two limitations quickly became apparent:

When you need to store a lot of metadata it quickly becomes a pain to organize.
Loading the data from the disk is actually quite slow (see snippet below).

import numpy as np
import time

arr = np.random.randint(0,1000,(25000,5000))
np.savez("arr.npz", **{"arr": arr})
np.save("arr.npy", arr)

ltime = []
for i in range(20):
    start = time.time()
    arr = np.load("arr.npy")[:, :]
    ltime.append(time.time() - start)

print "npy time:", np.mean(ltime)

ltime = []
for i in range(20):
    start = time.time()
    arr = np.load("arr.npz")["arr"][:, :]
    ltime.append(time.time() - start)

print "npz time:", np.mean(ltime)

npz time: 1.47687283754
npy time: 0.483348703384

which gave me the following times:

That's quite the difference !

Clearly, another approach is needed. So far, I have settled with the excellent and simple h5py module which stores the data in HDF5 format while being very transparent to numpy.

Here's how it goes:

import h5py
import numpy as np
import time

arr = np.random.randint(0,1000,(25000,5000))

with h5py.File("arr.h5", "w") as hf:
    hf.create_dataset("arr", data=arr)

And that's it !

You can also easily add metadata to each of your datasets:

import h5py
import numpy as np
import time

arr = np.random.randint(0,1000,(25000,5000))

with h5py.File("arr.h5", "w") as hf:
    dset = hf.create_dataset("arr", data=arr)
    dset.attrs['author'] = "pony"

My only gripe with the module was an ill-fated attempt at writing a file in parallel from several sources: you need to rebuild h5py to support parallelism (my anaconda distribution did not support it by default) and this takes you to a world of pain with conflicts between anaconda's own HDF5 library and the new parallel one you build. The only workaround I found involved reinstalling h5py outside of anaconda but messed with my MPI setup.

Anyway, let's test the speed of this new design:

import h5py
import numpy as np
import time

arr = np.random.randint(0,1000,(25000,5000))

with h5py.File("arr.h5", "w") as hf:
    dset = hf.create_dataset("arr", data=arr)
    dset.attrs['author'] = "pony"

ltime = []
for i in range(20):
    start = time.time()
    with h5py.File("arr.h5", "r") as hf:
        arr = hf["arr"][:, :]
        ltime.append(time.time() - start)

print "hdf5 time:", np.mean(ltime)

This gave me: