h5py vs npy/npz

本文探讨了在计算机视觉项目中,如何有效组织数据以提高机器学习模型的训练效率。作者对比了NumPy的np.savez和np.save方法,发现HDF5格式通过h5py模块不仅便于管理大量元数据,而且在数据加载速度上优于前者,甚至快于NumPy的.npy文件。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Data engineering for computer vision

Lately, I've been thinking hard about the best way to organize my data before feeding it to a machine learning classifier or regressor. I have a few guiding principles in mind:

  • Keep the number of data files to a minimum
  • Ease common operations (train test split, class selection)
  • Make sure loading the data is not a bottleneck

I first started with the numpy savez function which allows you to save the data in an ordered way. This would typically go as 

import numpy as np

# Assume you have a train and test array and labels:
arr_train, label_train
arr_test, label_test

# Define a dict
d = {"arr_train": arr_train,
     "arr_test": arr_test,
     "label_train": label_train,
     "label_test": label_test}

np.savez("data.npz", **d)

 

Which works just fine.

However, two limitations quickly became apparent:

  • When you need to store a lot of metadata it quickly becomes a pain to organize.
  • Loading the data from the disk is actually quite slow (see snippet below).
import numpy as np
import time

arr = np.random.randint(0,1000,(25000,5000))
np.savez("arr.npz", **{"arr": arr})
np.save("arr.npy", arr)

ltime = []
for i in range(20):
    start = time.time()
    arr = np.load("arr.npy")[:, :]
    ltime.append(time.time() - start)

print "npy time:", np.mean(ltime)

ltime = []
for i in range(20):
    start = time.time()
    arr = np.load("arr.npz")["arr"][:, :]
    ltime.append(time.time() - start)

print "npz time:", np.mean(ltime)
npz time: 1.47687283754
npy time: 0.483348703384

which gave me the following times:

That's quite the difference !

Clearly, another approach is needed. So far, I have settled with the excellent and simple h5py module which stores the data in HDF5 format while being very transparent to numpy.

Here's how it goes:

import h5py
import numpy as np
import time

arr = np.random.randint(0,1000,(25000,5000))

with h5py.File("arr.h5", "w") as hf:
    hf.create_dataset("arr", data=arr)

And that's it !

You can also easily add metadata to each of your datasets:

import h5py
import numpy as np
import time

arr = np.random.randint(0,1000,(25000,5000))

with h5py.File("arr.h5", "w") as hf:
    dset = hf.create_dataset("arr", data=arr)
    dset.attrs['author'] = "pony"

My only gripe with the module was an ill-fated attempt at writing a file in parallel from several sources: you need to rebuild h5py to support parallelism (my anaconda distribution did not support it by default) and this takes you to a world of pain with conflicts between anaconda's own HDF5 library and the new parallel one you build. The only workaround I found involved reinstalling h5py outside of anaconda but messed with my MPI setup.

Anyway, let's test the speed of this new design:

import h5py
import numpy as np
import time

arr = np.random.randint(0,1000,(25000,5000))

with h5py.File("arr.h5", "w") as hf:
    dset = hf.create_dataset("arr", data=arr)
    dset.attrs['author'] = "pony"

ltime = []
for i in range(20):
    start = time.time()
    with h5py.File("arr.h5", "r") as hf:
        arr = hf["arr"][:, :]
        ltime.append(time.time() - start)

print "hdf5 time:", np.mean(ltime)

This gave me:

hdf5 time: 0.386118304729

Which is even faster than the .npy version !

Later on, I'll try to give more details on my data pipeline.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值