python hdf5_用Python编写HDF5文件的最快方法?

对于GB级别的CSV文件,使用h5py模块将数据转换为HDF5格式时,最佳实践是分块写入,例如每次处理10,000行。避免整体写入,而是将数据作为一系列单独数组的集合存储,可以显著提高写入速度。更新后的代码示例展示了如何在C中高效地读写大数组,这可能对使用Python的h5py模块有所启示。" 114057446,10294938,Java使用Json Schema进行JSON数据校验,"['Java开发', '数据验证', 'JSON处理']
摘要由CSDN通过智能技术生成

Given a large (10s of GB) CSV file of mixed text/numbers, what is the fastest way to create an HDF5 file with the same content, while keeping the memory usage reasonable?

I'd like to use the h5py module if possible.

In the toy example below, I've found an incredibly slow and incredibly fast way to write data to HDF5. Would it be best practice to write to HDF5 in chunks of 10,000 rows or so? Or is there a better way to write a massive amount of data to such a file?

import h5py

n = 10000000

f = h5py.File('foo.h5','w')

dset = f.create_dataset('int',(n,),'i')

# this is terribly slow

for i in xrange(n):

dset[i] = i

# instantaneous

dset[...] = 42

解决方案

I would avoid chunking the data and would store the data as series of single array datasets (along the lines of what Benjamin is suggesting). I just finished loading the output of an enterprise app I've been working on into HDF5, and was able to pack about 4.5 Billion compound datatypes as 450,000 datasets, each containing a 10,000 array of data. Writes and reads now seem fairly instantaneous, but were painfully slow when I initially tried to chunk the data.

Just a thought!

Update:

These are a couple of snippets lifted from my actual code (I'm coding in C vs. Python, but you should get the idea of what I'm doing) and modified for clarity. I'm just writing long unsigned integers in arrays (10,000 values per array) and reading them back when I need an actual value

This is my typical writer code. In this case, I'm simply writing long unsigned integer sequence into a sequence of arrays and loading each array sequence into hdf5 as they are created.

//Our dummy data: a rolling count of long unsigned integers

long unsigned int k = 0UL;

//We'll use this to store our dummy data, 10,000 at a time

long unsigned int kValues[NUMPERDATASET];

//Create the SS adata files.

hid_t ssdb = H5Fcreate(SSHDF, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

//NUMPERDATASET = 10,000, so we get a 1 x 10,000 array

hsize_t dsDim[1] = {NUMPERDATASET};

//Create the data space.

hid_t dSpace = H5Screate_simple(1, dsDim, NULL);

//NUMDATASETS = MAXSSVALUE / NUMPERDATASET, where MAXSSVALUE = 4,500,000,000

for (unsigned long int i = 0UL; i < NUMDATASETS; i++){

for (unsigned long int j = 0UL; j < NUMPERDATASET; j++){

kValues[j] = k;

k += 1UL;

}

//Create the data set.

dssSet = H5Dcreate2(ssdb, g_strdup_printf("%lu", i), H5T_NATIVE_ULONG, dSpace, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);

//Write data to the data set.

H5Dwrite(dssSet, H5T_NATIVE_ULONG, H5S_ALL, H5S_ALL, H5P_DEFAULT, kValues);

//Close the data set.

H5Dclose(dssSet);

}

//Release the data space

H5Sclose(dSpace);

//Close the data files.

H5Fclose(ssdb);

This is a slightly modified version of my reader code. There are more elegant ways of doing this (i.e., I could use hyperplanes to get the value), but this was the cleanest solution with respect to my fairly disciplined Agile/BDD development process.

unsigned long int getValueByIndex(unsigned long int nnValue){

//NUMPERDATASET = 10,000

unsigned long int ssValue[NUMPERDATASET];

//MAXSSVALUE = 4,500,000,000; i takes the smaller value of MAXSSVALUE or nnValue

//to avoid index out of range error

unsigned long int i = MIN(MAXSSVALUE-1,nnValue);

//Open the data file in read-write mode.

hid_t db = H5Fopen(_indexFilePath, H5F_ACC_RDONLY, H5P_DEFAULT);

//Create the data set. In this case, each dataset consists of a array of 10,000

//unsigned long int and is named according to its integer division value of i divided

//by the number per data set.

hid_t dSet = H5Dopen(db, g_strdup_printf("%lu", i / NUMPERDATASET), H5P_DEFAULT);

//Read the data set array.

H5Dread(dSet, H5T_NATIVE_ULONG, H5S_ALL, H5S_ALL, H5P_DEFAULT, ssValue);

//Close the data set.

H5Dclose(dSet);

//Close the data file.

H5Fclose(db);

//Return the indexed value by using the modulus of i divided by the number per dataset

return ssValue[i % NUMPERDATASET];

}

The main take-away is the inner loop in the writing code and the integer division and mod operations to get the index of the dataset array and index of the desired value in that array. Let me know if this is clear enough so you can put together something similar or better in h5py. In C, this is dead simple and gives me significantly better read/write times vs. a chunked dataset solution. Plus since I can't use compression with compound datasets anyway, the apparent upside of chunking is a moot point, so all my compounds are stored the same way.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值