python fuel 操作HDF5格式数据

最新推荐文章于 2024-07-28 18:28:23 发布

长风破浪才健舟

最新推荐文章于 2024-07-28 18:28:23 发布

阅读量1.5k

点赞数

文章标签： fuel python

原文http://fuel.readthedocs.io/en/latest/h5py_dataset.html

看本文之前希望先看我前一篇写的综述

首先建立一个例子

#-*- coding:utf-8 -*-
import numpy
numpy.save(
     'train_vector_features.npy',
     numpy.random.normal(size=(90, 10)).astype('float32'))
numpy.save(
     'test_vector_features.npy',
     numpy.random.normal(size=(10, 10)).astype('float32'))
numpy.save(
     'train_image_features.npy',
     numpy.random.randint(2, size=(90, 3, 5, 5)).astype('uint8'))
numpy.save(
     'test_image_features.npy',
     numpy.random.randint(2, size=(10, 3, 5, 5)).astype('uint8'))
numpy.save(
     'train_targets.npy',
     numpy.random.randint(10, size=(90, 1)).astype('uint8'))
numpy.save(
     'test_targets.npy',
     numpy.random.randint(10, size=(10, 1)).astype('uint8'))

接下来介绍HDF5的dataset

最好的载入方法是使用H5PYDataset类，处理HDF5的方法是使用模块h5py

不过有以下要求

１所有数据被保存在一个hdf5文件中

２数据源位于根组，并且它们的名字定义了源名称

３数据源不显式存储在不同的hdf5 dataset中。而是将分支定义在根组中的‘split’属性之中。h5pydataset类的'split'属性是一个一维的numpy数组，数组有7个参数，这7个参数dtype是混合的。详情看下面split那里的讲解。

（１）split:分支名的字符标识，其实就是分训练集、测试集

（２）source：源名的字符标识，如图片特征，分类等

（３）start:源分支的开始索引（包括），只有indices为空时有效

（４）stop:源分支的结束索引（不包括），只有indices为空时有效

（５）indices:h5py.Reference格式,参考dataset的包含的子集split/source对的索引。

（６）available:布尔变量，False表示这个split对这个source不可用

（７）comment：注释字符串。

转换npy数据到h5py

train_vector_features = numpy.load('train_vector_features.npy')
test_vector_features = numpy.load('test_vector_features.npy')
train_image_features = numpy.load('train_image_features.npy')
test_image_features = numpy.load('test_image_features.npy')
train_targets = numpy.load('train_targets.npy')
test_targets = numpy.load('test_targets.npy')

打开文件，创建相关内容，即建立空的源，或者理解为数据。

import h5py
f = h5py.File('dataset.hdf5', mode='w')
vector_features = f.create_dataset(
 'vector_features', (100, 10), dtype='float32')
image_features = f.create_dataset(
 'image_features', (100, 3, 5, 5), dtype='uint8')
targets = f.create_dataset(
 'targets', (100, 1), dtype='uint8')

这里数目都是100,是训练+测试。可以用numpy填入数据。

vector_features[...]= numpy.vstack(
 [train_vector_features, test_vector_features])
image_features[...] = numpy.vstack(
 [train_image_features, test_image_features])
targets[...] = numpy.vstack([train_targets, test_targets])


给数据加标签，或者说是一种注释类似excel表格
vector_features.dims[0].label = 'batch'
vector_features.dims[1].label = 'feature'
image_features.dims[0].label = 'batch'
image_features.dims[1].label = 'channel'
image_features.dims[2].label = 'height'
image_features.dims[3].label = 'width'
targets.dims[0].label = 'batch'
targets.dims[1].label = 'index'
标签的选择是任意的，你也可以不填加标签。但是特定的外部结构可能对标签的选择强加限制。
最后的一件事是给H5PYDataset一个设置分支的机会（英文烂，随便翻的）。只需要在根处设置split属性。split——array这里是一个长度为6的数组，每个元素7个属性。
6=2*3。2是训练+测试。3是vector_feature、image_feature、target。但是这里分成6块只是设置名字，具体是哪个数值还要定义索引。
最后把定义为spilt矩阵加入到f的‘split’属性中即可。
split矩阵，只有split的信息，是dtype矩阵
split_array = numpy.empty(
     6,
     dtype=numpy.dtype([
         ('split', 'a', 5),
         ('source', 'a', 15),
         ('start', numpy.int64, 1),
         ('stop', numpy.int64, 1),
         ('indices', h5py.special_dtype(ref=h5py.Reference)),
         ('available', numpy.bool, 1),
         ('comment', 'a', 1)]))
split_array[0:3]['split'] = 'train'.encode('utf8')
split_array[3:6]['split'] = 'test'.encode('utf8')
split_array[0:6:3]['source'] = 'vector_features'.encode('utf8')
split_array[1:6:3]['source'] = 'image_features'.encode('utf8')
split_array[2:6:3]['source'] = 'targets'.encode('utf8')
split_array[0:3]['start'] = 0
split_array[0:3]['stop'] = 90
split_array[3:6]['start'] = 90
split_array[3:6]['stop'] = 100
split_array[:]['indices'] = h5py.Reference()
split_array[:]['available'] = True
split_array[:]['comment'] = '.'.encode('utf8')
f.attrs['split'] = split_array

dtype是(str, str, int, int, h5py.Reference, bool, str)


   
   
    
    Due to limitations in h5py, you must make sure to use bytes for split, source and comment.
    
    

    
    
   
   
H5PYDataset期望split属性包括尽量多的元素，也就是尽量多的split和source的笛卡尔积。如果有些类并没有值，可以设置其available为False，比如测试集没有标签
这样（‘test','targets'）就应该为空。

# split_array=[ ('train', 'vector_features', 0, 90, <HDF5 object reference (null)>, True, '.')
#  ('train', 'image_features', 0, 90, <HDF5 object reference (null)>, True, '.')
#  ('train', 'targets', 0, 90, <HDF5 object reference (null)>, True, '.')
#  ('test', 'vector_features', 90, 100, <HDF5 object reference (null)>, True, '.')
#  ('test', 'image_features', 90, 100, <HDF5 object reference (null)>, True, '.')
#  ('test', 'targets', 90, 100, <HDF5 object reference (null)>, True, '.')]
#生成６个元素的numpy矩阵,每个元素有７个属性,由于h5py限制，split source comment必须是字节


第二种方法：用create_split_array()
>>> from fuel.datasets.hdf5 import H5PYDataset
>>> split_dict = {
...     'train': {'vector_features': (0, 90), 'image_features': (0, 90),
...               'targets': (0, 90)},
...     'test': {'vector_features': (90, 100), 'image_features': (90, 100),
...              'targets': (90, 100)}}
>>> f.attrs['split'] = H5PYDataset.create_split_array(split_dict)

这种方法会默认把空的split/source集合的available=False
Tip
By default, H5PYDataset sorts sources in alphabetical order, and data requests are also returned in that order. If sources is passed as 
argument upon instantiation, H5PYDataset will use the order of sources instead. This means that if you want to force a particular 
source order, you can do so by explicitly passing the sources argument with the desired ordering. For example, if your dataset has
 two sources named 'features' and 'targets' and you’d like the targets to be returned first, you need to pass 
sources=('targets', 'features') as a constructor argument.

最后关闭
>>> f.flush()
>>> f.close()


使用刚才的H5PYDataset
 
 
  
  查看刚才的文件
 
 
 
 
  
  >>> train_set = H5PYDataset('dataset.hdf5', which_sets=('train',))
>>> print(train_set.num_examples)
90
>>> test_set = H5PYDataset('dataset.hdf5', which_sets=('test',))
>>> print(test_set.num_examples)
10
  
  

 
 
把训练集分割出一个验证集

>>> train_set = H5PYDataset(
...     'dataset.hdf5', which_sets=('train',), subset=slice(0, 80))
>>> print(train_set.num_examples)
80
>>> valid_set = H5PYDataset(
...     'dataset.hdf5', which_sets=('train',), subset=slice(80, 90))
>>> print(valid_set.num_examples)
10

可以自动选出每个split所有源内容

>>> print(train_set.provides_sources)
('image_features', 'targets', 'vector_features')

也可以得到每个split对应源的axis信息

>>> print(train_set.axis_labels['image_features'])
('batch', 'channel', 'height', 'width')
>>> print(train_set.axis_labels['vector_features'])
('batch', 'feature')
>>> print(train_set.axis_labels['targets'])
('batch', 'index')
像往常一样得到数据：

>>> handle = train_set.open()
>>> data = train_set.get_data(handle, slice(0, 10))
>>> print((data[0].shape, data[1].shape, data[2].shape))
((10, 3, 5, 5), (10, 1), (10, 10))
>>> train_set.close(handle)
只获得其vector feature：

>>> train_vector_features = H5PYDataset(
...     'dataset.hdf5', which_sets=('train',), subset=slice(0, 80),
...     sources=['vector_features'])
>>> handle = train_vector_features.open()
>>> data, = train_vector_features.get_data(handle, slice(0, 10))
>>> print(data.shape)
(10, 10)
>>> train_vector_features.close(handle)

未完待续

长风破浪才健舟

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python fuel 操作HDF5格式数据

原文http://fuel.readthedocs.io/en/latest/h5py_dataset.html看本文之前希望先看我前一篇写的综述首先建立一个例子#-*- coding:utf-8 -*-import numpynumpy.save( 'train_vector_features.npy', numpy.random.norm
复制链接

扫一扫