python fuel 操作HDF5格式数据

原文http://fuel.readthedocs.io/en/latest/h5py_dataset.html

看本文之前希望先看我前一篇写的综述


首先建立一个例子

#-*- coding:utf-8 -*-
import numpy
numpy.save(
     'train_vector_features.npy',
     numpy.random.normal(size=(90, 10)).astype('float32'))
numpy.save(
     'test_vector_features.npy',
     numpy.random.normal(size=(10, 10)).astype('float32'))
numpy.save(
     'train_image_features.npy',
     numpy.random.randint(2, size=(90, 3, 5, 5)).astype('uint8'))
numpy.save(
     'test_image_features.npy',
     numpy.random.randint(2, size=(10, 3, 5, 5)).astype('uint8'))
numpy.save(
     'train_targets.npy',
     numpy.random.randint(10, size=(90, 1)).astype('uint8'))
numpy.save(
     'test_targets.npy',
     numpy.random.randint(10, size=(10, 1)).astype('uint8'))



接下来介绍HDF5的dataset

最好的载入方法是使用H5PYDataset类,处理HDF5的方法是使用模块h5py

不过有以下要求

1所有数据被保存在一个hdf5文件中

2数据源位于根组,并且它们的名字定义了源名称

3数据源不显式存储在不同的hdf5 dataset中。而是将分支定义在根组中的‘split’属性之中。h5pydataset类的'split'属性是一个一维的numpy数组,数组有7个参数,这7个参数dtype是混合的。详情看下面split那里的讲解。

(1)split:分支名的字符标识,其实就是分训练集、测试集

(2)source:源名的字符标识,如图片特征,分类等

(3)start:源分支的开始索引(包括),只有indices为空时有效

(4)stop:源分支的结束索引(不包括),只有indices为空时有效

(5)indices:h5py.Reference格式,参考dataset的包含的子集split/source对的索引。

(6)available:布尔变量,False表示这个split对这个source不可用

(7)comment:注释字符串。




转换npy数据到h5py

train_vector_features = numpy.load('train_vector_features.npy')
test_vector_features = numpy.load('test_vector_features.npy')
train_image_features = numpy.load('train_image_features.npy')
test_image_features = numpy.load('test_image_features.npy')
train_targets = numpy.load('train_targets.npy')
test_targets = numpy.load('test_targets.npy')


打开文件,创建相关内容,即建立空的源,或者理解为数据。

import h5py
f = h5py.File('dataset.hdf5', mode='w')
vector_features = f.create_dataset(
 'vector_features', (100, 10), dtype='float32')
image_features = f.create_dataset(
 'image_features', (100, 3, 5, 5), dtype='uint8')
targets = f.create_dataset(
 'targets', (100, 1), dtype='uint8')

这里数目都是100,是训练+测试。可以用numpy填入数据。
vector_features[...]= numpy.vstack(
 [train_vector_features, test_vector_features])
image_features[...] = numpy.vstack(
 [train_image_features, test_image_features])
targets[...] = numpy.vstack([train_targets, test_targets])


给数据加标签,或者说是一种注释类似excel表格
vector_features.dims[0].label = 'batch'
vector_features.dims[1].label = 'feature'
image_features.dims[0].label = 'batch'
image_features.dims[1].label = 'channel'
image_features.dims[2].label = 'height'
image_features.dims[3].label = 'width'
targets.dims[0].label = 'batch'
targets.dims[1].label = 'index'
标签的选择是任意的,你也可以不填加标签。但是特定的外部结构可能对标签的选择强加限制。
最后的一件事是给H5PYDataset一个设置分支的机会(英文烂,随便翻的)。只需要在根处设置split属性。split——array这里是一个长度为6的数组,每个元素7个属性。
6=2*3。2是训练+测试。3是vector_feature、image_feature、target。但是这里分成6块只是设置名字,具体是哪个数值还要定义索引。
最后把定义为spilt矩阵加入到f的‘split’属性中即可。
split矩阵,只有split的信息,是dtype矩阵
split_array = numpy.empty(
     6,
     dtype=numpy.dtype([
         ('split', 'a', 5),
         ('source', 'a', 15),
         ('start', numpy.int64, 1),
         ('stop', numpy.int64, 1),
         ('indices', h5py.special_dtype(ref=h5py.Reference)),
         ('available', numpy.bool, 1),
         ('comment', 'a', 1)]))
split_array[0:3]['split'] = 'train'.encode('utf8')
split_array[3:6]['split'] = 'test'.encode('utf8')
split_array[0:6:3]['source'] = 'vector_features'.encode('utf8')
split_array[1:6:3]['source'] = 'image_features'.encode('utf8')
split_array[2:6:3]['source'] = 'targets'.encode('utf8')
split_array[0:3]['start'] = 0
split_array[0:3]['stop'] = 90
split_array[3:6]['start'] = 90
split_array[3:6]['stop'] = 100
split_array[:]['indices'] = h5py.Reference()
split_array[:]['available'] = True
split_array[:]['comment'] = '.'.encode('utf8')
f.attrs['split'] = split_array

dtype是(str, str, int, int, h5py.Reference, bool, str)


   
   

Due to limitations in h5py, you must make sure to use bytes for splitsource and comment.

H5PYDataset期望split属性包括尽量多的元素,也就是尽量多的split和source的笛卡尔积。如果有些类并没有值,可以设置其available为False,比如测试集没有标签
这样(‘test','targets')就应该为空。

# split_array=[ ('train', 'vector_features', 0, 90, <HDF5 object reference (null)>, True, '.')
#  ('train', 'image_features', 0, 90, <HDF5 object reference (null)>, True, '.')
#  ('train', 'targets', 0, 90, <HDF5 object reference (null)>, True, '.')
#  ('test', 'vector_features', 90, 100, <HDF5 object reference (null)>, True, '.')
#  ('test', 'image_features', 90, 100, <HDF5 object reference (null)>, True, '.')
#  ('test', 'targets', 90, 100, <HDF5 object reference (null)>, True, '.')]
#生成6个元素的numpy矩阵,每个元素有7个属性,由于h5py限制,split source comment必须是字节


第二种方法:用create_split_array()
>>> from fuel.datasets.hdf5 import H5PYDataset
>>> split_dict = {
...     'train': {'vector_features': (0, 90), 'image_features': (0, 90),
...               'targets': (0, 90)},
...     'test': {'vector_features': (90, 100), 'image_features': (90, 100),
...              'targets': (90, 100)}}
>>> f.attrs['split'] = H5PYDataset.create_split_array(split_dict)

这种方法会默认把空的split/source集合的available=False

Tip

By default, H5PYDataset sorts sources in alphabetical order, and data requests are also returned in that order. If sources is passed as

argument upon instantiation, H5PYDataset will use the order of sources instead. This means that if you want to force a particular

source order, you can do so by explicitly passing the sources argument with the desired ordering. For example, if your dataset has

two sources named 'features' and 'targets' and you’d like the targets to be returned first, you need to pass 

sources=('targets', 'features') as a constructor argument.


最后关闭
>>> f.flush()
>>> f.close()


使用刚才的H5PYDataset

查看刚才的文件
>>> train_set = H5PYDataset('dataset.hdf5', which_sets=('train',))
>>> print(train_set.num_examples)
90
>>> test_set = H5PYDataset('dataset.hdf5', which_sets=('test',))
>>> print(test_set.num_examples)
10

把训练集分割出一个验证集
>>> train_set = H5PYDataset(
...     'dataset.hdf5', which_sets=('train',), subset=slice(0, 80))
>>> print(train_set.num_examples)
80
>>> valid_set = H5PYDataset(
...     'dataset.hdf5', which_sets=('train',), subset=slice(80, 90))
>>> print(valid_set.num_examples)
10

可以自动选出每个split所有源内容
>>> print(train_set.provides_sources)
('image_features', 'targets', 'vector_features')
也可以得到每个split对应源的axis信息
>>> print(train_set.axis_labels['image_features'])
('batch', 'channel', 'height', 'width')
>>> print(train_set.axis_labels['vector_features'])
('batch', 'feature')
>>> print(train_set.axis_labels['targets'])
('batch', 'index')
像往常一样得到数据:
>>> handle = train_set.open()
>>> data = train_set.get_data(handle, slice(0, 10))
>>> print((data[0].shape, data[1].shape, data[2].shape))
((10, 3, 5, 5), (10, 1), (10, 10))
>>> train_set.close(handle)
只获得其vector feature:
>>> train_vector_features = H5PYDataset(
...     'dataset.hdf5', which_sets=('train',), subset=slice(0, 80),
...     sources=['vector_features'])
>>> handle = train_vector_features.open()
>>> data, = train_vector_features.get_data(handle, slice(0, 10))
>>> print(data.shape)
(10, 10)
>>> train_vector_features.close(handle)

未完待续



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值