用pandas读写HDF5文件

最新推荐文章于 2021-01-30 05:10:43 发布

Johnson0722

最新推荐文章于 2021-01-30 05:10:43 发布

阅读量1.3w

点赞数 8

分类专栏： python 文章标签： python hdf5 数据压缩 pandas

本文链接：https://blog.csdn.net/john_xyz/article/details/96829337

版权

python 专栏收录该内容

31 篇文章 1 订阅

订阅专栏

hdf5简介

HDF5是一种专门用于存储表格数据阵列的高性能存储格式。Pandas的HDFStore类可以将DataFrame存储在HDF5文件中，以便可以有效地访问它，同时仍保留列类型和其他元数据。它是一个类似字典的类，因此您可以像读取Python dict对象一样进行读写。

HDF5支持压缩存储，使用的方式是blosc，这个是速度最快的也是pandas默认支持的。使用压缩可以节省空间。开启压缩也没有什么劣势，只会慢一点点。
压缩在小数据量的时候优势不明显，数据量大了才有优势。同时发现hdf读取文件的时候只能是一次写，写的时候可以append，可以put，但是写完成了之后关闭文件，就不能再写了。

HDFStore常用方法

读取文件，支持压缩存储

# Read from the store, close it if we opened it.
read_hdf(path_or_buf[, key, mode])

往HDF中添加一张数据表, 以key作为索引

# Store object in HDFStore
HDFStore.put(self, key, value[, format, append])

将数据附加到现有的表中

# Append to Table in file.
HDFStore.append(self, key, value[, format, …])

根据键提取数据,返回dataframe

# Retrieve pandas object stored in file
HDFStore.get(self, key)

根据条件筛选数据

# Retrieve pandas - object stored in file, optionally based on where criteria
HDFStore.select(self, key[, where, start, …])

查看HDF存储的信息

# Print detailed information on the store.
- HDFStore.info(self)

查看HDF所有的键

# Return a (potentially unordered) list of the keys corresponding to the objects stored in the HDFStore.
HDFStore.keys(self)

HDFStore.groups(self) return a list of all the top-level nodes (that are not themselves a pandas storage object)
HDFStore.walk(self[, where])

例子

import numpy as np
import pandas as pd

# 打开一个hdf文件
hdf = pd.HDFStore('test.hdf','w')
df1 = pd.DataFrame(np.random.standard_normal((3,2)), columns=['A','B'])
df2 = pd.DataFrame(np.random.standard_normal((3,2)), columns=['A','B'])
hdf.put(key='key1', value=df1, format='table', data_columns=True)
hdf.put(key='key2', value=df2, format='table', data_columns=True)

print(hdf.info())

<class 'pandas.io.pytables.HDFStore'>
File path: test.hdf
/key1            frame        (shape->[3,2])
/key2            frame        (shape->[3,2])

print(hdf.keys())

['/key1', '/key2']

print(hdf.get('key1')) # equal to hdf['key1']
print(hdf.get('key2')) # equal to hdf['key2']

          A         B    
0  0.257239  1.684300 
1  0.076235 -0.071744 
2 -0.266105 -0.874081

          A         B   
0  1.178982 -0.517734
1  0.713010 -0.484248 
2  0.741703 -0.650327

# append df2 to the dataset of key1
hdf.append(key='key1', value=df2, format='table', data_columns=True)
print(hdf.get('key1')

          A         B    
0  0.257239  1.684300 
1  0.076235 -0.071744 
2 -0.266105 -0.874081
0  1.178982 -0.517734
1  0.713010 -0.484248 
2  0.741703 -0.650327

使用压缩格式存储

large_data = pd.DataFrame(np.random.standard_normal((90000000,4)))
# 普通格式存储：
hdf1 = pd.HDFStore('test1.h5','w')
hdf1.put(key='data', value=large_data)
hdf1.close()
# 压缩格式存储
hdf2 = pd.HDFStore('test2.h5','w', complevel=4, complib='blosc')
hdf2.put(key='data', value=large_data)
hdf2.close()

从结果上看,test2.h5比test1.h5小了700mb，节省了存储空间。

reference:

https://glowingpython.blogspot.com/2014/08/quick-hdf5-with-pandas.html
https://pandas.pydata.org/pandas-docs/stable/reference/io.html#hdfstore-pytables-hdf5
https://realpython.com/fast-flexible-pandas/#this-tutorial

Johnson0722

关注

8
点赞
踩
35

收藏

觉得还不错? 一键收藏
4
评论
用pandas读写HDF5文件

hdf5简介HDF5是一种专门用于存储表格数据阵列的高性能存储格式。Pandas的HDFStore类k可以将将DataFrame存储在HDF5文件中，以便可以有效地访问它，同时仍保留列类型和其他元数据。它是一个类似字典的类，因此您可以像读取Python dict对象一样进行读写。HDF5支持压缩存储，使用的方式是blosc，这个是速度最快的也是pandas默认支持的。使用压缩可以节省空间。...
复制链接

扫一扫