python gzip pickle_加载速度更快:Python中的pickle或hdf5

对于1.5GB的pandas数据框列表,作者对比了不同压缩格式的加载速度,包括pickle、hdf5和gzip压缩的CSV。结果显示,HDF5(PyTables)在多种压缩设置下通常比pickle快,尤其是使用zlib压缩时。然而,最佳选项可能因数据类型而异,建议根据实际数据进行基准测试。
摘要由CSDN通过智能技术生成

Given a 1.5 Gb list of pandas dataframes, which format is fastest for loading compressed data:

pickle (via cPickle), hdf5, or something else in Python?

I only care about fastest speed to load the data into memory

I don't care about dumping the data, it's slow but I only do this once.

I don't care about file size on disk

解决方案

I would consider only two storage formats: HDF5 (PyTables) and Feather

Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).

Comparison for the following storage formats: (CSV, CSV.gzip, Pickle, HDF5 [various compression]):

read_s write_s size_ratio_to_CSV

storage

CSV 17.900 69.00 1.000

CSV.gzip 18.900 186.00 0.047

Pickle 0.173 1.77 0.374

HDF_fixed 0.196 2.03 0.435

HDF_tab 0.230 2.60 0.437

HDF_tab_zlib_c5 0.845 5.44 0.035

HDF_tab_zlib_c9 0.860 5.95 0.035

HDF_tab_bzip2_c5 2.500 36.50 0.011

HDF_tab_bzip2_c9 2.500 36.50 0.011

But it might be different for you, because all my data was of the datetime dtype, so it's always better to make such a comparison with your real data or at least with the similar data...

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值