Python与Julia : parquet、feather格式比较

最新推荐文章于 2025-03-04 19:58:40 发布

songroom

最新推荐文章于 2025-03-04 19:58:40 发布

阅读量2.7k

点赞数 3

本文链接：https://blog.csdn.net/wowotuo/article/details/109828399

版权

本文通过实操，对比了Parquet和Feather两种数据格式。以一个约59万行、14列的CSV文件为材料，测试了读写性能和文件大小。结果显示，Parquet省空间、性能良好，Feather读速度快且对字符串敏感性低。还测试了跨语言交互，建议按需选择格式。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

最近看到Arrow格式，感觉设计很牛B，具体就不介绍了。所以实操了解一下。说明一下，python中有pyarrow库，而不是Arrow库（时间日期库）。这个库的功能很强大，可以处理多种数据格式。

一、材料准备
准备了一个csv文件，大约约59万行，14列，大小约61M，格式如下：

在这里插入图片描述

table shape row:  589680
table shape col:  14

为了测试需要，在最后一列，open_interest这一列补上0，减少缺省处理。
有了这个csv材料，便可以转成Dataframe，进而转成parquet格式，或feather格式。

二、具体代码

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import time as t

# 生成arrow格式
print("write parquet file....")
csv_path = "C:\\Users\\songroom\\Desktop\\test.csv"
time_0 = t.time()
df = pd.read_csv(csv_path)
time_1 =t.time()
print("read csv cost :", time_1-time_0)
print("type of df : ",type(df))
time_2 = t.time()
table = pa.Table.from_pandas(df)
time_3 = t.time()
print("type of table :",type(table))
print("Dataframe convert table:", time_3-time_2)
#
print("write to parquet to disk.....")
time_4 = t.time()
pq_path = "C:\\Users\\songroom\\Desktop\\test.parquet"
pq.write_table(table, pq_path)
time_5 = t.time()
print("write  parquet cost :", time_5-time_4)

print("read parquet file from disk ....")
table2 = pq.read_table(pq_path)

time_6 = t.time()
print("read  parquet cost :", time_6-time_5)
print("type of table2 :",type(table2))
print("table shape row: ",table2.shape[0])
print("table shape col: ",table2.shape[1])

三、文件大小比较

1、读写性能、文件大小
生成parquet文件，大约是11.3M，和原来的csv文件比，大约是20%，这个很省空间呀。

读写速度具体比较如下：

write parquet file....
read csv cost : 1.0619995594024658
type of df :  <class 'pandas.core.frame.DataFrame'>
type of table : <class 'pyarrow.lib.Table'>
Dataframe convert table: 0.08900094032287598
write to parquet to disk.....
write  parquet cost : 0.3249986171722412
read parquet file from disk ....
read  parquet cost : 0.05600690841674805
type of table2 : <class 'pyarrow.lib.Table'>
table2 shape row:  589680
table2 shape col:  14

也就是说，对一个59万行的文件测试，结果是：
csv pk parquet（1.06s PK 0.056s ）用时比：5%
csv pk parquet (61M PK 11.3M ) 空间占比：20%
总体上，总结一下，parquet性能良好，多快好省。

当然这个数量级和不同运行环境并不一定相同，谨供参考。

说明的是，parquet文件通过压缩软件再压缩的空间不太大，11.3M->7M水平。

2、parquet与字符串
因为样本csv文件的第一列是时间字符串，“2010-01-01 9：30：00”这种格式，如果把这个换成数值型，这个速度会略提高一些。

(base) D:\py_joinquant>python -u "d:\py_joinquant\test.py"
write parquet file....
read csv cost : 0.6745004653930664
type of df :  <class 'pandas.core.frame.DataFrame'>
type of table : <class 'pyarrow.lib.Table'>        
Dataframe convert table: 0.010273456573486328      
write to parquet to disk.....
write  parquet cost : 0.28311967849731445
read parquet file from disk ....
read  parquet cost : 0.04045557975769043    
type of table2 : <class 'pyarrow.lib.Table'>
table shape row:  589680
table shape col:  14

大体上是，0.056 s pk 0.040s 。总体上对字符串没有特别大的压力。

四、和Feather比较

还是同一个csv文件，我们用feather处理一下，比较一下读的速度。feather读出来是DataFrame格式，而parquet读出来是Table格式。

1、和Julia的Feather下测试

using DataFrames
using CSV
using Feather

csv_path = s"C:\Users\songroom\Desktop\test.csv"
println("csv => DataFrame: ")
df = @time CSV.File(csv_path) |> DataFrame;
ft_path = s"C:\Users\songroom\Desktop\ft.ft"
println("DataFrame=> Feather:")
@time ft_file = Feather.write(ft_path,df)
println("read Feather")
@time ft = Feather.read(ft_path)
println("type of ft :",typeof(ft))

在同一台电脑julia程序测试如下：
在这里插入图片描述可见，同样的文件，feather格式要比parquet格式快。feather还有一个优点是对字符串不太挑，相比parquet，字符串敏感性更低。补充一下，hdf5对字符串是最敏感的。

0.022 s pk 0.056 s

速度上，julia feather要快于 python parquet 。但并不清楚，Julia本身比Python在语言本身上的速度影响，只能初步估计feather在读上应比parquet。

需要说明的是，feather文件的可压缩空间很大，压缩后只占原来文件的5-8%左右。

2、julia下读python生成的parquet文件

julia有Parquet.jl库，但还是有点问题，主要是性能特别慢。

（1）性能问题

# test.jl
using Parquet
filename = s"C:\Users\songroom\Desktop\test.parquet";
@time pq = ParFile(filename) ;
println("type of pq : ",typeof(pq));
println("c_names: ", colnames(pq));
println("rows : ",nrows(pq));
println("cols : ",ncols(pq));

julia> @run test
  0.001709 seconds (3.13 k allocations: 197.570 KiB)
type of pq : ParFile
c_names: [["dt"], ["open"], ["close"], ["low"], ["high"], ["volume"], ["money"], ["factor"], ["high_limit"], ["low_limit"], ["avg"], ["pre_close"], ["paused"], ["open_interest"]]
rows : 589680
cols : 14

可以看到，这个速度有多快，要比python快太多了！

但是这个ParFile格式目前还不是很友好，不象dataframe一样。所以速度并不能简单相比。

利用ParquetFiles库，转换成dataframe输出：

using ParquetFiles
filename = s"C:\Users\songroom\Desktop\test.parquet";
println("ParquetFiles : parquet => df")
@time df = load(filename) |> DataFrame

结果：

ParquetFiles : parquet => df
 21.306288 seconds (210.99 M allocations: 8.567 GiB, 8.11% gc time)

可以看出，ParquetFiles和Parquet目前还存在严重的性能问题尚未解决，期盼！github的issues区大家也提出相同的性能问题。【—做个记号2020-11-22，性能问题！】

（2）跨语言交互：通畅
目前用julia库读python写的parquet文件，目前是可以的，上面的试验已经表明。也就是说parquet的文件有跨语言通用性，这个和csv文件类似。

3、Python下feather测试
在python 找到feather库。代码如下：

import feather
print("now is feather....");
ft_path = "C:\\Users\\songroom\\Desktop\\test.feather"
time_7 = t.time()
feather.write_dataframe(df, ft_path)
time_8 = t.time()
print("feather write_dataframe : " ,time_8-time_7)
time_9 = t.time()
df2 = feather.read_dataframe(ft_path)
time_10 = t.time()
print("feather read_dataframe : ",time_10-time_9)

输出结果：

now is feather....
feather write_dataframe :  0.06961703300476074
feather read_dataframe :  0.1556386947631836

或者上面的代码换成下面的，也是一样的：

print("pyarrow.feather.....")
import pyarrow.feather as feather
time_11 = t.time()
feather.write_feather(df, ft_path)
time_12 = t.time()
print("write feather cost :",time_12-time_11)
time_13 = t.time()
feather.read_feather(ft_path);
time_14 = t.time()
print("read feather cost :",time_14-time_13)