python 加速csv读取速度_python – Pandas read_csv加速

最新推荐文章于 2024-03-12 22:15:45 发布

weixin_39950081

最新推荐文章于 2024-03-12 22:15:45 发布

阅读量808

点赞数

文章标签： python 加速csv读取速度

本文链接：https://blog.csdn.net/weixin_39950081/article/details/114913545

版权

我们来试试吧！

数据生成：

sz = 10**3

df = pd.DataFrame(np.random.randint(0, 10**6, (sz, 2)), columns=['i1','i2'])

df['date'] = pd.date_range('2000-01-01', freq='1S', periods=len(df))

df['dt2'] = pd.date_range('1980-01-01', freq='999S', periods=len(df))

df['f1'] = np.random.rand(len(df))

df['f2'] = np.random.rand(len(df))

# generate 10 string columns

for i in range(1, 11):

df['s{}'.format(i)] = pd.util.testing.rands_array(10, len(df))

df = pd.concat([df] * 10**3, ignore_index=True).sample(frac=1)

df = df.set_index(df.pop('date').sort_values())

我们已经生成了以下DF

In [59]: df

Out[59]:

i1 i2 dt2 f1 ... s7 s8 s9 s10

date ...

2000-01-01 00:00:00 216625 4179 1980-01-04 04:35:24 0.679989 ... 7G8rLnoocA E7Ot7oPsJ6 puQamLn0I2 zxHrATQn0m

2000-01-01 00:00:00 374740 967991 1980-01-09 11:07:48 0.202064 ... wLETO2g8uL MhtzNLPXCH PW1uKxY0df wTakdCe6nK

2000-01-01 00:00:00 152181 627451 1980-01-10 11:49:39 0.956117 ... mXOsfUPqOy 6IIst7UFDT nL6XZxrT3r BxPCFNdZTK

2000-01-01 00:00:00 915732 730737 1980-01-06 10:25:30 0.854145 ... Crh94m085p M1tbrorxGT XWSKk3b8Pv M9FWQtPzaa

2000-01-01 00:00:00 590262 248378 1980-01-06 11:48:45 0.307373 ... wRnMPxeopd JF24uTUwJC 2CRrs9yB2N hxYrXFnT1H

2000-01-01 00:00:00 161183 620876 1980-01-08 21:48:36 0.207536 ... cyN0AExPO2 POaldI6Y0l TDc13rPdT0 xgoDOW8Y1L

2000-01-01 00:00:00 589696 784856 1980-01-12 02:07:21 0.909340 ... GIRAAVBRpj xwcnpwFohz wqcoTMjQ4S GTcIWXElo7

... ... ... ... ... ... ... ... ... ...

2000-01-01 00:16:39 773606 205714 1980-01-12 07:40:21 0.895944 ... HEkXfD7pku 1ogy12wBom OT3KmQRFGz Dp1cK5R4Gq

2000-01-01 00:16:39 915732 730737 1980-01-06 10:25:30 0.854145 ... Crh94m085p M1tbrorxGT XWSKk3b8Pv M9FWQtPzaa

2000-01-01 00:16:39 990722 567886 1980-01-03 05:50:06 0.676511 ... gVO3g0I97R yCqOhTVeEi imCCeQa0WG 9tslOJGWDJ

2000-01-01 00:16:39 531778 438944 1980-01-04 20:07:48 0.190714 ... rbLmkbnO5G ATm3BpWLC0 moLkyY2Msc 7A2UJERrBG

2000-01-01 00:16:39 880791 245911 1980-01-02 15:57:36 0.014967 ... bZuKNBvrEF K84u9HyAmG 4yy2bsUVNn WZQ5Vvl9zD

2000-01-01 00:16:39 239866 425516 1980-01-10 05:26:42 0.667183 ... 6xukg6TVah VEUz4d92B8 zHDxty6U3d ItztnI5LmJ

2000-01-01 00:16:39 338368 804695 1980-01-12 05:27:09 0.084818 ... NM4fdjKBuW LXGUbLIuw9 SHdpnttX6q 4oXKMsaOJ5

[1000000 rows x 15 columns]

In [60]: df.shape

Out[60]: (1000000, 15)

In [61]: df.info()

DatetimeIndex: 1000000 entries, 2000-01-01 00:00:00 to 2000-01-01 00:16:39

Data columns (total 15 columns):

i1 1000000 non-null int32

i2 1000000 non-null int32

dt2 1000000 non-null datetime64[ns]

f1 1000000 non-null float64

f2 1000000 non-null float64

s1 1000000 non-null object

s2 1000000 non-null object

s3 1000000 non-null object

s4 1000000 non-null object

s5 1000000 non-null object

s6 1000000 non-null object

s7 1000000 non-null object

s8 1000000 non-null object

s9 1000000 non-null object

s10 1000000 non-null object

dtypes: datetime64[ns](1), float64(2), int32(2), object(10)

memory usage: 114.4+ MB

#print(df.shape)

#print(df.info())

让我们以不同的格式将它写入磁盘:( CSV,HDF5固定,HDF5表,羽毛)：

# CSV

df.to_csv('c:/tmp/test.csv')

# HDF5 table format

df.to_hdf('c:/tmp/test.h5', 'test', format='t')

# HDF5 fixed format

df.to_hdf('c:/tmp/test_fix.h5', 'test')

# Feather format

import feather

feather.write_dataframe(df, 'c:/tmp/test.feather')

定时：

现在我们可以测量从磁盘读取：

In [54]: # CSV

...: %timeit pd.read_csv('c:/tmp/test.csv', parse_dates=['date', 'dt2'], index_col=0)

1 loop, best of 3: 12.3 s per loop # 3rd place

In [55]: # HDF5 fixed format

...: %timeit pd.read_hdf('c:/tmp/test_fix.h5', 'test')

1 loop, best of 3: 1.85 s per loop # 1st place

In [56]: # HDF5 table format

...: %timeit pd.read_hdf('c:/tmp/test.h5', 'test')

1 loop, best of 3: 24.2 s per loop # 4th place

In [57]: # Feather

...: %timeit feather.read_dataframe('c:/tmp/test.feather')

1 loop, best of 3: 3.21 s per loop # 2nd place

如果您不总是需要读取所有数据,那么将数据存储为HDF5表格格式是有意义的(并使用data_columns参数来索引那些将用于过滤的列).

weixin_39950081

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫