这可能是因为rename方法作用于dataframe的每个分区,并且它的开销我认为相当于dd.rename
考虑一下这个:In [45]: %time (dd.demo.daily_stock('GOOG', '2008', '2010', freq='1s',
random_state=1234).repartition(npartitions=1).rename(columns = {col:
col.upper() for col in df.columns}).CLOSE.mean().compute())
CPU times: user 11.7 s, sys: 4.65 s, total: 16.3 s
Wall time: 9.23 s
Out[45]: 450.46079905299979
In [46]: %time (dd.demo.daily_stock('GOOG', '2008', '2010', freq='1s',
random_state=1234).repartition(npartitions=1).close.mean().compute())
CPU times: user 11.3 s, sys: 4.63 s, total: 15.9 s
Wall time: 8.8 s
Out[46]: 450.46079905299979
当partition设置为1时,重命名开销似乎不像示例中那样明显。在
更新1:添加拼花地板示例
^{pr2}$
更新2:显式添加列
根据Matt上面的回答,避免读取拼花地板文件的所有列如下所示:%time dd.read_parquet('df',columns =['close']).rename(columns = {'close':'CLOSE'}).CLOSE.mean().com
...: pute()
CPU times: user 4.65 s, sys: 801 ms, total: 5.45 s
Wall time: 2.71 s
类似于:%time dd.read_parquet('df',columns =['close']).close.mean().compute()
CPU times: user 4.46 s, sys: 795 ms, total: 5.25 s
Wall time: 2.51 s
Out[110]: 450.46079905300002
Aside:rename+task scheduling在我的机器上的单个数据分区上有~40ms的开销:In [114]: %timeit -n 3 dd.read_parquet('df',columns =['close']).repartition(npartitions=1).rename(columns = {
...: 'close': 'CLOSE'}).CLOSE.mean().compute()
3 loops, best of 3: 2.36 s per loop
In [115]: %timeit -n 3 dd.read_parquet('df',columns =['close']).repartition(npartitions=1).close.mean().compu
...: te()
3 loops, best of 3: 2.32 s per loop
应用于500个分区,大约需要20秒。以防万一,这种事将来会有帮助。在