当csv文件特别大时,pandas读取整个文件非常的耗时,比如我这边有文件大小为5.77G
!wc -l x.csv
行数2390492也非常多;
用pandas加载x.csv,花了将近2分钟。 为了加快速度,将使用python 包datatable
import datatable as dt
%%time
train_data_datatable = dt.fread('x.csv')
CPU times: user 27.6 s, sys: 3.31 s, total: 30.9 s
Wall time: 8.04 s
将数据convert 到pandas dataframe格式
%%time
train_data = train_data_datatable.to_pandas()
CPU times: user 7.04 s, sys: 3.37 s, total: 10.4 s
Wall time: 5.24 s
现在,我们在不到17秒的时间内加载了x.csv。
参考:
- https://www.kaggle.com/carlmcbrideellis/jane-street-eda-of-day-0-and-feature-importance
- https://www.kaggle.com/rohanrao/tutorial-on-reading-large-datasets