大规模数据集的读存技巧
养生的控制人
浙江大学 控制科学与工程博士在读
文章来源:https://www.kaggle.com/rohanrao/tutorial-on-reading-large-datasets
本文以Kaggle上最近的Riiid竞赛为例,介绍几种大数据集的读取技巧,因为在大数据面前,常规的pd.read_csv 很容易就会爆内存了。
不同的库有不同的读取方式,这里主要介绍以下几种:
pandas
Dask
Datatable
Rapids
除了能从csv文件中读取数据,还有一种常见的作法是将数据集转换成另外的格式,使得数据集占用硬盘更少,读得更快,这里主要考虑以下数据格式
csv
feather
hdf5
jay
parquet
pickle
不同场景适应的方法可能不一样,没有哪个方法一定是最好的,大家按需选择。如果懒得看想直接看结论的直接看最后的意见即可。
库
Pandas
import pandas as pd
%%time
dtypes = {
"row_id": "int64",
"timestamp": "int64",
"user_id": "int32",
"content_id": "int16",
"content_type_id": "boolean",
"task_container_id": "int16",
"user_answer": "int8",
"answered_correctly": "int8",
"prior_question_elapsed_time": "float32",
"prior_question_had_explanation": "boolean"
}
data = pd.read_csv("../input/riiid-test-answer-prediction/train.csv", dtype=dtypes)
print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 8min 11s, sys: 10.8 s, total: 8min 22s
Wall time: 8min 22s
Dask
import dask.dataframe as dd
%%time
dtypes = {
"row_id": "int64",
"timestamp": "int64",
"user_id": "int32",
"content_id": "int16",
"content_type_id": "boolean",
"task_container_id": "int16",
"user_answer": "int8",
"answered_correctly": "int8",
"prior_question_elapsed_time": "float32",
"prior_question_had_explanation": "boolean"
}
data = dd.read_csv("../input/riiid-test-answer-prediction/train.csv", dtype=dtypes).compute()
print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 9min 24s, sys: 28.8 s, total: 9min 52s
Wall time: 7min 41s
Datatable
import datatable as dt
%%time
data = dt.fread("../input/riiid-test-answer-prediction/train.csv")
print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 52.5 s, sys: 18.4 s, total: 1min 10s
Wall time: 20.5 s
Rapids
import cudf
%%time
data = cudf.read_csv("../input/riiid-test-answer-prediction/train.csv")
print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 4.58 s, sys: 3.31 s, total: 7.89 s
Wall time: 30.7 s
文件格式
首先介绍一下如何将数据存为不同格式
用datatable读取csv并装成pandas
reading data from csv using datatable and converting to pandas
data = dt.fread("../input/riiid-test-answer-prediction/train.csv").to_pandas()
将数据写为csv格式
data.to_csv("riiid_train.csv", index=False)
将数据写为hdf5
data.to_hdf("riiid_train.h5", "riiid_train")
将数据写为feather
data.to_feather("riiid_train.feather")
将数据写为parquet
data.to_parquet("riiid_train.parquet")
将数据写为pickle
data.to_pickle("riiid_train.pkl.gzip")
将数据写为jay
dt.Frame(data).to_jay("riiid_train.jay")
接下来看看不同数据格式的读取时间差异
csv
%%time
dtypes = {
"row_id": "int64",
"timestamp": "int64",
"user_id": "int32",
"content_id": "int16",
"content_type_id": "boolean",
"task_container_id": "int16",
"user_answer": "int8",
"answered_correctly": "int8",
"prior_question_elapsed_time": "float32",
"prior_question_had_explanation": "boolean"
}
data = pd.read_csv("../input/riiid-test-answer-prediction/train.csv", dtype=dtypes)
print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 8min 36s, sys: 11.3 s, total: 8min 48s
Wall time: 8min 49s
feather
%%time
data = pd.read_feather("../input/riiid-train-data-multiple-formats/riiid_train.feather")
print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 2.59 s, sys: 8.91 s, total: 11.5 s
Wall time: 5.19 s
hdf5
%%time
data = pd.read_hdf("../input/riiid-train-data-multiple-formats/riiid_train.h5", "riiid_train")
print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 8.16 s, sys: 10.7 s, total: 18.9 s
Wall time: 19.8 s
jay
%%time
data = dt.fread("../input/riiid-train-data-multiple-formats/riiid_train.jay")
print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 4.88 ms, sys: 7.35 ms, total: 12.2 ms
Wall time: 38 ms
parquet
%%time
data = pd.read_parquet("../input/riiid-train-data-multiple-formats/riiid_train.parquet")
print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 29.9 s, sys: 20.5 s, total: 50.4 s
Wall time: 27.3 s
pickle
%%time
data = pd.read_pickle("../input/riiid-train-data-multiple-formats/riiid_train.pkl.gzip")
print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 5.65 s, sys: 7.08 s, total: 12.7 s
Wall time: 15 s
个人建议:如果遇到的数据表规模比较大,读取的时候可以采用Datatable,数据的存储可以使用jay或者feather。