python: 大规模数据集的读存技巧

大规模数据集的读存技巧
养生的控制人

浙江大学 控制科学与工程博士在读

文章来源:https://www.kaggle.com/rohanrao/tutorial-on-reading-large-datasets

本文以Kaggle上最近的Riiid竞赛为例,介绍几种大数据集的读取技巧,因为在大数据面前,常规的pd.read_csv 很容易就会爆内存了。

不同的库有不同的读取方式,这里主要介绍以下几种:

pandas
Dask
Datatable
Rapids

除了能从csv文件中读取数据,还有一种常见的作法是将数据集转换成另外的格式,使得数据集占用硬盘更少,读得更快,这里主要考虑以下数据格式

csv
feather
hdf5
jay
parquet
pickle

不同场景适应的方法可能不一样,没有哪个方法一定是最好的,大家按需选择。如果懒得看想直接看结论的直接看最后的意见即可。

Pandas

import pandas as pd
%%time

dtypes = {
    "row_id": "int64",
    "timestamp": "int64",
    "user_id": "int32",
    "content_id": "int16",
    "content_type_id": "boolean",
    "task_container_id": "int16",
    "user_answer": "int8",
    "answered_correctly": "int8",
    "prior_question_elapsed_time": "float32", 
    "prior_question_had_explanation": "boolean"
}

data = pd.read_csv("../input/riiid-test-answer-prediction/train.csv", dtype=dtypes)

print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 8min 11s, sys: 10.8 s, total: 8min 22s
Wall time: 8min 22s

Dask

import dask.dataframe as dd
%%time

dtypes = {
    "row_id": "int64",
    "timestamp": "int64",
    "user_id": "int32",
    "content_id": "int16",
    "content_type_id": "boolean",
    "task_container_id": "int16",
    "user_answer": "int8",
    "answered_correctly": "int8",
    "prior_question_elapsed_time": "float32", 
    "prior_question_had_explanation": "boolean"
}

data = dd.read_csv("../input/riiid-test-answer-prediction/train.csv", dtype=dtypes).compute()

print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 9min 24s, sys: 28.8 s, total: 9min 52s
Wall time: 7min 41s

Datatable

import datatable as dt
%%time

data = dt.fread("../input/riiid-test-answer-prediction/train.csv")

print("Train size:", data.shape)

    Train size: (101230332, 10)
    CPU times: user 52.5 s, sys: 18.4 s, total: 1min 10s
    Wall time: 20.5 s

Rapids

import cudf
%%time

data = cudf.read_csv("../input/riiid-test-answer-prediction/train.csv")

print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 4.58 s, sys: 3.31 s, total: 7.89 s
Wall time: 30.7 s

文件格式

首先介绍一下如何将数据存为不同格式

用datatable读取csv并装成pandas

reading data from csv using datatable and converting to pandas
data = dt.fread("../input/riiid-test-answer-prediction/train.csv").to_pandas()

将数据写为csv格式

data.to_csv("riiid_train.csv", index=False)

将数据写为hdf5

data.to_hdf("riiid_train.h5", "riiid_train")

将数据写为feather

data.to_feather("riiid_train.feather")

将数据写为parquet

data.to_parquet("riiid_train.parquet")

将数据写为pickle

data.to_pickle("riiid_train.pkl.gzip")

将数据写为jay

dt.Frame(data).to_jay("riiid_train.jay")

接下来看看不同数据格式的读取时间差异
csv

%%time

dtypes = {
    "row_id": "int64",
    "timestamp": "int64",
    "user_id": "int32",
    "content_id": "int16",
    "content_type_id": "boolean",
    "task_container_id": "int16",
    "user_answer": "int8",
    "answered_correctly": "int8",
    "prior_question_elapsed_time": "float32", 
    "prior_question_had_explanation": "boolean"
}

data = pd.read_csv("../input/riiid-test-answer-prediction/train.csv", dtype=dtypes)

print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 8min 36s, sys: 11.3 s, total: 8min 48s
Wall time: 8min 49s

feather

%%time

data = pd.read_feather("../input/riiid-train-data-multiple-formats/riiid_train.feather")

print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 2.59 s, sys: 8.91 s, total: 11.5 s
Wall time: 5.19 s

hdf5

%%time

data = pd.read_hdf("../input/riiid-train-data-multiple-formats/riiid_train.h5", "riiid_train")

print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 8.16 s, sys: 10.7 s, total: 18.9 s
Wall time: 19.8 s

jay

%%time

data = dt.fread("../input/riiid-train-data-multiple-formats/riiid_train.jay")

print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 4.88 ms, sys: 7.35 ms, total: 12.2 ms
Wall time: 38 ms

parquet

%%time

data = pd.read_parquet("../input/riiid-train-data-multiple-formats/riiid_train.parquet")

print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 29.9 s, sys: 20.5 s, total: 50.4 s
Wall time: 27.3 s

pickle

%%time

data = pd.read_pickle("../input/riiid-train-data-multiple-formats/riiid_train.pkl.gzip")

print("Train size:", data.shape)
Train size: (101230332, 10)
CPU times: user 5.65 s, sys: 7.08 s, total: 12.7 s
Wall time: 15 s

个人建议:如果遇到的数据表规模比较大,读取的时候可以采用Datatable,数据的存储可以使用jay或者feather。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值