.pkl文件_Pandas读取文件的效率-CSV VS Pickle

v2-b9fa8d1fb2ae30075470185798d6519f_1440w.jpg?source=172ae18b

Pandas读取文件的效率-CSV VS Pickle

读取csv文件

import pandas as pd
csv_path = 'gun_deaths_in_america.csv'
data_csv = pd.read_csv(csv_path,header=0)
data_csv.head()

v2-a110a603fdcfe7eea4c97303577cceee_b.jpg
data_csv.shape
(100798, 10)

%timeit pd.read_csv(csv_path,header=0)
114 ms ± 5.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

查看文件大小

查看本地文件大小

import os
os.stat('gun_deaths_in_america.csv').st_size # 单位是byte

4824404

查看占用内存大小

data_csv.memory_usage(deep=True).sum()

30368107

查看每一列占用内存大小

  • object 类型占用内存空间很大
  • int/float类型占用内存小
data_csv.memory_usage(deep=True)

Index             80
year          806384
month         806384
intent       6495168
police        806384
sex          6249476
age           806384
race         6322009
hispanic      806384
place        6463070
education     806384
dtype: int64

# 查看数据类型
data_csv.dtypes

year           int64
month          int64
intent        object
police         int64
sex           object
age          float64
race          object
hispanic       int64
place         object
education    float64
dtype: object

保存为Pickle文件

直接保存为Pickle文件

保存为本地文件后,文件大小比原文件大。

data_csv.to_pickle('gun_deaths_in_america_before_transform.pkl')
pkl_path_before = 'gun_deaths_in_america_before_transform.pkl'
os.stat(pkl_path_before).st_size

5656925

对比文件读取速度

pickle文件的读取速度比csv文件读取速度快2倍 !

%timeit pd.read_csv(csv_path,header=0)
102 ms ± 7.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.read_pickle(pkl_path_before)
32.4 ms ± 5.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

类型转换后保存为Pickle文件

刚才看到object类型很占内存,可以将其转换为category类型。

data_csv.intent.astype('category').head()

0    Suicide
1    Suicide
2    Suicide
3    Suicide
4    Suicide
Name: intent, dtype: category
Categories (4, object): [Accidental, Homicide, Suicide, Undetermined]

先准换intent列,对比object的6495168,category的大小为object的1/65.

data_csv.intent.astype('category').memory_usage(deep=True)

101303

将所有数据转换成category类型

for col in data_csv.columns:
    data_csv[col] = data_csv[col].astype('category')

查看转换后占用内存大小,相比转换前的303688107,转换后的内存大小减小57倍。

data_csv.memory_usage(deep=True).sum()

1018587

将转换后的数据保存为pickle文件,并查看pickle本地文件大小。相比转换前的4824404,转换后的文件的大小减小4倍。

data_csv.to_pickle('gun_deaths_in_america_after_transform.pkl')
pkl_path_after = 'gun_deaths_in_america_after_transform.pkl'
os.stat(pkl_path_after).st_size

1012643

对比文件读取速度,比转换前快42倍。

%timeit pd.read_pickle(pkl_path_after)
2.57 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit pd.read_csv(csv_path,header=0)
106 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

综合对比

files = [csv_path,pkl_path_before,pkl_path_after]

对比本地文件大小

转换后的文件占用磁盘空间最小,比原文件小4倍,对于保存大量数据非常有用。

for file in files:
    print('File size of the {0} is {1}:  '.format(file,os.stat(file).st_size))

File size of the gun_deaths_in_america.csv is 4824404:  
File size of the gun_deaths_in_america_before_transform.pkl is 5656925:  
File size of the gun_deaths_in_america_after_transform.pkl is 1012643:

对比文件读取速度

转换后的读取速度比普通csv文件的读取速度快42倍。

%timeit pd.read_csv(csv_path,header=0)
97.5 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.read_pickle(pkl_path_before)
28.5 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.read_pickle(pkl_path_after)
2.18 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

对比占用内存大小

转换后占用内存比转换前小30倍。

for file in files:
    if os.path.splitext(file)[1]=='.csv':
        print('memory_usage of the {0} is : {1}'. 
            format(file,pd.read_csv(file,header=0).memory_usage(deep=True).sum()))
    else:
        print('memory_usage of the {0} is : {1}'. 
            format(file,pd.read_pickle(file).memory_usage(deep=True).sum()))

memory_usage of the gun_deaths_in_america.csv is : 30368107
memory_usage of the gun_deaths_in_america_before_transform.pkl is : 30368107
memory_usage of the gun_deaths_in_america_after_transform.pkl is : 1010827

读取的数据都是一样的,就是数据类型不一样。

pd.read_csv(csv_path,header=0).head(2)

v2-9084b35c81189b9ccf58087f8565ceb5_b.png
pd.read_pickle(pkl_path_before).head(2)

v2-40f372fd6b60e4355062ebb8dffc7382_b.png
pd.read_pickle(pkl_path_after).head(2)

v2-732d3de00a093925c5fc1cab70c74f1f_b.png
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值