pandas内存压缩

Kevin Davis

已于 2022-08-23 13:50:24 修改

阅读量851

点赞数 2

分类专栏： pandas 文章标签： pandas python

于 2022-08-23 13:30:03 首次发布

本文链接：https://blog.csdn.net/weixin_44590417/article/details/126480257

版权

pandas 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

众所周知，数据挖掘比赛的数据通常都比较大，动不动就好几个G，如果直接读取全部数据很有可能会爆内存，对内存容量较小的电脑非常不友好。以下先放解决方法在放原理，各取所需。

一、解决方法

解决办法通常有分批读取，压缩内存等，以下介绍pandas读取csv文件时压缩使用内存的方法。

注：以下代码来自Datawhale的鱼佬，我只做解读。

def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: 
        print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 
                    100 * (start_mem - end_mem) / start_mem))
    return df

这个函数的输入参数df为pandas的DataFrame数据类型，verbose为bool类型变量，表示是否要输出冗余结果（内存占用减少了多少）。

该方法的思路为：根据字段的最大值和最小值来匹配合适的数据类型。比如df中某一列变量（即字段）的最大值为1，最小值为0，那么我们将该字段的数据类型转换为占用内存更少的int16来存储这一变量。

使用案例：

这里展示一个我在推荐赛事中使用上述方法的案例，各位可以将data_ads换成自己的DataFrame类型的数据。

data_ads = pd.read_csv('./data_ads.csv')

print('压缩前 data_ads 内存占用为: {:.2f} MB'.format(data_ads.memory_usage().sum() / 1024**2))

data_ads = reduce_mem_usage(data_ads)

print('压缩后 data_ads 内存占用为: {:.2f} MB'.format(data_ads.memory_usage().sum() / 1024**2))

结果：

压缩前 data_ads 内存占用为: 2376.23 MB
Mem. usage decreased to 618.81 Mb (74.0% reduction)
压缩后 data_ads 内存占用为: 618.81 MB

可以看到压缩前data_ads占用了2个G的内存，压缩后只占用了不到1个G内存，减少了74%的内存占用，非常优雅！

二、原理

首先，我们来查看一下python中int和float数据类型的存储范围和内存占用情况：

import sys

ints = [np.uint8(1), np.uint16(1), np.int8(1), np.int16(1),np.int32(1),np.int64(1)]
for i in ints:
    print(f'{type(i)} 占用内存 {sys.getsizeof(i)} 字节')
    print(np.iinfo(i))

floats = [np.float16(1), np.float32(1), np.float64(1)]
for i in floats:
    print(f'{type(i)} 占用内存 {sys.getsizeof(i)} 字节')
    print(np.finfo(i))

结果：

<class 'numpy.uint8'> 占用内存 25 字节
Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------

<class 'numpy.uint16'> 占用内存 26 字节
Machine parameters for uint16
---------------------------------------------------------------
min = 0
max = 65535
---------------------------------------------------------------

<class 'numpy.int8'> 占用内存 25 字节
Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------

<class 'numpy.int16'> 占用内存 26 字节
Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

<class 'numpy.int32'> 占用内存 28 字节
Machine parameters for int32
---------------------------------------------------------------
min = -2147483648
max = 2147483647
---------------------------------------------------------------

<class 'numpy.int64'> 占用内存 32 字节
Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------

<class 'numpy.float16'> 占用内存 26 字节
Machine parameters for float16
---------------------------------------------------------------
precision =   3   resolution = 1.00040e-03
machep =    -10   eps =        9.76562e-04
negep =     -11   epsneg =     4.88281e-04
minexp =    -14   tiny =       6.10352e-05
maxexp =     16   max =        6.55040e+04
nexp =        5   min =        -max
smallest_normal = 6.10352e-05   smallest_subnormal = 5.96046e-08
---------------------------------------------------------------

<class 'numpy.float32'> 占用内存 28 字节
Machine parameters for float32
---------------------------------------------------------------
precision =   6   resolution = 1.0000000e-06
machep =    -23   eps =        1.1920929e-07
negep =     -24   epsneg =     5.9604645e-08
minexp =   -126   tiny =       1.1754944e-38
maxexp =    128   max =        3.4028235e+38
nexp =        8   min =        -max
smallest_normal = 1.1754944e-38   smallest_subnormal = 1.4012985e-45
---------------------------------------------------------------

<class 'numpy.float64'> 占用内存 32 字节
Machine parameters for float64
---------------------------------------------------------------
precision =  15   resolution = 1.0000000000000001e-15
machep =    -52   eps =        2.2204460492503131e-16
negep =     -53   epsneg =     1.1102230246251565e-16
minexp =  -1022   tiny =       2.2250738585072014e-308
maxexp =   1024   max =        1.7976931348623157e+308
nexp =       11   min =        -max
smallest_normal = 2.2250738585072014e-308   smallest_subnormal = 4.9406564584124654e-324
---------------------------------------------------------------

将以上结果统计为下表：

数据类型	范围下限（含）	范围上限（含）	内存占用（字节数）
unit8	0	255（ $2^{8}-1$ ）	25
unit16	0	65535（ $2^{16}-1$ ）	26
int8	-128（ $2^7$ ）	127（ $2^7-1$ ）	25
int16	-32768（ $2^{15}$ ）	32767（ $2^{15}-1$ ）	26
int32	-2147483648（ $2^{31}$ ）	2147483647（ $2^{31}-1$ ）	28
int64	-9223372036854775808（ $2^{63}$ ）	9223372036854775807（ $2^{63}-1$ ）	32
float16	-65500	65500	26
float32	-3.4028235e+38	3.4028235e+38	28
float64	-1.7976931348623157e+308	1.7976931348623157e+308	32

注意int和float数据类型没有谁比谁大的说法，区别在于int是整型，浮点型float后面跟了小数，可以计算精度，在对精度有要求的计算里更受欢迎。

pandas的底层数据存储和计算是基于numpy库的，pandas在读取数据时，尤其是读取数值类型时，会优先采用存储数值范围更大的int64或float64类型来存储数值数据，比如对数值1采用float64类型存储，这么做虽然可以保证数据在计算时不会出错，防止数据的数值范围超出数值类型的范围上限变为inf，但是，float64数据类型就是更占内存。

因此，如果一个变量它的取值范围在 $[0, 1]$ 上，那么我们完全可以采用占用内存较少的int8类型来存储，这也就是前文解决方法的思路。