Pandas读取数据内存优化

最新推荐文章于 2024-05-24 18:44:55 发布

dzysunshine

最新推荐文章于 2024-05-24 18:44:55 发布

阅读量1.4k

点赞数 1

分类专栏： python相关用法文章标签： python 机器学习

本文链接：https://blog.csdn.net/dzysunshine/article/details/108077867

版权

python相关用法专栏收录该内容

18 篇文章 3 订阅

订阅专栏

在处理数据的时候，往往会遇到数据量比较大的情况，而使用pandas读取数据通常会占用内存过大，导致数据无法正常读取或者在进行特征工程时速度很慢，本文从两个角度简单介绍下应对大数据的处理方法。

1. 分块读取数据

读数据时只读取其中某些重要的列

df = pd.read_csv('demo.csv',usecols=['column1', 'column2', 'column3'])

读数据时只读取限定的行

df = pd.read_csv('demo.csv',nrows=1000,usecols=['column1', 'column2', 'column3'])

使用chunksize分块读取数据

data_path = r'E:\python\Study\BiGData\demo.csv'
def read_bigfile(path):
    # 分块，每一块是一个chunk，之后将chunk进行拼接
    df = pd.read_csv(path, engine='python', encoding='gbk', iterator=True)  
    loop = True
    chunkSize = 10000
    chunks = []
    while loop:
        try:
            chunk = df.get_chunk(chunkSize)
            chunks.append(chunk)
        except StopIteration:
            loop = False
            print("Iteration is stopped.")
    df = pd.concat(chunks, ignore_index=True)
after_df = read_bigfile(path = data_path)

或者

    def read_big_table(path): # 定义一个函数对数据进行分块读取，以免因数据过多而无法完成读取
        reader = pd.read_table(path, header=None, chunksize=10000)
        data = pd.concat(reader, axis=0, ignore_index=True)
        return data

2. 读取数据后的内存优化

当我们明确知道要加载数据的范围，使用pd.read_table读取数据时，可以用其中的dtype参数来手动指定类型。比如某一列的数据范围肯定在0~255之中，那么我们可以指定为np.uint8类型，如果不手动指定的话默认为np.int64类型，这之间的差距巨大。

通过numpy中的函数来查看范围

import numpy as np
# 查看int16的范围
ii16 = np.iinfo(np.int16)
ii16.min
-32768
# 与iinfo相应，finfo可以查看float类型的范围
fi16 = np.finfo(np.float16)
fin16.min
-3.4028235e+38

通过上面的例子可以看出，如果某一列数据是在一定范围内，可对其类型进行限制，实现内存优化。

# @from: https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65/code
# @liscense: Apache 2.0
# @author: weijian
def reduce_mem_usage(props):
    # 计算当前内存
    start_mem_usg = props.memory_usage().sum() / 1024 ** 2
    print("Memory usage of the dataframe is :", start_mem_usg, "MB")
    
    # 哪些列包含空值，空值用-999填充。why：因为np.nan当做float处理
    NAlist = []
    for col in props.columns:
        # 这里只过滤了object格式，如果你的代码中还包含其他类型，请一并过滤
        if (props[col].dtypes != object):
            
            print("**************************")
            print("columns: ", col)
            print("dtype before", props[col].dtype)
            
            # 判断是否是int类型
            isInt = False
            mmax = props[col].max()
            mmin = props[col].min()
            
            # Integer does not support NA, therefore Na needs to be filled
            if not np.isfinite(props[col]).all():
                NAlist.append(col)
                props[col].fillna(-999, inplace=True) # 用-999填充
                
            # test if column can be converted to an integer
            asint = props[col].fillna(0).astype(np.int64)
            result = np.fabs(props[col] - asint)
            result = result.sum()
            if result < 0.01: # 绝对误差和小于0.01认为可以转换的，要根据task修改
                isInt = True
            
            # make interger / unsigned Integer datatypes
            if isInt:
                if mmin >= 0: # 最小值大于0，转换成无符号整型
                    if mmax <= 255:
                        props[col] = props[col].astype(np.uint8)
                    elif mmax <= 65535:
                        props[col] = props[col].astype(np.uint16)
                    elif mmax <= 4294967295:
                        props[col] = props[col].astype(np.uint32)
                    else:
                        props[col] = props[col].astype(np.uint64)
                else: # 转换成有符号整型
                    if mmin > np.iinfo(np.int8).min and mmax < np.iinfo(np.int8).max:
                        props[col] = props[col].astype(np.int8)
                    elif mmin > np.iinfo(np.int16).min and mmax < np.iinfo(np.int16).max:
                        props[col] = props[col].astype(np.int16)
                    elif mmin > np.iinfo(np.int32).min and mmax < np.iinfo(np.int32).max:
                        props[col] = props[col].astype(np.int32)
                    elif mmin > np.iinfo(np.int64).min and mmax < np.iinfo(np.int64).max:
                        props[col] = props[col].astype(np.int64)  
            else: # 注意：这里对于float都转换成float16，需要根据你的情况自己更改
                props[col] = props[col].astype(np.float16)
            
            print("dtype after", props[col].dtype)
            print("********************************")
    print("___MEMORY USAGE AFTER COMPLETION:___")
    mem_usg = props.memory_usage().sum() / 1024**2 
    print("Memory usage is: ",mem_usg," MB")
    print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")

当然，我们可以只修改数值型字段的数据类型

def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df