pandas之自动优化数据类型

june_francis

已于 2022-09-09 11:24:52 修改

阅读量270

点赞数

分类专栏： python 文章标签： pandas python 数据分析

于 2022-09-09 11:16:16 首次发布

原文链接：https://mp.weixin.qq.com/s?__biz=MzUzODYwMDAzNA==&mid=2247549551&idx=1&sn=5ffa78258aabe6df446c0ff622492811&chksm=fad77562cda0fc74e0e75d4e3260fe4d39636dee9a121071f57362f19e0d8e290b2df3160643&scene=178&cur_album_id=1699019347278561282#rd

版权

python 专栏收录该内容

78 篇文章 11 订阅

订阅专栏

前言

pandas作为一款优秀的数据处理工具库正在逐渐的被大家喜爱，但是在一些规模较大的数据场景下，有时候却又显的捉襟见肘，我相信平日里大家都试过各种优化的方法，例如：

主动删除垃圾 del some_garbage ，调用Python的垃圾回收库 gc 完成对冗余的内存的清除 gc.collect() ；
尽量使用numpy数组运算来替换 for 循环和 apply 操作；
数据存储方面使用更加高效的 leather 或者 parquet 等文件格式。

数据类型转换

除了上面这些方法之外，我们还可以在数据类型上做点文章，因为在某些场景下，pandas默认的数据类型的容量可能远远大于当前的数据规模，我们完全可以使用子类型对它进行替换：

Pandas Type	Numpy Type	Python Type	Usage
object	string_, unicode	str	Text
int64	int, int8, int16, int32, int64, uint8, uint16, uint32, uint64	int	Integer numbers
float64	float, float16, float32, float64	float	Float point numbers
bool	bool_	bool	True/False values
datetime64	datetime64[ns]	NA	Date and time values
timedelta[ns]	NA	NA	Difference between two datetimes
category	NA	NA	Finite list of text values

图表来源：http : //pbpython.com/pandas_dtypes.html

Data type	Description
bool_	Boolean(True or False) stored as a byte
int_	Default integer type(same as C 1ong ; normally either int64or int32)
intc	ldentical to C int(normally int32 or int64)
intp	Integer used for indexing(same as C ssize_t; normally either int32 or int64)
int	8Byte(-128 to 127)
int16	Integer(-32768 to 32767)
int32	Integer(-2147483648 to 2147483647)
int64	Integer(-9223372036854775808 to 9223372036854775807)
uint8	Unsigned integer(0 to 255)
uint16	Unsigned integer(0 to 65535)
uint32	Unsigned integer(0 to 4294967295)
uint64	Unsigned integer(0 to 18446744073709551615)
float_	Shorthand for float64.
float16	Half precision float: sign bit,5 bits exponent,10 bits mantissa
float32	Single precision float: sign bit,8 bits exponent,23 bits mantissa
float64	Double precision float: sign bit,11 bits exponent,52 bits mantissa
complex_	Double precision float: sign bit,11 bits exponent,52 bits mantissa
complex64	Complex number, represented by two 32-bit floats(real and imaginary components)
complex128	Complex number, represented by two 64-bit floats(real and imaginary components)

图表来源：https : //docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html

我们希望将类型简单化，以此节省内存，比如将 浮点数 转换为float16/32，或者将具有 正整数和负整数 的列转为 int8/16/32，还可以将 布尔值 转换为 uint8，甚至仅使用正整数来进一步减少内存消耗。

由此我们可以根据上面的数据类型将 浮点数和整数 转换为它们的 最小子类型：

def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

上述代码仅限于数值型的数据，在真实生产环境下，我们可能遇到最多的就是 Object 类型的数据，很是头疼，那么有没有什么好的方法进行转换呢？
答案是肯定的，pandas内有已经封装好了的自动处理数据类型的方法：

df.convert_dtypes()

参考文献

微信公众号【东哥起飞】：
爆减内存！pandas自动优化骚操作
 变量类型自动转换

june_francis

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pandas之自动优化数据类型

主动删除垃圾，调用Python的垃圾回收库gc完成对冗余的内存的清除；尽量使用numpy数组运算来替换for循环和apply操作；数据存储方面使用更加高效的leather或者parquet等文件格式。
复制链接

扫一扫

专栏目录