pandas的内存使用

只要开始永远不晚

已于 2022-02-07 14:05:08 修改

阅读量1.7k

点赞数 1

分类专栏：工欲善其事必先利其器 # pandas # python 文章标签： python pandas numpy memory_usage 内存优化

于 2022-01-27 20:01:03 首次发布

本文链接：https://blog.csdn.net/haohaizijhz/article/details/122722847

版权

工欲善其事必先利其器同时被 3 个专栏收录

64 篇文章 30 订阅 ¥9.90 ¥99.00

订阅专栏

超级会员免费看

pandas

11 篇文章 0 订阅

订阅专栏

python

5 篇文章 0 订阅

订阅专栏

本文探讨了pandas如何统计内存使用情况，包括`info()`和`memory_usage()`方法。通过调用`info()`，我们可以了解DataFrame的内存占用概况，而`memory_usage()`则提供列级别的详细内存使用数据。对于更准确的内存报告，可以启用`deep=True`选项。数据类型的选取直接影响内存消耗，不同的NumPy dtype会导致不同的内存占用。

摘要由CSDN通过智能技术生成

统计内存使用情况

info

ataFram对象调用 info() 时会显示 DataFrame 的内存使用情况（包括索引）。
例如，调用 info() 时会显示下面的 DataFrame 的内存使用情况：

import pandas as pd
import numpy as np
dtypes = [
        "int8",
        "uint8",
        "int16",
        "int32",
        "int64",
        "float64",
        "datetime64[ns]",
        "timedelta64[ns]",
        "complex128",
        "object",
        "bool",
    ] 
n = 5000

data = {"col_"+t: np.random.randint(100, size=n).astype(t) for t in dtypes}

df = pd.DataFrame(data)

df["categorical"] = df["col_object"].astype("category")

df.info()

# output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype          
---  ------               --------------  -----          
 0   col_int8             5000 non-null   int8           
 1   col_uint8            5000 non-null   uint8          
 2   col_int16            5000 non-null   int16          
 3   col_int32            5000 non-null   int32          
 4   col_int64            5000 non-null   int64          
 5   col_float64          5000 non-null   float64        
 6   col_datetime64[ns]   5000 non-null   datetime64[ns] 
 7   col_timedelta64[ns]  5000 non-null   timedelta64[ns]
 8   col_complex128       5000 non-null   complex128     
 9   col_object           5000 non-null   object         
 10  col_bool             5000 non-null   bool           
 11  categorical          5000 non-null   category       
dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int16(1), int32(1), int64(1), int8(1), object(1), timedelta64[ns](1), uint8(1)
memory usage: 327.2+ KB

+ 符号表示实际内存使用量可能更高，因为 pandas 不计算 dtype=object 列中的值使用的内存。

传递 memory_usage='deep' 将启用更准确的内存使用报告，说明所包含对象的全部使用情况。这是可选的，因为进行这种更深入的内省可能会很昂贵。

df.info(memory_usage="deep")
# output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype          
---  ------               --------------  -----          
 0   col_int8             5000 non-null   int8           
 1   col_uint8            5000 non-null   uint8          
 2   col_int16            5000 non-null   int16          
 3   col_int32            5000 non-null   int32          
 4   col_int64            5000 non-null   int64          
 5   col_float64          5000 non-null   float64        
 6   col_datetime64[ns]   5000 non-null   datetime64[ns] 
 7   col_timedelta64[ns]  5000 non-null   timedelta64[ns]
 8   col_complex128       5000 non-null   complex128     
 9   col_object           5000 non-null   object         
 10  col_bool             5000 non-null   bool           
 11  categorical          5000 non-null   category       
dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int16(1), int32(1), int64(1), int8(1), object(1), timedelta64[ns](1), uint8(1)
memory usage: 463.8 KB

memory_usage

每列的内存使用情况可以通过调用memory_usage()方法得到。这将返回一个 Series，其索引由列名和每列的内存使用情况表示，以字节为单位。对于上面的DataFrame，可以通过memory_usage方法查看每一列的内存使用量和总内存使用量：

如果要获取准确内存时候情况，可以开启参数deep=True

df.memory_usage(deep=True)
# output
Index                     128
col_int8                 5000
col_uint8                5000
col_int16               10000
col_int32               20000
col_int64               40000
col_float64             40000
col_datetime64[ns]      40000
col_timedelta64[ns]     40000
col_complex128          80000
col_object             179800
col_bool                 5000
categorical              9968
dtype: int64


df.memory_usage(deep=True).sum()
#output
474896

数据类型和内存的关系

Data type	Description
`bool_`	Boolean (True or False) stored as a byte
`int_`	Default integer type (same as C `long`; normally either `int64` or `int32`)
`intc`	Identical to C `int` (normally `int32` or `int64`)
`intp`	Integer used for indexing (same as C `ssize_t`; normally either `int32`or `int64`)
`int8`	Byte (-128 to 127)
`int16`	Integer (-32768 to 32767)
`int32`	Integer (-2147483648 to 2147483647)
`int64`	Integer (-9223372036854775808 to 9223372036854775807)
`uint8`	Unsigned integer (0 to 255)
`uint16`	Unsigned integer (0 to 65535)
`uint32`	Unsigned integer (0 to 4294967295)
`uint64`	Unsigned integer (0 to 18446744073709551615)
`float_`	Shorthand for `float64`.
`float16`	Half precision float: sign bit, 5 bits exponent, 10 bits mantissa
`float32`	Single precision float: sign bit, 8 bits exponent, 23 bits mantissa
`float64`	Double precision float: sign bit, 11 bits exponent, 52 bits mantissa
`complex_`	Shorthand for `complex128`.
`complex64`	Complex number, represented by two 32-bit floats
`complex128`	Complex number, represented by two 64-bit floats