Pandas.DataFrame.describe() 统计学描述详解含代码含测试数据集随Pandas版本持续更新

数象限

已于 2024-01-22 20:12:10 修改

阅读量2.9k

点赞数 45

分类专栏： Pandas API参考文章标签： pandas

于 2024-01-19 14:33:30 首次发布

本文链接：https://blog.csdn.net/mingqinsky/article/details/135697384

版权

Pandas API参考专栏收录该内容

76 篇文章

订阅专栏

本文详细介绍了Pandas库中DataFrame.describe()函数的用法，包括其默认行为、自定义数据类型参与、排除、百分位数设置，以及如何处理布尔值、复数和缺失值。通过实例演示了如何根据需求调整参数以获取所需统计描述结果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

关于Pandas版本： 本文基于 pandas2.2.0 编写。

关于本文内容更新： 随着pandas的stable版本更迭，本文持续更新，不断完善补充。

传送门： Pandas API参考目录

传送门： Pandas 版本更新及新特性

传送门： Pandas 由浅入深系列教程

本节目录

Pandas.DataFrame.describe()

Pandas.DataFrame.describe()

Pandas.DataFrame.describe 用于生成 DataFrame 的统计学描述。返回一个多行的统计表，每一行对应一个统计指标，有总数、平均数、标准差、最小值、四分位数、最大值等，

参与统计描述的列，里面的缺失值(NaN)，会在计算时被排除。

语法：

DataFrame.describe(percentiles=None, include=None, exclude=None)

返回值：

Series or DataFrame

调用 DataFrame.describe 方法时，根据传入类型的不同，返回 Series 或 DataFrame 。

参数说明：

include 数据类型白名单

include : ‘all’, list-like of dtypes or None (default), optional

include 参数，用于指定哪种数据类型的列参与统计描述。如果某列的数据类型出现在白名单中，此列将参与统计描述。

默认 include=None 表示只有数字类型(numpy.number) 的列参与统计描述：^{Pandas Numpy 数据类型速查表}
- None: 默认 include=None ，仅数字类型(numpy.number) 的列，参与统计描述。^例1
- ‘all’： 所有数据类型的列，都参与统计描述； ^例1-3
- list-like of dtypes： 使用类似列表传递需要参与统计描述的数据类型，符合这些数据类型的列，会参与统计描述。^例1-4
⚠️ 注意 :
- 使用 numpy.number 来包含所有数值类型的列 ^{Pandas Numpy 数据类型速查表}
  - 虽然 numpy.number 包含布尔值 bool ，但是Pandas.DataFrame.describe 计算中的实际行为表明，布尔值bool并没有被包含。^例1-2
  - 虽然 numpy.number 包含复数 np.complexfloating ，但是 Pandas.DataFrame.describe 只支持实数的计算，如果 DataFrame 存在复数，但是没有被排除，会引发报错 TypeError: a must be an array of real numbers 。 ^例4
- 使用 numpy.object 或 ‘O’ 来包含混合数据类型的列。例如： df.describe(include=['O']) 。
- 使用 ‘category’` 来包含分类数据类型的列。
- include 和 exclude 可以混用 ^例2-2

exclude 数据类型黑名单

exclude : list-like of dtypes or None (default), optional,

exclude 参数，用于指定要排除的数据类型白名单。如果某列的数据类型出现在黑名单中，此列将不会参与统计描述。

默认 exclude=None 表示不做排除：
- list-like of dtypes： 使用类似列表传递需要排除的数据类型，这些数据类型的列，不会参与统计描述。^例2
⚠️ 注意 :
- 可以使用 numpy.number 来排除数值类型的列 ^{Pandas Numpy 数据类型速查表}
  - 布尔值 bool 在 Pandas.DataFrame.describe 计算时的实际行为表明，其并不属于 np.number ，如果 exclude=np.number 对数值类型的列进行排除，布尔值 bool类型的列反而会冒出来。 ^例2-1
- 使用 numpy.object 来排除混合类型的列。也可以使用字符串的方式，类似于 df.describe(include=['O']) 。
- 使用 ‘category’ 来排除分类数据类型的列。
- include 和 exclude 可以混用 ^例2-2

percentiles 自定义百分位数

percentiles : *list-like of numbers, optional

percentiles 参数用于自定义 百分位数 ：
- list-like： 用类似列表传递自定义的 百分位数 ，列表里每个元素都应该介于0-1之间，默认状态下，百分数只会返回 [0.25, 0.5, 0.75] （即第1~3四分位数）。
⚠️ 注意 :

你可以指定多个百分位数。^例3

⚠️ 注意 :

对于数值数据（numeric data），结果的索引将包括 count、mean、std、min、max，以及 lower、50 和 upper 百分位数。默认情况下，lower 百分位数是 25，upper 百分位数是 75。50 百分位 数与 中位数 相同。

对于对象数据（object data），例如字符串或时间戳，结果的索引将包括 count、unique、top 和 freq。top 是最常见的值，freq 是最常见值的频率。时间戳还包括第一个和最后一个项。

如果多个对象值具有最高计数，则计数和 top 的结果将从具有最高计数的值中任意选择。

对于通过 DataFrame 提供的混合数据类型()，默认情况下仅返回数值列的分析结果。如果 DataFrame 仅包含对象(‘object’)和分类数据(‘category’)而没有任何数值列，则默认情况下将返回对对象(‘object’)和分类数据(‘category’)列的分析结果。如果提供了 include=‘all’ 作为选项，则结果将包括每种类型的属性的并集。

include 和 exclude 参数可用于限制要分析的 DataFrame 中的列。在分析 Series 时，这些参数将被忽略。

示例：

测试文件下载：

本文所涉及的测试文件，如有需要，可在文章顶部的绑定资源处下载。

若发现文件无法下载，应该是资源包有内容更新，正在审核，请稍后再试。或站内私信作者索要。

测试文件下载位置.png

测试文件下载位置

例1：默认情况下，统计描述只有数据类型为数字类型(`numpy.number`)的列参与。

例1-1、构建演示数据，并观察数据内容

import numpy as np
import pandas as pd

# 构建演示数据
df = pd.DataFrame(
    {
        "分类": pd.Categorical(["d", "e", "f"]),
        "整数": [1, 2, 3],
        "浮点数": [1.5, 2.5, 3.5],
        "布尔": [True, True, False],
        "object": ["a", "b", "c"],
    }
)

df

	分类	整数	浮点数	布尔	object
0	d	1	1.5	True	a
1	e	2	2.5	True	b
2	f	3	3.5	False	c

再来观察一下数据类型,为了方便观察，使用to_frame() 转换为表格样式：

df.dtypes.to_frame()

	0
分类	category
整数	int64
浮点数	float64
布尔	bool
object	object

例1-2、尝试进行统计描述，所有参数保持默认，看一下结果，注意！布尔值默认被排除了。

df.describe()

	整数	浮点数
count	3.0	3.0
mean	2.0	2.5
std	1.0	1.0
min	1.0	1.5
25%	1.5	2.0
50%	2.0	2.5
75%	2.5	3.0
max	3.0	3.5

由上面结果可见，在统计描述的结果中，整数列、浮点数列。这符合 df.describe() 的默认行为

例1-3：设置参数 `include='all'` ，所有数据类型的列，将全都参与统计描述。但是对于无法进行计算的项目，将显示为缺失值(`NaN`)。例如字符串类型的数据，无法进行平均值计算(mean)。

df.describe(include="all")

	分类	整数	浮点数	布尔	object
count	3	3.0	3.0	3	3
unique	3	NaN	NaN	2	3
top	d	NaN	NaN	True	a
freq	1	NaN	NaN	2	1
mean	NaN	2.0	2.5	NaN	NaN
std	NaN	1.0	1.0	NaN	NaN
min	NaN	1.0	1.5	NaN	NaN
25%	NaN	1.5	2.0	NaN	NaN
50%	NaN	2.0	2.5	NaN	NaN
75%	NaN	2.5	3.0	NaN	NaN
max	NaN	3.0	3.5	NaN	NaN

例1-4：指定只有是 `category` 类型和数值类型的列参与统计描述

df.describe(include=[np.number, "category"])

	分类	整数	浮点数
count	3	3.0	3.0
unique	3	NaN	NaN
top	d	NaN	NaN
freq	1	NaN	NaN
mean	NaN	2.0	2.5
std	NaN	1.0	1.0
min	NaN	1.0	1.5
25%	NaN	1.5	2.0
50%	NaN	2.0	2.5
75%	NaN	2.5	3.0
max	NaN	3.0	3.5

例2：通过 `exclude` 参数，排除某些数据类型的列

例2-1、依然使用例1的数据内容，我们来排除数字类型的列，看下结果。注意！布尔值类型的列反而冒了出来。

df.describe(exclude=np.number)

	分类	布尔	object
count	3	3	3
unique	3	2	3
top	d	True	a
freq	1	2	1

例2-2、`include` 和 `exclude` 混用

df.describe(include=["category"], exclude=[np.number])

	分类
count	3
unique	3
top	d
freq	1

例3：自定义百分位数

df.describe(include=[np.number], percentiles=[0.1, 0.4, 0.7, 0.8, 0.85])

	整数	浮点数
count	3.0	3.0
mean	2.0	2.5
std	1.0	1.0
min	1.0	1.5
10%	1.2	1.7
40%	1.8	2.3
50%	2.0	2.5
70%	2.4	2.9
80%	2.6	3.1
85%	2.7	3.2
max	3.0	3.5

例4：复数的统计描述

例4-1、构建包含复数的DataFrame

import numpy as np
import pandas as pd

# 构建演示数据
df = pd.DataFrame(
    {
        "分类": pd.Categorical(["d", "e", "f"]),
        "整数": [1, 2, 3],
        "浮点数": [1.5, 2.5, 3.5],
        "布尔": [True, True, False],
        "复数": [1 + 1j, 2 + 2j, 3 + 3j],
        "object": ["a", "b", "c"],
    }
)

df

	分类	整数	浮点数	布尔	复数	object
0	d	1	1.5	True	1.0+1.0j	a
1	e	2	2.5	True	2.0+2.0j	b
2	f	3	3.5	False	3.0+3.0j	c

例4-2、如果不排除复数的列，会引发报错

df.describe()

D:\miniconda3\envs\python3.12\Lib\site-packages\numpy\core\_methods.py:49: ComplexWarning: Casting complex values to real discards the imaginary part
  return umr_sum(a, axis, dtype, out, keepdims, initial, where)
D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\nanops.py:944: RuntimeWarning: invalid value encountered in sqrt
  result = np.sqrt(nanvar(values, axis=axis, skipna=skipna, ddof=ddof, mask=mask))



---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

Cell In[59], line 1
----> 1 df.describe()


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\generic.py:11544, in NDFrame.describe(self, percentiles, include, exclude)
  11302 @final
  11303 def describe(
  11304     self,
   (...)
  11307     exclude=None,
  11308 ) -> Self:
  11309     """
  11310     Generate descriptive statistics.
  11311 
   (...)
  11542     max            NaN      3.0
  11543     """
> 11544     return describe_ndframe(
  11545         obj=self,
  11546         include=include,
  11547         exclude=exclude,
  11548         percentiles=percentiles,
  11549     ).__finalize__(self, method="describe")


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\methods\describe.py:97, in describe_ndframe(obj, include, exclude, percentiles)
     90 else:
     91     describer = DataFrameDescriber(
     92         obj=cast("DataFrame", obj),
     93         include=include,
     94         exclude=exclude,
     95     )
---> 97 result = describer.describe(percentiles=percentiles)
     98 return cast(NDFrameT, result)


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\methods\describe.py:170, in DataFrameDescriber.describe(self, percentiles)
    168 for _, series in data.items():
    169     describe_func = select_describe_func(series)
--> 170     ldesc.append(describe_func(series, percentiles))
    172 col_names = reorder_columns(ldesc)
    173 d = concat(
    174     [x.reindex(col_names, copy=False) for x in ldesc],
    175     axis=1,
    176     sort=False,
    177 )


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\methods\describe.py:232, in describe_numeric_1d(series, percentiles)
    227 formatted_percentiles = format_percentiles(percentiles)
    229 stat_index = ["count", "mean", "std", "min"] + formatted_percentiles + ["max"]
    230 d = (
    231     [series.count(), series.mean(), series.std(), series.min()]
--> 232     + series.quantile(percentiles).tolist()
    233     + [series.max()]
    234 )
    235 # GH#48340 - always return float on non-complex numeric data
    236 dtype: DtypeObj | None


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\series.py:2769, in Series.quantile(self, q, interpolation)
   2765 # We dispatch to DataFrame so that core.internals only has to worry
   2766 #  about 2D cases.
   2767 df = self.to_frame()
-> 2769 result = df.quantile(q=q, interpolation=interpolation, numeric_only=False)
   2770 if result.ndim == 2:
   2771     result = result.iloc[:, 0]


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\frame.py:11831, in DataFrame.quantile(self, q, axis, numeric_only, interpolation, method)
  11827     raise ValueError(
  11828         f"Invalid method: {method}. Method must be in {valid_method}."
  11829     )
  11830 if method == "single":
> 11831     res = data._mgr.quantile(qs=q, interpolation=interpolation)
  11832 elif method == "table":
  11833     valid_interpolation = {"nearest", "lower", "higher"}


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\internals\managers.py:1508, in BlockManager.quantile(self, qs, interpolation)
   1504 new_axes = list(self.axes)
   1505 new_axes[1] = Index(qs, dtype=np.float64)
   1507 blocks = [
-> 1508     blk.quantile(qs=qs, interpolation=interpolation) for blk in self.blocks
   1509 ]
   1511 return type(self)(blocks, new_axes)


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\internals\blocks.py:1587, in Block.quantile(self, qs, interpolation)
   1584 assert self.ndim == 2
   1585 assert is_list_like(qs)  # caller is responsible for this
-> 1587 result = quantile_compat(self.values, np.asarray(qs._values), interpolation)
   1588 # ensure_block_shape needed for cases where we start with EA and result
   1589 #  is ndarray, e.g. IntegerArray, SparseArray
   1590 result = ensure_block_shape(result, ndim=2)


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\array_algos\quantile.py:39, in quantile_compat(values, qs, interpolation)
     37     fill_value = na_value_for_dtype(values.dtype, compat=False)
     38     mask = isna(values)
---> 39     return quantile_with_mask(values, mask, fill_value, qs, interpolation)
     40 else:
     41     return values._quantile(qs, interpolation)


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\array_algos\quantile.py:97, in quantile_with_mask(values, mask, fill_value, qs, interpolation)
     95     result = np.repeat(flat, len(values)).reshape(len(values), len(qs))
     96 else:
---> 97     result = _nanpercentile(
     98         values,
     99         qs * 100.0,
    100         na_value=fill_value,
    101         mask=mask,
    102         interpolation=interpolation,
    103     )
    105     result = np.array(result, copy=False)
    106     result = result.T


File D:\miniconda3\envs\python3.12\Lib\site-packages\pandas\core\array_algos\quantile.py:218, in _nanpercentile(values, qs, na_value, mask, interpolation)
    216     return result
    217 else:
--> 218     return np.percentile(
    219         values,
    220         qs,
    221         axis=1,
    222         # error: No overload variant of "percentile" matches argument types
    223         # "ndarray[Any, Any]", "ndarray[Any, dtype[floating[_64Bit]]]",
    224         # "int", "Dict[str, str]"  [call-overload]
    225         method=interpolation,  # type: ignore[call-overload]
    226     )


File D:\miniconda3\envs\python3.12\Lib\site-packages\numpy\lib\function_base.py:4277, in percentile(a, q, axis, out, overwrite_input, method, keepdims, interpolation)
   4275 a = np.asanyarray(a)
   4276 if a.dtype.kind == "c":
-> 4277     raise TypeError("a must be an array of real numbers")
   4279 q = np.true_divide(q, 100)
   4280 q = asanyarray(q)  # undo any decay that the ufunc performed (see gh-13105)


TypeError: a must be an array of real numbers