16_Pandas.DataFrame计算统计信息并按GroupBy分组

最新推荐文章于 2024-09-29 17:11:46 发布

饺子大人

最新推荐文章于 2024-09-29 17:11:46 发布

阅读量7k

点赞数 3

分类专栏： Pandas 文章标签： python 机器学习

本文链接：https://blog.csdn.net/qq_18351157/article/details/106118984

版权

Pandas 专栏收录该内容

75 篇文章 122 订阅

订阅专栏

16_Pandas.DataFrame计算统计信息并按GroupBy分组

可以通过andas.DataFrame和pandas.Series的groupby（）方法对数据进行分组。可以汇总每个组的数据，并且可以通过任何函数计算或处理统计信息，例如平均值，最小值，最大值和总计。

这里，将描述以下内容。

iris数据集
通过groupby()分组
计算平均值，最小值，最大值，总和等
通过应用任意处理进行聚合：agg（）
批量统计关键统计信息：describe（）
绘制图表

iris数据集

以iris数据集为例。

在这里，我们使用包含在seaborn中的数据作为样本。

import pandas as pd
import seaborn as sns
import numpy as np

df = sns.load_dataset("iris")
print(df.shape)
# (150, 5)

print(df.head(5))
#    sepal_length  sepal_width  petal_length  petal_width species
# 0           5.1          3.5           1.4          0.2  setosa
# 1           4.9          3.0           1.4          0.2  setosa
# 2           4.7          3.2           1.3          0.2  setosa
# 3           4.6          3.1           1.5          0.2  setosa
# 4           5.0          3.6           1.4          0.2  setosa

将其更改为省略的列名，以节省空间。

df.columns = ['sl', 'sw', 'pl', 'pw', 'species']
print(df.head(5))
#     sl   sw   pl   pw species
# 0  5.1  3.5  1.4  0.2  setosa
# 1  4.9  3.0  1.4  0.2  setosa
# 2  4.7  3.2  1.3  0.2  setosa
# 3  4.6  3.1  1.5  0.2  setosa
# 4  5.0  3.6  1.4  0.2  setosa

通过groupby()分组

按pandas.DataFrame的groupby（）方法分组。

如果在参数中指定了列名，则会对该列中的每个值进行分组。

返回的是一个GroupBy对象，print（）打印不显示内容。

grouped = df.groupby('species')
print(grouped)
# <pandas.core.groupby.groupby.DataFrameGroupBy object at 0x10c69f6a0>

print(type(grouped))
# <class 'pandas.core.groupby.groupby.DataFrameGroupBy'>

可以使用size（）方法检查每个组中的样本数量。

print(grouped.size())
# species
# setosa        50
# versicolor    50
# virginica     50
# dtype: int64

计算平均值，最小值，最大值，总和等

通过mean（），min（），max（），sum（）方法应用于GroupBy对象，可以计算每个组的统计信息，例如平均值，最小值，最大值和总和。

print(grouped.mean())
#                sl     sw     pl     pw
# species                               
# setosa      5.006  3.428  1.462  0.246
# versicolor  5.936  2.770  4.260  1.326
# virginica   6.588  2.974  5.552  2.026

print(grouped.min())
#              sl   sw   pl   pw
# species                       
# setosa      4.3  2.3  1.0  0.1
# versicolor  4.9  2.0  3.0  1.0
# virginica   4.9  2.2  4.5  1.4

print(grouped.max())
#              sl   sw   pl   pw
# species                       
# setosa      5.8  4.4  1.9  0.6
# versicolor  7.0  3.4  5.1  1.8
# virginica   7.9  3.8  6.9  2.5

print(grouped.sum())
#                sl     sw     pl     pw
# species                               
# setosa      250.3  171.4   73.1   12.3
# versicolor  296.8  138.5  213.0   66.3
# virginica   329.4  148.7  277.6  101.3

还有标准偏差std（）和方差var（）。都返回一个新的pandas.DataFrame。

print(type(grouped.mean()))
# <class 'pandas.core.frame.DataFrame'>

通过应用任意处理进行聚合：agg（）

可以通过GroupBy对象的agg（）方法进行任意处理。

指定要应用于参数的函数。可以将其指定为可调用对象，例如函数或函数名称的字符串。

print(grouped.agg(min))
#              sl   sw   pl   pw
# species                       
# setosa      4.3  2.3  1.0  0.1
# versicolor  4.9  2.0  3.0  1.0
# virginica   4.9  2.2  4.5  1.4

print(grouped.agg('max'))
#              sl   sw   pl   pw
# species                       
# setosa      5.8  4.4  1.9  0.6
# versicolor  7.0  3.4  5.1  1.8
# virginica   7.9  3.8  6.9  2.5

注意，如果将内置函数中未指定的均值（）指定为均值，则会发生错误。 NumPy函数，指定为np.mean或字符串’mean’。

# print(grouped.agg(mean))
# NameError: name 'mean' is not defined

print(grouped.agg(np.mean))
#                sl     sw     pl     pw
# species                               
# setosa      5.006  3.428  1.462  0.246
# versicolor  5.936  2.770  4.260  1.326
# virginica   6.588  2.974  5.552  2.026

print(grouped.agg('mean'))
#                sl     sw     pl     pw
# species                               
# setosa      5.006  3.428  1.462  0.246
# versicolor  5.936  2.770  4.260  1.326
# virginica   6.588  2.974  5.552  2.026

如果在列表中指定，则可以同时应用多个过程。在这种情况下，生成的pandas.DataFrame的列将被多索引。

print(grouped.agg([min, 'max']))
#              sl        sw        pl        pw     
#             min  max  min  max  min  max  min  max
# species                                           
# setosa      4.3  5.8  2.3  4.4  1.0  1.9  0.1  0.6
# versicolor  4.9  7.0  2.0  3.4  3.0  5.1  1.0  1.8
# virginica   4.9  7.9  2.2  3.8  4.5  6.9  1.4  2.5

还可以使用以列名作为键的字典（字典类型对象）对每列进行不同的处理。

print(grouped.agg({'sl': min, 'sw': max, 'pl': np.mean, 'pw': 'mean'}))
#              sl   sw     pl     pw
# species                           
# setosa      4.3  4.4  1.462  0.246
# versicolor  4.9  3.4  4.260  1.326
# virginica   4.9  3.8  5.552  2.026

匿名函数（lambda表达式）也可以。

print(grouped.agg(lambda x: max(x) - min(x)))
#              sl   sw   pl   pw
# species                       
# setosa      1.5  2.1  0.9  0.5
# versicolor  2.1  1.4  2.1  0.8
# virginica   3.0  1.6  2.4  1.1

对于lambda表达式，每个组的值都作为pandas.Series传递。

print(grouped.agg(lambda x: type(x))['sl'])
# species
# setosa        <class 'pandas.core.series.Series'>
# versicolor    <class 'pandas.core.series.Series'>
# virginica     <class 'pandas.core.series.Series'>
# Name: sl, dtype: object

注意，如果它不是接收pandas.Series并返回一个对象的lambda表达式，则会出现错误。

# print(grouped.agg(lambda x: x + 1))
# Exception: Must produce aggregated value

批量统计关键统计信息：describe（）

describe（）方法可用于集体计算每个组的主要统计数据。

在以下示例中，仅输出sl列的结果。

print(grouped.describe()['sl']) 
#             count   mean       std  min    25%  50%  75%  max
# species                                                      
# setosa       50.0  5.006  0.352490  4.3  4.800  5.0  5.2  5.8
# versicolor   50.0  5.936  0.516171  4.9  5.600  5.9  6.3  7.0
# virginica    50.0  6.588  0.635880  4.9  6.225  6.5  6.9  7.9

绘制图表

如上所述，如果将mean（），min（），max（），sum（）之类的方法应用于GroupBy对象，它将返回pandas.DataFrame，因此plot（）方法将用于绘制图形。可以可视化。

print(type(grouped.max()))
# <class 'pandas.core.frame.DataFrame'>

ax = grouped.max().plot.bar(rot=0)
fig = ax.get_figure()
fig.savefig('./data/16/iris_pandas_groupby_max.png')