pandas常用的聚合操作

给我起把狙

于 2024-09-02 11:25:25 发布

阅读量66

点赞数

文章标签： pandas

本文链接：https://blog.csdn.net/weixin_48232453/article/details/141815420

版权

pandas常用的聚合操作

Pandas 提供了多种聚合操作，用于对数据进行分组并执行计算，常见的聚合操作包括计算最小值、最大值、均值、求和、计数等。

1. 常见聚合操作

1.1 `min` 和 `max`

功能：分别计算每个分组的最小值和最大值。
用法：在agg中指定 min 或 max。

import pandas as pd

data = {
    'powerplant': ['A', 'A', 'B', 'B'],
    'dtime': ['2021-01-01', '2021-01-02', '2021-01-01', '2021-01-02'],
    'temperature': [15, 17, 14, 16],
    'wspd': [5, 6, 7, 8]
}

df = pd.DataFrame(data)

# 按 'powerplant' 分组，并计算每组的最小温度和最大风速
result = df.groupby('powerplant').agg(
    temperature_min=('temperature', 'min'),
    wspd_max=('wspd', 'max')
)

print(result)

输出：

            temperature_min  wspd_max
powerplant                           
A                       15         6
B                       14         8

1.2 `mean`

功能：计算每个分组的平均值。
用法：在agg中指定 mean。

# 计算每个电厂的平均温度和平均风速
result = df.groupby('powerplant').agg(
    temperature_mean=('temperature', 'mean'),
    wspd_mean=('wspd', 'mean')
)

print(result)

输出：

            temperature_mean  wspd_mean
powerplant                             
A                       16.0        5.5
B                       15.0        7.5

1.3 `sum`

功能：计算每个分组的总和。
用法：在agg中指定 sum。

# 计算每个电厂的温度和风速总和
result = df.groupby('powerplant').agg(
    temperature_sum=('temperature', 'sum'),
    wspd_sum=('wspd', 'sum')
)

print(result)

输出：

            temperature_sum  wspd_sum
powerplant                           
A                       32        11
B                       30        15

1.4 `count`

功能：计算每个分组中的非空值个数。
用法：在agg中指定 count。

# 计算每个电厂的数据点数量
result = df.groupby('powerplant').agg(
    temperature_count=('temperature', 'count'),
    wspd_count=('wspd', 'count')
)

print(result)

输出：

            temperature_count  wspd_count
powerplant                                
A                           2           2
B                           2           2

1.5 `median`

功能：计算每个分组的中位数。
用法：在agg中指定 median。

# 计算每个电厂的温度和风速的中位数
result = df.groupby('powerplant').agg(
    temperature_median=('temperature', 'median'),
    wspd_median=('wspd', 'median')
)

print(result)

输出：

            temperature_median  wspd_median
powerplant                                 
A                         16.0          5.5
B                         15.0          7.5

1.6 自定义函数

功能：可以在agg中使用自定义的函数，进行更复杂的聚合计算。
用法：在agg中定义函数。

# 使用自定义函数计算每个电厂的温度范围（最大值 - 最小值）
result = df.groupby('powerplant').agg(
    temperature_range=('temperature', lambda x: x.max() - x.min())
)

print(result)

输出：

            temperature_range
powerplant                   
A                           2
B                           2

1.7 `mode`（众数）

功能：计算每个分组的众数。
用法：在agg中使用lambda函数来计算众数。

# 假设 'wea_types' 列是表示天气类型的数据
df['wea_types'] = ['Sunny', 'Rainy', 'Sunny', 'Rainy']

# 计算每个电厂的最常见天气类型（众数）
result = df.groupby('powerplant').agg(
    wea_type_mode=('wea_types', lambda x: x.mode()[0] if not x.mode().empty else None)
)

print(result)

输出：

            wea_type_mode
powerplant               
A                   Rainy
B                   Rainy

2. 多重聚合操作

在 pandas 中，还可以对同一个列进行多个不同的聚合操作。可以在agg方法中对每列定义多个操作。

# 对每个电厂的温度进行多个聚合操作
result = df.groupby('powerplant').agg(
    temperature_min=('temperature', 'min'),
    temperature_max=('temperature', 'max'),
    temperature_mean=('temperature', 'mean')
)

print(result)

输出：

            temperature_min  temperature_max  temperature_mean
powerplant                                                   
A                       15               17              16.0
B                       14               16              15.0

3. 逐列应用不同聚合操作

在某些情况下，需要对数据框的不同列应用不同的聚合操作，这时可以通过agg的字典格式来指定不同列的不同聚合操作。

# 对不同列应用不同的聚合操作
result = df.groupby('powerplant').agg(
    temperature_stats=('temperature', ['min', 'max', 'mean']),
    wspd_sum=('wspd', 'sum')
)

print(result)

输出：

            temperature_stats              wspd_sum
                      min max mean                  
powerplant                                       
A                      15  17 16.0               11
B                      14  16 15.0               15

4. 多索引列聚合

pandas 支持多索引列的聚合操作，在这种情况下，聚合结果会生成多层索引。

# 对每个电厂和时间进行分组，然后计算聚合
result = df.groupby(['powerplant', 'dtime']).agg(
    temperature_mean=('temperature', 'mean'),
    wspd_sum=('wspd', 'sum')
)

print(result)

输出：

                     temperature_mean  wspd_sum
powerplant dtime                              
A          2021-01-01             15.0         5
           2021-01-02             17.0         6
B          2021-01-01             14.0         7
           2021-01-02             16.0         8