分组运算分为三步:拆分 分组 运算
拆分:根据什么进行分组
应用:每个分组进行什么样的运算
合并:把每个分组的计算结果合并起来
df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
'key2': ['one', 'two', 'one', 'two', 'one'],
'data1': np.random.randint(1, 10, 5),
'data2': np.random.randint(1, 10, 5)})
print(df)
key1 key2 data1 data2
0 a one 1 7
1 a two 5 4
2 b one 8 9
3 b two 5 7
4 a one 2 4
将data1与key1进行分组,进行相加的操作
print(df['data1'].groupby(df['key1']).sum())
key1
a 8
b 13
Name: data1, dtype: int32
对指定的key进行分组,相加
key = [1, 2, 1, 1, 2]
print(df['data1'].groupby(key).sum())
1 20
2 6
Name: data1, dtype: int32
多层分组
print(df['data1'].groupby([df['key1'], df['key2']]).sum())
key1 key2
a one 4
two 7
b one 5
two 1
mean = df.groupby(['key1', 'key2']).sum()['data1']
print(mean.unstack())
key2 one two
key1
a 15 9
b 6 4
for name, group in df.groupby('key1'):
print(name)
print(group)
a
key1 key2 data1 data2
0 a one 5 1
1 a two 7 6
4 a one 8 1
b
key1 key2 data1 data2
2 b one 6 4
3 b two 8 7
按列进行分组求和
mapping = {'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'orange', 'e': 'blue'}
grouped = df.groupby(mapping, axis=1)
print(grouped.sum())
blue orange red
Alice 14 3 14
Bob 15 2 11
Candy 13 7 8
Dark 13 5 13
Emily 7 2 12
查看分组的情况
blue 2
orange 1
red 2
多级索引
行索引一级,列索引两级
columns = pd.MultiIndex.from_arrays([['China', 'USA', 'China', 'US', 'China'],
['A', 'A', 'B', 'C', 'B']], names=['country', 'index'])
df = pd.DataFrame(np.random.randint(1, 10, (5, 5)), columns=columns)
print(df)
country China USA China US China
index A A B C B
0 3 8 4 4 3
1 6 1 6 6 2
2 9 8 7 5 5
3 2 1 1 4 3
4 2 3 5 5 6
对country的列索引进行分组
country China US USA
0 16 2 7
1 13 5 8
2 11 8 1
3 23 9 6
4 13 7 3