groupby:pandas中最为常用的分组函数
(1)、按列分组
import pandas as pd
import numpy as np
df = DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],'data1':np.random.randn(5),'data2':np.random.randn(5)})
df
data1 data2 key1 key2
0 -1.488061 -0.002241 a one
1 0.707773 0.338733 a two
2 -1.689161 0.647643 b one
3 0.987463 -0.584322 b two
4 -0.560973 -1.147191 a one
依据单个列名’key1’进行为分组
group1 = df.groupby('key1')
[x for x in group1]
[('a', data1 data2 key1 key2
0 -1.488061 -0.002241 a one
1 0.707773 0.338733 a two
4 -0.560973 -1.147191 a one),
('b', data1 data2 key1 key2
2 -1.689161 0.647643 b one
3 0.987463 -0.584322 b two)]
依据多个列名[‘key1’,’key2’]进行分组
group2 = df.groupby(['key1','key2'])
[x for x in group2]
[(('a', 'one'), data1 data2 key1 key2
0 -1.488061 -0.002241 a one
4 -0.560973 -1.147191 a one),
(('a', 'two'), data1 data2 key1 key2
1 0.707773 0.338733 a two),
(('b', 'one'), data1 data2 key1 key2
2 -1.689161 0.647643 b one),
(('b', 'two'), data1 data2 key1 key2
3 0.987463 -0.584322 b two)]
其中,group1是一个中间分组变量,为GroupBy类型;
推导式[x for x in group1]用于显示分组内容
(2)、分组统计
对分组group1、group2分别应用size()、sum()、count()等统计函数,可分别统计分组的数量、不同列的分组和、不同列的分组数量。
group1.size()
key1
a 3
b 2
dtype: int64
group1.sum()
data1 data2
key1
a -1.341260 -0.810698
b -0.701698 0.063321
group2.size()
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
group2.count()
data1 data2
key1 key2
a one 2 2
two 1 1
b one 1 1
two 1 1
(3)、agg()
agg(func)可对分组后的某一列或者多个列的数据应用func函数,也可推广到同时作用于多个列和多个函数上。
例:对分组后的’data1’列求均值
group1['data1'].agg('mean')
key1
a -0.447087
b -0.350849
Name: data1, dtype: float64
例:对分组后的’data1’和’data2’列分别求均值、求和
group1['data1','data2'].agg(['mean','sum'])
data1 data2
mean sum mean sum
key1
a -0.447087 -1.341260 -0.270233 -0.810698
b -0.350849 -0.701698 0.031660 0.063321
(4)、apply()
不同于agg()之处:apply()应用于dataframe的各个列,后者仅作用于指定的列。
df.groupby('key1').apply(np.mean)
data1 data2
key1
a -0.447087 -0.270233
b -0.350849 0.031660
df.groupby(['key1','key2']).apply(np.mean)
data1 data2
key1 key2
a one -1.024517 -0.574716
two 0.707773 0.338733
b one -1.689161 0.647643
two 0.987463 -0.584322
(5)、reset_index()
通过reset_index()函数可以将groupby()的分组结果转换成DataFrame对象,进而保存。
group1['data1','data2'].agg(['mean','sum']).reset_index()
key1 data1 data2
mean sum mean sum
0 a -0.447087 -1.341260 -0.270233 -0.810698
1 b -0.350849 -0.701698 0.031660 0.063321