现在考虑以下这个简单的DataFrame:
df = pd.DataFrame({'key1':list('aabba'),
'key2':['one','two','one','two','one'],
'data1':np.random.randn(5),
'data2':np.random.randn(5)})
假设我们需要按key1计算data1列的均值,其中一种是调用groupby方法:
grouped = df['data1'].groupby(df['key1'])
print(grouped)
# <pandas.core.groupby.generic.SeriesGroupBy object at 0x0000020C98733278>
print(grouped.mean())
# key1
# a 1.102942
# b -0.397096
grouped是一个GroupBy对象。可以理解为以key1的值作为索引生成了多个新的Series对象,如果我们将多个数组作为列表传入,则可以得到以下结果:
grouped = df['data1'].groupby([df['key1'],df['key2']])
print(grouped.mean())
# key1 key2
# a one -0.151180
# two 0.348812
# b one -1.013545
# two -0.165107
# Name: data1, dtype: float64
groupby方法中分组键也可以是任意长度的数组:
states = np.array(['O','C','C','O','O'])
year = np.array([2005,2005,2006,2005,2006])
grouped = df['data1'].groupby([states,year])
print(grouped.mean())
# C 2005 -0.092444
# 2006 0.026081
# O 2005 -0.770975
# 2006 1.242315
# Name: data1, dtype: float64
如果你想将DataFrame中所有的数据按照key1分组,则可以使用以下方式:
grouped = df.groupby('key1')
print(grouped.mean())
# data1 data2
# key1
# a -0.505483 -0.273290
# b -0.747742 0.042547
注意到结果里面并没有key2列,这是因为key2列中内容并非为数值数据。类似的还可以传入列名列表:
grouped = df.groupby(['key1','key2'])
print(grouped.mean())
# data1 data2
# key1 key2
# a one -1.635699 0.811572
# two -0.185163 0.254038
# b one -0.235654 -0.935724
# two 1.243953 -0.165800
使用size方法则可以轻松的查看各个分组的数据个数(不包含NaN):
grouped = df.groupby(['key1','key2'])
print(grouped.size())
# key1 key2
# a one 2
# two 1
# b one 1
# two 1
# dtype: int64
遍历各个分组
既然已经完成了分组,那么我们自然而然就会想到对结果进行遍历访问,可以使用以下代码进行遍历:
grouped = df.groupby('key1')
for name,group in grouped:
print(name)
print(group)
# a
# key1 key2 data1 data2
# 0 a one -1.441728 0.255910
# 1 a two -1.802737 -2.371888
# 4 a one 1.054068 0.894644
# b
# key1 key2 data1 data2
# 2 b one -0.165366 0.327962
# 3 b two -2.213890 -1.306378
倘若按照了多个键进行分组,返回的第一个元素是一个元组,则可以使用以下方式进行遍历:
grouped = df.groupby(['key1','key2'])
for (k1,k2),group in grouped:
print((k1,k2))
print(group)
# ('a', 'one')
# key1 key2 data1 data2
# 0 a one 0.014639 1.766811
# 4 a one 0.083017 1.360163
# ('a', 'two')
# key1 key2 data1 data2
# 1 a two -0.186763 -0.16364
# ('b', 'one')
# key1 key2 data1 data2
# 2 b one -0.050916 1.504799
# ('b', 'two')
# key1 key2 data1 data2
# 3 b two 1.401061 -0.659461
为了方便进行随机访问,可以将按以下操作将分组结果装入一个字典中:
grouped = df.groupby('key1')
pieces = dict(list(grouped))
print(pieces['b'])
# key1 key2 data1 data2
# 2 b one 0.389972 1.179354
# 3 b two 1.011641 0.832874
选择一列或者所有列的子集
对于大型数据集,可能只需要聚合少部分列,例如,在处理数据时,计算data1的均值,可以写为:
print(df['data1'].groupby(df['key1']).mean())
# key1 返回的是Series形式
# a 0.700501
# b 0.121686
# Name: data1, dtype: float64
print(df[['data1']].groupby(df['key1']).mean())
# data1 返回的是DataFrame形式
# key1
# a -0.991542
# b -1.008923
以上两种写法比较复杂,因此开发者提供了一个语法糖来进行快速的选择:
print(df.groupby(['key1','key2'])[['data1']].mean())
# data1 结果为DataFrame形式
# key1 key2
# a one 0.285455
# two -0.331077
# b one 1.145625
# two 0.050510
print(df.groupby(['key1','key2'])['data1'].mean())
# key1 key2 结果为Sereis形式
# a one 0.285455
# two -0.331077
# b one 1.145625
# two 0.050510
# Name: data1, dtype: float64
使用字典和Series进行分组
考虑下面这个小型的DataFrame:
people = pd.DataFrame(np.random.randn(5,5),
columns=list('abcde'),
index=['Joe','Steve','Wes','Jim','Travis'])
people.iloc[2:3,[1,2]] = np.nan
现在假设有一个各列的分组对应关系,并且想把各列像累加,可以进行以下的操作:
mapping = {"a":'red',"b":'red',"c":"blue","d":"blue","e":"red","f":'orange'}
by_columns = people.groupby(mapping,axis=1)
print(by_columns.sum())
# blue red
# Joe -1.043278 1.362828
# Steve -0.779433 -3.263282
# Wes -0.250116 -0.177344
# Jim 0.339998 1.843258
# Travis -1.310913 3.238224
Series也可以传入,与之前类似不在赘述。
使用函数分组
可以传入len来进行按照名字的长度来进行分组:
print(people.groupby(len).sum())
#计算索引长度,并按索引长度进行分组
# a b c d e
# 3 -1.816681 -0.172404 -0.956793 -0.472355 -0.518942
# 5 1.772594 0.523826 -0.073321 0.466430 0.854293
# 6 -0.300037 1.399494 0.899126 -0.908313 -0.643516
当然我们可以将函数分组和列表或者Series混合使用:
print(people.groupby([len,key_list]).sum())
# a b c d e
# 3 one 0.477578 0.206624 2.464722 -0.628018 0.752694
# two 1.634067 -0.259040 0.702192 -1.371848 0.279807
# 5 one -0.770036 -0.630894 2.214618 0.506788 -1.556065
# 6 two 0.424615 0.829693 -0.276156 -0.457296 -0.203110
根据索引层级分组
当拥有多层索引时,可以使用以下方法方便的按某个层级进行分组:
colunms = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],
[1,3,5,1,3]],
names = ['cty','tenor'])
hier_df = pd.DataFrame(np.random.randn(4,5),columns=colunms)
print(hier_df)
# cty US JP
# tenor 1 3 5 1 3
# 0 -0.444301 0.473209 -0.092656 0.148257 -0.703641
# 1 0.821177 -0.146728 -0.028356 -0.777455 -2.556110
# 2 -0.298212 0.663500 0.037683 -0.860470 -0.097507
# 3 -1.075454 0.090151 1.353959 -0.538752 0.199268
print(hier_df.groupby(level='cty',axis=1).count())
# cty JP US
# 0 2 3
# 1 2 3
# 2 2 3
# 3 2 3