1.分组级运算和转换
聚合只不过是分组运算的其中一种而已。介绍transform和apply方法,它们能够执行更多其他的分组运算
假设我们想要为一个DataFrame添加一个用于存放各索引分组平均值的列。一个办法是先聚合再合并
print df
k1_means=df.groupby('key1').mean().add_prefix('mean_')
print k1_means
print pd.merge(df,k1_means,left_on='key1',right_index=True)
结果为:
data1 data2 key1 key2
0 1.297237 1.069077 a one
1 1.586896 -0.679487 a two
2 -0.866223 0.597460 b one
3 1.440054 0.304970 b two
4 -0.705452 -0.104290 a one
mean_data1 mean_data2
key1
a 0.726227 0.095100
b 0.286915 0.451215
data1 data2 key1 key2 mean_data1 mean_data2
0 1.297237 1.069077 a one 0.726227 0.095100
1 1.586896 -0.679487 a two 0.726227 0.095100
4 -0.705452 -0.104290 a one 0.726227 0.095100
2 -0.866223 0.597460 b one 0.286915 0.451215
3 1.440054 0.304970 b two 0.286915 0.451215
虽然这样也行,但是不太灵活。你可以将该过程看做利用np.mean函数对两个数据列进行转换。
key=['one','two','one','two','one']
print people
print people.groupby(key).mean()
print people.groupby(key).transform(np.mean)
结果为:
a b c d e
Joe 0.286169 -1.354119 0.171155 -1.654205 0.034074
Steve 1.436373 0.746910 -0.010747 0.481846 -0.291208
Wes -1.678330 NaN NaN -0.659455 0.502947
Jim 0.097660 0.010846 0.770573 0.451625 -0.867913
Travis -0.029736 -0.743793 -0.490787 0.204776 0.275947
a b c d e
one -0.473966 -1.048956 -0.159816 -0.702961 0.270989
two 0.767017 0.378878 0.379913 0.466736 -0.579561
a b c d e
Joe -0.473966 -1.048956 -0.159816 -0.702961 0.270989
Steve 0.767017 0.378878 0.379913 0.466736 -0.579561
Wes -0.473966 -1.048956 -0.159816 -0.702961 0.270989
Jim 0.767017 0.378878 0.379913 0.466736 -0.579561
Travis -0.473966 -1.048956 -0.159816 -0.702961 0.270989
transform会将一个函数应用到各个分组,然后将结果放置到适当的位置上。如果各分组产生的是一个标量值,则该值就会被广播出去
假设你希望从各组中减去平均值。我们先创建一个距平化函数,然后将其传给transform
def demean(arr):
return arr-arr.mean()
demeaned=people.groupby(key).transform(demean)
print demeaned
结果为:
a b c d e
Joe 0.067355 -0.120862 0.487208 0.190557 -2.555154
Steve -0.751458 -0.616967 0.113476 -0.433799 0.949293
Wes 0.062807 NaN NaN 0.917042 2.490994
Jim 0.751458 0.616967 -0.113476 0.433799 -0.949293
Travis -0.130163 0.120862 -0.487208 -1.107599 0.064159
你可以检查一下demeaned现在的分组平均值是否为0
print demeaned.groupby(key).mean()
2.apply:一般性的“拆分-应用-合并”
跟aggregate一样,transform也是一个有着严格条件的特殊函数:传入的函数只能产生两种结果,要么产生一个可以广播的标量值(如np.mean),要么产生一个相同大小的结果数组。
回到之前那个小费数据集,假设你想要根据分组选出最高的5个tip_pct值。首先,编写一个选取指定列具有最大值的行的函数
def top(df,n=5,column='tip_pct'):
return df.sort_index(by=column)[-n:]
print top(tips,n=6)
结果为:
total_bill tip sex smoker day time size tip_pct
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
232 11.61 3.39 Male No Sat Dinner 2 0.291990
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
现在,如果对smoker分组并用该函数调用apply,就会得到
print tips.groupby('smoker').apply(top)
结果为:
total_bill tip sex smoker day time size tip_pct
smoker
No 88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
51 10.29 2.60 Female No Sun Dinner 2 0.252672
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
Yes 109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
top函数在DataFrame的各个片段上调用,然后结果有pandas.concat组装到一起,并以分组名称进行了标记。于是,最终结果就有了一个层次化索引,其内层索引值来自原DataFrame
如果传给apply的函数能够接受其他参数或关键字,则可以将这些内容放在函数名后面一并传入
print tips.groupby(['smoker','day']).apply(top,n=1,column='total_bill')
结果为:
total_bill tip sex smoker day time size \
smoker day
No Fri 94 22.75 3.25 Female No Fri Dinner 2
Sat 212 48.33 9.00 Male No Sat Dinner 4
Sun 156 48.17 5.00 Male No Sun Dinner 6
Thur 142 41.19 5.00 Male No Thur Lunch 5
Yes Fri 95 40.17 4.73 Male Yes Fri Dinner 4
Sat 170 50.81 10.00 Male Yes Sat Dinner 3
Sun 182 45.35 3.50 Male Yes Sun Dinner 3
Thur 197 43.11 5.00 Female Yes Thur Lunch 4
tip_pct
smoker day
No Fri 94 0.142857
Sat 212 0.186220
Sun 156 0.103799
Thur 142 0.121389
Yes Fri 95 0.117750
Sat 170 0.196812
Sun 182 0.077178
Thur 197 0.115982
之前我在GroupBy对象上调用过describe
result=tips.groupby('smoker')['tip_pct'].describe()
print result
print result.unstack('smoker')
结果为:
smoker
No count 151.000000
mean 0.159328
std 0.039910
min 0.056797
25% 0.136906
50% 0.155625
75% 0.185014
max 0.291990
Yes count 93.000000
mean 0.163196
std 0.085119
min 0.035638
25% 0.106771
50% 0.153846
75% 0.195059
max 0.710345
dtype: float64
smoker No Yes
count 151.000000 93.000000
mean 0.159328 0.163196
std 0.039910 0.085119
min 0.056797 0.035638
25% 0.136906 0.106771
50% 0.155625 0.153846
75% 0.185014 0.195059
max 0.291990 0.710345
在GroupBy中,当你调用诸如describe之类的方法时,实际上只是应用了下面两条代码的快捷方式而已
f=lamda x: x.describe()
grouped.apply(f)
3.禁止分组键
从上面的例子中可以看出,分组键跟原始对象的索引共同构成结果对象中的层次化索引。将group_keys=False传入groupby即可禁止该效果
print tips.groupby('smoker',group_keys=False).apply(top)
结果为:
total_bill tip sex smoker day time size tip_pct
88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
51 10.29 2.60 Female No Sun Dinner 2 0.252672
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
4.分位数和桶分析
pandas有一些能根据指定面元或样本分位数将数据拆分成多块的工具(比如cut和qcut)。将这些函数跟groupby结合起来,就能非常轻松地实现对数据集的桶(bucket)或分位数(quantile)分析了。下面这个简单的随机数据集为例,我们利用cut将其装入长度相等的桶中
frame=DataFrame({'data1':np.random.randn(1000),
'data2':np.random.randn(1000)})