基于上述文章,进行连载!具体如下文所示。
选取一个或一组列
对于由DataFrame产生的GroupBy对象,如果用一个(单个字符串)或一组(字符串数组)列名对其进行索引,就能实现选取部分列进行聚合的目的。
In [28]: import numpy as np
...: import pandas as pd
...: from pandas import Series,DataFrame
...: df=DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],
'data1':np.random.randn(5),'data2':np.random.randn(5)})
In [29]: df
Out[29]:
key1 key2 data1 data2
0 a one 1.523269 -0.063037
1 a two -0.368826 -0.406932
2 b one 0.753990 -1.866066
3 b two 0.602979 -1.923139
4 a one -1.712829 0.600494
In [30]: df.groupby('key1')['data1']
Out[30]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x119874f28>
In [31]: df.groupby('key1')[['data2']]
Out[31]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x1198a52e8>
试一下代码的语法糖:
In [32]: df['data1'].groupby(df['key1'])
Out[32]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x1198a5dd8>
In [33]: df[['data2']].groupby(df['key1'])
Out[33]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x119874e48>
尤其对于大数据集,很可能只需要对部分列进行聚合
In [34]: df.groupby(['key1','key2'])[['data2']].mean()
Out[34]:
data2
key1 key2
a one 0.268729
two -0.406932
b one -1.866066
two -1.923139
这种索引操作所返回的对象是一个已分组的DataFrame(如果传入的是列表或数组)或已分组的Series(如果传入的是标量形式的单个列名):
In [35]: s_grouped=df.groupby(['key1','key2'])['data2']
In [36]: s_grouped
Out[36]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x119840198>
In [37]: s_grouped.mean()
Out[37]:
key1 key2
a one 0.268729
two -0.406932
b one -1.866066
two -1.923139
Name: data2, dtype: float64
通过字典或Series进行分组
In [38]: people=DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],
index=['Joe','Steve','Wes','Jim','Travis'])
In [39]: people
Out[39]:
a b c d e
Joe 0.798251 -2.181076 -1.002941 -1.216847 -1.306730
Steve -0.384873 -0.712956 2.000298 -0.201616 0.618182
Wes 2.630382 0.643784 -0.609386 1.151120 -0.780954
Jim 0.226097 -0.177566 0.615332 -0.199972 -0.995759
Travis 0.283909 0.601818 -0.264502 -0.997113 0.205104
#添加几个NA值
In [40]: people.ix[2:3,['b','c']]=np.nan
In [41]: people
Out[41]:
a b c d e
Joe 0.798251 -2.181076 -1.002941 -1.216847 -1.306730
Steve -0.384873 -0.712956 2.000298 -0.201616 0.618182
Wes 2.630382 NaN NaN 1.151120 -0.780954
Jim 0.226097 -0.177566 0.615332 -0.199972 -0.995759
Travis 0.283909 0.601818 -0.264502 -0.997113 0.205104
假设已知列的分组关系,并希望根据分组计算列的总计:
In [39]: mapping={'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
将这个字典传给groupby即可:
In [43]: by_column=people.groupby(mapping,axis=1)
In [44]: by_column.sum()
Out[44]:
blue red
Joe -2.219788 -2.689556
Steve 1.798682 -0.479647
Wes 1.151120 1.849428
Jim 0.415360 -0.947227
Travis -1.261615 1.090831
Series也有同样的功能,它可以被看做一个固定大小的映射。
如果用Series作为分组键,则pandas会检查Series以确保其索引跟分组轴是对齐的:
In [46]: map_series
Out[46]:
a red
b red
c blue
d blue
e red
f orange
dtype: object
In [47]: people.groupby(map_series,axis=1).count()
Out[47]:
blue red
Joe 2 3
Steve 2 3
Wes 1 2
Jim 2 3
Travis 2 3