df1 is DataFrame with 4 columns.
I want to created a new DataFrame (df2) by grouping df1 with Column 'A' with multi-column operation on column 'C' and 'D'
Column 'AA' = mean(C)+mean(D)
Column 'BB' = std(D)
df1= pd.DataFrame({
'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
A B C D
0 foo one 1.652675 -1.983378
1 bar one 0.926656 -0.598756
2 foo two 0.131381 0.604803
3 bar three -0.436376 -1.186363
4 foo two 0.487161 -0.650876
5 bar two 0.358007 0.249967
6 foo one -1.150428 2.275528
7 foo three 0.202677 -1.408699
def fun1(gg): # this does not work
return pd.DataFrame({'AA':C.mean()+gg.C.std(), 'BB':gg.C.std()})
dg1 = df1.groupby('A')
df2 = dg1.apply(fun1)
This does not work. It seems like aggregation() only works for Series and multi-column operation is not possible.
And apply() only produce Series output with multi-column operation.
Is there any other way to produce multi-column output (DataFrame) with multi-column operation?
解决方案
Do you have a typo in your f function? Should AA be C.mean() + C.std() or C.mean() + D.mean()
In this first case, AA = C.mean() + C.std(),
In [91]: df = df1.groupby('A').agg({'C': lambda x: x.mean() + x.std(),
'D': lambda x x.std()})
In [92]: df
Out[92]:
C D
A
bar 1.255506 0.588981
foo 1.775945 0.442724
For the second one C.mean() + D.mean(), things aren't quite as nice. When you give the .agg function on groupby objects a dict, I don't think there's a way to get values from two columns.
In [108]: g = df1.groupby('A')
In [109]: df = pd.DataFrame({"AA": g.mean()['C'] + g.mean()['D'], "BB": g.std()['D']})
In [110]: df
Out[110]:
AA BB
A
bar 0.532263 0.721351
foo 0.427608 0.494980
You may want to assign g.mean() and g.std() to temporary variables to avoid calculating them twice.