pandas-groupby函数使用以及聚合函数使用(比赛中常用的写法!!)(transform的解释)


import pandas as pd
 
df = pd.DataFrame({'Country':['China','China', 'India', 'India', 'America', 'Japan', 'China', 'India'], 
                   'Income':[10000, 10000, 5000, 5002, 40000, 50000, 8000, 5000],
                    'Age':[5000, 4321, 1234, 4010, 250, 250, 4500, 4321]})
df
AgeCountryIncome
05000China10000
14321China10000
21234India5000
34010India5002
4250America40000
5250Japan50000
64500China8000
74321India5000

主要就是理解下面这句话!!!后面的是一些细节!!

一个基础的语法!!!

a=df.groupby('Country', as_index = False)['Income'].agg({ 'Income_sum':'sum'})
a
CountryIncome_sum
0America40000
1China28000
2India15002
3Japan50000

上面代码的意思:统计每一种Conutry下Income的总和!!!

具体起到的作用:
①groupby(‘Country’, as_index = False) :按每一种Country进行统计!!!!!
②[‘Income’]:统计的是每一种Country下Income的相关特征!!!
③agg函数中 ‘Income_sum’:‘sum’,第一个字段’Income_sum’,其实你也可以写’sum’,但是如果写’Income_sum’,pandas会将
“每一个country下的Income总和”这个特征的特证名添为’Income_sum’

可以看到的是,这样造出来的 Income_sum特征,需要拼接回原来的df上去
也就是

data_merged = pd.merge(df, a, on=['Country'], how='left')
data_merged
AgeCountryIncomeIncome_sum
05000China1000028000
14321China1000028000
21234India500015002
34010India500215002
4250America4000040000
5250Japan5000050000
64500China800028000
74321India500015002

比赛中经常用到的写法

df.groupby('Country').Income.transform('sum')
0    28000
1    28000
2    15002
3    15002
4    40000
5    50000
6    28000
7    15002
Name: Income, dtype: int64

可以看到,上面的数据,如第0行,和China的Income总和是一样的,而
df的第0行,实际上也是China!!,也就是说,他已经按Country字段给merge好了!!!

-------------------------------------------------分割线,下面是原理和细节

a=df.groupby('Country')
a.max()
AgeIncome
Country
America25040000
China500010000
India43215002
Japan25050000
a.agg(['min', 'mean', 'max'])
AgeIncome
minmeanmaxminmeanmax
Country
America250250.0000002504000040000.00000040000
China43214607.000000500080009333.33333310000
India12343188.333333432150005000.6666675002
Japan250250.0000002505000050000.00000050000

下面这个就是单独的取了一个字段的聚合结果
下面的结果表示,a[‘Age’]是单独的一个字段的group,对单独的一个字段的group我们可以对其聚合进行改名
比如下面 num_agg = {‘p’:[‘min’, ‘mean’, ‘max’]} 其实那个’p’按常规来说应该写成Age,但是此处这么写,就相当于将表中的Age字段直接更名为’p’了

num_agg = {'p':['min', 'mean', 'max']}
a['Age'].agg(num_agg)############ 只拿出了Age这一列!!!!!,也就是一个字段的时候可以改名
/home/ubuntu/anaconda3/lib/python3.5/site-packages/ipykernel_launcher.py:2: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
p
minmeanmax
Country
America250250.000000250
China43214607.0000005000
India12343188.3333334321
Japan250250.000000250

要注意 a是一个包含两个字段的(Income和Age)的group,
agg函数不支持对含两个以上字段的聚合进行名字更改(如下图)

num_agg = {'Age_1':['min', 'mean', 'max']}
a.agg(num_agg)######  a包含了两个以上的字段了!
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-7-8131bee46d39> in <module>()
      1 num_agg = {'Age_1':['min', 'mean', 'max']}
----> 2 a.agg(num_agg)######  a包含了两个以上的字段了!


~/.local/lib/python3.5/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
   1313     @Appender(_shared_docs['aggregate'])
   1314     def aggregate(self, arg, *args, **kwargs):
-> 1315         return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
   1316 
   1317     agg = aggregate


~/.local/lib/python3.5/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
    184 
    185         _level = kwargs.pop('_level', None)
--> 186         result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
    187         if how is None:
    188             return result


~/.local/lib/python3.5/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
    408                           k not in obj.columns):
    409                         raise KeyError(
--> 410                             "Column '{col}' does not exist!".format(col=k))
    411 
    412                 arg = new_arg


KeyError: "Column 'Age_1' does not exist!"
num_agg = {'Age':['min', 'mean', 'max'], 'Income':['min', 'max']}
a.agg(num_agg,as_index=False)
IncomeAge
minmaxminmeanmax
Country
America4000040000250250.000000250
China80001000043214607.0000005000
India5000500212343188.3333334321
Japan5000050000250250.000000250
num_agg = {'Age':['min', 'mean', 'max'], 'Income':['min', 'max']}
a.agg(num_agg,as_index=True)
IncomeAge
minmaxminmeanmax
Country
America4000040000250250.000000250
China80001000043214607.0000005000
India5000500212343188.3333334321
Japan5000050000250250.000000250
num_agg = {'Age':['min', 'mean', 'max']}
a.agg(num_agg,as_index=False)
Age
minmeanmax
Country
America250250.000000250
China43214607.0000005000
India12343188.3333334321
Japan250250.000000250
num_agg = {'Age':['min', 'mean', 'max']}
a['Age'].agg(num_agg,as_index=True)
/home/ubuntu/anaconda3/lib/python3.5/site-packages/ipykernel_launcher.py:2: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
Age
minmeanmax
Country
America250250.000000250
China43214607.0000005000
India12343188.3333334321
Japan250250.000000250
num_agg = {'min', 'mean', 'max'}#正规用法!!!!!!
a['Age'].agg(num_agg,as_index=False)
minmaxmean
Country
America250250250.000000
China432150004607.000000
India123443213188.333333
Japan250250250.000000
  • 9
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值