import pandas as pd
df = pd. DataFrame( { 'Country' : [ 'China' , 'China' , 'India' , 'India' , 'America' , 'Japan' , 'China' , 'India' ] ,
'Income' : [ 10000 , 10000 , 5000 , 5002 , 40000 , 50000 , 8000 , 5000 ] ,
'Age' : [ 5000 , 4321 , 1234 , 4010 , 250 , 250 , 4500 , 4321 ] } )
df
Age Country Income 0 5000 China 10000 1 4321 China 10000 2 1234 India 5000 3 4010 India 5002 4 250 America 40000 5 250 Japan 50000 6 4500 China 8000 7 4321 India 5000
主要就是理解下面这句话!!!后面的是一些细节!!
一个基础的语法!!!
a= df. groupby( 'Country' , as_index = False ) [ 'Income' ] . agg( { 'Income_sum' : 'sum' } )
a
Country Income_sum 0 America 40000 1 China 28000 2 India 15002 3 Japan 50000
上面代码的意思:统计每一种Conutry下Income的总和!!! 具体起到的作用: ①groupby(‘Country’, as_index = False) :按每一种Country进行统计!!!!! ②[‘Income’]:统计的是每一种Country下Income的相关特征!!! ③agg函数中 ‘Income_sum’:‘sum’,第一个字段’Income_sum’,其实你也可以写’sum’,但是如果写’Income_sum’,pandas会将 “每一个country下的Income总和”这个特征的特证名添为’Income_sum’
可以看到的是,这样造出来的 Income_sum特征,需要拼接回原来的df上去 也就是
data_merged = pd. merge( df, a, on= [ 'Country' ] , how= 'left' )
data_merged
Age Country Income Income_sum 0 5000 China 10000 28000 1 4321 China 10000 28000 2 1234 India 5000 15002 3 4010 India 5002 15002 4 250 America 40000 40000 5 250 Japan 50000 50000 6 4500 China 8000 28000 7 4321 India 5000 15002
比赛中经常用到的写法
df. groupby( 'Country' ) . Income. transform( 'sum' )
0 28000
1 28000
2 15002
3 15002
4 40000
5 50000
6 28000
7 15002
Name: Income, dtype: int64
可以看到,上面的数据,如第0行,和China的Income总和是一样的,而 df的第0行,实际上也是China!!,也就是说,他已经按Country字段给merge好了!!!
-------------------------------------------------分割线,下面是原理和细节
a= df. groupby( 'Country' )
a. max ( )
Age Income Country America 250 40000 China 5000 10000 India 4321 5002 Japan 250 50000
a. agg( [ 'min' , 'mean' , 'max' ] )
Age Income min mean max min mean max Country America 250 250.000000 250 40000 40000.000000 40000 China 4321 4607.000000 5000 8000 9333.333333 10000 India 1234 3188.333333 4321 5000 5000.666667 5002 Japan 250 250.000000 250 50000 50000.000000 50000
下面这个就是单独的取了一个字段的聚合结果 下面的结果表示,a[‘Age’]是单独的一个字段的group,对单独的一个字段的group我们可以对其聚合进行改名 比如下面 num_agg = {‘p’:[‘min’, ‘mean’, ‘max’]} 其实那个’p’按常规来说应该写成Age,但是此处这么写,就相当于将表中的Age字段直接更名为’p’了
num_agg = { 'p' : [ 'min' , 'mean' , 'max' ] }
a[ 'Age' ] . agg( num_agg)
/home/ubuntu/anaconda3/lib/python3.5/site-packages/ipykernel_launcher.py:2: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
p min mean max Country America 250 250.000000 250 China 4321 4607.000000 5000 India 1234 3188.333333 4321 Japan 250 250.000000 250
要注意 a是一个包含两个字段的(Income和Age)的group, agg函数不支持对含两个以上字段的聚合进行名字更改(如下图)
num_agg = { 'Age_1' : [ 'min' , 'mean' , 'max' ] }
a. agg( num_agg)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-7-8131bee46d39> in <module>()
1 num_agg = {'Age_1':['min', 'mean', 'max']}
----> 2 a.agg(num_agg)###### a包含了两个以上的字段了!
~/.local/lib/python3.5/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
1313 @Appender(_shared_docs['aggregate'])
1314 def aggregate(self, arg, *args, **kwargs):
-> 1315 return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
1316
1317 agg = aggregate
~/.local/lib/python3.5/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
184
185 _level = kwargs.pop('_level', None)
--> 186 result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
187 if how is None:
188 return result
~/.local/lib/python3.5/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
408 k not in obj.columns):
409 raise KeyError(
--> 410 "Column '{col}' does not exist!".format(col=k))
411
412 arg = new_arg
KeyError: "Column 'Age_1' does not exist!"
num_agg = { 'Age' : [ 'min' , 'mean' , 'max' ] , 'Income' : [ 'min' , 'max' ] }
a. agg( num_agg, as_index= False )
Income Age min max min mean max Country America 40000 40000 250 250.000000 250 China 8000 10000 4321 4607.000000 5000 India 5000 5002 1234 3188.333333 4321 Japan 50000 50000 250 250.000000 250
num_agg = { 'Age' : [ 'min' , 'mean' , 'max' ] , 'Income' : [ 'min' , 'max' ] }
a. agg( num_agg, as_index= True )
Income Age min max min mean max Country America 40000 40000 250 250.000000 250 China 8000 10000 4321 4607.000000 5000 India 5000 5002 1234 3188.333333 4321 Japan 50000 50000 250 250.000000 250
num_agg = { 'Age' : [ 'min' , 'mean' , 'max' ] }
a. agg( num_agg, as_index= False )
Age min mean max Country America 250 250.000000 250 China 4321 4607.000000 5000 India 1234 3188.333333 4321 Japan 250 250.000000 250
num_agg = { 'Age' : [ 'min' , 'mean' , 'max' ] }
a[ 'Age' ] . agg( num_agg, as_index= True )
/home/ubuntu/anaconda3/lib/python3.5/site-packages/ipykernel_launcher.py:2: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
Age min mean max Country America 250 250.000000 250 China 4321 4607.000000 5000 India 1234 3188.333333 4321 Japan 250 250.000000 250
num_agg = { 'min' , 'mean' , 'max' }
a[ 'Age' ] . agg( num_agg, as_index= False )
min max mean Country America 250 250 250.000000 China 4321 5000 4607.000000 India 1234 4321 3188.333333 Japan 250 250 250.000000