数据聚合技术Aggregation

数据聚合技术Aggregation

引入相关库

import numpy as np
import pandas as pd
from pandas import Series,DataFrame

数据获取

df=pd.read_csv('../homework/city_weather.csv')
g=df.groupby('city')

数据聚合

通过agg函数求最大值

g.agg('max')
datetemperaturewind
city
BJ31/01/2016195
GZ31/07/2016255
SH27/03/2016205
SZ25/09/2016204

通过agg函数求最小值

g.agg('min')
datetemperaturewind
city
BJ03/01/2016-32
GZ14/08/2016-12
SH03/07/2016-102
SZ11/09/2016-101

自定义一个聚合函数

def foo(attr):
    print(type(attr))
    return np.nan

向agg方法传入该函数

g.agg(foo)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
datetemperaturewind
city
BJNaNNaNNaN
GZNaNNaNNaN
SHNaNNaNNaN
SZNaNNaNNaN

查看每一行结果Series里面存放的什么

def foo(attr):
    print(type(attr)),print(attr)
    return np.nan
g.agg(foo)
<class 'pandas.core.series.Series'>
Series([], Name: date, dtype: object)
<class 'pandas.core.series.Series'>
0    03/01/2016
1    17/01/2016
2    31/01/2016
3    14/02/2016
4    28/02/2016
5    13/03/2016
Name: date, dtype: object
<class 'pandas.core.series.Series'>
14    17/07/2016
15    31/07/2016
16    14/08/2016
17    28/08/2016
Name: date, dtype: object
<class 'pandas.core.series.Series'>
6     27/03/2016
7     10/04/2016
8     24/04/2016
9     08/05/2016
10    22/05/2016
11    05/06/2016
12    19/06/2016
13    03/07/2016
Name: date, dtype: object
<class 'pandas.core.series.Series'>
18    11/09/2016
19    25/09/2016
Name: date, dtype: object
<class 'pandas.core.series.Series'>
Series([], Name: temperature, dtype: int64)
<class 'pandas.core.series.Series'>
0     8
1    12
2    19
3    -3
4    19
5     5
Name: temperature, dtype: int64
<class 'pandas.core.series.Series'>
14    10
15    -1
16     1
17    25
Name: temperature, dtype: int64
<class 'pandas.core.series.Series'>
6     -4
7     19
8     20
9     17
10     4
11   -10
12     0
13    -9
Name: temperature, dtype: int64
<class 'pandas.core.series.Series'>
18    20
19   -10
Name: temperature, dtype: int64
<class 'pandas.core.series.Series'>
Series([], Name: wind, dtype: int64)
<class 'pandas.core.series.Series'>
0    5
1    2
2    2
3    3
4    2
5    3
Name: wind, dtype: int64
<class 'pandas.core.series.Series'>
14    2
15    5
16    5
17    4
Name: wind, dtype: int64
<class 'pandas.core.series.Series'>
6     4
7     3
8     3
9     3
10    2
11    4
12    5
13    5
Name: wind, dtype: int64
<class 'pandas.core.series.Series'>
18    1
19    4
Name: wind, dtype: int64
datetemperaturewind
city
BJNaNNaNNaN
GZNaNNaNNaN
SHNaNNaNNaN
SZNaNNaNNaN

把函数改成attr的最大值减去最小值,例如北京的wind为max-min=5-2=3

def foo(attr):
    return attr.max()-attr.min()
g.agg(foo)
temperaturewind
city
BJ223
GZ263
SH303
SZ303

查看原始的DataFrame

df
datecitytemperaturewind
003/01/2016BJ85
117/01/2016BJ122
231/01/2016BJ192
314/02/2016BJ-33
428/02/2016BJ192
513/03/2016BJ53
627/03/2016SH-44
710/04/2016SH193
824/04/2016SH203
908/05/2016SH173
1022/05/2016SH42
1105/06/2016SH-104
1219/06/2016SH05
1303/07/2016SH-95
1417/07/2016GZ102
1531/07/2016GZ-15
1614/08/2016GZ15
1728/08/2016GZ254
1811/09/2016SZ201
1925/09/2016SZ-104

对df做groupby,传入两个columns,‘city’和‘wind’,产生一个groupby对象

g_new=df.groupby(['city','wind'])
g_new
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000017B3ADF1E08>

对新产生的groupby对象做group,比原始的g的groups要多,因为加上了wind,对于每个城市不同的wind又会分group

g_new.groups
{('BJ', 2): Int64Index([1, 2, 4], dtype='int64'),
 ('BJ', 3): Int64Index([3, 5], dtype='int64'),
 ('BJ', 5): Int64Index([0], dtype='int64'),
 ('GZ', 2): Int64Index([14], dtype='int64'),
 ('GZ', 4): Int64Index([17], dtype='int64'),
 ('GZ', 5): Int64Index([15, 16], dtype='int64'),
 ('SH', 2): Int64Index([10], dtype='int64'),
 ('SH', 3): Int64Index([7, 8, 9], dtype='int64'),
 ('SH', 4): Int64Index([6, 11], dtype='int64'),
 ('SH', 5): Int64Index([12, 13], dtype='int64'),
 ('SZ', 1): Int64Index([18], dtype='int64'),
 ('SZ', 4): Int64Index([19], dtype='int64')}
g.groups
{('BJ': Int64Index([0,1,2,3,4,5], dtype='int64'),
  'GZ': Int64Index([14,15,16,17], dtype='int64'),
  'SH': Int64Index([6, 7, 8, 9, 10, 11, 12, 13], dtype='int64'),
  'SZ': Int64Index([18, 19], dtype='int64'),

在g获取BJ这一group

g.get_group('BJ')
datetemperaturewind
003/01/201685
117/01/2016122
231/01/2016192
314/02/2016-33
428/02/2016192
513/03/201653

在g_new获取BJ,wind为2这一group

g_new.get_group(('BJ',2))
datecitytemperaturewind
117/01/2016BJ122
231/01/2016BJ192
428/02/2016BJ192

在g_new获取BJ,wind为3这一group

g_new.get_group(('BJ',3))
datecitytemperaturewind
314/02/2016BJ-33
513/03/2016BJ53

用for循环访问的按多个columns做group的所有group

for (name_1,name_2),group in g_new:
    print(name_1,name_2)
    print(group)
BJ 2
         date city  temperature  wind
1  17/01/2016   BJ           12     2
2  31/01/2016   BJ           19     2
4  28/02/2016   BJ           19     2
BJ 3
         date city  temperature  wind
3  14/02/2016   BJ           -3     3
5  13/03/2016   BJ            5     3
BJ 5
         date city  temperature  wind
0  03/01/2016   BJ            8     5
GZ 2
          date city  temperature  wind
14  17/07/2016   GZ           10     2
GZ 4
          date city  temperature  wind
17  28/08/2016   GZ           25     4
GZ 5
          date city  temperature  wind
15  31/07/2016   GZ           -1     5
16  14/08/2016   GZ            1     5
SH 2
          date city  temperature  wind
10  22/05/2016   SH            4     2
SH 3
         date city  temperature  wind
7  10/04/2016   SH           19     3
8  24/04/2016   SH           20     3
9  08/05/2016   SH           17     3
SH 4
          date city  temperature  wind
6   27/03/2016   SH           -4     4
11  05/06/2016   SH          -10     4
SH 5
          date city  temperature  wind
12  19/06/2016   SH            0     5
13  03/07/2016   SH           -9     5
SZ 1
          date city  temperature  wind
18  11/09/2016   SZ           20     1
SZ 4
          date city  temperature  wind
19  25/09/2016   SZ          -10     4
  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值