第10章 数据聚合与分组运算
分组统计数据,并对各组进行数据处理分析,是数据分析中的重要环节之一。pandas提供了一个groupby功能。
10.1 GroupBy机制
“split-apply-combine”(拆分-应用-合并),先把数据按着自己的需求或者算法要求的形式进行分组,然后应用相关的数据分析方法,最后合并处理结果。
import numpy as np
import pandas as pd
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),
'data2' : np.random.randn(5)})
print(df)
key1 key2 data1 data2
0 a one -1.287078 0.667990
1 a two 0.073181 -0.501827
2 b one -0.354324 1.296282
3 b two 0.980430 0.907485
4 a one 1.610517 -0.461716
#对df按照key2进行分组,根据分组计算data2的平均值
#在pandas中可以访问data2,根据key2进行调用groupby函数:
grouped=df['data2'].groupby(df['key2'])
grouped.mean()
key2
one 0.500852
two 0.202829
Name: data2, dtype: float64
如果传入多个分组对象,也会根据分组的不同,进行相关的数据运算。
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
print(means)
print('\n')
print(means.unstack())
key1 key2
a one 0.161719
two 0.073181
b one -0.354324
two 0.980430
Name: data1, dtype: float64
key2 one two
key1
a 0.161719 0.073181
b -0.354324 0.980430
#分组列名可以不是数组中的的情况下,也可以传入新的数组。
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, i2005, 2006, 2005, 2006])
print(df['data1'].groupby([states, years]).mean())
California 2005 0.073181
2006 -0.354324
Ohio 2005 -0.153324
2006 1.610517
Name: data1, dtype: float64
使用groupby的size方法,会返回一个含有分组大小的series
print( df.groupby(['key1', 'key2']).size())
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
对分组进行迭代
for name, group in df.groupby('key1'):
print(name)
print(group)
a
key1 key2 data1 data2
0 a one -1.287078 0.667990
1 a two 0.073181 -0.501827
4 a one 1.610517 -0.461716
b
key1 key2 data1 data2
2 b one -0.354324 1.296282
3 b two 0.980430 0.907485
for (k1, k2), group in df.groupby(['key1', 'key2']):
print((k1,k2))
print(group)
('a', 'one')
key1 key2 data1 data2
0 a one -1.287078 0.667990
4 a one 1.610517 -0.461716
('a', 'two')
key1 key2 data1 data2
1 a two 0.073181 -0.501827
('b', 'one')
key1 key2 data1 data2
2 b one -0.354324 1.296282
('b', 'two')
key1 key2 data1 data2
3 b two 0.98043 0.907485
选取一列或列的子集
对于其中部分数据的选择可以按着以下的方法。
s_grouped = df.groupby(['key1', 'key2'])['data2']
print(s_grouped.mean())
key1 key2
a one 0.103137
two -0.501827
b one 1.296282
two 0.907485
Name: data2, dtype: float64
通过字典或Series进行分组
people = pd.DataFrame(np.random.randn(5, 5),
columns=['a', 'b', 'c', 'd', 'e'],
index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.iloc[2:3, [1, 2]] = np.nan
print(people)
a b c d e
Joe -0.582748 -0.005825 -0.957329 -0.840149 -0.325898
Steve 1.112183 0.624433 2.891562 -0.420787 0.539915
Wes 0.445727 NaN NaN -0.295581 -1.019202
Jim -1.778101 0.924397 0.772355 0.920256 -0.049690
Travis 0.079208 0.737500 -0.065959 0.679241 -0.262730
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
'd': 'blue', 'e': 'red', 'f' : 'orange'}
by_column = people.groupby(mapping, axis=1)
print(by_column.sum())
blue red
Joe -1.797478 -0.914472
Steve 2.470775 2.276531
Wes -0.295581 -0.573475
Jim 1.692611 -0.903393
Travis 0.613282 0.553978
通过函数进行分组
以计算1个字符串长度的数组,更简单的方法是传入len函数
print(people.groupby(len).sum())
a b c d e
3 -1.915123 0.918572 -0.184974 -0.215474 -1.394790
5 1.112183 0.624433 2.891562 -0.420787 0.539915
6 0.079208 0.737500 -0.065959 0.679241 -0.262730
#同时也可以根据其他级别进行分组计算.
key_list = ['one', 'one', 'one', 'two', 'two']
print( people.groupby([len, key_list]).min())
a b c d e
3 one -0.582748 -0.005825 -0.957329 -0.840149 -1.019202
two -1.778101 0.924397 0.772355 0.920256 -0.049690
5 one 1.112183 0.624433 2.891562 -0.420787 0.539915
6 two 0.079208 0.737500 -0.065959 0.679241 -0.262730
根据索引级别分组
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
[1, 3, 5, 1, 3]],
names=['cty', 'tenor'])
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
print(hier_df)
cty US JP
tenor 1 3 5 1 3
0 -0.474091 -1.441095 -0.027898 0.411016 0.830097
1 -0.173916 -0.729263 -0.885296 -0.574423 -1.892896
2 -1.158821 -0.939684 -1.185016 -0.915995 -0.892493
3 -1.066578 1.639527 0.651484 -0.054653 0.572186
10.2 数据聚合
这个是指一些常用的数据统计过程。
grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)
key1
a 1.303050
b 0.846955
Name: data1, dtype: float64
print( grouped.describe())
data1 \
count mean std min 25% 50% 75%
key1
a 3.0 0.132207 1.449699 -1.287078 -0.606948 0.073181 0.841849
b 2.0 0.313053 0.943814 -0.354324 -0.020636 0.313053 0.646742
data2 \
max count mean std min 25% 50%
key1
a 1.610517 3.0 -0.098518 0.664118 -0.501827 -0.481771 -0.461716
b 0.980430 2.0 1.101883 0.274921 0.907485 1.004684 1.101883
75% max
key1
a 0.103137 0.667990
b 1.199083 1.296282
面向列的多函数应用
tips = pd.read_csv('examples/tips.csv')
tips['tip_pct'] = tips['tip'] / tips['total_bill']
print(tips[:6])
total_bill tip smoker day time size tip_pct
0 16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808
5 25.29 4.71 No Sun Dinner 4 0.186240
#通过天数和smoker对数组进行分组。
grouped = tips.groupby(['day', 'smoker'])
grouped_pct = grouped['tip_pct']
print(grouped_pct.agg('mean'))
day smoker
Fri No 0.151650
Yes 0.174783
Sat No 0.158048
Yes 0.147906
Sun No 0.160113
Yes 0.187250
Thur No 0.160298
Yes 0.163863
Name: tip_pct, dtype: float64
#如果传入一组函数或函数名,得到的DataFrame的列就会以相应的函数命名:
print(grouped_pct.agg(['mean', 'std']))#我在传入peak_to_peak时失败了。
mean std
day smoker
Fri No 0.151650 0.028123
Yes 0.174783 0.051293
Sat No 0.158048 0.039767
Yes 0.147906 0.061375
Sun No 0.160113 0.042347
Yes 0.187250 0.154134
Thur No 0.160298 0.038774
Yes 0.163863 0.039389
如果传入的是1个由(name,function)元组组成的列表,
则各元组的第1个元素就会被yong作DataFrame的列名。
print(grouped_pct.agg([('foo', 'mean'), ('bar', np.std)]))
foo bar
day smoker
Fri No 0.151650 0.028123
Yes 0.174783 0.051293
Sat No 0.158048 0.039767
Yes 0.147906 0.061375
Sun No 0.160113 0.042347
Yes 0.187250 0.154134
Thur No 0.160298 0.038774
Yes 0.163863 0.039389
#如果想要对两列数据统计三个统计信息.
functions = ['count', 'mean', 'max']
result = grouped['tip_pct', 'total_bill'].agg(functions)
print(result)
tip_pct total_bill
count mean max count mean max
day smoker
Fri No 4 0.151650 0.187735 4 18.420000 22.75
Yes 15 0.174783 0.263480 15 16.813333 40.17
Sat No 45 0.158048 0.291990 45 19.661778 48.33
Yes 42 0.147906 0.325733 42 21.276667 50.81
Sun No 57 0.160113 0.252672 57 20.506667 48.17
Yes 19 0.187250 0.710345 19 24.120000 45.35
Thur No 45 0.160298 0.266312 45 17.113111 41.19
Yes 17 0.163863 0.241255 17 19.190588 43.11
以“没有行索引”的形式返回聚合数据
这个可以按着序号进行数据表示
print(tips.groupby(['day', 'smoker'], as_index=False).mean())
day smoker total_bill tip size tip_pct
0 Fri No 18.420000 2.812500 2.250000 0.151650
1 Fri Yes 16.813333 2.714000 2.066667 0.174783
2 Sat No 19.661778 3.102889 2.555556 0.158048
3 Sat Yes 21.276667 2.875476 2.476190 0.147906
4 Sun No 20.506667 3.167895 2.929825 0.160113
5 Sun Yes 24.120000 3.516842 2.578947 0.187250
6 Thur No 17.113111 2.673778 2.488889 0.160298
7 Thur Yes 19.190588 3.030000 2.352941 0.163863
10.3 apply:一般性的“拆分-应用-合并”
#选取指定列具有最大值的行的函数
def top(df, n=5, column='tip_pct'):
return df.sort_values(by=column)[-n:]
print(top(tips, n=6))
total_bill tip smoker day time size tip_pct
109 14.31 4.00 Yes Sat Dinner 2 0.279525
183 23.17 6.50 Yes Sun Dinner 4 0.280535
232 11.61 3.39 No Sat Dinner 2 0.291990
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345
#对smoker分组丙用该函数调用apply。
print( tips.groupby('smoker').apply(top))
#top函数在DataFrame的各个片段上调用,然后结果由pandas.concat组装到一起
total_bill tip smoker day time size tip_pct
smoker
No 88 24.71 5.85 No Thur Lunch 2 0.236746
185 20.69 5.00 No Sun Dinner 5 0.241663
51 10.29 2.60 No Sun Dinner 2 0.252672
149 7.51 2.00 No Thur Lunch 2 0.266312
232 11.61 3.39 No Sat Dinner 2 0.291990
Yes 109 14.31 4.00 Yes Sat Dinner 2 0.279525
183 23.17 6.50 Yes Sun Dinner 4 0.280535
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345
禁止分组键
主要是可以得到在一个表里无差别显示。
print( tips.groupby('smoker', group_keys=False).apply(top))
total_bill tip smoker day time size tip_pct
88 24.71 5.85 No Thur Lunch 2 0.236746
185 20.69 5.00 No Sun Dinner 5 0.241663
51 10.29 2.60 No Sun Dinner 2 0.252672
149 7.51 2.00 No Thur Lunch 2 0.266312
232 11.61 3.39 No Sat Dinner 2 0.291990
109 14.31 4.00 Yes Sat Dinner 2 0.279525
183 23.17 6.50 Yes Sun Dinner 4 0.280535
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345
分位数和桶分析
书中通过一个例子说明了把数据按着一定的分组,进行分组,最后通过得到的分组进行相关的统计。
frame = pd.DataFrame({'data1': np.random.randn(1000),
'data2': np.random.randn(1000)})
quartiles = pd.cut(frame.data1, 4)
print( quartiles[:10])
0 (-0.392, 1.249]
1 (-2.034, -0.392]
2 (-2.034, -0.392]
3 (-0.392, 1.249]
4 (-0.392, 1.249]
5 (1.249, 2.89]
6 (-2.034, -0.392]
7 (-2.034, -0.392]
8 (-0.392, 1.249]
9 (-0.392, 1.249]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.681, -2.034] < (-2.034, -0.392] < (-0.392, 1.249] < (1.249, 2.89]]
def get_stats(group):
return {'min': group.min(), 'max': group.max(),
'count': group.count(), 'mean': group.mean()}
grouped = frame.data2.groupby(quartiles)
print( grouped.apply(get_stats).unstack())
min max count mean
data1
(-3.681, -2.034] -2.434236 0.861348 19.0 -0.192120
(-2.034, -0.392] -2.856124 3.208903 313.0 0.036098
(-0.392, 1.249] -2.630524 2.791874 557.0 0.061876
(1.249, 2.89] -2.068452 2.237532 111.0 -0.009870
#要根据样本分位数得到大小相等的桶,使用qcut即可。传入labels=False即可
#只获取分位数的编号
grouping = pd.qcut(frame.data1, 10, labels=False)
grouped = frame.data2.groupby(grouping)
print( grouped.apply(get_stats).unstack())
min max count mean
data1
0 -2.434236 2.511737 100.0 -0.079724
1 -2.709602 2.556729 100.0 -0.019378
2 -2.632848 3.208903 100.0 0.130136
3 -2.856124 2.791874 100.0 0.024044
4 -1.812431 2.503049 100.0 0.142189
5 -2.304221 2.321085 100.0 -0.054853
6 -1.763090 2.657159 100.0 0.163413
7 -1.816266 2.559731 100.0 0.162461
8 -2.630524 2.039064 100.0 -0.101672
9 -2.068452 2.237532 100.0 0.043562
用特定于分组的值填充缺失值
有时则可能会希望用1个固定值或由数据集本身所衍生出来的值去填充NA值。这时就得使用fillna这个工具了。利用用平均值填充去掉NA值。
s = pd.Series(np.random.randn(6))
s[::2] = np.nan
print(s)
print('\n')
print( s.fillna(s.mean()))
0 NaN
1 -1.017690
2 NaN
3 -0.324837
4 NaN
5 -0.441826
dtype: float64
0 -0.594784
1 -1.017690
2 -0.594784
3 -0.324837
4 -0.594784
5 -0.441826
dtype: float64
随机采样和排列
“抽取”的方式有很多,使用的方法是对Series使用sample方法
suits = ['H', 'S', 'C', 'D']
card_val = (list(range(1, 11)) + [10] * 3) * 4
base_names = ['A'] + list(range(2, 11)) + ['J', 'K', 'Q']
cards = []
for suit in ['H', 'S', 'C', 'D']:
cards.extend(str(num) + suit for num in base_names)
deck = pd.Series(card_val, index=cards)
print(deck[:13])
AH 1
2H 2
3H 3
4H 4
5H 5
6H 6
7H 7
8H 8
9H 9
10H 10
JH 10
KH 10
QH 10
dtype: int64
def draw(deck, n=5):
return deck.sample(n)
draw(deck)
KC 10
QC 10
JD 10
8D 8
QS 10
dtype: int64
分组加权平均数和相关系数
df = pd.DataFrame({'category': ['a', 'a', 'a', 'a',
'b', 'b', 'b', 'b'],
'data': np.random.randn(8),
'weights': np.random.rand(8)})
print(df)
category data weights
0 a 1.154081 0.056365
1 a 0.689828 0.459289
2 a 1.556843 0.570800
3 a -0.133847 0.449101
4 b -0.925937 0.381151
5 b -1.068985 0.839105
6 b -0.261693 0.932000
7 b -1.052958 0.847123
#利用category计算分组加权平均数
grouped = df.groupby('category')
get_wavg = lambda g: np.average(g['data'], weights=g['weights'])
print( grouped.apply(get_wavg))
category
a 0.788259
b -0.795430
dtype: float64
close_px = pd.read_csv('examples/stock_px_2.csv', parse_dates=True,
index_col=0)
print(close_px[-4:])
AAPL MSFT XOM SPX
2011-10-11 400.29 27.00 76.27 1195.54
2011-10-12 402.19 26.96 77.16 1207.25
2011-10-13 408.43 27.18 76.37 1203.66
2011-10-14 422.00 27.27 78.11 1224.58
spx_corr = lambda x: x.corrwith(x['SPX'])#计算每列和spx列的成对相关系数
rets = close_px.pct_change().dropna()#使用pct_change计算close_px的百分比变化
get_year = lambda x: x.year
by_year = rets.groupby(get_year)
print(by_year.apply(spx_corr))
AAPL MSFT XOM SPX
2003 0.541124 0.745174 0.661265 1.0
2004 0.374283 0.588531 0.557742 1.0
2005 0.467540 0.562374 0.631010 1.0
2006 0.428267 0.406126 0.518514 1.0
2007 0.508118 0.658770 0.786264 1.0
2008 0.681434 0.804626 0.828303 1.0
2009 0.707103 0.654902 0.797921 1.0
2010 0.710105 0.730118 0.839057 1.0
2011 0.691931 0.800996 0.859975 1.0
print(by_year.apply(lambda g: g['AAPL'].corr(g['MSFT'])))
2003 0.480868
2004 0.259024
2005 0.300093
2006 0.161735
2007 0.417738
2008 0.611901
2009 0.432738
2010 0.571946
2011 0.581987
dtype: float64
组级别的线性回归
import statsmodels.api as sm
#regress函数对个数据块执行普通最小二乘法回归。
def regress(data, yvar, xvars):
Y = data[yvar]
X = data[xvars]
X['intercept'] = 1.
result = sm.OLS(Y, X).fit()
return result.params
print(by_year.apply(regress, 'AAPL', ['SPX']))
SPX intercept
2003 1.195406 0.000710
2004 1.363463 0.004201
2005 1.766415 0.003246
2006 1.645496 0.000080
2007 1.198761 0.003438
2008 0.968016 -0.001110
2009 0.879103 0.002954
2010 1.052608 0.001261
2011 0.806605 0.001514
10.4 透视表和交叉表
print(tips.pivot_table(index=['day', 'smoker']))
size tip tip_pct total_bill
day smoker
Fri No 2.250000 2.812500 0.151650 18.420000
Yes 2.066667 2.714000 0.174783 16.813333
Sat No 2.555556 3.102889 0.158048 19.661778
Yes 2.476190 2.875476 0.147906 21.276667
Sun No 2.929825 3.167895 0.160113 20.506667
Yes 2.578947 3.516842 0.187250 24.120000
Thur No 2.488889 2.673778 0.160298 17.113111
Yes 2.352941 3.030000 0.163863 19.190588
#聚合tip_pct和size,而且想根据time进行分组
print(tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'],columns='smoker'))
size tip_pct
smoker No Yes No Yes
time day
Dinner Fri 2.000000 2.222222 0.139622 0.165347
Sat 2.555556 2.476190 0.158048 0.147906
Sun 2.929825 2.578947 0.160113 0.187250
Thur 2.000000 NaN 0.159744 NaN
Lunch Fri 3.000000 1.833333 0.187735 0.188937
Thur 2.500000 2.352941 0.160311 0.163863
交叉表:crosstab
data=pd.DataFrame({'q':np.random.randn(6),
'p':np.random.randn(6),
'mm':['a','b','a','b','a','a'],
'nn':['bob','h','g','h','h','bob']})
print(data)
q p mm nn
0 1.178628 0.373467 a bob
1 0.344182 -0.030927 b h
2 -1.837187 0.209537 a g
3 1.131483 0.848201 b h
4 0.386403 -0.199555 a h
5 -0.989318 -0.038694 a bob
print(pd.crosstab(data.mm, data.nn, margins=True))
nn bob g h All
mm
a 2 1 1 4
b 0 0 2 2
All 2 1 3 6