pandas对每个分组分别统计百分比
数据如下:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({
'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id':
list('ABCDEF') * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]
}).sort_values(['state', 'office_id'], ignore_index=True)
print(df)
'''
state office_id sales
0 AZ B 222579
1 AZ D 252315
2 AZ F 835831
3 CA A 405711
4 CA C 710581
5 CA E 982371
6 CO A 404137
7 CO C 217952
8 CO E 474564
9 WA B 535829
10 WA D 548242
11 WA F 459783
'''
目标:根据每个state分别统计其下的每个office_id占比
首先想到的方法是df['sales'] / df.groupby('state')['sales'].sum()
,但由于行数不一致且索引也无法对齐,得到的结果都是NaN
,因此需先将state
和office_id
字段设置为索引后再操作
df['%'] = 100 * (df[['sales']] / df.groupby('state')['sales'].sum())
print(df)
'''
state office_id sales %
0 AZ B 222579 NaN
1 AZ D 252315 NaN
2 AZ F 835831 NaN
3 CA A 405711 NaN
4 CA C 710581 NaN
5 CA E 982371 NaN
6 CO A 404137 NaN
7 CO C 217952 NaN
8 CO E 474564 NaN
9 WA B 535829 NaN
10 WA D 548242 NaN
11 WA F 459783 NaN
'''
df = df.set_index(['state','office_id'])
df['%'] = 100 * (df / df.groupby('state').sum())
print(df)
'''
sales %
state office_id
AZ B 222579 16.981365
D 252315 19.250033
F 835831 63.768601
CA A 405711 19.331879
C 710581 33.858747
E 982371 46.809373
CO A 404137 36.851857
C 217952 19.874290
E 474564 43.273852
WA B 535829 34.707233
D 548242 35.511259
F 459783 29.781508
'''
我们还有另外一种方法transform
,比较下两种方法的结果,可以发现transform
生成了与原数据相同行数的结果,且不会丢弃非数值字段:
df.groupby('state').transform('sum')
'''
office_id sales
0 BDF 1310725
1 BDF 1310725
2 BDF 1310725
3 ACE 2098663
4 ACE 2098663
5 ACE 2098663
6 ACE 1096653
7 ACE 1096653
8 ACE 1096653
9 BDF 1543854
10 BDF 1543854
11 BDF 1543854
'''
df.groupby('state').sum()
'''
sales
state
AZ 1310725
CA 2098663
CO 1096653
WA 1543854
'''
由于transform
方法得到结果行数与原数据一致,因此可以直接进行运算:
df['%'] = 100 * (df['sales'] / df.groupby('state')['sales'].transform('sum'))
print(df)
'''
state office_id sales %
0 AZ B 222579 16.981365
1 AZ D 252315 19.250033
2 AZ F 835831 63.768601
3 CA A 405711 19.331879
4 CA C 710581 33.858747
5 CA E 982371 46.809373
6 CO A 404137 36.851857
7 CO C 217952 19.874290
8 CO E 474564 43.273852
9 WA B 535829 34.707233
10 WA D 548242 35.511259
11 WA F 459783 29.781508
'''
使用交叉表和透视图也可以实现目标:
df = pd.crosstab(index=df['state'],
columns=df['office_id'],
values=df['sales'],
aggfunc='sum',
normalize='index')
print(df)
'''
office_id A B C D E F
state
AZ 0.000000 0.169814 0.000000 0.192500 0.000000 0.637686
CA 0.193319 0.000000 0.338587 0.000000 0.468094 0.000000
CO 0.368519 0.000000 0.198743 0.000000 0.432739 0.000000
WA 0.000000 0.347072 0.000000 0.355113 0.000000 0.297815
'''
df = df.pivot_table(index='state',
columns='office_id').apply(lambda x: x / x.sum(), axis=1)
print(df)
'''
sales
office_id A B C D E F
state
AZ NaN 0.169814 NaN 0.192500 NaN 0.637686
CA 0.193319 NaN 0.338587 NaN 0.468094 NaN
CO 0.368519 NaN 0.198743 NaN 0.432739 NaN
WA NaN 0.347072 NaN 0.355113 NaN 0.297815
'''