第六章 数据分组与聚合
6.1数据分组
6.1.1GroupBy简介
>>> import pandas as pd
>>> tips = pd.read_csv(open('/Users/anchror/Documents/python/20200311/tips.csv'))
>>> grouped = tips['tip'].groupby([tips['day'],tips['time']]).mean()
>>> grouped.plot(kind='barh')
>>> import matplotlib.pyplot as plt
>>> plt.show()
6.1.2 按列名分组
>>> size_mean1 = tips['tip'].groupby(tips['size']).mean()
>>> size_mean2 = tips.groupby('size')['tip'].mean()
>>> size_mean1
size
1 1.437500
2 2.582308
3 3.393158
4 4.135405
5 4.028000
6 5.225000
Name: tip, dtype: float64
>>> size_mean2
size
1 1.437500
2 2.582308
3 3.393158
4 4.135405
5 4.028000
6 5.225000
Name: tip, dtype: float64
6.1.3按列表或元组分组
>>> df = DataFrame(np.arange(16).reshape(4,4))
>>> df
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
>>> list = ['a','b','a','b']
>>> df.groupby(list).sum()
0 1 2 3
a 8 10 12 14
b 16 18 20 22
6.1.4按字典分组
>>> df = DataFrame(np.arange(16).reshape(4,4),index=['a','b','a','b'])
>>> df
0 1 2 3
a 0 1 2 3
b 4 5 6 7
a 8 9 10 11
b 12 13 14 15
>>> dict1={'a':'one','b':'one'}
>>> df.groupby(dict1).sum()
0 1 2 3
one 24 28 32 36
6.1.5按函数分组
>>> dt = DataFrame(np.random.randn(4,4))
>>> dt
0 1 2 3
0 0.971588 0.951163 -1.235710 0.276430
1 -1.092289 -0.199758 -0.880682 0.994877
2 0.946338 1.048721 -1.050854 -0.058461
3 1.270285 0.621590 -0.679906 -0.179442
>>> def jug(x):
... if x>=0:
... return 'a'
... else:
... return 'b'
...
>>> dt.groupby(dt[3].map(jug)).sum()
0 1 2 3
3
a -0.120701 0.751405 -2.116392 1.271307
b 2.216623 1.670311 -1.730760 -0.237903
>>> dt[3].groupby(dt[3].map(jug)).sum()
3
a 1.271307
b -0.237903
Name: 3, dtype: float64
>>> df
apple orange
red green red green
one a 0 1 2 3
b 4 5 6 7
two a 8 9 10 11
b 12 13 14 15
>>> df.groupby(level=1).sum()
apple orange
red green red green
a 8 10 12 14
b 16 18 20 22
>>> df.groupby(level=1,axis=1).sum()
green red
one a 4 2
b 12 10
two a 20 18
b 28 26
6.2聚合运算
6.2.1聚合运算
参数 | 使用说明 |
---|---|
count | 计数 |
sum | 求和 |
mean | 求平均值 |
median | 求算数中位数 |
std\var | 无偏标准差和方差 |
min\max | 求最小值和最大值 |
prod | 求积 |
first\last | 第一个和最后一个 |
6.2.2多函数应用
1.一列多函数
>>> tips.groupby(['sex','smoker'])['tip'].agg(['mean','std'])
mean std
sex smoker
Female No 2.773519 1.128425
Yes 2.931515 1.219916
Male No 3.113402 1.489559
Yes 3.051167 1.500120
2.多列多函数
>>> tips.groupby(['day','time'])['total_bill','tip'].agg([('tip-mean','mean'),('tip-std','std')])
total_bill tip
tip-mean tip-std tip-mean tip-std
day time
Fri Dinner 19.663333 9.471753 2.940000 1.156098
Lunch 12.845714 2.842228 2.382857 0.662966
Sat Dinner 20.441379 9.480419 2.993103 1.631014
Sun Dinner 21.410000 8.832122 3.255132 1.234880
Thur Dinner 18.780000 NaN 3.000000 NaN
Lunch 17.664754 7.950334 2.767705 1.250162
3.不同列不同函数
>>> tips.groupby(['day','time'])['total_bill','tip'].agg({'total_bill':['mean','sum'],'tip':['mean']})
<stdin>:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
total_bill tip
mean sum mean
day time
Fri Dinner 19.663333 235.96 2.940000
Lunch 12.845714 89.92 2.382857
Sat Dinner 20.441379 1778.40 2.993103
Sun Dinner 21.410000 1627.16 3.255132
Thur Dinner 18.780000 18.78 3.000000
Lunch 17.664754 1077.55 2.767705
6.3分组运算
6.3.1transform方法
>>> data
{'name': ['张三', '李四', 'peter', '王五', '小明', '小红'], 'sex': ['female', 'female', 'male', 'male', 'male', 'female'], 'math': [67, 72, nan, 82, 90, nan]}
>>> DataFrame(data)
name sex math
0 张三 female 67.0
1 李四 female 72.0
2 peter male NaN
3 王五 male 82.0
4 小明 male 90.0
5 小红 female NaN
>>> data=DataFrame(data)
>>> data
name sex math
0 张三 female 67.0
1 李四 female 72.0
2 peter male NaN
3 王五 male 82.0
4 小明 male 90.0
5 小红 female NaN
>>> tt=data.fillna(data['math'].mean())
>>> tt
name sex math
0 张三 female 67.00
1 李四 female 72.00
2 peter male 77.75
3 王五 male 82.00
4 小明 male 90.00
5 小红 female 77.75
>>> f= lambda x:x.fillna(x.mean())
>>> data.groupby('sex').apply(f)
name sex math
sex
female 0 张三 female 67.0
1 李四 female 72.0
5 小红 female 69.5
male 2 peter male 86.0
3 王五 male 82.0
4 小明 male 90.0
6.4数据透视表
6.4.1透视图
pivot_table函数的常用参数:
value代表的是值
index为行
columns为列
计算默认是平均值,可以通过aggfunc参数来指定
margins加入分项小计
>>> tips.pivot_table(values='tip',index='sex',columns='smoker')
smoker No Yes
sex
Female 2.773519 2.931515
Male 3.113402 3.051167
>>> tips.pivot_table(values='tip',index='sex',columns='smoker',aggfunc='sum')
smoker No Yes
sex
Female 149.77 96.74
Male 302.00 183.07
>>> tips.pivot_table(values='tip',index='sex',columns='smoker',aggfunc='sum',margins=True)
smoker No Yes All
sex
Female 149.77 96.74 246.51
Male 302.00 183.07 485.07
All 451.77 279.81 731.58
6.4.2交叉表
>>> cross_table=pd.crosstab(index=tips['day'],columns=tips['size'])
>>> pd.crosstab(index=tips['day'],columns=tips['size'])
size 1 2 3 4 5 6
day
Fri 1 16 1 1 0 0
Sat 2 53 18 13 1 0
Sun 0 39 15 18 3 1
Thur 1 48 4 5 1 3
>>> df = cross_table.div(cross_table.sum(1),axis=0)
>>> df
size 1 2 3 4 5 6
day
Fri 0.052632 0.842105 0.052632 0.052632 0.000000 0.000000
Sat 0.022989 0.609195 0.206897 0.149425 0.011494 0.000000
Sun 0.000000 0.513158 0.197368 0.236842 0.039474 0.013158
Thur 0.016129 0.774194 0.064516 0.080645 0.016129 0.048387
>>> df.plot(kind='bar',stacked=True)
6.5综合实例——巴尔的摩公务员工资数据集
6.5.1数据来源
6.5.2定义问题
6.5.3数据清洗
>>> salary.isnull().sum()
Name 0
JobTitle 0
AgencyID 0
Agency 0
HireDate 0
AnnualSalary 0
GrossPay 272
dtype: int64
>>> salary=salary.dropna()
>>> salary.isnull().sum()
Name 0
JobTitle 0
AgencyID 0
Agency 0
HireDate 0
AnnualSalary 0
GrossPay 0
dtype: int64
>>> salary['AnnualSalary']=salary['AnnualSalary'].str.strip('$')
>>> salary['GrossPay']=salary['GrossPay'].str.strip('$')
>>>> salary['GrossPay'].dtype
dtype('O')
>>> salary['GrossPay']=salary['GrossPay'].astype('float')
>>> salary['AnnualSalary']=salary['AnnualSalary'].astype('float')
>>> salary['GrossPay'].dtype
dtype('float64')
>>> salary['month']=salary['HireDate'].str.split('/').str[0]
6.5.4数据探索
salary['AnnualSalary'].hist(bins=20)
plt.show()
>>> month_count = salary['month'].value_counts()
>>> month_count.plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot object at 0x119697670>
>>> plt.show()
>>> agg_salary=salary.groupby('JobTitle')['AnnualSalary'].agg(['mean','count'])
>>> sort_salary = agg_salary.sort_values(by='mean',ascending=False)[:5]
>>> sort_salary
mean count
JobTitle
STATE'S ATTORNEY 238772.0 1
Police Commissioner 200000.0 1
Executive Director V 182500.0 1
MAYOR 171635.0 1
Executive Director III 171306.5 10
>>> sort_salary['mean'].plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot object at 0x11c451dc0>
>>> plt.show()
>>> sortcount_salary = agg_salary.sort_values(by='count',ascending=False)[:5]
>>> sortcount_salary
mean count
JobTitle
POLICE OFFICER 70396.973804 1756
LABORER (Hourly) 31274.818512 551
EMT Firefighter Suppression 63688.088319 351
RECREATION ARTS INSTRUCTOR 24420.313480 319
OFFICE SUPPORT SPECIALIST III 35560.003300 303
>>> sortcount_salary['count'].plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot object at 0x11be5ea30>
>>> plt.show()