第六章 数据分组与聚合

第六章 数据分组与聚合

6.1数据分组

6.1.1GroupBy简介

>>> import pandas as pd
>>> tips = pd.read_csv(open('/Users/anchror/Documents/python/20200311/tips.csv'))
>>> grouped = tips['tip'].groupby([tips['day'],tips['time']]).mean()
>>> grouped.plot(kind='barh')
>>> import matplotlib.pyplot as plt
>>> plt.show()

在这里插入图片描述

6.1.2 按列名分组

>>> size_mean1 = tips['tip'].groupby(tips['size']).mean()
>>> size_mean2 = tips.groupby('size')['tip'].mean()
>>> size_mean1
size
1    1.437500
2    2.582308
3    3.393158
4    4.135405
5    4.028000
6    5.225000
Name: tip, dtype: float64
>>> size_mean2
size
1    1.437500
2    2.582308
3    3.393158
4    4.135405
5    4.028000
6    5.225000
Name: tip, dtype: float64

6.1.3按列表或元组分组

>>> df = DataFrame(np.arange(16).reshape(4,4))
>>> df
    0   1   2   3
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
>>> list = ['a','b','a','b']
>>> df.groupby(list).sum()
    0   1   2   3
a   8  10  12  14
b  16  18  20  22

6.1.4按字典分组

>>> df = DataFrame(np.arange(16).reshape(4,4),index=['a','b','a','b'])
>>> df
    0   1   2   3
a   0   1   2   3
b   4   5   6   7
a   8   9  10  11
b  12  13  14  15
>>> dict1={'a':'one','b':'one'}
>>> df.groupby(dict1).sum()
      0   1   2   3
one  24  28  32  36

6.1.5按函数分组

>>> dt = DataFrame(np.random.randn(4,4))
>>> dt
          0         1         2         3
0  0.971588  0.951163 -1.235710  0.276430
1 -1.092289 -0.199758 -0.880682  0.994877
2  0.946338  1.048721 -1.050854 -0.058461
3  1.270285  0.621590 -0.679906 -0.179442
>>> def jug(x):
...     if x>=0:
...             return 'a'
...     else:
...             return 'b'
... 
>>> dt.groupby(dt[3].map(jug)).sum()
          0         1         2         3
3                                        
a -0.120701  0.751405 -2.116392  1.271307
b  2.216623  1.670311 -1.730760 -0.237903
>>> dt[3].groupby(dt[3].map(jug)).sum()
3
a    1.271307
b   -0.237903
Name: 3, dtype: float64
>>> df
      apple       orange      
        red green    red green
one a     0     1      2     3
    b     4     5      6     7
two a     8     9     10    11
    b    12    13     14    15
>>> df.groupby(level=1).sum()
  apple       orange      
    red green    red green
a     8    10     12    14
b    16    18     20    22
>>> df.groupby(level=1,axis=1).sum()
       green  red
one a      4    2
    b     12   10
two a     20   18
    b     28   26

6.2聚合运算

6.2.1聚合运算

参数使用说明
count计数
sum求和
mean求平均值
median求算数中位数
std\var无偏标准差和方差
min\max求最小值和最大值
prod求积
first\last第一个和最后一个

6.2.2多函数应用

1.一列多函数
>>> tips.groupby(['sex','smoker'])['tip'].agg(['mean','std'])
                   mean       std
sex    smoker                    
Female No      2.773519  1.128425
       Yes     2.931515  1.219916
Male   No      3.113402  1.489559
       Yes     3.051167  1.500120
2.多列多函数
>>> tips.groupby(['day','time'])['total_bill','tip'].agg([('tip-mean','mean'),('tip-std','std')])
            total_bill                 tip          
              tip-mean   tip-std  tip-mean   tip-std
day  time                                           
Fri  Dinner  19.663333  9.471753  2.940000  1.156098
     Lunch   12.845714  2.842228  2.382857  0.662966
Sat  Dinner  20.441379  9.480419  2.993103  1.631014
Sun  Dinner  21.410000  8.832122  3.255132  1.234880
Thur Dinner  18.780000       NaN  3.000000       NaN
     Lunch   17.664754  7.950334  2.767705  1.250162
3.不同列不同函数
>>> tips.groupby(['day','time'])['total_bill','tip'].agg({'total_bill':['mean','sum'],'tip':['mean']})
<stdin>:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
            total_bill                tip
                  mean      sum      mean
day  time                                
Fri  Dinner  19.663333   235.96  2.940000
     Lunch   12.845714    89.92  2.382857
Sat  Dinner  20.441379  1778.40  2.993103
Sun  Dinner  21.410000  1627.16  3.255132
Thur Dinner  18.780000    18.78  3.000000
     Lunch   17.664754  1077.55  2.767705

6.3分组运算

6.3.1transform方法

>>> data
{'name': ['张三', '李四', 'peter', '王五', '小明', '小红'], 'sex': ['female', 'female', 'male', 'male', 'male', 'female'], 'math': [67, 72, nan, 82, 90, nan]}
>>> DataFrame(data)
    name     sex  math
0     张三  female  67.0
1     李四  female  72.0
2  peter    male   NaN
3     王五    male  82.0
4     小明    male  90.0
5     小红  female   NaN
>>> data=DataFrame(data)
>>> data
    name     sex  math
0     张三  female  67.0
1     李四  female  72.0
2  peter    male   NaN
3     王五    male  82.0
4     小明    male  90.0
5     小红  female   NaN
>>> tt=data.fillna(data['math'].mean())
>>> tt
    name     sex   math
0     张三  female  67.00
1     李四  female  72.00
2  peter    male  77.75
3     王五    male  82.00
4     小明    male  90.00
5     小红  female  77.75
>>> f= lambda x:x.fillna(x.mean())
>>> data.groupby('sex').apply(f)
           name     sex  math
sex                          
female 0     张三  female  67.0
       1     李四  female  72.0
       5     小红  female  69.5
male   2  peter    male  86.0
       3     王五    male  82.0
       4     小明    male  90.0

6.4数据透视表

6.4.1透视图

pivot_table函数的常用参数:
value代表的是值
index为行
columns为列
计算默认是平均值,可以通过aggfunc参数来指定
margins加入分项小计

>>> tips.pivot_table(values='tip',index='sex',columns='smoker')
smoker        No       Yes
sex                       
Female  2.773519  2.931515
Male    3.113402  3.051167
>>> tips.pivot_table(values='tip',index='sex',columns='smoker',aggfunc='sum')
smoker      No     Yes
sex                   
Female  149.77   96.74
Male    302.00  183.07
>>> tips.pivot_table(values='tip',index='sex',columns='smoker',aggfunc='sum',margins=True)
smoker      No     Yes     All
sex                           
Female  149.77   96.74  246.51
Male    302.00  183.07  485.07
All     451.77  279.81  731.58

6.4.2交叉表

>>> cross_table=pd.crosstab(index=tips['day'],columns=tips['size'])
>>> pd.crosstab(index=tips['day'],columns=tips['size'])
size  1   2   3   4  5  6
day                      
Fri   1  16   1   1  0  0
Sat   2  53  18  13  1  0
Sun   0  39  15  18  3  1
Thur  1  48   4   5  1  3
>>> df = cross_table.div(cross_table.sum(1),axis=0)
>>> df
size         1         2         3         4         5         6
day                                                             
Fri   0.052632  0.842105  0.052632  0.052632  0.000000  0.000000
Sat   0.022989  0.609195  0.206897  0.149425  0.011494  0.000000
Sun   0.000000  0.513158  0.197368  0.236842  0.039474  0.013158
Thur  0.016129  0.774194  0.064516  0.080645  0.016129  0.048387
>>> df.plot(kind='bar',stacked=True)

在这里插入图片描述

6.5综合实例——巴尔的摩公务员工资数据集

6.5.1数据来源

6.5.2定义问题

6.5.3数据清洗

>>> salary.isnull().sum()
Name              0
JobTitle          0
AgencyID          0
Agency            0
HireDate          0
AnnualSalary      0
GrossPay        272
dtype: int64
>>> salary=salary.dropna()
>>> salary.isnull().sum()                                            
Name            0
JobTitle        0
AgencyID        0
Agency          0
HireDate        0
AnnualSalary    0
GrossPay        0
dtype: int64
>>> salary['AnnualSalary']=salary['AnnualSalary'].str.strip('$')
>>> salary['GrossPay']=salary['GrossPay'].str.strip('$')
>>>> salary['GrossPay'].dtype
dtype('O')
>>> salary['GrossPay']=salary['GrossPay'].astype('float')
>>> salary['AnnualSalary']=salary['AnnualSalary'].astype('float')
>>> salary['GrossPay'].dtype
dtype('float64')
>>> salary['month']=salary['HireDate'].str.split('/').str[0]

6.5.4数据探索

salary['AnnualSalary'].hist(bins=20)
plt.show()

在这里插入图片描述

>>> month_count = salary['month'].value_counts()
>>> month_count.plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot object at 0x119697670>
>>> plt.show()

在这里插入图片描述

>>> agg_salary=salary.groupby('JobTitle')['AnnualSalary'].agg(['mean','count'])
>>> sort_salary = agg_salary.sort_values(by='mean',ascending=False)[:5]
>>> sort_salary
                            mean  count
JobTitle                               
STATE'S ATTORNEY        238772.0      1
Police Commissioner     200000.0      1
Executive Director V    182500.0      1
MAYOR                   171635.0      1
Executive Director III  171306.5     10
>>> sort_salary['mean'].plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot object at 0x11c451dc0>
>>> plt.show()

在这里插入图片描述

>>> sortcount_salary = agg_salary.sort_values(by='count',ascending=False)[:5]
>>> sortcount_salary
                                       mean  count
JobTitle                                          
POLICE OFFICER                 70396.973804   1756
LABORER (Hourly)               31274.818512    551
EMT Firefighter Suppression    63688.088319    351
RECREATION ARTS INSTRUCTOR     24420.313480    319
OFFICE SUPPORT SPECIALIST III  35560.003300    303
>>> sortcount_salary['count'].plot(kind='bar')                       
<matplotlib.axes._subplots.AxesSubplot object at 0x11be5ea30>
>>> plt.show()

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值