【cookbook pandas】chapter 9 grouping by continuous variables ,etc.

calculating weighted mean SAT scores per state with apply

  • something about ‘.apply’ method
    the groupobject has four methods that accepts a function (or functions ) to perform a calculations on each group.
    these four methods are .agg,.filter,.transform, and .apply .
    each of the first three of these method has a specific output that the function must return .
    .agg must return a scalar value,
    .filter must return a Boolean,
    .transform must a Series or DataFrame with the same length as the the passed group.
    the .apply method, however, may return a scalar value, a Series,or even a DataFrame of any shape, therefore making it very flexible
    the .apply method is also called only once per group (on a DataFrame), while the .transform and .agg method get called once for each aggregating column(on a Series).
    the .apply method 's ability to return a single object when operating on multiple columns at the same time .
# Task: calculate the weight average of both the math and verbal SAT scores pre state from the college dataset
# weight the scores by the population of undergraduate students per school
import pandas as pd
college=pd.read_csv('college.csv')
subset=['UGDS','SATMTMID','SATVRMID']
# drop any rows that have missing values in the subeset
# .dropna ,by default, drops rows that have one or more missing values 
college2=college.dropna(subset=subset)
college.shape
Out[13]: (7535, 27)
college2.shape
Out[14]: (1184, 27)
# create a user-defined function to calculate the weighted average of the SAT math scores
# this function gets passed a DataFrame of all the original columns for each group, and returns a single scalar
def weighted_math_average(df):
    weight_math=df['UGDS']*df['SATMTMID']
    return int(weight_math.sum()/df['UGDS'].sum())
# group by state and pass this function the .apply method
# because each of group has multiple columns and we want to reduce those to a single value
# we need to use .apply
# the weight_math_average function will be called once for each group (not on the individual column in the group    
# we cannot use .agg method since it returns a value for each of its aggregating columns.
college2.groupby('STABBR').apply(weighted_math_average)
Out[22]: 
STABBR
AK    503
AL    536
AR    529
AZ    569
CA    564
...   ...
WV    500
WY    540
dtype: int64
# we successfully returned a scalar value for each group
# take a detour to look what happen if calling .agg method
college2.groupby('STABBR').agg(weighted_math_average)
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-
  ....
KeyError: 'UGDS'
# this error is caused by lack of aggregating columns

# the weighted_math_average function gets applied to each non-aggregating column in the DataFrame
# the nice feature of .apply is that you can create multiple new columns by returning a Series
# the index of this returned Series will be the column names
# modify function to calculate the weight and arithmetic average for both SAT scores along with the count of insitutions from each group 
def weighted_average(df):
    weight_m=df['UGDS']*df['SATMTMID']
    weight_v=df['UGDS']*df['SATVRMID']
    wm_avg=weight_m.sum()/df['UGDS'].sum()
    wv_avg=weight_v.sum()/df['UGDS'].sum()
    data={'w_math_avg':wm_avg,
          'w_verbal_avg':wv_avg,
          'math_avg':df['SATMTMID'].mean(),
          'verbal_avg':df['SATVRMID'].mean(),
          'count':len(df)}    
    return pd.Series(data)
college2.groupby('STABBR').apply(weighted_average)
Out[35]: 
        w_math_avg  w_verbal_avg    math_avg  verbal_avg  count
STABBR                                                         
AK      503.000000    555.000000  503.000000  555.000000    1.0
AL      536.137917    533.383387  504.285714  508.476190   21.0
AR      529.112332    504.876157  515.937500  491.875000   16.0
AZ      569.313985    557.303350  536.666667  538.333333    6.0
CA      564.945420    539.316605  562.902778  549.083333   72.0
CO      553.123820    547.033996  540.214286  537.714286   14.0
...  ...  ... 
WI      593.562862    556.348332  545.071429  516.857143   14.0
WV      500.725781    487.680187  481.705882  473.411765   17.0
WY      540.000000    535.000000  540.000000  535.000000    1.0
# to use .apply to create multiple columns, you must return a Series.
# the index values are used as column names in the resulting DataFrame
# you can return as many values as you want with this method
college2.groupby('STABBR').apply(weighted_average).astype(int)
Out[36]: 
        w_math_avg  w_verbal_avg  math_avg  verbal_avg  count
STABBR                                                       
AK             503           555       503         555      1
AL             536           533       504         508     21
AR             529           504       515         491     16
AZ             569           557       536         538      6

... 	... 	 ... 
WY             540           535       540         535      1


# it's possible to return any numbeof rows and columns for each group by returning a DataFrame
# Task: also find the geometric and harmonic meams for both SAT columns 
# and return the results as a DataFrame with rows as the name of type of mean and columns as the SAT type
# use the NumPy function average to compute the weighted average 
# the SciPy functions `gmean` and `hmean` for geometric and harmonic means
from scipy.stats import gmean,hmean       
def calculate_means(df):
    df_mean=pd.DataFrame(index=['Arithmetic','Weight','Geometric','Harmonic'])
    cols=['SATMTMID','SATVRMID']
    for col in cols:
        arithmetic=df[col].mean()
        weighted=np.average(df[col],weights=df["UGDS"])
        geometric=gmean(df[col])
        harmonic=hmean(df[col])
        df_mean[col]=[arithmetic,weighted,geometric,harmonic]
    df_mean['count']=len(df)
    return df_mean.astype(int)

import numpy as np
college2.groupby('STABBR').apply(calculate_means)
Out[48]: 
                   SATMTMID  SATVRMID  count
STABBR                                      
AK     Arithmetic       503       555      1
       Weight           503       555      1
       Geometric        503       555      1
       Harmonic         503       555      1
AL     Arithmetic       504       508     21
                     ...       ...    ...
WV     Harmonic         480       472     17
WY     Arithmetic       540       535      1
       Weight           540       535      1
       Geometric        540       534      1
       Harmonic         540       535      1
[212 rows x 3 columns]


Grouping by continuous variables

import pandas as pd
import numpy as np
flights=pd.read_csv('flights.csv')
flights
Out[5]: 
       MONTH  DAY  WEEKDAY AIRLINE  ... SCHED_ARR ARR_DELAY  DIVERTED  CANCELLED
0          1    1        4      WN  ...      1905      65.0         0          0
1          1    1        4      UA  ...      1333     -13.0         0          0
2          1    1        4      MQ  ...      1453      35.0         0          0
3          1    1        4      AA  ...      1935      -7.0         0          0
4          1    1        4      WN  ...      2225      39.0         0          0
      ...  ...      ...     ...  ...       ...       ...       ...        ...
58487     12   31        4      AA  ...      1045     -19.0         0          0
58488     12   31        4      F9  ...      2050       4.0         0          0
58489     12   31        4      OO  ...      1956      -5.0         0          0
58490     12   31        4      WN  ...       855      34.0         0          0
58491     12   31        4      OO  ...      1146      -1.0         0          0
[58492 rows x 14 columns]
# if we want to find the distribution of airlines overa range of distance,
# we need to place the values of 'DIST' column into discrete bins 
# use the pandas `cut` function to split the data into five bins
bins=[-np.inf,200,500,1000,2000,np.inf]
# the bins are created by a sequence of six numbers defining the edges
# you always need one more edge than the number of bins
# `.cut` function places each value of the `DIST` column into one of five bins
cuts=pd.cut(flights['DIST'],bins=bins)
cuts
Out[13]: 
0         (500.0, 1000.0]
1        (1000.0, 2000.0]
2         (500.0, 1000.0]
3        (1000.0, 2000.0]
4        (1000.0, 2000.0]
               ...       
58487    (1000.0, 2000.0]
58488      (200.0, 500.0]
58489      (200.0, 500.0]
58490     (500.0, 1000.0]
58491     (500.0, 1000.0]
Name: DIST, Length: 58492, dtype: category
Categories (5, interval[float64]): [(-inf, 200.0] < (200.0, 500.0] < (500.0, 1000.0] <
                                    (1000.0, 2000.0] < (2000.0, inf]]
                                   
# `cuts` variable is now a Series of five ordered categories
# or pass the bins parameter one integer, which automatically creates that number of equal-width bins
# negetive infinity and positive infinity values are availiable in NumPy and ensure that all values get placed in bins
# if you have values that are outside the bin edges, they will be made missing and not placed in a bin
 
pd.cut(flights['DIST'],bins=5)
Out[15]: 
0        (62.565, 954.0]
1        (954.0, 1841.0]
2        (62.565, 954.0]
3        (954.0, 1841.0]
4        (954.0, 1841.0]
              ...       
58487    (954.0, 1841.0]
58488    (62.565, 954.0]
58489    (62.565, 954.0]
58490    (62.565, 954.0]
58491    (62.565, 954.0]
Name: DIST, Length: 58492, dtype: category
Categories (5, interval[float64]): [(62.565, 954.0] < (954.0, 1841.0] < (1841.0, 2728.0] <
                                    (2728.0, 3615.0] < (3615.0, 4502.0]]
# an order categorical Series is created
# to help get an idea of what happen, count the values of each category
cuts.value_counts()
Out[18]: 
(500.0, 1000.0]     20659
(200.0, 500.0]      15874
(1000.0, 2000.0]    14186
(2000.0, inf]        4054
(-inf, 200.0]        3719
Name: DIST, dtype: int64
# the `cuts` Series can be now be used to form groups
# pandas allows you to pass many types into the .groupby method
flights.groupby(cuts)['AIRLINE'].value_counts(normalize=True)
Out[21]: 
DIST              AIRLINE
(-inf, 200.0]     OO         0.325625
                  EV         0.289325
                  MQ         0.210809
                  DL         0.086045
                  AA         0.052165
                  UA         0.027427
                  WN         0.008604
(200.0, 500.0]    WN         0.193902
                  DL         0.188736
                  OO         0.158687
                  EV         0.156293
                  MQ         0.100164
                  AA         0.071375
                  UA         0.062051
                  VX         0.028222
                  US         0.016001
                  NK         0.011843
                  B6         0.006867
                  F9         0.004914
                  AS         0.000945
(500.0, 1000.0]   DL         0.205625
                  AA         0.143908
                  WN         0.138196
                  UA         0.131129
                  OO         0.106443
                  EV         0.100683
                  MQ         0.051213
                  F9         0.038192
                  NK         0.029527
                  US         0.025316
                  AS         0.023234
                  VX         0.003582
                  B6         0.002953
(1000.0, 2000.0]  AA         0.263781
                  UA         0.199070
                  DL         0.165092
                  WN         0.159664
                  OO         0.046454
                  NK         0.045115
                  US         0.040462
                  F9         0.030664
                  AS         0.015931
                  EV         0.015579
                  VX         0.012125
                  B6         0.003313
                  MQ         0.002749
(2000.0, inf]     UA         0.289097
                  AA         0.211643
                  DL         0.171436
                  B6         0.080414
                  VX         0.073754
                  US         0.065121
                  WN         0.046374
                  HA         0.027627
                  NK         0.019240
                  AS         0.011593
                  F9         0.003700
Name: AIRLINE, dtype: float64
flights.groupby(cuts)['AIRLINE'].value_counts(normalize=True).round(3).head(3)
Out[22]: 
DIST           AIRLINE
(-inf, 200.0]  OO         0.326
               EV         0.289
               MQ         0.211
Name: AIRLINE, dtype: float64
# .value_counts method is used to get a sense of its distribution
# the .groupgy method allows you to pass any object to group on
# this means that you are able to form groups from somthing completely unrelated DataFrame
# here we group by the values in the `cuts` variable
###############################
# create informative string labels when using the `cut` fuction
# these labels replace the interval notation found in the index
# chain .unstack method, which transpose the inner index level to columnnames
labels=['Under an Hour','1 Hour','1-2 Hours','2-4 Hours','4+ Hours']
cuts2=pd.cut(flights['DIST'],labels=labels,bins=bins)
flights.groupby(cuts2)['AIRLINE'].value_counts(normalize=True).round(3)
Out[33]: 
DIST           AIRLINE
Under an Hour  OO         0.326
               EV         0.289
               MQ         0.211
               DL         0.086
               AA         0.052
               UA         0.027
               WN         0.009
1 Hour         WN         0.194
               DL         0.189
               OO         0.159
               EV         0.156
               MQ         0.100
               AA         0.071
               UA         0.062
               VX         0.028
               US         0.016
               NK         0.012
               B6         0.007
               F9         0.005
               AS         0.001
1-2 Hours      DL         0.206
               AA         0.144
               WN         0.138
               UA         0.131
               OO         0.106
               EV         0.101
               MQ         0.051
               F9         0.038
               NK         0.030
               US         0.025
               AS         0.023
               VX         0.004
               B6         0.003
2-4 Hours      AA         0.264
               UA         0.199
               DL         0.165
               WN         0.160
               OO         0.046
               NK         0.045
               US         0.040
               F9         0.031
               AS         0.016
               EV         0.016
               VX         0.012
               B6         0.003
               MQ         0.003
4+ Hours       UA         0.289
               AA         0.212
               DL         0.171
               B6         0.080
               VX         0.074
               US         0.065
               WN         0.046
               HA         0.028
               NK         0.019
               AS         0.012
               F9         0.004
Name: AIRLINE, dtype: float64
flights.groupby(cuts2)['AIRLINE'].value_counts(normalize=True).round(3).unstack()
Out[34]: 
AIRLINE           AA     AS     B6     DL  ...     UA     US     VX     WN
DIST                                       ...                            
Under an Hour  0.052    NaN    NaN  0.086  ...  0.027    NaN    NaN  0.009
1 Hour         0.071  0.001  0.007  0.189  ...  0.062  0.016  0.028  0.194
1-2 Hours      0.144  0.023  0.003  0.206  ...  0.131  0.025  0.004  0.138
2-4 Hours      0.264  0.016  0.003  0.165  ...  0.199  0.040  0.012  0.160
4+ Hours       0.212  0.012  0.080  0.171  ...  0.289  0.065  0.074  0.046
[5 rows x 14 columns]

Counting the total number of flights between two city

generic 普通的,一般的

data=pd.DataFrame(index=list('abc'),data=[1,2,3])
data
Out[36]: 
   0
a  1
b  2
c  3
data.reset_index(drop=True)
Out[37]: 
   0
0  1
1  2
2  3
# .reset_index(drop=True) 表示删除索引

flights[['ORG_AIR','DEST_AIR']]
Out[41]: 
      ORG_AIR DEST_AIR
0         LAX      SLC
1         DEN      IAD
2         DFW      VPS
3         DFW      DCA
4         LAX      MCI
       ...      ...
58487     SFO      DFW
58488     LAS      SFO
58489     SFO      SBA
58490     MSP      ATL
58491     SFO      BOI
[58492 rows x 2 columns]
# apply the .size method to the `groupby` object, which returns the total number of each group 
flights_ct=flights.groupby(['ORG_AIR','DEST_AIR']).size()
flights_ct
Out[45]: 
ORG_AIR  DEST_AIR
ATL      ABE          31
         ABQ          16
         ABY          19
         ACY           6
         AEX          40
                    ... 
SFO      SNA         122
         STL          20
         SUN          10
         TUS          20
         XNA           2
Length: 1130, dtype: int64
# select the total number of flights between IAH and ATL in both directions
# one way to select rows from a MultipleIndex is to pass the `.loc` index operator a tuple of the exact level values
flights_ct.loc[[('ATL','IAH'),('IAH','ATL')]]
Out[48]: 
ORG_AIR  DEST_AIR
ATL      IAH         121
IAH      ATL         148
dtype: int64
# we can simply sum these two numbers together to find the total flights between two cities
# there is a more efficient and automated solution that can work for all flights
# we would like to have just one label for all flights between Atlanta and Houston
# sort the origin and destination columns for each row alphabetically
# use axis='columns' to do that
# .apply method on a DataFrame, must be passed a    function.
# by default, this function is passed each column,
# we can change the direction of computation by using axis='columns' (or axis=1)
# the lambda function has each row of data passed to it impilicity as a Series,
# it returns a Series with sorted airport codes
#  call .reset_index so that the columns do not realign 
#  the .apply function iterates over all rows using the lambda function 
f_part3=(flights[['ORG_AIR','DEST_AIR']]
    .apply(lambda ser:ser.sort_values().reset_index(drop=True),axis='columns' )
         )
f_part3
Out[54]: 
         0    1
0      LAX  SLC
1      DEN  IAD
2      DFW  VPS
3      DCA  DFW
4      LAX  MCI
    ...  ...
58487  DFW  SFO
58488  LAS  SFO
58489  SBA  SFO
58490  ATL  MSP
58491  BOI  SFO
[58492 rows x 2 columns]
# rename them to something more generic 
#and again find the total number of flights between cities 
rename_dirt={0:'AIR1',1:'AIR2'}
f_part3.rename(columns=rename_dirt).groupby(['AIR1','AIR2']).size()
Out[58]: 
AIR1  AIR2
ABE   ATL      31
      ORD      24
ABI   DFW      74
ABQ   ATL      16
      DEN      46
             ... 
SFO   SNA     122
      STL      20
      SUN      10
      TUS      20
      XNA       2
Length: 1085, dtype: int64
(f_part3.rename(columns=rename_dirt).groupby(['AIR1','AIR2']).size()
 .loc[('ATL','IAH')]
)
Out[59]: 269

least performant operations 最低性能的操作
consecutive 连续不断的

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值