注:本教程为系列教程此章节接前面第一弹
16 分组聚合、过滤、转换
16.1 准备数据
pd.options.display.max_columns = 8
collage_data = pd.read_csv("pandasLearnData/college.csv")
collage_data.head(5)
| INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP |
---|
0 | Alabama A & M University | Normal | AL | 1.0 | ... | 0.8284 | 0.1049 | 30300 | 33888 |
1 | University of Alabama at Birmingham | Birmingham | AL | 0.0 | ... | 0.5214 | 0.2422 | 39700 | 21941.5 |
2 | Amridge University | Montgomery | AL | 0.0 | ... | 0.7795 | 0.8540 | 40100 | 23370 |
3 | University of Alabama in Huntsville | Huntsville | AL | 0.0 | ... | 0.4596 | 0.2640 | 45500 | 24097 |
4 | Alabama State University | Montgomery | AL | 1.0 | ... | 0.7554 | 0.1270 | 26600 | 33118.5 |
5 rows × 27 columns
16.2 定义聚合
gb_STABBR = collage_data.groupby("STABBR")
16.3 聚合的属性原理
16.3.1 聚合类别
type(gb_STABBR)
pandas.core.groupby.generic.DataFrameGroupBy
16.3.2 聚合类型的所用方法和属性
for attr in dir(gb_STABBR):
print("" if attr.startswith("_") else attr+"\t",end="")
CITY CURROPER DISTANCEONLY GRAD_DEBT_MDN_SUPP HBCU INSTNM MD_EARN_WNE_P10 MENONLY PCTFLOAN PCTPELL PPTUG_EF RELAFFIL SATMTMID SATVRMID STABBR UG25ABV UGDS UGDS_2MOR UGDS_AIAN UGDS_ASIAN UGDS_BLACK UGDS_HISP UGDS_NHPI UGDS_NRA UGDS_UNKN UGDS_WHITE WOMENONLY agg aggregate all any apply backfill bfill boxplot corr corrwith count cov cumcount cummax cummin cumprod cumsum describe diff dtypes expanding ffill fillna filter first get_group groups head hist idxmax idxmin indices last mad max mean median min ndim ngroup ngroups nth nunique ohlc pad pct_change pipe plot prod quantile rank resample rolling sem shift size skew std sum tail take transform tshift var
16.3.3 查看分组数量
gb_STABBR.ngroups
59
16.3.4 返回分组的键值
gb_STABBR.groups.keys()
dict_keys(['AK', 'AL', 'AR', 'AS', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'FM', 'GA', 'GU', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO', 'MP', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR', 'PW', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VI', 'VT', 'WA', 'WI', 'WV', 'WY'])
16.3.5 根据键获取某一个分组
gb_STABBR.get_group("AK")
| INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP |
---|
60 | University of Alaska Anchorage | Anchorage | AK | 0.0 | ... | 0.2647 | 0.4386 | 42500 | 19449.5 |
61 | Alaska Bible College | Palmer | AK | 0.0 | ... | 0.2857 | 0.4286 | NaN | PrivacySuppressed |
62 | University of Alaska Fairbanks | Fairbanks | AK | 0.0 | ... | 0.2550 | 0.4519 | 36200 | 19355 |
63 | University of Alaska Southeast | Juneau | AK | 0.0 | ... | 0.1996 | 0.5550 | 37400 | 16875 |
64 | Alaska Pacific University | Anchorage | AK | 0.0 | ... | 0.5297 | 0.4910 | 47000 | 23250 |
65 | AVTEC-Alaska's Institute of Technology | Seward | AK | 0.0 | ... | 0.0664 | 0.7127 | 33500 | PrivacySuppressed |
66 | Charter College-Anchorage | Anchorage | AK | 0.0 | ... | 0.7503 | 0.5472 | 39200 | 13875 |
67 | Alaska Career College | Anchorage | AK | 0.0 | ... | 0.7860 | 0.5612 | 28700 | 8994 |
5171 | Ilisagvik College | Barrow | AK | 0.0 | ... | 0.0000 | 0.6498 | 24900 | PrivacySuppressed |
5417 | Alaska Christian College | Soldotna | AK | 0.0 | ... | 0.6792 | 0.2264 | NaN | PrivacySuppressed |
10 rows × 27 columns
16.3.6 DataFrameGroupBy对象是一个可迭代对象
from collections import Iterable
isinstance(gb_STABBR,Iterable)
True
i = 0
for name,group in gb_STABBR:
i += 1
print("组名:",name,"\t\t类型:",type(group))
display(group.head(2))
if i >= 3:
break
组名: AK 类型: <class 'pandas.core.frame.DataFrame'>
| INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP |
---|
60 | University of Alaska Anchorage | Anchorage | AK | 0.0 | ... | 0.2647 | 0.4386 | 42500 | 19449.5 |
61 | Alaska Bible College | Palmer | AK | 0.0 | ... | 0.2857 | 0.4286 | NaN | PrivacySuppressed |
2 rows × 27 columns
组名: AL 类型: <class 'pandas.core.frame.DataFrame'>
| INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP |
---|
0 | Alabama A & M University | Normal | AL | 1.0 | ... | 0.8284 | 0.1049 | 30300 | 33888 |
1 | University of Alabama at Birmingham | Birmingham | AL | 0.0 | ... | 0.5214 | 0.2422 | 39700 | 21941.5 |
2 rows × 27 columns
组名: AR 类型: <class 'pandas.core.frame.DataFrame'>
| INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP |
---|
128 | University of Arkansas at Little Rock | Little Rock | AR | 0.0 | ... | 0.4775 | 0.4062 | 33900 | 21736 |
129 | University of Arkansas for Medical Sciences | Little Rock | AR | 0.0 | ... | 0.6144 | 0.5133 | 61400 | 12500 |
2 rows × 27 columns
16.3.7 head显示每个分组的头几行
gb_STABBR.head(2).head(5)
| INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP |
---|
0 | Alabama A & M University | Normal | AL | 1.0 | ... | 0.8284 | 0.1049 | 30300 | 33888 |
1 | University of Alabama at Birmingham | Birmingham | AL | 0.0 | ... | 0.5214 | 0.2422 | 39700 | 21941.5 |
43 | Prince Institute-Southeast | Elmhurst | IL | 0.0 | ... | 0.9375 | 0.6569 | PrivacySuppressed | 20992 |
60 | University of Alaska Anchorage | Anchorage | AK | 0.0 | ... | 0.2647 | 0.4386 | 42500 | 19449.5 |
61 | Alaska Bible College | Palmer | AK | 0.0 | ... | 0.2857 | 0.4286 | NaN | PrivacySuppressed |
5 rows × 27 columns
16.3.8 nth指定显示分组内的相应行
gb_STABBR.nth([1,-1]).head(5)
| INSTNM | CITY | HBCU | MENONLY | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP |
---|
STABBR | | | | | | | | | |
---|
AK | Alaska Bible College | Palmer | 0.0 | 0.0 | ... | 0.2857 | 0.4286 | NaN | PrivacySuppressed |
AK | Alaska Christian College | Soldotna | 0.0 | 0.0 | ... | 0.6792 | 0.2264 | NaN | PrivacySuppressed |
AL | University of Alabama at Birmingham | Birmingham | 0.0 | 0.0 | ... | 0.5214 | 0.2422 | 39700 | 21941.5 |
AL | Strayer University-Huntsville Campus | Huntsville | NaN | NaN | ... | NaN | NaN | 49200 | 36173.5 |
AR | Career Academy of Hair Design-Fayetteville | Fayetteville | NaN | NaN | ... | NaN | NaN | NaN | 6365 |
5 rows × 26 columns
16.4 对分组对象使用聚合函数
16.4.1 agg函数传入聚合函数名称
gb_STABBR["UGDS"].agg("sum").head(5)
STABBR
AK 24932.0
AL 248298.0
AR 134820.0
AS 1276.0
AZ 520439.0
Name: UGDS, dtype: float64
16.4.2 直接使用统计函数
gb_STABBR["UGDS"].sum().head(5)
STABBR
AK 24932.0
AL 248298.0
AR 134820.0
AS 1276.0
AZ 520439.0
Name: UGDS, dtype: float64
16.4.3 agg传入多种聚合函数进行统计
gb_STABBR["UGDS"].agg(["mean",np.sum,"std"]).head(5)
| mean | sum | std |
---|
STABBR | | | |
---|
AK | 2493.200000 | 24932.0 | 4051.726650 |
AL | 2789.865169 | 248298.0 | 4657.877043 |
AR | 1644.146341 | 134820.0 | 3142.774213 |
AS | 1276.000000 | 1276.0 | NaN |
AZ | 4130.468254 | 520439.0 | 14893.640651 |
16.5 对多个字段进行分组
16.5.1 分组
collage_data.groupby(["STABBR","CITY"]).head(2).head(5)
| INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP |
---|
0 | Alabama A & M University | Normal | AL | 1.0 | ... | 0.8284 | 0.1049 | 30300 | 33888 |
1 | University of Alabama at Birmingham | Birmingham | AL | 0.0 | ... | 0.5214 | 0.2422 | 39700 | 21941.5 |
2 | Amridge University | Montgomery | AL | 0.0 | ... | 0.7795 | 0.8540 | 40100 | 23370 |
3 | University of Alabama in Huntsville | Huntsville | AL | 0.0 | ... | 0.4596 | 0.2640 | 45500 | 24097 |
4 | Alabama State University | Montgomery | AL | 1.0 | ... | 0.7554 | 0.1270 | 26600 | 33118.5 |
5 rows × 27 columns
16.5.2 对多个字段分组并使用多种聚合方法
grouped = collage_data.groupby(["STABBR","CITY"]).agg({"INSTNM":"size","UGDS":["mean","std","sum"]})
grouped.head(5)
| | INSTNM | UGDS |
---|
| | size | mean | std | sum |
---|
STABBR | CITY | | | | |
---|
AK | Anchorage | 4 | 4218.75 | 5922.370605 | 16875.0 |
Barrow | 1 | 109.00 | NaN | 109.0 |
Fairbanks | 1 | 5536.00 | NaN | 5536.0 |
Juneau | 1 | 1428.00 | NaN | 1428.0 |
Palmer | 1 | 27.00 | NaN | 27.0 |
16.6 消除多级索引
16.6.1 查看索引结构
grouped.columns
MultiIndex([('INSTNM', 'size'),
( 'UGDS', 'mean'),
( 'UGDS', 'std'),
( 'UGDS', 'sum')],
)
16.6.2 获取一级索引
columns_level1 = grouped.columns.get_level_values(0)
columns_level1
Index(['INSTNM', 'UGDS', 'UGDS', 'UGDS'], dtype='object')
16.6.3 获取二级索引
columns_level2 = grouped.columns.get_level_values(1)
columns_level2
Index(['size', 'mean', 'std', 'sum'], dtype='object')
16.6.4 合并索引
grouped.columns = columns_level1 + "_" + columns_level2
grouped.head(5)
| | INSTNM_size | UGDS_mean | UGDS_std | UGDS_sum |
---|
STABBR | CITY | | | | |
---|
AK | Anchorage | 4 | 4218.75 | 5922.370605 | 16875.0 |
Barrow | 1 | 109.00 | NaN | 109.0 |
Fairbanks | 1 | 5536.00 | NaN | 5536.0 |
Juneau | 1 | 1428.00 | NaN | 1428.0 |
Palmer | 1 | 27.00 | NaN | 27.0 |
16.6.5 行索引可直接利用reset_index清除
grouped.reset_index().head(5)
| STABBR | CITY | INSTNM_size | UGDS_mean | UGDS_std | UGDS_sum |
---|
0 | AK | Anchorage | 4 | 4218.75 | 5922.370605 | 16875.0 |
1 | AK | Barrow | 1 | 109.00 | NaN | 109.0 |
2 | AK | Fairbanks | 1 | 5536.00 | NaN | 5536.0 |
3 | AK | Juneau | 1 | 1428.00 | NaN | 1428.0 |
4 | AK | Palmer | 1 | 27.00 | NaN | 27.0 |
16.6.6 在使用groupby方法时,传入参数不增加索引
collage_data.groupby(["STABBR","CITY"],as_index=False,sort=False)\
.agg({"INSTNM":"size","UGDS":["mean","std","sum"]})\
.head(5)
| STABBR | CITY | INSTNM | UGDS |
---|
| | | size | mean | std | sum |
---|
0 | AL | Normal | 1 | 4206.000000 | NaN | 4206.0 |
1 | AL | Birmingham | 14 | 2236.428571 | 3314.611283 | 31310.0 |
2 | AL | Montgomery | 11 | 1562.300000 | 1717.282482 | 15623.0 |
3 | AL | Huntsville | 7 | 1511.500000 | 2042.258627 | 9069.0 |
4 | AL | Tuscaloosa | 3 | 11887.333333 | 15666.545258 | 35662.0 |
16.7 自定义聚合函数
16.7.1 定义聚合函数
def my_max(s):
max_value = 0
for i in s:
max_value = max_value if max_value > i and i != np.NAN else i
return s.max()
16.7.2 使用自定义聚合函数
collage_data.groupby("STABBR")[["UGDS"]].agg(my_max).head(5)
| UGDS |
---|
STABBR | |
---|
AK | 12865.0 |
AL | 29851.0 |
AR | 21405.0 |
AS | 1276.0 |
AZ | 151558.0 |
16.7.3 和自带的聚合函数混合使用
collage_data.groupby("STABBR")["UGDS"].agg(["mean","sum",my_max]).head(5)
| mean | sum | my_max |
---|
STABBR | | | |
---|
AK | 2493.200000 | 24932.0 | 12865.0 |
AL | 2789.865169 | 248298.0 | 29851.0 |
AR | 1644.146341 | 134820.0 | 21405.0 |
AS | 1276.000000 | 1276.0 | 1276.0 |
AZ | 4130.468254 | 520439.0 | 151558.0 |
16.7.4 修改列名
my_max.__name__ = "max"
collage_data.groupby("STABBR")["UGDS"].agg(["mean","sum",my_max]).head(5)
| mean | sum | max |
---|
STABBR | | | |
---|
AK | 2493.200000 | 24932.0 | 12865.0 |
AL | 2789.865169 | 248298.0 | 29851.0 |
AR | 1644.146341 | 134820.0 | 21405.0 |
AS | 1276.000000 | 1276.0 | 1276.0 |
AZ | 4130.468254 | 520439.0 | 151558.0 |
collage_data.groupby("STABBR")["UGDS"].agg(["mean","sum",my_max]).rename({"my_max":"max"},inplace=False).head(5)
| mean | sum | max |
---|
STABBR | | | |
---|
AK | 2493.200000 | 24932.0 | 12865.0 |
AL | 2789.865169 | 248298.0 | 29851.0 |
AR | 1644.146341 | 134820.0 | 21405.0 |
AS | 1276.000000 | 1276.0 | 1276.0 |
AZ | 4130.468254 | 520439.0 | 151558.0 |
16.8 用 args 和 *kwargs 自定义聚合函数
16.8.1 定义聚合函数
def my_between_cnt(s,low,high):
return s.between(low,high).sum()
16.8.2 使用自定义聚合函数
collage_data.groupby("STABBR")["UGDS"].agg(my_between_cnt,10000,20000).head(5)
STABBR
AK 1.0
AL 4.0
AR 0.0
AS 0.0
AZ 4.0
Name: UGDS, dtype: float64
16.8.3 使用闭包简化代码
def init_agg_func(func,col_name,*args,**kwargs):
def wrapper(s):
return func(s,*args,**kwargs)
wrapper.__name__ = col_name
return wrapper
my_max = init_agg_func(np.max,"max")
between_10_20k = init_agg_func(my_between_cnt,"10k~20k",10000,20000)
between_20_30k = init_agg_func(my_between_cnt,"20k~30k",20000,30000)
collage_data.groupby("STABBR")["UGDS"].agg([my_max,between_10_20k,between_20_30k]).head(5)
| max | 10k~20k | 20k~30k |
---|
STABBR | | | |
---|
AK | 12865.0 | 1.0 | 0.0 |
AL | 29851.0 | 4.0 | 2.0 |
AR | 21405.0 | 0.0 | 1.0 |
AS | 1276.0 | 0.0 | 0.0 |
AZ | 151558.0 | 4.0 | 2.0 |
16.9 使用filter对分组进行过滤
collage_data.groupby("STABBR").filter(lambda x:x["UGDS"].sum() > 1000000)
| INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP |
---|
192 | Academy of Art University | San Francisco | CA | 0.0 | ... | 0.5524 | 0.4043 | 36000 | 35093 |
193 | ITT Technical Institute-Rancho Cordova | Rancho Cordova | CA | 0.0 | ... | 0.7667 | 0.7235 | 38800 | 25827.5 |
194 | Academy of Chinese Culture and Health Sciences | Oakland | CA | 0.0 | ... | NaN | NaN | NaN | PrivacySuppressed |
195 | The Academy of Radio and TV Broadcasting | Huntington Beach | CA | 0.0 | ... | 1.0000 | 0.4545 | 28400 | 9500 |
196 | Avalon School of Cosmetology-Alameda | Alameda | CA | 0.0 | ... | 0.6768 | 0.3387 | 21600 | 9860 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7528 | WestMed College - Merced | Merced | CA | NaN | ... | NaN | NaN | NaN | 15623.5 |
7529 | Vantage College | El Paso | TX | NaN | ... | NaN | NaN | NaN | 9500 |
7530 | SAE Institute of Technology San Francisco | Emeryville | CA | NaN | ... | NaN | NaN | NaN | 9500 |
7533 | Bay Area Medical Academy - San Jose Satellite ... | San Jose | CA | NaN | ... | NaN | NaN | NaN | PrivacySuppressed |
7534 | Excel Learning Center-San Antonio South | San Antonio | TX | NaN | ... | NaN | NaN | NaN | 12125 |
1245 rows × 27 columns
| INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP |
---|
0 | Alabama A & M University | Normal | AL | 1.0 | ... | 0.8284 | 0.1049 | 30300 | 33888 |
1 | University of Alabama at Birmingham | Birmingham | AL | 0.0 | ... | 0.5214 | 0.2422 | 39700 | 21941.5 |
2 | Amridge University | Montgomery | AL | 0.0 | ... | 0.7795 | 0.8540 | 40100 | 23370 |
3 | University of Alabama in Huntsville | Huntsville | AL | 0.0 | ... | 0.4596 | 0.2640 | 45500 | 24097 |
4 | Alabama State University | Montgomery | AL | 1.0 | ... | 0.7554 | 0.1270 | 26600 | 33118.5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7530 | SAE Institute of Technology San Francisco | Emeryville | CA | NaN | ... | NaN | NaN | NaN | 9500 |
7531 | Rasmussen College - Overland Park | Overland Park | KS | NaN | ... | NaN | NaN | NaN | 21163 |
7532 | National Personal Training Institute of Cleveland | Highland Heights | OH | NaN | ... | NaN | NaN | NaN | 6333 |
7533 | Bay Area Medical Academy - San Jose Satellite ... | San Jose | CA | NaN | ... | NaN | NaN | NaN | PrivacySuppressed |
7534 | Excel Learning Center-San Antonio South | San Antonio | TX | NaN | ... | NaN | NaN | NaN | 12125 |
7535 rows × 27 columns
16.10 分组对象的 apply方法的使用
collage_data.groupby("STABBR").apply(lambda df:df.mean()).head(5)
| HBCU | MENONLY | WOMENONLY | RELAFFIL | ... | CURROPER | PCTPELL | PCTFLOAN | UG25ABV |
---|
STABBR | | | | | | | | | |
---|
AK | 0.000000 | 0.0 | 0.000000 | 0.300000 | ... | 1.000000 | 0.394530 | 0.381660 | 0.506240 |
AL | 0.166667 | 0.0 | 0.011111 | 0.250000 | ... | 0.937500 | 0.603621 | 0.509734 | 0.387039 |
AR | 0.048780 | 0.0 | 0.000000 | 0.209302 | ... | 0.965116 | 0.581470 | 0.505556 | 0.356059 |
AS | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 1.000000 | 0.724500 | 0.000000 | 0.177400 |
AZ | 0.000000 | 0.0 | 0.000000 | 0.067669 | ... | 0.879699 | 0.549792 | 0.543702 | 0.480859 |
5 rows × 22 columns
16.11 用连续变量分组
16.11.1 cut概念
my_cut = pd.cut(collage_data["UGDS"],bins=[-np.Inf,10000,20000,30000,40000,50000,np.Inf])
my_cut
0 (-inf, 10000.0]
1 (10000.0, 20000.0]
2 (-inf, 10000.0]
3 (-inf, 10000.0]
4 (-inf, 10000.0]
...
7530 NaN
7531 NaN
7532 NaN
7533 NaN
7534 NaN
Name: UGDS, Length: 7535, dtype: category
Categories (6, interval[float64]): [(-inf, 10000.0] < (10000.0, 20000.0] < (20000.0, 30000.0] < (30000.0, 40000.0] < (40000.0, 50000.0] < (50000.0, inf]]
16.11.2 使用cut进行分组
collage_data.groupby(my_cut,as_index=True)[["UGDS","INSTNM"]].max()
| UGDS | INSTNM |
---|
UGDS | | |
---|
(-inf, 10000.0] | 9999.0 | eClips School of Cosmetology and Barbering |
(10000.0, 20000.0] | 19979.0 | Youngstown State University |
(20000.0, 30000.0] | 29977.0 | West Virginia University |
(30000.0, 40000.0] | 39958.0 | Valencia College |
(40000.0, 50000.0] | 49340.0 | Western Governors University |
(50000.0, inf] | 151558.0 | University of Phoenix-Arizona |