pandas学习笔记(第五弹)

注:本教程为系列教程此章节接前面第一弹

16 分组聚合、过滤、转换

16.1 准备数据

# 设置最多显示8列数据
pd.options.display.max_columns = 8

collage_data = pd.read_csv("pandasLearnData/college.csv")
collage_data.head(5)
INSTNMCITYSTABBRHBCU...PCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
0Alabama A & M UniversityNormalAL1.0...0.82840.10493030033888
1University of Alabama at BirminghamBirminghamAL0.0...0.52140.24223970021941.5
2Amridge UniversityMontgomeryAL0.0...0.77950.85404010023370
3University of Alabama in HuntsvilleHuntsvilleAL0.0...0.45960.26404550024097
4Alabama State UniversityMontgomeryAL1.0...0.75540.12702660033118.5

5 rows × 27 columns

16.2 定义聚合

gb_STABBR = collage_data.groupby("STABBR")

16.3 聚合的属性原理

16.3.1 聚合类别

type(gb_STABBR)
pandas.core.groupby.generic.DataFrameGroupBy

16.3.2 聚合类型的所用方法和属性

for attr in dir(gb_STABBR):
    print("" if attr.startswith("_") else attr+"\t",end="")
CITY	CURROPER	DISTANCEONLY	GRAD_DEBT_MDN_SUPP	HBCU	INSTNM	MD_EARN_WNE_P10	MENONLY	PCTFLOAN	PCTPELL	PPTUG_EF	RELAFFIL	SATMTMID	SATVRMID	STABBR	UG25ABV	UGDS	UGDS_2MOR	UGDS_AIAN	UGDS_ASIAN	UGDS_BLACK	UGDS_HISP	UGDS_NHPI	UGDS_NRA	UGDS_UNKN	UGDS_WHITE	WOMENONLY	agg	aggregate	all	any	apply	backfill	bfill	boxplot	corr	corrwith	count	cov	cumcount	cummax	cummin	cumprod	cumsum	describe	diff	dtypes	expanding	ffill	fillna	filter	first	get_group	groups	head	hist	idxmax	idxmin	indices	last	mad	max	mean	median	min	ndim	ngroup	ngroups	nth	nunique	ohlc	pad	pct_change	pipe	plot	prod	quantile	rank	resample	rolling	sem	shift	size	skew	std	sum	tail	take	transform	tshift	var	

16.3.3 查看分组数量

gb_STABBR.ngroups
59

16.3.4 返回分组的键值

# groups是返回一个字典 {"分组键":索引列表, ...}
gb_STABBR.groups.keys()
dict_keys(['AK', 'AL', 'AR', 'AS', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'FM', 'GA', 'GU', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO', 'MP', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR', 'PW', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VI', 'VT', 'WA', 'WI', 'WV', 'WY'])

16.3.5 根据键获取某一个分组

gb_STABBR.get_group("AK")
INSTNMCITYSTABBRHBCU...PCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
60University of Alaska AnchorageAnchorageAK0.0...0.26470.43864250019449.5
61Alaska Bible CollegePalmerAK0.0...0.28570.4286NaNPrivacySuppressed
62University of Alaska FairbanksFairbanksAK0.0...0.25500.45193620019355
63University of Alaska SoutheastJuneauAK0.0...0.19960.55503740016875
64Alaska Pacific UniversityAnchorageAK0.0...0.52970.49104700023250
65AVTEC-Alaska's Institute of TechnologySewardAK0.0...0.06640.712733500PrivacySuppressed
66Charter College-AnchorageAnchorageAK0.0...0.75030.54723920013875
67Alaska Career CollegeAnchorageAK0.0...0.78600.5612287008994
5171Ilisagvik CollegeBarrowAK0.0...0.00000.649824900PrivacySuppressed
5417Alaska Christian CollegeSoldotnaAK0.0...0.67920.2264NaNPrivacySuppressed

10 rows × 27 columns

16.3.6 DataFrameGroupBy对象是一个可迭代对象

from collections import Iterable
isinstance(gb_STABBR,Iterable)
True
# 显示前3个分组
i  = 0
for name,group in gb_STABBR:
    i += 1
    print("组名:",name,"\t\t类型:",type(group))
    display(group.head(2))
    if i >= 3:
        break
组名: AK 		类型: <class 'pandas.core.frame.DataFrame'>
INSTNMCITYSTABBRHBCU...PCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
60University of Alaska AnchorageAnchorageAK0.0...0.26470.43864250019449.5
61Alaska Bible CollegePalmerAK0.0...0.28570.4286NaNPrivacySuppressed

2 rows × 27 columns

组名: AL 		类型: <class 'pandas.core.frame.DataFrame'>
INSTNMCITYSTABBRHBCU...PCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
0Alabama A & M UniversityNormalAL1.0...0.82840.10493030033888
1University of Alabama at BirminghamBirminghamAL0.0...0.52140.24223970021941.5

2 rows × 27 columns

组名: AR 		类型: <class 'pandas.core.frame.DataFrame'>
INSTNMCITYSTABBRHBCU...PCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
128University of Arkansas at Little RockLittle RockAR0.0...0.47750.40623390021736
129University of Arkansas for Medical SciencesLittle RockAR0.0...0.61440.51336140012500

2 rows × 27 columns

16.3.7 head显示每个分组的头几行

# 前面一个head用于空值每个分组内显示前几行,后面表示总共显示前几行
gb_STABBR.head(2).head(5)
INSTNMCITYSTABBRHBCU...PCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
0Alabama A & M UniversityNormalAL1.0...0.82840.10493030033888
1University of Alabama at BirminghamBirminghamAL0.0...0.52140.24223970021941.5
43Prince Institute-SoutheastElmhurstIL0.0...0.93750.6569PrivacySuppressed20992
60University of Alaska AnchorageAnchorageAK0.0...0.26470.43864250019449.5
61Alaska Bible CollegePalmerAK0.0...0.28570.4286NaNPrivacySuppressed

5 rows × 27 columns

16.3.8 nth指定显示分组内的相应行

# 显示每个分组的第一行,最后一行
gb_STABBR.nth([1,-1]).head(5)
INSTNMCITYHBCUMENONLY...PCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
STABBR
AKAlaska Bible CollegePalmer0.00.0...0.28570.4286NaNPrivacySuppressed
AKAlaska Christian CollegeSoldotna0.00.0...0.67920.2264NaNPrivacySuppressed
ALUniversity of Alabama at BirminghamBirmingham0.00.0...0.52140.24223970021941.5
ALStrayer University-Huntsville CampusHuntsvilleNaNNaN...NaNNaN4920036173.5
ARCareer Academy of Hair Design-FayettevilleFayettevilleNaNNaN...NaNNaNNaN6365

5 rows × 26 columns

16.4 对分组对象使用聚合函数

16.4.1 agg函数传入聚合函数名称

# 统计每个每个州的本科生人数
gb_STABBR["UGDS"].agg("sum").head(5)
STABBR
AK     24932.0
AL    248298.0
AR    134820.0
AS      1276.0
AZ    520439.0
Name: UGDS, dtype: float64

16.4.2 直接使用统计函数

gb_STABBR["UGDS"].sum().head(5)
STABBR
AK     24932.0
AL    248298.0
AR    134820.0
AS      1276.0
AZ    520439.0
Name: UGDS, dtype: float64

16.4.3 agg传入多种聚合函数进行统计

gb_STABBR["UGDS"].agg(["mean",np.sum,"std"]).head(5)
meansumstd
STABBR
AK2493.20000024932.04051.726650
AL2789.865169248298.04657.877043
AR1644.146341134820.03142.774213
AS1276.0000001276.0NaN
AZ4130.468254520439.014893.640651

16.5 对多个字段进行分组

16.5.1 分组

#对州和城市进行分组
collage_data.groupby(["STABBR","CITY"]).head(2).head(5)
INSTNMCITYSTABBRHBCU...PCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
0Alabama A & M UniversityNormalAL1.0...0.82840.10493030033888
1University of Alabama at BirminghamBirminghamAL0.0...0.52140.24223970021941.5
2Amridge UniversityMontgomeryAL0.0...0.77950.85404010023370
3University of Alabama in HuntsvilleHuntsvilleAL0.0...0.45960.26404550024097
4Alabama State UniversityMontgomeryAL1.0...0.75540.12702660033118.5

5 rows × 27 columns

16.5.2 对多个字段分组并使用多种聚合方法

grouped = collage_data.groupby(["STABBR","CITY"]).agg({"INSTNM":"size","UGDS":["mean","std","sum"]})
grouped.head(5)
INSTNMUGDS
sizemeanstdsum
STABBRCITY
AKAnchorage44218.755922.37060516875.0
Barrow1109.00NaN109.0
Fairbanks15536.00NaN5536.0
Juneau11428.00NaN1428.0
Palmer127.00NaN27.0

16.6 消除多级索引

16.6.1 查看索引结构

# 可以看到列索引具有两级
grouped.columns
MultiIndex([('INSTNM', 'size'),
            (  'UGDS', 'mean'),
            (  'UGDS',  'std'),
            (  'UGDS',  'sum')],
           )

16.6.2 获取一级索引

columns_level1 = grouped.columns.get_level_values(0)
columns_level1
Index(['INSTNM', 'UGDS', 'UGDS', 'UGDS'], dtype='object')

16.6.3 获取二级索引

columns_level2 = grouped.columns.get_level_values(1)
columns_level2
Index(['size', 'mean', 'std', 'sum'], dtype='object')

16.6.4 合并索引

grouped.columns = columns_level1 + "_" + columns_level2
grouped.head(5)
INSTNM_sizeUGDS_meanUGDS_stdUGDS_sum
STABBRCITY
AKAnchorage44218.755922.37060516875.0
Barrow1109.00NaN109.0
Fairbanks15536.00NaN5536.0
Juneau11428.00NaN1428.0
Palmer127.00NaN27.0

16.6.5 行索引可直接利用reset_index清除

grouped.reset_index().head(5)
STABBRCITYINSTNM_sizeUGDS_meanUGDS_stdUGDS_sum
0AKAnchorage44218.755922.37060516875.0
1AKBarrow1109.00NaN109.0
2AKFairbanks15536.00NaN5536.0
3AKJuneau11428.00NaN1428.0
4AKPalmer127.00NaN27.0

16.6.6 在使用groupby方法时,传入参数不增加索引

# as_index:表示是否改变索引,sort:表示是否根据分组字段进行排序
collage_data.groupby(["STABBR","CITY"],as_index=False,sort=False)\
.agg({"INSTNM":"size","UGDS":["mean","std","sum"]})\
.head(5)
STABBRCITYINSTNMUGDS
sizemeanstdsum
0ALNormal14206.000000NaN4206.0
1ALBirmingham142236.4285713314.61128331310.0
2ALMontgomery111562.3000001717.28248215623.0
3ALHuntsville71511.5000002042.2586279069.0
4ALTuscaloosa311887.33333315666.54525835662.0

16.7 自定义聚合函数

16.7.1 定义聚合函数

def my_max(s):
    max_value = 0
    for i in  s:
        max_value = max_value if max_value > i and i != np.NAN else i
    return s.max()

16.7.2 使用自定义聚合函数

collage_data.groupby("STABBR")[["UGDS"]].agg(my_max).head(5)
UGDS
STABBR
AK12865.0
AL29851.0
AR21405.0
AS1276.0
AZ151558.0

16.7.3 和自带的聚合函数混合使用

collage_data.groupby("STABBR")["UGDS"].agg(["mean","sum",my_max]).head(5)
meansummy_max
STABBR
AK2493.20000024932.012865.0
AL2789.865169248298.029851.0
AR1644.146341134820.021405.0
AS1276.0000001276.01276.0
AZ4130.468254520439.0151558.0

16.7.4 修改列名

# 默认就是显示该聚合函数的名称(__name__属性)
# 方法1: 修改__name__属性即可
my_max.__name__ = "max"
collage_data.groupby("STABBR")["UGDS"].agg(["mean","sum",my_max]).head(5)
meansummax
STABBR
AK2493.20000024932.012865.0
AL2789.865169248298.029851.0
AR1644.146341134820.021405.0
AS1276.0000001276.01276.0
AZ4130.468254520439.0151558.0
# 方法2: 使用DataFrame的rename方法修改列名称
# inplace 参数表示是否在操作的DataFrame中进行修改
collage_data.groupby("STABBR")["UGDS"].agg(["mean","sum",my_max]).rename({"my_max":"max"},inplace=False).head(5)
meansummax
STABBR
AK2493.20000024932.012865.0
AL2789.865169248298.029851.0
AR1644.146341134820.021405.0
AS1276.0000001276.01276.0
AZ4130.468254520439.0151558.0

16.8 用 args 和 *kwargs 自定义聚合函数

16.8.1 定义聚合函数

def  my_between_cnt(s,low,high):
    return s.between(low,high).sum()

16.8.2 使用自定义聚合函数

collage_data.groupby("STABBR")["UGDS"].agg(my_between_cnt,10000,20000).head(5)
STABBR
AK    1.0
AL    4.0
AR    0.0
AS    0.0
AZ    4.0
Name: UGDS, dtype: float64

16.8.3 使用闭包简化代码

def init_agg_func(func,col_name,*args,**kwargs):
    def wrapper(s):
        return func(s,*args,**kwargs)
    wrapper.__name__ = col_name
    return wrapper

my_max = init_agg_func(np.max,"max")
between_10_20k = init_agg_func(my_between_cnt,"10k~20k",10000,20000)
between_20_30k = init_agg_func(my_between_cnt,"20k~30k",20000,30000)

# 统计每个州本科生人数的最大值,和在10k-20k,20k-30k之间的人数
collage_data.groupby("STABBR")["UGDS"].agg([my_max,between_10_20k,between_20_30k]).head(5)
max10k~20k20k~30k
STABBR
AK12865.01.00.0
AL29851.04.02.0
AR21405.00.01.0
AS1276.00.00.0
AZ151558.04.02.0

16.9 使用filter对分组进行过滤

# 对DataFrameGroupBy对象使用过滤时,会根据回调函数返回的布尔值决定是否保留该分组
# 显示每个州本科生人数和大于100万的州
collage_data.groupby("STABBR").filter(lambda x:x["UGDS"].sum() > 1000000)
INSTNMCITYSTABBRHBCU...PCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
192Academy of Art UniversitySan FranciscoCA0.0...0.55240.40433600035093
193ITT Technical Institute-Rancho CordovaRancho CordovaCA0.0...0.76670.72353880025827.5
194Academy of Chinese Culture and Health SciencesOaklandCA0.0...NaNNaNNaNPrivacySuppressed
195The Academy of Radio and TV BroadcastingHuntington BeachCA0.0...1.00000.4545284009500
196Avalon School of Cosmetology-AlamedaAlamedaCA0.0...0.67680.3387216009860
..............................
7528WestMed College - MercedMercedCANaN...NaNNaNNaN15623.5
7529Vantage CollegeEl PasoTXNaN...NaNNaNNaN9500
7530SAE Institute of Technology San FranciscoEmeryvilleCANaN...NaNNaNNaN9500
7533Bay Area Medical Academy - San Jose Satellite ...San JoseCANaN...NaNNaNNaNPrivacySuppressed
7534Excel Learning Center-San Antonio SouthSan AntonioTXNaN...NaNNaNNaN12125

1245 rows × 27 columns


INSTNMCITYSTABBRHBCU...PCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
0Alabama A & M UniversityNormalAL1.0...0.82840.10493030033888
1University of Alabama at BirminghamBirminghamAL0.0...0.52140.24223970021941.5
2Amridge UniversityMontgomeryAL0.0...0.77950.85404010023370
3University of Alabama in HuntsvilleHuntsvilleAL0.0...0.45960.26404550024097
4Alabama State UniversityMontgomeryAL1.0...0.75540.12702660033118.5
..............................
7530SAE Institute of Technology San FranciscoEmeryvilleCANaN...NaNNaNNaN9500
7531Rasmussen College - Overland ParkOverland ParkKSNaN...NaNNaNNaN21163
7532National Personal Training Institute of ClevelandHighland HeightsOHNaN...NaNNaNNaN6333
7533Bay Area Medical Academy - San Jose Satellite ...San JoseCANaN...NaNNaNNaNPrivacySuppressed
7534Excel Learning Center-San Antonio SouthSan AntonioTXNaN...NaNNaNNaN12125

7535 rows × 27 columns

16.10 分组对象的 apply方法的使用

# 注意这里apply和DataFrame中的apply方法不一样,这里传入回调函数的是一个分组的DataFrame
# 你需要返回的是一个Series
# 求每个分组各个字段的平均值
collage_data.groupby("STABBR").apply(lambda df:df.mean()).head(5)
HBCUMENONLYWOMENONLYRELAFFIL...CURROPERPCTPELLPCTFLOANUG25ABV
STABBR
AK0.0000000.00.0000000.300000...1.0000000.3945300.3816600.506240
AL0.1666670.00.0111110.250000...0.9375000.6036210.5097340.387039
AR0.0487800.00.0000000.209302...0.9651160.5814700.5055560.356059
AS0.0000000.00.0000000.000000...1.0000000.7245000.0000000.177400
AZ0.0000000.00.0000000.067669...0.8796990.5497920.5437020.480859

5 rows × 22 columns

16.11 用连续变量分组

16.11.1 cut概念

# 被划分成了六个片元(就是六个区间,将这些区间映射到传入的Series中)
my_cut = pd.cut(collage_data["UGDS"],bins=[-np.Inf,10000,20000,30000,40000,50000,np.Inf])
my_cut
0          (-inf, 10000.0]
1       (10000.0, 20000.0]
2          (-inf, 10000.0]
3          (-inf, 10000.0]
4          (-inf, 10000.0]
               ...        
7530                   NaN
7531                   NaN
7532                   NaN
7533                   NaN
7534                   NaN
Name: UGDS, Length: 7535, dtype: category
Categories (6, interval[float64]): [(-inf, 10000.0] < (10000.0, 20000.0] < (20000.0, 30000.0] < (30000.0, 40000.0] < (40000.0, 50000.0] < (50000.0, inf]]

16.11.2 使用cut进行分组

collage_data.groupby(my_cut,as_index=True)[["UGDS","INSTNM"]].max()
UGDSINSTNM
UGDS
(-inf, 10000.0]9999.0eClips School of Cosmetology and Barbering
(10000.0, 20000.0]19979.0Youngstown State University
(20000.0, 30000.0]29977.0West Virginia University
(30000.0, 40000.0]39958.0Valencia College
(40000.0, 50000.0]49340.0Western Governors University
(50000.0, inf]151558.0University of Phoenix-Arizona
  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值