第四章 分组
一、分组模式及其对象
1 、分组的一般模式
想要实现分组操作,必须明确三个元素:分组依据、数据来源、操作及返回结果,分组代码的一般模式即:
df.groupby(分组依据)[数据来源].使用操作
依据性别分组,统计全国人口寿命的平均值:
df.groupby('Gender')['Longevity'].mean( )
按照性别统计身高中位数:
df = pd.read_csv('joyful-pandas/data/learn_pandas.csv')
df.groupby('Gender')['Height'].median()
# Gender
# Female 159.6
# Male 173.4
# Name: Height, dtype: float64
2、分组依据的本质
在gruopby中传入多个列名构成的列表,即可实现根据多个维度进行分组。
根据学校和性别进行分组,统计身高的均值:
df.groupby(['School', 'Gender'])['Height'].mean()
# School Gender
# Fudan University Female 158.776923
# Male 174.212500
# Peking University Female 158.666667
# Male 172.030000
# Shanghai Jiao Tong University Female 159.122500
# Male 176.760000
# Tsinghua University Female 159.753333
# Male 171.638889
# Name: Height, dtype: float64
更复杂的分组条件:根据学生体重是否超过总体均值来分组,计算身高的均值
condition = df.Weight > df.Weight.mean()
df.groupby(condition)['Height'].mean()
# Weight
# False 159.034646
# True 172.705357
# Name: Height, dtype: float64
【练一练】
请根据上下四分位数分隔,将体重分为high、normal、low三组,统计身高的均值
# 根据上下四分位数,先将体重分为三组,新建一列命名为cond
condition1 = df.Weight > df.Weight.quantile(0.75)
condition2 = (df.Weight >= df.Weight.quantile(0.25)) & (df.Weight <= df.Weight.quantile(0.75))
condition3 = df.Weight < df.Weight.quantile(0.25)
df['cond'] = df['Weight'].mask(condition1,'high').mask(condition2,'normal').mask(condition3,'low')
# 分组计算身高均值
df.groupby('cond')['Height'].mean()
# cond
# high 174.935714
# low 153.753659
# normal 161.883516
# Name: Height, dtype: float64
从索引可以看出,其实最后产生的结果就是按照条件列表中元素的值(此处是True和False)来分组,下面用随机传入字母序列来验证这一想法
item = np.random.choice(list('abc'), df.shape[0])
df.groupby(item)['Height'].mean()
# a 164.407143
# b 161.930508
# c 163.355882
# Name: Height, dtype: float64
此处的索引就是原先item中的元素,如果传入多个序列进入groupby,那么最后分组的依据就是这两个序列对应行的唯一组合:
df.groupby([condition, item])['Height'].mean()
# Weight
# False a 159.822222
# b 158.338636
# c 159.082979
# True a 172.660000
# b 172.466667
# c 172.919048
# Name: Height, dtype: float64
由此可以看出,之前传入列名只是一种简便的记号,事实上等价于传入的是一个或多个列,最后分组的依据来自于数据来源组合的unique值,通过drop_duplicates
就能知道具体的组类别:
df[['School', 'Gender']].drop_duplicates()
# School Gender
# 0 Shanghai Jiao Tong University Female
# 1 Peking University Male
# 2 Shanghai Jiao Tong University Male
# 3 Fudan University Female
# 4 Fudan University Male
# 5 Tsinghua University Female
# 9 Peking University Female
# 16 Tsinghua University Male
df.groupby([df['School'], df['Gender']])['Height'].mean()
# School Gender
# Fudan University Female 158.776923
# Male 174.212500
# Peking University Female 158.666667
# Male 172.030000
# Shanghai Jiao Tong University Fem