分组
import numpy as np
import pandas as pd
一、分组模式及其对象
1. 分组的一般模式
分组的三个要素:分组依据、数据来源、操作及其返回结果
一般模式:df.group(分组依据)[数据来源].使用操作
df = pd.read_csv('data/learn_pandas.csv')
df.groupby('Gender')['Height'].mean()
#按照性别统计身高的平均值
Gender
Female 159.19697
Male 173.62549
Name: Height, dtype: float64
df.head(1)
School | Grade | Name | Gender | Height | Weight | Transfer | Test_Number | Test_Date | Time_Record | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Shanghai Jiao Tong University | Freshman | Gaopeng Yang | Female | 158.9 | 46.0 | N | 1 | 2019/10/5 | 0:04:34 |
2. 分组依据的本质
1.多维度分组,只需要在groupby中传入相应列名的列表
df.groupby(['School','Gender'])['Height'].mean()
School Gender
Fudan University Female 158.776923
Male 174.212500
Peking University Female 158.666667
Male 172.030000
Shanghai Jiao Tong University Female 159.122500
Male 176.760000
Tsinghua University Female 159.753333
Male 171.638889
Name: Height, dtype: float64
2.复杂逻辑分组
先写出分组条件
condition = df.Weight > df.Weight.mean()
df.groupby(condition)['Height'].mean()
Weight
False 159.034646
True 172.705357
Name: Height, dtype: float64
3. 练一练1
根据上下四分数分割,将体重分为high、normal、low三组,统计身高的均值
思路:首先尝试将体重分为三类,使用了if函数,但是发现无法将这个条件与Weight序列结合起来,因为apply具有迭代的作用,尝试使用apply,先将这个条件定义为一个函数。在这个过程中我在定义函数的时候将return写成了print,导致我的返回值出现了问题,这一点需要注意。
data = df.copy()
def my_condition(x):
if x <= df.Weight.quantile(0.25):
return('low')
elif x>= df.Weight.quantile(0.75):
return('high')
else:
return('normal')
new = data.Weight.apply(my_condition)
data.groupby(new)['Height'].mean()
Weight
high 174.511364
low 154.119149
normal 162.465217
Name: Height, dtype: float64
3.随机传入字母序列
item = np.random.choice(list('abc'),df.shape[0])
item.shape
(200,)
df.shape
(200, 10)
df.groupby(item)['Height'].mean()
a 163.770667
b 162.486885
c 163.285106
Name: Height, dtype: float64
此处我的理解是,我们在做groupby的时候只需要保持与原数据框的行数一致的序列,在做groupby的时候会将这个条件与原数据框自动按顺序拼接。
df.groupby([condition,item])['Weight'].mean()
Weight
False a 47.541667
b 46.558140
c 48.000000
True a 69.500000
b 70.571429
c 74.928571
Name: Weight, dtype: float64
由此可以看出,之前传入列名只是一种简便的记号,事实上等价于传入的是一个或多个列,最后分组的依据来自于数据来源组合的unique值
4.通过drop_duplicates了解具体的组类别
df[['School','Gender']].drop_duplicates()
School | Gender | |
---|---|---|
0 | Shanghai Jiao Tong University | Female |
1 | Peking University | Male |
2 | Shanghai Jiao Tong University | Male |
3 | Fudan University | Female |
4 | Fudan University | Male |
5 | Tsinghua University | Female |
9 | Peking University | Female |
16 | Tsinghua University | Male |
df.groupby([df['School'],df['Gender']])['Height'].mean()
School Gender
Fudan University Female 158.776923
Male 174.212500
Peking University Female 158.666667
Male 172.030000
Shanghai Jiao Tong University Female 159.122500
Male 176.760000
Tsinghua University Female 159.753333
Male 171.638889
Name: Height, dtype: float64
4. groupby对象
gb = df.groupby(['School','Grade'])
gb
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000244F75A50A0>
1.通过ngroups属性,得到分组个数
gb.ngroups
16
2.通过groups属性,返回组名映射到组索引列表的字典
会返回分组的组别及其包含的值
res = gb.groups
res.keys()
dict_keys([('Fudan University', 'Freshman'), ('Fudan University', 'Junior'), ('Fudan University', 'Senior'), ('Fudan University', 'Sophomore'), ('Peking University', 'Freshman'), ('Peking University', 'Junior'), ('Peking University', 'Senior'), ('Peking University', 'Sophomore'), ('Shanghai Jiao Tong University', 'Freshman'), ('Shanghai Jiao Tong University', 'Junior'), ('Shanghai Jiao Tong University', 'Senior'), ('Shanghai Jiao Tong University', 'Sophomore'), ('Tsinghua University', 'Freshman'), ('Tsinghua University', 'Junior'), ('Tsinghua University', 'Senior'), ('Tsinghua University', 'Sophomore')])
5. 练一练2
上一小节介绍了可以通过 drop_duplicates 得到具体的组类别,现请用 groups 属性完成类似的功能。
思路:groups可以获取分组的组合,然后获取到key值,将key值转换为数据框,但是index的显示与drop_duplicates的不一致
df_demo = df.groupby(['School','Gender'])
res = df_demo.groups
pd.DataFrame(res.keys(),columns=['School','Gender'])
School | Gender | |
---|---|---|
0 | Fudan University | Female |
1 | Fudan University | Male |
2 | Peking University | Female |
3 | Peking University | Male |
4 | Shanghai Jiao Tong University | Female |
5 | Shanghai Jiao Tong University | Male |
6 | Tsinghua University | Female |
7 | Tsinghua University | Male |
3.size在DataFrame的属性时,返回的是表长乘以表宽的大小,但在groupby对象上表示统计每个组的元素个数
gb.size()
School Grade
Fu