Datawhale Task04 分组模式及其对象 打卡

本文详细介绍了Pandas库中关于数据分组(groupby)、聚合函数(如mean、quantile等)的应用,包括条件分组、自定义函数和跨列操作,以及transform和filter方法的使用。通过实例演示了如何利用这些功能进行数据处理和分析。
摘要由CSDN通过智能技术生成

import numpy as np
import pandas as pd

4.1 分组模式及其对象

4.1.1 分组的一般模式

df.groupby(分组依据)[数据来源].使用操作

df = pd.read_csv(r'joyful-pandas-master\data\learn_pandas.csv')
df.groupby('Gender')['Height'].median()
Gender
Female    159.6
Male      173.4
Name: Height, dtype: float64

4.1.2 分组依据的本质

df.groupby(['School','Gender'])['Height'].mean()
School                         Gender
Fudan University               Female    158.776923
                               Male      174.212500
Peking University              Female    158.666667
                               Male      172.030000
Shanghai Jiao Tong University  Female    159.122500
                               Male      176.760000
Tsinghua University            Female    159.753333
                               Male      171.638889
Name: Height, dtype: float64
condition = df.Weight > df.Weight.mean()
df.groupby(condition)['Height'].mean()   #大于weight平均值的weight中的height的平均值
Weight
False    159.034646
True     172.705357
Name: Height, dtype: float64
items = np.random.choice(list('abc'),df.shape[0])
items
array(['c', 'c', 'b', 'c', 'c', 'b', 'b', 'b', 'a', 'a', 'a', 'c', 'c',
       'c', 'b', 'b', 'c', 'a', 'b', 'a', 'c', 'b', 'c', 'c', 'b', 'a',
       'c', 'a', 'c', 'c', 'a', 'c', 'c', 'c', 'a', 'c', 'c', 'a', 'a',
       'c', 'b', 'c', 'c', 'b', 'b', 'c', 'a', 'b', 'b', 'b', 'a', 'c',
       'b', 'b', 'b', 'a', 'b', 'b', 'c', 'b', 'b', 'b', 'b', 'c', 'a',
       'c', 'a', 'a', 'c', 'c', 'c', 'a', 'c', 'b', 'a', 'b', 'b', 'c',
       'a', 'b', 'b', 'a', 'b', 'c', 'a', 'a', 'a', 'a', 'a', 'b', 'a',
       'b', 'a', 'b', 'a', 'c', 'c', 'b', 'c', 'b', 'a', 'b', 'a', 'b',
       'a', 'c', 'c', 'a', 'b', 'b', 'c', 'c', 'b', 'b', 'b', 'a', 'b',
       'b', 'a', 'c', 'c', 'b', 'b', 'b', 'b', 'c', 'a', 'c', 'b', 'b',
       'a', 'b', 'c', 'a', 'a', 'b', 'b', 'b', 'c', 'a', 'a', 'a', 'a',
       'a', 'c', 'a', 'b', 'b', 'a', 'b', 'c', 'a', 'b', 'a', 'a', 'c',
       'b', 'c', 'b', 'c', 'c', 'c', 'b', 'c', 'a', 'b', 'b', 'a', 'b',
       'c', 'b', 'b', 'b', 'b', 'a', 'b', 'a', 'b', 'b', 'b', 'a', 'c',
       'b', 'a', 'b', 'a', 'b', 'c', 'a', 'b', 'b', 'c', 'c', 'a', 'b',
       'c', 'a', 'c', 'c', 'c'], dtype='<U1')
df.groupby(items)['Height'].mean()
a    163.283333
b    163.984932
c    162.155357
Name: Height, dtype: float64
df.groupby([condition,items])['Height'].mean()
Weight   
False   a    159.609524
        b    158.879545
        c    158.612195
True    a    176.141667
        b    171.731034
        c    171.840000
Name: Height, dtype: float64
df[['School','Gender']].drop_duplicates()
SchoolGender
0Shanghai Jiao Tong UniversityFemale
1Peking UniversityMale
2Shanghai Jiao Tong UniversityMale
3Fudan UniversityFemale
4Fudan UniversityMale
5Tsinghua UniversityFemale
9Peking UniversityFemale
16Tsinghua UniversityMale
df.groupby([df['School'],df['Gender']])['Height'].mean()
School                         Gender
Fudan University               Female    158.776923
                               Male      174.212500
Peking University              Female    158.666667
                               Male      172.030000
Shanghai Jiao Tong University  Female    159.122500
                               Male      176.760000
Tsinghua University            Female    159.753333
                               Male      171.638889
Name: Height, dtype: float64

4.1.3 Groupby 对象

gb = df.groupby(['School','Grade'])
gb
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000205BA959108>
gb.ngroups
16
res = gb.groups
res  #组名:组索引
{('Fudan University', 'Freshman'): [15, 28, 63, 70, 73, 105, 108, 157, 186], ('Fudan University', 'Junior'): [26, 41, 82, 84, 90, 107, 145, 152, 173, 187, 189, 195], ('Fudan University', 'Senior'): [39, 46, 49, 52, 66, 77, 112, 129, 131, 138, 144], ('Fudan University', 'Sophomore'): [3, 4, 37, 48, 68, 98, 135, 170], ('Peking University', 'Freshman'): [1, 32, 35, 36, 38, 45, 54, 57, 88, 96, 99, 140, 185], ('Peking University', 'Junior'): [9, 20, 59, 72, 75, 102, 159, 183], ('Peking University', 'Senior'): [30, 86, 116, 127, 130, 132, 147, 194], ('Peking University', 'Sophomore'): [29, 61, 83, 101, 120], ('Shanghai Jiao Tong University', 'Freshman'): [0, 6, 10, 60, 114, 117, 119, 121, 141, 148, 149, 153, 184], ('Shanghai Jiao Tong University', 'Junior'): [31, 42, 50, 56, 58, 64, 85, 93, 115, 122, 143, 155, 164, 172, 174, 188, 190], ('Shanghai Jiao Tong University', 'Senior'): [2, 12, 19, 21, 22, 23, 79, 87, 89, 103, 104, 109, 123, 134, 156, 161, 165, 166, 171, 192, 197, 198], ('Shanghai Jiao Tong University', 'Sophomore'): [13, 65, 71, 124, 167], ('Tsinghua University', 'Freshman'): [5, 8, 33, 34, 43, 44, 47, 51, 62, 67, 81, 111, 125, 133, 136, 142, 146], ('Tsinghua University', 'Junior'): [7, 11, 16, 17, 27, 69, 94, 95, 113, 118, 128, 137, 150, 154, 158, 160, 162, 163, 169, 176, 177, 191], ('Tsinghua University', 'Senior'): [14, 18, 24, 25, 78, 92, 100, 126, 168, 175, 179, 180, 193, 196], ('Tsinghua University', 'Sophomore'): [40, 53, 55, 74, 76, 80, 91, 97, 106, 110, 139, 151, 178, 181, 182, 199]}
gb.size()
School                         Grade    
Fudan University               Freshman      9
                               Junior       12
                               Senior       11
                               Sophomore     8
Peking University              Freshman     13
                               Junior        8
                               Senior        8
                               Sophomore     5
Shanghai Jiao Tong University  Freshman     13
                               Junior       17
                               Senior       22
                               Sophomore     5
Tsinghua University            Freshman     17
                               Junior       22
                               Senior       14
                               Sophomore    16
dtype: int64
gb.get_group(('Fudan University','Freshman')).iloc[:4,:6] #行 列
SchoolGradeNameGenderHeightWeight
15Fudan UniversityFreshmanChangqiang YangFemale156.049.0
28Fudan UniversityFreshmanGaoqiang QinFemale170.263.0
63Fudan UniversityFreshmanGaofeng ZhaoFemale152.243.0
70Fudan UniversityFreshmanYanquan WangFemale163.555.0

4.1.4 分组的三大操作

聚合 变换 过滤

4.2 聚合函数

4.2.1 内置聚合函数

gb = df.groupby('Gender')['Height']
gb.idxmin() #计算能够获取到最小值的索引位置(整数)
Gender
Female    143
Male      199
Name: Height, dtype: int64
gb.quantile(0.95)
Gender
Female    166.8
Male      185.9
Name: Height, dtype: float64
gb = df.groupby('Gender')[['Height','Weight']]
gb.max()
HeightWeight
Gender
Female170.263.0
Male193.989.0

all 是否全为True
any 是否有一个为True
prod 乘积
skew 偏度
df.mean() # 返回所有列的均值
df.mean(1) # 返回所有行的均值,下同
df.corr() # 返回列与列之间的相关系数
df.count() # 返回每一列中的非空值的个数
df.max() # 返回每一列的最大值
df.min() # 返回每一列的最小值
df.abs() # 绝对值
df.median() # 返回每一列的中位数
df.std() # 返回每一列的标准差, 贝塞尔校正的样本标准偏差
df.var() # 无偏方差
df.sem() # 平均值的标准误差
df.mode() # 众数
df.prod() # 连乘
df.mad() # 平均绝对偏差
df.cumprod() # 累积连乘,累乘
df.cumsum(axis=0) # 累积连加,累加
df.nunique() # 去重数量,不同值的量
df.idxmax() # 每列最大的值的索引名
df.idxmin() # 最小
df.cummax() # 累积最大值
df.cummin() # 累积最小值
df.skew() # 样本偏度 (第三阶)
df.kurt() # 样本峰度 (第四阶)
df.quantile() # 样本分位数 (不同 % 的值)

4.2.2 age方法

#使用多个函数
gb.agg(['sum','idxmax','skew'])
HeightWeight
sumidxmaxskewsumidxmaxskew
Gender
Female21014.028-0.2192536469.028-0.268482
Male8854.91930.4375353929.02-0.332393
#对特定的列使用特定的聚合函数
gb.agg({'Height':['mean','max'],'Weight':'count'})
HeightWeight
meanmaxcount
Gender
Female159.19697170.2135
Male173.62549193.954

练一练

gb.agg({'Height':['sum','idxmax','skew'],'Weight':['sum','idxmax','skew']})
HeightWeight
sumidxmaxskewsumidxmaxskew
Gender
Female21014.028-0.2192536469.028-0.268482
Male8854.91930.4375353929.02-0.332393
#使用自定义的函数
gb.agg(lambda x: x.max()-x.min())
HeightWeight
Gender
Female24.829.0
Male38.238.0
def my_func(s):      #看不懂这个代码
    res = 'High'
    if s.mean() <= df[s.name].mean():
        res = 'low'
    return res
gb.agg(my_func)
HeightWeight
Gender
Femalelowlow
MaleHighHigh
#聚合结果重命名
gb.agg([('range',lambda x: x.max()-x.min()),('my_sum','sum')])
HeightWeight
rangemy_sumrangemy_sum
Gender
Female24.821014.029.06469.0
Male38.28854.938.03929.0
gb.agg({'Height':[('my_func',my_func),'sum'],
       'Weight':lambda x:x.max()})
HeightWeight
my_funcsum<lambda>
Gender
Femalelow21014.063.0
MaleHigh8854.989.0

4.3变换和过滤

4.3.1 变换函数与transform方法

gb.cummax().head()
HeightWeight
0158.946.0
1166.570.0
2188.989.0
3NaN46.0
4188.989.0
gb.transform(lambda x: (x-x.mean())/x.std()).head()
HeightWeight
0-0.058760-0.354888
1-1.010925-0.355000
22.1670632.089498
3NaN-1.279789
40.0531330.159631
gb.transform('mean').head()  #标量广播的技巧在特征工程中常用
HeightWeight
0159.1969747.918519
1173.6254972.759259
2173.6254972.759259
3159.1969747.918519
4173.6254972.759259
gb.filter(lambda x:x.shape[0] > 100).head()  #过滤得到所有容量大于100的组
HeightWeight
0158.946.0
3NaN41.0
5158.051.0
6162.552.0
7161.950.0

4.4 跨列分组

4.4.1 apply的引入

4.4.2 apply的使用

def BMI(x):
    Height = x['Height']/100
    Weight = x['Weight']
    BMI_value = Weight/Height**2
    return BMI_value.mean()
gb.apply(BMI)
Gender
Female    18.860930
Male      24.318654
dtype: float64
gb = df.groupby(['Gender','Test_Number'])[['Height','Weight']]
gb.apply(lambda x:0)
Gender  Test_Number
Female  1              0
        2              0
        3              0
Male    1              0
        2              0
        3              0
dtype: int64

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值