import numpy as np
import pandas as pd
df = pd. read_csv( 'data/learn_pandas.csv' )
一、分组模式及其对象
1.分组的一般模式
想要实现分组操作,需要明确三个要素:分组依据、数据来源、操作及其返回结果。 分组其实也就相当于在原来对整列的操作上,将列划分为几部分。就比如下面的根据性别统计身高。性别是依据,可以理解为将身高列中的数据,先根据性别化为两部分(性别只有俩值),然后分别操作就行了。
df. groupby( 'Gender' ) [ 'Height' ] . median( )
Gender
Female 159.6
Male 173.4
Name: Height, dtype: float64
2.分组依据的本质
对多个维度分组,只需要在groupby中传入一个列表就行。
item = np. random. choice( list ( 'abc' ) , df. shape[ 0 ] )
df. groupby( item) [ 'Height' ] . mean( )
a 163.571429
b 163.244444
c 162.962195
Name: Height, dtype: float64
df. groupby( [ df[ 'School' ] , df[ 'Gender' ] ] ) [ 'Height' ] . mean( )
School Gender
Fudan University Female 158.776923
Male 174.212500
Peking University Female 158.666667
Male 172.030000
Shanghai Jiao Tong University Female 159.122500
Male 176.760000
Tsinghua University Female 159.753333
Male 171.638889
Name: Height, dtype: float64
3.Groupby对象
gb = df. groupby( [ 'School' , 'Gender' ] )
gb. ngroups
8
res = gb. groups
res. keys( )
dict_keys([('Fudan University', 'Female'), ('Fudan University', 'Male'), ('Peking University', 'Female'), ('Peking University', 'Male'), ('Shanghai Jiao Tong University', 'Female'), ('Shanghai Jiao Tong University', 'Male'), ('Tsinghua University', 'Female'), ('Tsinghua University', 'Male')])
gb. size( )
School Gender
Fudan University Female 30
Male 10
Peking University Female 22
Male 12
Shanghai Jiao Tong University Female 41
Male 16
Tsinghua University Female 48
Male 21
dtype: int64
gb. get_group( ( 'Fudan University' , 'Female' ) ) . iloc[ : 3 , : 3 ]
School Grade Name 3 Fudan University Sophomore Xiaojuan Sun 15 Fudan University Freshman Changqiang Yang 26 Fudan University Junior Yanli You
二、聚合函数
1.内置聚合函数
2.agg方法
【a】使用多个函数 【b】对特定的列使用特定的聚合函数 【c】使用自定义函数
gb = df. groupby( 'Gender' ) [ [ 'Height' , 'Weight' ] ]
gb. agg( [ 'mean' , 'idxmax' , 'skew' ] )
gb. agg( { 'Height' : [ 'mean' , 'max' ] , 'Weight' : 'count' } )
gb. agg( lambda x: x. mean( ) - x. min ( ) )
Height Weight Gender Female 13.79697 13.918519 Male 17.92549 21.759259
gb. agg( [ ( 'range' , lambda x: x. max ( ) - x. min ( ) ) ] )
Height Weight range range Gender Female 24.8 29.0 Male 38.2 38.0
三、变换和过滤
1.变换函数与transform方法
gb. cummax( ) . head( )
Height Weight 0 158.9 46.0 1 166.5 70.0 2 188.9 89.0 3 NaN 46.0 4 188.9 89.0
gb. transform( lambda x: ( x- x. mean( ) ) / x. std( ) ) . head( )
Height Weight 0 -0.058760 -0.354888 1 -1.010925 -0.355000 2 2.167063 2.089498 3 NaN -1.279789 4 0.053133 0.159631
gb. transform( 'mean' ) . head( )
Height Weight 0 159.19697 47.918519 1 173.62549 72.759259 2 173.62549 72.759259 3 159.19697 47.918519 4 173.62549 72.759259
2.组索引与过滤
之前的索引是对所有行进行筛选,而组过滤可以理解为是在一个组内对所有行进行筛选。
gb. filter ( lambda x: x. shape[ 0 ] > 100 ) . head( )
Height Weight 0 158.9 46.0 3 NaN 41.0 5 158.0 51.0 6 162.5 52.0 7 161.9 50.0
df[ [ 'Height' , 'Weight' ] ] . shape
(200, 2)
四、跨列分组
1. apply的使用
gb. apply ( lambda x: ( x[ 'Weight' ] / ( x[ 'Height' ] / 100 ) ** 2 ) . mean( ) )
Gender
Female 18.860930
Male 24.318654
dtype: float64
gb = df. groupby( [ 'Gender' , 'Test_Number' ] ) [ [ 'Height' , 'Weight' ] ]
gb. apply ( lambda x: 0 )
Gender Test_Number
Female 1 0
2 0
3 0
Male 1 0
2 0
3 0
dtype: int64
gb. apply ( lambda x: [ 0 , 0 ] )
Gender Test_Number
Female 1 [0, 0]
2 [0, 0]
3 [0, 0]
Male 1 [0, 0]
2 [0, 0]
3 [0, 0]
dtype: object
gb. apply ( lambda x: pd. Series( [ 0 , 0 ] , index= [ 'a' , 'b' ] ) )
a b Gender Test_Number Female 1 0 0 2 0 0 3 0 0 Male 1 0 0 2 0 0 3 0 0
gb. apply ( lambda x: pd. DataFrame( np. ones( ( 2 , 2 ) ) ,
index = [ 'a' , 'b' ] ,
columns= pd. Index( [ ( 'w' , 'x' ) , ( 'y' , 'z' ) ] ) ) )
w y x z Gender Test_Number Female 1 a 1.0 1.0 b 1.0 1.0 2 a 1.0 1.0 b 1.0 1.0 3 a 1.0 1.0 b 1.0 1.0 Male 1 a 1.0 1.0 b 1.0 1.0 2 a 1.0 1.0 b 1.0 1.0 3 a 1.0 1.0 b 1.0 1.0
总结:本次学习尚有很多内容未理解透彻,还需继续补充。