import pandas as pd
import numpy as np
import matplotlib. pyplot as plt
% matplotlib inline
GroupBy
group_by的底层步骤: (1)split:依据某个标准把数据 分割 (2)Apply:将某个函数 应用 到每一组数据,例如agg、transform (3)combining:将结果合并
在apply这一步,我们可以采取: (1)聚合操作:将一个group的数据映射为一个数,例如求平均值、group内元素的个数等 (2)转换操作:数据分布正态化等 (3)过滤操作:去除较少元素的group或者去除较小的元素等,必须传入一个返回布尔值的func
1-spliting into groups
groupby函数有两个比较重要的参数:by和axis,其中: (1)by表示依据什么进行分组,可以传入一个 column-name、list、dict、func,具体特点为: column-name:以该column对应的series为依据进行分组,series内容相同的为同组; list:根据lis内容进行分类汇总; dict:key是对应的axis label(可以是index也可以是column),value是目标label,同一个目标label的为一组; func:实现axis label到目标label的转换,同一个label的为一组。 (2)axis表示按照什么方向进行groupby,他将决定数据切分的方式,具体而言,axis=0/"index"表示以行为单位进行split; 反之,axis=1/“columns”,表示数据以列为单位进行split。不要被惯性干扰,其实对于一个dataframe而言,内容完全是对称的。
df = pd. DataFrame( {
"class" : [ "bird" , "bird" , "mammal" , "mammal" , "mammal" ] ,
"order" : [ "Falconiformes" , "Psittaciformes" , "Carnivora" , "Primates" , "Carnivora" ] ,
"max_speed" : [ 389.0 , 24.0 , 80.2 , np. nan, 58 ]
} , index= [ "falcon" , "parrot" , "lion" , "monkey" , "leopard" ] )
df
class order max_speed falcon bird Falconiformes 389.0 parrot bird Psittaciformes 24.0 lion mammal Carnivora 80.2 monkey mammal Primates NaN leopard mammal Carnivora 58.0
按照行进行分割
df. groupby( "class" , axis= "index" )
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fea93816ca0>
df. groupby( "class" , axis= 0 ) . groups
{'bird': ['falcon', 'parrot'], 'mammal': ['lion', 'monkey', 'leopard']}
df. groupby( [ "class" , "order" ] )
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fea93861580>
df. groupby( [ "class" , "order" ] ) . groups
{('bird', 'Falconiformes'): ['falcon'], ('bird', 'Psittaciformes'): ['parrot'], ('mammal', 'Carnivora'): ['lion', 'leopard'], ('mammal', 'Primates'): ['monkey']}
In [ 6 ] : df = pd. DataFrame(
. . . : {
. . . : "A" : np. random. randn( 5 ) ,
. . . : "B" : np. random. randn( 5 ) ,
. . . : "C" : np. random. randn( 5 ) ,
. . . : "D" : np. random. randn( 5 ) ,
. . . : }
. . . : )
df
A B C D 0 1.047022 -0.266527 1.659578 -1.731920 1 -0.726851 -0.931228 -2.774025 0.443112 2 0.265430 -0.172112 0.278415 0.333350 3 1.128021 -0.499495 -0.694711 -0.884655 4 0.180662 0.802036 -1.210244 -1.251326
按照列column进行分割
def get_letter_type ( letter) :
if letter. lower( ) in "aeiou" :
return "vowel"
else :
return "constant"
grouped = df. groupby( get_letter_type, axis= 1 )
grouped. groups
{'constant': ['B', 'C', 'D'], 'vowel': ['A']}
grouped. get_group( "constant" ) . head( 3 )
B C D 0 -0.266527 1.659578 -1.731920 1 -0.931228 -2.774025 0.443112 2 -0.172112 0.278415 0.333350
根据字典进行groupby
df. groupby( { "A" : "A" , "B" : "A" , "C" : "C" , "D" : "C" } , axis= 1 ) . get_group( "A" ) . head( 3 )
A B 0 1.047022 -0.266527 1 -0.726851 -0.931228 2 0.265430 -0.172112
df. groupby( { "A" : "A" , "B" : "A" , "C" : "C" , "D" : "C" } , axis= 1 ) . get_group( "C" )
C D 0 1.659578 -1.731920 1 -2.774025 0.443112 2 0.278415 0.333350 3 -0.694711 -0.884655 4 -1.210244 -1.251326
df. groupby( { "A" : "A" , "B" : "A" , "C" : "C" , "D" : "C" } , axis= 1 ) . agg( max )
A C 0 1.047022 1.659578 1 -0.726851 0.443112 2 0.265430 0.333350 3 1.128021 -0.694711 4 0.802036 -1.210244
小结: 重要属性:groups属性返回字典,key就是group-key,value是命中的axis-label 重要接口:get_group()接口返回 当前 group key 对应的所有数据 特点:pandas 的index是允许重复的,如果一个包含重复内容的index被作为grouby的key,那么 同一个key指向的value会被分配到同一个组和,所以groupby之后的内容是没有重复的。 问题:如果是行切的话,可以基于列名做聚合,那如果是列切,是基于index做聚合吗?最后的例子表明是的。 另外,groupby作用于df时返回Dataframegroupby对象(dfgb),作用与Series返回sgb,dfgb虽然是df,但也要从group的角度理解其数据
GroupBy Sort
groupby之后生成dfgb对象,聚合操作后生成df对象,可输出 groupby操作会对key排序,如果不需要的话可以领sort为False以加速
df2 = pd. DataFrame( { "X" : [ "B" , "B" , "A" , "A" ] , "Y" : [ 1 , 2 , 3 , 4 ] } )
b = df2. groupby( "X" ) . sum ( )
type ( b)
pandas.core.frame.DataFrame
b
df2. groupby( "X" , sort= False ) . sum ( )
groupby 去空
使用groupby操作时,通过配置 dropna参数为True或者False来确定,空数据时候可以作为group的key
df_list = [ [ 1 , 2 , 3 ] , [ 1 , None , 4 ] , [ 2 , 1 , 3 ] , [ 1 , 2 , 2 ] ]
df_dropna = pd. DataFrame( df_list, columns= [ "a" , "b" , "c" ] )
df_dropna
a b c 0 1 2.0 3 1 1 NaN 4 2 2 1.0 3 3 1 2.0 2
group-key列具有空值
df_dropna. groupby( "b" ) . sum ( )
df_dropna. groupby( "b" , dropna= False ) . sum ( )
每一个group内部都有一定的数据,在执行聚合函数时候,只会跳过nan数据本身,不干扰同行其他数据
df_dropna. groupby( "a" ) . sum ( )
dfgroupby对象的列选操作–column selectioon
df groupby 之后,可能希望对不同的列做不同的操作,这个时候可以使用列选操作,后面会有列选后续操作的使用 GroupBy 多重索引-MultiIndex:https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
df_dropna
a b c 0 1 2.0 3 1 1 NaN 4 2 2 1.0 3 3 1 2.0 2
grouped = df_dropna. groupby( "a" )
grouped[ "c" ]
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fbc71f67d60>
grouped[ "c" ] . get_group( 1 )
0 3
1 4
3 2
Name: c, dtype: int64
groupby对象的迭代
如前所述,groupby对象可以看作是key-value,其中key是group key,value就是其他数据构成的dataframe
for key, group in grouped:
print ( key)
print ( group)
print ( type ( group) )
print ( "" )
1
a b c
0 1 2.0 3
1 1 NaN 4
3 1 2.0 2
<class 'pandas.core.frame.DataFrame'>
2
a b c
2 2 1.0 3
<class 'pandas.core.frame.DataFrame'>
聚合操作-Aggregation
对groupbyDataframe 对象可以进一步用聚合函数得到同类汇聚的结果进行展示 可以直接调用groupbyDataframe的成员函数例如:sum/mean/max/size/count等,但是注意加括号
df = pd. DataFrame(
{
"A" : [ "foo" , "bar" , "foo" , "bar" , "foo" , "bar" , "foo" , "foo" ] ,
"B" : [ "one" , "one" , "two" , "three" , "two" , "two" , "one" , "three" ] ,
"C" : np. random. randn( 8 ) ,
"D" : np. random. randn( 8 ) ,
}
)
df
A B C D 0 foo one 0.351405 0.425799 1 bar one -2.175271 -0.278632 2 foo two -0.175300 -0.910150 3 bar three -1.723109 0.614626 4 foo two -1.765967 -0.465801 5 bar two 0.653539 1.727326 6 foo one 0.037191 -0.381093 7 foo three 2.056519 -1.862873
df. groupby( "A" ) . sum ( )
C D A bar -3.244841 2.063321 foo 0.503848 -3.194118
小插曲:聚合后得到dataFrame,然后可以基于to_dict接口得到字典
df. groupby( "A" ) . sum ( ) . to_dict( "index" )
{'bar': {'C': -3.2448414872221907, 'D': 2.063321126560749},
'foo': {'C': 0.5038481220391884, 'D': -3.194117627544297}}
df. groupby( "A" ) . sum ( ) . to_dict( )
{'C': {'bar': -3.2448414872221907, 'foo': 0.5038481220391884},
'D': {'bar': 2.063321126560749, 'foo': -3.194117627544297}}
可以看出,aggergation之后,group names会作为新的index出现在列表中,如果不想让他们 作为index,可以配置 as_index=False,或者调用reset_index方法
df. groupby( "A" , as_index= False ) . agg( np. sum )
A C D 0 bar 0.512978 1.766982 1 foo 2.510977 -3.129734
df. groupby( "A" ) . size( )
A
bar 3
foo 5
dtype: int64
df. groupby( "A" ) . count( )
df = pd. DataFrame(
{
"A" : [ "foo" , "bar" , "foo" , "bar" , "foo" , "bar" , "foo" , "foo" ] ,
"B" : [ "one" , "one" , "two" , "three" , "two" , "two" , "one" , "three" ] ,
"C" : np. random. randn( 8 ) ,
"D" : np. random. randn( 8 ) ,
}
)
df
A B C D 0 foo one 0.596527 -1.168976 1 bar one 1.387488 0.439306 2 foo two -0.105645 -0.085638 3 bar three 0.019256 0.758142 4 foo two -0.038975 -0.505880 5 bar two -0.212490 1.759912 6 foo one 0.259668 -0.550867 7 foo three -1.184676 -1.158491
grouped = df. groupby( "A" )
Applying multiple functions at once-单次投放多个函数
grouped. agg( [ np. sum , np. mean] )
C D sum mean sum mean A bar 0.952934 0.317645 1.565201 0.521734 foo -3.357455 -0.671491 -1.722682 -0.344536
grouped. agg( [ np. sum , np. mean] ) . to_excel( "groupby_test.xlsx" )
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3h515W4T-1629604064172)(attachment:groupby_test.png)]
grouped[ "C" ] . agg( [ np. sum , np. max , np. mean] )
sum amax mean A bar 0.952934 0.456841 0.317645 foo -3.357455 0.979221 -0.671491
agg后现实的名字是 function 确定的,如果想重命名,可以用rename方法,Series和DataFrame方法类似如下
grouped[ "C" ] . agg( [ np. sum , np. mean] ) . rename( columns= { "sum" : "和" , "mean" : "平均值" } )
和 平均值 A bar 0.952934 0.317645 foo -3.357455 -0.671491
Named Aggregation
为了支持面向 列 的聚合操作,pandas支持Named Aggregagtion操作 key就是输出的列名,value包含两部分,面向的column,使用的aggfunc,具体见下面的例子 逻辑:groupby之后,显然任一个key对应的height和weight都是一个series,NamedAggregation可以为不同的series配置不用的聚合函数
animals = pd. DataFrame(
{
"kind" : [ "cat" , "dog" , "cat" , "dog" ] ,
"height" : [ 9.1 , 6.0 , 9.5 , 34.0 ] ,
"weight" : [ 7.9 , 7.5 , 9.9 , 198.0 ] ,
}
)
animals
kind height weight 0 cat 9.1 7.9 1 dog 6.0 7.5 2 cat 9.5 9.9 3 dog 34.0 198.0
animals. groupby( "kind" ) . agg(
min_height= pd. NamedAgg( column= "height" , aggfunc= "min" ) ,
max_height= pd. NamedAgg( column= "height" , aggfunc= "max" ) ,
average_weight= pd. NamedAgg( column= "weight" , aggfunc= "mean" )
)
min_height max_height average_weight kind cat 9.1 9.5 8.90 dog 6.0 34.0 102.75
上面的NamedAgg使用的是namedTuple,下面是等价的普通元组形式:
animals. groupby( "kind" ) . agg(
min_height= ( "height" , "min" ) ,
max_weight= ( "weight" , "max" )
)
min_height max_weight kind cat 9.1 9.9 dog 6.0 198.0
当选取的keywords不是python合法命名时,可以用字典+unpack的方式实现
animals. groupby( "kind" ) . agg(
** { "min height" : ( "height" , "min" ) } ,
max_weight= ( "weight" , "max" )
)
min height max_weight kind cat 9.1 9.9 dog 6.0 198.0
如果只关注其中一个columns,那么可以对SeriesGroupBy对象使用named聚合 由于事先指定了列,那么value部分就只有函数
animals. groupby( "kind" ) . height. agg(
min_height= "min" ,
max_height= "max"
)
min_height max_height kind cat 9.1 9.5 dog 6.0 34.0
animals. groupby( "kind" ) [ "height" ] . agg(
min_height= "min" ,
max_height= "max"
)
min_height max_height kind cat 9.1 9.5 dog 6.0 34.0
上述功能可以基于字典等价实现如下:
animals. groupby( "kind" ) . agg( { "height" : [ min ] , "weight" : max } ) . rename( columns= { "height" : "高度" , "min" : "最小值" } )
高度 weight 最小值 max kind cat 9.1 9.9 dog 6.0 198.0
note:传入的func可以是np系列,也可以是groupby类的成员函数,成员函数可以用字符串的形式访问/也可以不用字符串
基于用户自定义函数进行聚合
首先,明确不同情况下返回的数据结构如下,可以看出 groupby之后返回的是groupby dataframe 1、之后进行列选 返回的是 groupby Series; 2、用[[column]]返回dataframe对象; 以上是和直接基于dataFrame进行列选的的结果相同的
animals. groupby( "kind" )
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fea94016490>
animals. groupby( "kind" ) [ [ "weight" ] ]
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fea94016a00>
animals. groupby( "kind" ) [ "weight" ]
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fea94025460>
animals. groupby( "kind" ) [ [ "weight" ] ] . agg( lambda x: set ( x) )
weight kind cat {9.9, 7.9} dog {198.0, 7.5}
小结: 1、多列使用1个函数 2、每列使用不同的函数 3、一列使用不同的函数 4、Named Aggregation 5、基于自定义函数
Transformation
对group内部的数据做映射 与aggregation不同的是 Transform不直接追求数据的降维,而是对每一个数据做映射
index = pd. date_range( "10/1/1999" , periods= 1100 , freq= "D" )
ts = pd. Series( np. random. normal( 0 , 2 , 1100 ) , index= index)
ts. head( )
1999-10-01 -0.215051
1999-10-02 -1.241018
1999-10-03 -3.258130
1999-10-04 -0.441858
1999-10-05 -2.059294
Freq: D, dtype: float64
grouped = ts. groupby( lambda x: x. year)
grouped. mean( )
1999 -0.080469
2000 -0.009197
2001 0.030337
2002 0.165433
dtype: float64
grouped. var( )
1999 4.574915
2000 3.531059
2001 4.272156
2002 4.283034
dtype: float64
transformer = grouped. transform( lambda x: ( x - np. mean( x) ) / np. std( x) )
type ( transformer)
pandas.core.series.Series
transformer
1999-10-01 -0.063266
1999-10-02 -0.545563
1999-10-03 -1.493789
1999-10-04 -0.169886
1999-10-05 -0.930227
...
2002-09-30 0.305898
2002-10-01 -2.120204
2002-10-02 0.328765
2002-10-03 -1.115765
2002-10-04 -0.054039
Freq: D, Length: 1100, dtype: float64
transformer. mean( )
-3.087176978701998e-17
transformer. var( )
1.0009099181073695
compare = pd. DataFrame( { "original" : ts, "Transformed" : transformer} )
compare. plot( )
必要时,transform会broadcast
ts. groupby( lambda x: x. year) . transform( lambda x: x. max ( ) - x. min ( ) )
1999-10-01 11.571518
1999-10-02 11.571518
1999-10-03 11.571518
1999-10-04 11.571518
1999-10-05 11.571518
...
2002-09-30 12.554169
2002-10-01 12.554169
2002-10-02 12.554169
2002-10-03 12.554169
2002-10-04 12.554169
Freq: D, Length: 1100, dtype: float64
ts. groupby( lambda x: x. year) . transform( lambda x: max ( x) )
1999-10-01 6.142405
1999-10-02 6.142405
1999-10-03 6.142405
1999-10-04 6.142405
1999-10-05 6.142405
...
2002-09-30 6.754179
2002-10-01 6.754179
2002-10-02 6.754179
2002-10-03 6.754179
2002-10-04 6.754179
Freq: D, Length: 1100, dtype: float64
基于transform填补空缺值
df = pd. DataFrame( [ [ np. nan, 2 , 0 , 0 ] ,
. . . [ 3 , 4 , 0 , 1 ] ,
. . . [ np. nan, np. nan, 1 , 5 ] ,
. . . [ 5 , 3 , 2 , 4 ] ] ,
. . . columns= list ( "ABCD" ) )
df
A B C D 0 NaN 2.0 0 0 1 3.0 4.0 0 1 2 NaN NaN 1 5 3 5.0 3.0 2 4
grouped = df. groupby( [ 0 , 0 , 1 , 1 ] )
grouped. transform( lambda x: x. fillna( x. mean( ) ) )
A B C D 0 3.0 2.0 0 0 1 3.0 4.0 0 1 2 5.0 3.0 1 5 3 5.0 3.0 2 4
滑窗操作
df = pd. DataFrame( { "A" : [ 1 ] * 10 + [ 5 ] * 10 , "B" : np. arange( 20 ) } )
df
A B 0 1 0 1 1 1 2 1 2 3 1 3 4 1 4 5 1 5 6 1 6 7 1 7 8 1 8 9 1 9 10 5 10 11 5 11 12 5 12 13 5 13 14 5 14 15 5 15 16 5 16 17 5 17 18 5 18 19 5 19
df. groupby( "A" ) . rolling( 4 ) . B. mean( )
A
1 0 NaN
1 NaN
2 NaN
3 1.5
4 2.5
5 3.5
6 4.5
7 5.5
8 6.5
9 7.5
5 10 NaN
11 NaN
12 NaN
13 11.5
14 12.5
15 13.5
16 14.5
17 15.5
18 16.5
19 17.5
Name: B, dtype: float64
df. groupby( "A" ) . expanding( ) . sum ( )
B A 1 0 0.0 1 1.0 2 3.0 3 6.0 4 10.0 5 15.0 6 21.0 7 28.0 8 36.0 9 45.0 5 10 10.0 11 21.0 12 33.0 13 46.0 14 60.0 15 75.0 16 91.0 17 108.0 18 126.0 19 145.0
resample函数:对数据进行重采样(可以变少/多)
df_re = pd. DataFrame(
{
"date" : pd. date_range( start= "2016-01-01" , periods= 4 , freq= "W" ) ,
"group" : [ 1 , 1 , 2 , 2 ] ,
"val" : [ 5 , 6 , 7 , 8 ] ,
}
) . set_index( "date" )
df_re
group val date 2016-01-03 1 5 2016-01-10 1 6 2016-01-17 2 7 2016-01-24 2 8
df_re. resample( "2D" ) . ffill( )
group val date 2016-01-03 1 5 2016-01-05 1 5 2016-01-07 1 5 2016-01-09 1 5 2016-01-11 1 6 2016-01-13 1 6 2016-01-15 1 6 2016-01-17 2 7 2016-01-19 2 7 2016-01-21 2 7 2016-01-23 2 7
Filtration
必须为filter函数传入一个func作为参数,这个函数的每一个元素都是groupbyDataFrame中的一个Group
dff = pd. DataFrame( { "A" : np. arange( 8 ) , "B" : list ( "aaabbbbc" ) } )
dff. groupby( "B" ) . filter ( lambda x: len ( x) >= 3 )
dispatching
这里展示了一些groupbySeriesd的简便的实例函数的调度方法:std、nlargest、nsmallest
df
A B 0 1 0 1 1 1 2 1 2 3 1 3 4 1 4 5 1 5 6 1 6 7 1 7 8 1 8 9 1 9 10 5 10 11 5 11 12 5 12 13 5 13 14 5 14 15 5 15 16 5 16 17 5 17 18 5 18 19 5 19
df. groupby( "A" ) . std( )
df. groupby( "A" ) . agg( np. std)
df. groupby( "A" ) . agg( lambda x: x. std( ) )
s = pd. Series( [ 9 , 8 , 7 , 5 , 19 , 1 , 4.2 , 3.3 ] )
g = pd. Series( list ( "abababab" ) )
gb = s. groupby( g)
gb. nlargest( 3 )
a 4 19.0
0 9.0
2 7.0
b 1 8.0
3 5.0
7 3.3
dtype: float64
灵活的Apply操作
如果前面介绍的agg、filter、transform不能满足要求,可以考虑使用apply函数
s = pd. Series( np. random. rand( 5 ) )
s
0 0.309940
1 0.801719
2 0.225639
3 0.622519
4 0.545119
dtype: float64
def f ( x) :
return pd. Series( [ x, x** 2 ] , index= [ "x" , "x^2" ] )
s. apply ( f)
x x^2 0 0.309940 0.096063 1 0.801719 0.642753 2 0.225639 0.050913 3 0.622519 0.387530 4 0.545119 0.297155