import pandas as pd
问题说明
数据中样本年份不连续,所以想办法知道每个样本的最大连续年份,比如说下面的df中A是3年,B是5年
df = pd. DataFrame( { 'id' : 'A' , 'year' : [ 2000 , 2001 , 2003 , 2005 , 2006 , 2007 ] } ) . append(
pd. DataFrame( { 'id' : 'B' , 'year' : [ 2001 , 2003 , 2004 , 2005 , 2006 , 2007 , 2009 ] } ) )
df
id year 0 A 2000 1 A 2001 2 A 2003 3 A 2005 4 A 2006 5 A 2007 0 B 2001 1 B 2003 2 B 2004 3 B 2005 4 B 2006 5 B 2007 6 B 2009
方法1:使用分组循环
参考 : https://www.jianshu.com/p/718eb7a83c8e
思路,通过标记前后两个年份差大于1的为1,然后记性累计加总,最后使用value_counts得到小组最多的
计算较慢,但是方便得到其他信息
% % time
for name, group in df. groupby( 'id' ) :
group[ 's' ] = ( ( group[ 'year' ] . shift( 1 ) . fillna( 0 ) + 1 ) . astype( int ) != group[ 'year' ] ) . cumsum( )
print ( 'group:{}' . format ( name) )
print ( group[ 's' ] . value_counts( ) )
group:A
3 3
1 2
2 1
Name: s, dtype: int64
group:B
2 5
3 1
1 1
Name: s, dtype: int64
Wall time: 1.99 s
方法2:使用agg聚合计算
使用agg,改善上一个方法计算太慢的缺陷
速度较大提升
% % time
df. groupby( 'id' ) . agg( lambda x: ( x. diff( ) . fillna( 0 ) - 1 ) . cumsum( ) . value_counts( ) . max ( ) )
Wall time: 13 ms
% % time
df_min_year = df. groupby( 'id' ) . agg( lambda x: x[ ( x. diff( ) . fillna( 0 ) - 1 ) . cumsum( ) == (
x. diff( ) . fillna( 0 ) - 1 ) . cumsum( ) . value_counts( ) . nlargest( 1 ) . index[ 0 ] ] . values. min ( ) )
df_max_year = df. groupby( 'id' ) . agg( lambda x: x[ ( x. diff( ) . fillna( 0 ) - 1 ) . cumsum( ) == (
x. diff( ) . fillna( 0 ) - 1 ) . cumsum( ) . value_counts( ) . nlargest( 1 ) . index[ 0 ] ] . values. max ( ) )
df_min_year. join( df_max_year, rsuffix= '_max' )
Wall time: 32 ms
year year_max id A 2005 2007 B 2003 2007
方法3:使用agg加numpy
咨询大佬的答案,速度更快但
% % time
df. groupby( 'id' ) . agg( lambda x: np. diff( np. flatnonzero( np. diff(
np. r_[ 0 , ( x. diff( ) . fillna( 1 ) - 1 ) == 0 , 0 ] ) ) . reshape( - 1 , 2 ) , axis= 1 ) . max ( ) + 1 )
Wall time: 9 ms
tips:apply和agg
对于df.groupby.apply(lambda x:fun(x)中x为dataframe,即对dataframe操作
对于df.groupby.agg(lambda x:fun(x)中x为(n-1)个series,n是列数量
使用内置函数速度大于自定义函数
df[ 'make' ] = 3
df
id year make 0 A 2000 3 1 A 2001 3 2 A 2003 3 3 A 2005 3 4 A 2006 3 5 A 2007 3 0 B 2001 3 1 B 2003 3 2 B 2004 3 3 B 2005 3 4 B 2006 3 5 B 2007 3 6 B 2009 3
df. groupby( 'id' ) . agg( lambda x: print ( x) )
0 2000
1 2001
2 2003
3 2005
4 2006
5 2007
Name: year, dtype: int64
0 2001
1 2003
2 2004
3 2005
4 2006
5 2007
6 2009
Name: year, dtype: int64
0 3
1 3
2 3
3 3
4 3
5 3
Name: make, dtype: int64
0 3
1 3
2 3
3 3
4 3
5 3
6 3
Name: make, dtype: int64
year make id A None None B None None
df. groupby( 'id' ) . apply ( lambda x: print ( x) )
id year make
0 A 2000 3
1 A 2001 3
2 A 2003 3
3 A 2005 3
4 A 2006 3
5 A 2007 3
id year make
0 A 2000 3
1 A 2001 3
2 A 2003 3
3 A 2005 3
4 A 2006 3
5 A 2007 3
id year make
0 B 2001 3
1 B 2003 3
2 B 2004 3
3 B 2005 3
4 B 2006 3
5 B 2007 3
6 B 2009 3
% % time
df. groupby( 'id' ) . agg( 'sum' )
Wall time: 6 ms
year make id A 12022 18 B 14035 21
% % time
df. groupby( 'id' ) . agg( lambda x: x. sum ( ) )
Wall time: 12 ms
year make id A 12022 18 B 14035 21
% % time
df. groupby( 'id' ) . apply ( lambda x: x. sum ( ) )
Wall time: 9.01 ms
id year make id A AAAAAA 12022 18 B BBBBBBB 14035 21