数据来源于网上,是用户在一家单车网站上的消费记录
【目录】
import pandas as pd
import numpy as np
columns= [ 'user_id' , 'order_dt' , 'order_products' , 'order_amount' ]
df= pd. read_table( 'bicycle_master.txt' , names= columns, sep= '\s+' )
order_dt:购买日期 order_products:购买产品数 order_amount:购买金额
df. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69659 entries, 0 to 69658
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 69659 non-null int64
1 order_dt 69659 non-null int64
2 order_products 69659 non-null int64
3 order_amount 69659 non-null float64
dtypes: float64(1), int64(3)
memory usage: 2.1 MB
df. head( )
user_id order_dt order_products order_amount 0 1 19970101 1 11.77 1 2 19970112 1 12.00 2 2 19970112 5 77.00 3 3 19970102 2 20.76 4 3 19970330 2 20.76
df. describe( )
user_id order_dt order_products order_amount count 69659.000000 6.965900e+04 69659.000000 69659.000000 mean 11470.854592 1.997228e+07 2.410040 35.893648 std 6819.904848 3.837735e+03 2.333924 36.281942 min 1.000000 1.997010e+07 1.000000 0.000000 25% 5506.000000 1.997022e+07 1.000000 14.490000 50% 11410.000000 1.997042e+07 2.000000 25.980000 75% 17273.000000 1.997111e+07 3.000000 43.700000 max 23570.000000 1.998063e+07 99.000000 1286.010000
大部分订单只消费了少量商品(平均2.4),有一定值干扰 用户的消费金额比较稳定,平均消费35元,中位数在25元,有一定极值的干扰
df[ 'order_dt' ] = pd. to_datetime( df. order_dt, format = '%Y%m%d' )
df[ 'month' ] = df. order_dt. astype( 'datetime64[M]' )
df. month
0 1997-01-01
1 1997-01-01
2 1997-01-01
3 1997-01-01
4 1997-03-01
...
69654 1997-04-01
69655 1997-04-01
69656 1997-03-01
69657 1997-03-01
69658 1997-03-01
Name: month, Length: 69659, dtype: datetime64[ns]
对时间进行解析,一开始默认的格式是 int64的,可以用 df.info() 来查看,默认会是 datetime64[ns] 类型,后面中括号表示 时间间隔是 ns 下面是对 order_dt 列(新版本可以不用取values直接用astype强制转换),转换类型为datetime64[M],默认就会是每月的第一天了,同理设置为[Y]就是每年的1月1日,然后生成新的一列 month
1.进行用户消费趋势的分析(按月) - 每月的消费总金额 - 每月的消费次数 - 每月的产品购买量 - 每月的消费人数
每月消费总金额
month_order_amount= df. groupby( 'month' ) . order_amount. sum ( )
month_order_amount
month
1997-01-01 299060.17
1997-02-01 379590.03
1997-03-01 393155.27
1997-04-01 142824.49
1997-05-01 107933.30
1997-06-01 108395.87
1997-07-01 122078.88
1997-08-01 88367.69
1997-09-01 81948.80
1997-10-01 89780.77
1997-11-01 115448.64
1997-12-01 95577.35
1998-01-01 76756.78
1998-02-01 77096.96
1998-03-01 108970.15
1998-04-01 66231.52
1998-05-01 70989.66
1998-06-01 76109.30
Name: order_amount, dtype: float64
每月消费次数
month_order_fq= df. groupby( 'month' ) . order_amount. count( )
month_order_fq
month
1997-01-01 8928
1997-02-01 11272
1997-03-01 11598
1997-04-01 3781
1997-05-01 2895
1997-06-01 3054
1997-07-01 2942
1997-08-01 2320
1997-09-01 2296
1997-10-01 2562
1997-11-01 2750
1997-12-01 2504
1998-01-01 2032
1998-02-01 2026
1998-03-01 2793
1998-04-01 1878
1998-05-01 1985
1998-06-01 2043
Name: order_amount, dtype: int64
每月产品购买量
month_order_products= df. groupby( 'month' ) . order_products. sum ( )
month_order_products
month
1997-01-01 19416
1997-02-01 24921
1997-03-01 26159
1997-04-01 9729
1997-05-01 7275
1997-06-01 7301
1997-07-01 8131
1997-08-01 5851
1997-09-01 5729
1997-10-01 6203
1997-11-01 7812
1997-12-01 6418
1998-01-01 5278
1998-02-01 5340
1998-03-01 7431
1998-04-01 4697
1998-05-01 4903
1998-06-01 5287
Name: order_products, dtype: int64
一次性取出每月消费总金额、每月消费次数、每月产品购买量多组数据
month_info= df. groupby( 'month' ) [ [ 'order_amount' , 'user_id' , 'order_products' ] ] . agg( { 'order_amount' : sum , 'user_id' : 'count' , 'order_products' : sum } )
month_info
order_amount user_id order_products month 1997-01-01 299060.17 8928 19416 1997-02-01 379590.03 11272 24921 1997-03-01 393155.27 11598 26159 1997-04-01 142824.49 3781 9729 1997-05-01 107933.30 2895 7275 1997-06-01 108395.87 3054 7301 1997-07-01 122078.88 2942 8131 1997-08-01 88367.69 2320 5851 1997-09-01 81948.80 2296 5729 1997-10-01 89780.77 2562 6203 1997-11-01 115448.64 2750 7812 1997-12-01 95577.35 2504 6418 1998-01-01 76756.78 2032 5278 1998-02-01 77096.96 2026 5340 1998-03-01 108970.15 2793 7431 1998-04-01 66231.52 1878 4697 1998-05-01 70989.66 1985 4903 1998-06-01 76109.30 2043 5287
month_info. rename( columns= { 'order_amount' : '月度消费总金额' , 'user_id' : '月度消费次数' , 'order_products' : '月度购买量' } , inplace= True )
month_info. head( )
月度消费总金额 月度消费次数 月度购买量 month 1997-01-01 299060.17 8928 19416 1997-02-01 379590.03 11272 24921 1997-03-01 393155.27 11598 26159 1997-04-01 142824.49 3781 9729 1997-05-01 107933.30 2895 7275
每月消费人数
month_info[ '月度消费人数' ] = df. groupby( 'month' ) . user_id. nunique( )
month_info. head( )
月度消费总金额 月度消费次数 月度购买量 月度消费人数 month 1997-01-01 299060.17 8928 19416 7846 1997-02-01 379590.03 11272 24921 9633 1997-03-01 393155.27 11598 26159 9524 1997-04-01 142824.49 3781 9729 2822 1997-05-01 107933.30 2895 7275 2214
month_info= month_info. reset_index( )
month_info[ 'month' ] = month_info. astype( str )
month_info. head( )
month 月度消费总金额 月度消费次数 月度购买量 月度消费人数 0 1997-01-01 299060.17 8928 19416 7846 1 1997-02-01 379590.03 11272 24921 9633 2 1997-03-01 393155.27 11598 26159 9524 3 1997-04-01 142824.49 3781 9729 2822 4 1997-05-01 107933.30 2895 7275 2214
month_info. to_excel( r'.\月度销售趋势分析.xlsx' )
去重的方法有多种,这里也可以使用数据库思想,df.groupby([‘month’,‘user_id’]).count().reset_index() 每月消费人数低于每月消费次数,但差异不大 前三个月每月的消费人数在8000-10000之间,后续月份,平均消费人数在2000不到
df. pivot_table( index= 'month' ,
values= [ 'order_amount' , 'user_id' , 'order_products' ] ,
aggfunc= { 'order_amount' : sum , 'user_id' : 'count' , 'order_products' : sum } ) . head( )
order_amount order_products user_id month 1997-01-01 299060.17 19416 8928 1997-02-01 379590.03 24921 11272 1997-03-01 393155.27 26159 11598 1997-04-01 142824.49 9729 3781 1997-05-01 107933.30 7275 2895
每月用户平均消费金额的趋势 每月用户平均消费次数的趋势
user_month_trends= df. groupby( 'month' ) [ [ 'order_amount' , 'order_products' ] ] . agg( np. mean)
user_month_trends. rename( columns= { 'order_amount' : '用户每月平均消费金额的趋势' , 'order_products' : '用户每月平均消费次数的趋势' } , inplace= True )
user_month_trends. head( )
用户每月平均消费金额的趋势 用户每月平均消费次数的趋势 month 1997-01-01 33.496883 2.174731 1997-02-01 33.675482 2.210877 1997-03-01 33.898540 2.255475 1997-04-01 37.774263 2.573129 1997-05-01 37.282660 2.512953
2.用户个体消费分析
用户消费金额,消费次数的描述统计(可绘制散点图) 用户消费金额的分布 用户消费次数的分布 用户累计消费金额占比(百分之多少的用户占了百分之多少的消费额)
df. columns
Index(['user_id', 'order_dt', 'order_products', 'order_amount', 'month'], dtype='object')
用户消费金额,消费次数的描述统计
group_user= df. groupby( 'user_id' )
group_user. sum ( ) . describe( )
order_products order_amount count 23570.000000 23570.000000 mean 7.122656 106.080426 std 16.983531 240.925195 min 1.000000 0.000000 25% 1.000000 19.970000 50% 3.000000 43.395000 75% 7.000000 106.475000 max 1033.000000 13990.930000
user_info= group_user. sum ( )
user_info. to_excel( r'.\用户个体消费行为分析.xlsx' )
用户消费金额的分布
group_user. order_amount. sum ( )
user_id
1 11.77
2 89.00
3 156.46
4 100.50
5 385.61
...
23566 36.00
23567 20.97
23568 121.70
23569 25.74
23570 94.08
Name: order_amount, Length: 23570, dtype: float64
group_user_order_amount_sum_lst= [ i for i in range ( 0 , int ( group_user. order_amount. sum ( ) . max ( ) + 50 ) , 50 ) ]
group_user_order_amount_sum= pd. cut( group_user. order_amount. sum ( ) ,
bins= group_user_order_amount_sum_lst,
labels= group_user_order_amount_sum_lst[ 1 : ] )
group_user_order_amount_sum
user_id
1 50
2 100
3 200
4 150
5 400
...
23566 50
23567 50
23568 150
23569 50
23570 100
Name: order_amount, Length: 23570, dtype: category
Categories (280, int64): [50 < 100 < 150 < 200 ... 13850 < 13900 < 13950 < 14000]
group_user_order_amount_sum. to_excel( r'.\用户消费金额的分布.xlsx' )
用户消费次数的分布
group_user. order_amount. count( )
user_id
1 1
2 2
3 6
4 4
5 11
..
23566 1
23567 1
23568 3
23569 1
23570 2
Name: order_amount, Length: 23570, dtype: int64
group_user_order_amount_count_lst= [ i for i in range ( 0 , int ( group_user. order_amount. count( ) . max ( ) + 2 ) , 2 ) ]
group_user_order_amount_count= pd. cut( group_user. order_amount. count( ) ,
bins= group_user_order_amount_count_lst,
labels= group_user_order_amount_count_lst[ 1 : ] )
group_user_order_amount_count
user_id
1 2
2 2
3 6
4 4
5 12
..
23566 2
23567 2
23568 4
23569 2
23570 2
Name: order_amount, Length: 23570, dtype: category
Categories (109, int64): [2 < 4 < 6 < 8 ... 212 < 214 < 216 < 218]
group_user_order_amount_count. to_excel( r'.\用户消费次数的分布.xlsx' )
用户累计消费金额占比
user_cumsum= group_user. sum ( ) . sort_values( 'order_amount' ) . apply ( lambda x: x. cumsum( ) / x. sum ( ) )
user_cumsum. reset_index( ) . order_amount
0 0.000000
1 0.000000
2 0.000000
3 0.000000
4 0.000000
...
23565 0.985405
23566 0.988025
23567 0.990814
23568 0.994404
23569 1.000000
Name: order_amount, Length: 23570, dtype: float64
user_cumsum. reset_index( ) . order_amount. to_excel( r'.\用户累计消费金额占比.xlsx' )
3.用户消费行为
用户第一次消费(首购) 用户最后一次消费 新老客户消费比 多少用户仅消费一次 每月新客占比 用户分层 RFM模型 新、老、活跃、回流、流失 用户购买周期(按订单) 用户消费周期描述 用户消费周期分布 用户生命周期(按第一次和最后一次消费) -用户生命周期描述 用户生命周期分布
用户第一次消费(首购)
group_user_dtmin= group_user. order_dt. min ( ) . value_counts( )
group_user_dtmin= group_user_dtmin. reset_index( ) . rename( columns= { 'index' : 'first_date' , 'order_dt' : 'quantities' } )
group_user_dtmin. first_date= group_user_dtmin. first_date. astype( str )
group_user_dtmin
first_date quantities 0 1997-02-08 363 1 1997-02-24 347 2 1997-02-04 346 3 1997-02-06 346 4 1997-03-04 340 ... ... ... 79 1997-01-08 213 80 1997-03-21 213 81 1997-01-07 211 82 1997-01-01 209 83 1997-01-04 174
84 rows × 2 columns
group_user_dtmin. to_excel( r'.\用户首购.xlsx' )
用户最后一次消费
group_user_maxdt= group_user. order_dt. max ( ) . value_counts( )
group_user_maxdt= group_user_maxdt. reset_index( ) . rename( columns= { 'index' : 'last_date' , 'order_dt' : 'quantities' } )
group_user_maxdt. last_date= group_user_maxdt. last_date. astype( str )
group_user_maxdt
last_date quantities 0 1997-02-08 221 1 1997-03-12 213 2 1997-02-04 210 3 1997-03-06 204 4 1997-02-27 202 ... ... ... 541 1997-07-04 4 542 1997-10-19 4 543 1997-07-18 4 544 1997-06-18 4 545 1997-07-13 4
546 rows × 2 columns
group_user_maxdt. to_excel( r'.\用户最后一次购买.xlsx' )
user_life= df. groupby( 'user_id' ) . order_dt. agg( [ min , max ] )
user_life. head( )
min max user_id 1 1997-01-01 1997-01-01 2 1997-01-12 1997-01-12 3 1997-01-02 1998-05-28 4 1997-01-01 1997-12-12 5 1997-01-01 1998-01-03
( user_life[ 'min' ] == user_life[ 'max' ] )
user_id
1 True
2 True
3 False
4 False
5 False
...
23566 True
23567 True
23568 False
23569 True
23570 False
Length: 23570, dtype: bool
( user_life[ 'min' ] == user_life[ 'max' ] ) . value_counts( )
True 12054
False 11516
dtype: int64
有一半以上的用户只消费了一次
每月新客占比
group_um= df. groupby( [ 'month' , 'user_id' ] ) . order_dt. agg( [ min , max ] )
group_um
min max month user_id 1997-01-01 1 1997-01-01 1997-01-01 2 1997-01-12 1997-01-12 3 1997-01-02 1997-01-02 4 1997-01-01 1997-01-18 5 1997-01-01 1997-01-14 ... ... ... ... 1998-06-01 23444 1998-06-28 1998-06-28 23489 1998-06-21 1998-06-21 23513 1998-06-14 1998-06-14 23555 1998-06-10 1998-06-10 23556 1998-06-07 1998-06-07
55379 rows × 2 columns
group_um[ 'new' ] = ( group_um[ 'min' ] == group_um[ 'max' ] )
group_um. reset_index( ) . groupby( 'month' ) . new. value_counts( )
month new
1997-01-01 True 7093
False 753
1997-02-01 True 8571
False 1062
1997-03-01 True 8154
False 1370
1997-04-01 True 2228
False 594
1997-05-01 True 1801
False 413
1997-06-01 True 1912
False 427
1997-07-01 True 1743
False 437
1997-08-01 True 1450
False 322
1997-09-01 True 1410
False 329
1997-10-01 True 1489
False 350
1997-11-01 True 1654
False 374
1997-12-01 True 1493
False 371
1998-01-01 True 1242
False 295
1998-02-01 True 1264
False 287
1998-03-01 True 1627
False 433
1998-04-01 True 1175
False 262
1998-05-01 True 1209
False 279
1998-06-01 True 1212
False 294
Name: new, dtype: int64
用户分层
rfm= df. pivot_table( index= 'user_id' ,
values= [ 'order_products' , 'order_amount' , 'order_dt' ] ,
aggfunc= { 'order_dt' : max , 'order_amount' : sum , 'order_products' : sum } )
rfm. head( )
order_amount order_dt order_products user_id 1 11.77 1997-01-01 1 2 89.00 1997-01-12 6 3 156.46 1998-05-28 16 4 100.50 1997-12-12 7 5 385.61 1998-01-03 29
rfm. order_dt. max ( )
Timestamp('1998-06-30 00:00:00')
rfm. order_dt- rfm. order_dt. max ( )
user_id
1 -545 days
2 -534 days
3 -33 days
4 -200 days
5 -178 days
...
23566 -462 days
23567 -462 days
23568 -434 days
23569 -462 days
23570 -461 days
Name: order_dt, Length: 23570, dtype: timedelta64[ns]
rfm[ 'R' ] = ( rfm. order_dt- rfm. order_dt. max ( ) ) / np. timedelta64( 1 , 'D' )
rfm. rename( columns= { 'order_products' : 'F' , 'order_amount' : 'M' } , inplace= True )
rfm. head( )
M order_dt F R user_id 1 11.77 1997-01-01 1 -545.0 2 89.00 1997-01-12 6 -534.0 3 156.46 1998-05-28 16 -33.0 4 100.50 1997-12-12 7 -200.0 5 385.61 1998-01-03 29 -178.0
rfm[ [ 'R' , 'F' , 'M' ] ] . apply ( lambda x: x- x. mean( ) )
R F M user_id 1 -177.778362 -6.122656 -94.310426 2 -166.778362 -1.122656 -17.080426 3 334.221638 8.877344 50.379574 4 167.221638 -0.122656 -5.580426 5 189.221638 21.877344 279.529574 ... ... ... ... 23566 -94.778362 -5.122656 -70.080426 23567 -94.778362 -6.122656 -85.110426 23568 -66.778362 -1.122656 15.619574 23569 -94.778362 -5.122656 -80.340426 23570 -93.778362 -2.122656 -12.000426
23570 rows × 3 columns
def rfm_func ( x) :
level= x. apply ( lambda x: '1' if x> 0 else '0' )
label= level. R+ level. F+ level. M
d= { '111' : '重要价值用户' ,
'011' : '重要保持用户' ,
'101' : '重要发展用户' ,
'001' : '重要挽留用户' ,
'110' : '一般价值用户' ,
'010' : '一般保持用户' ,
'100' : '一般发展用户' ,
'000' : '一般挽留用户' ,
}
result= d[ label]
return result
rfm[ 'label' ] = rfm[ [ 'R' , 'F' , 'M' ] ] . apply ( lambda x: x- x. mean( ) ) . apply ( rfm_func, axis= 1 )
M order_dt F R label color user_id 1 11.77 1997-01-01 1 -545.0 一般挽留用户 非重要价值客户 2 89.00 1997-01-12 6 -534.0 一般挽留用户 非重要价值客户 3 156.46 1998-05-28 16 -33.0 重要价值用户 非重要价值客户 4 100.50 1997-12-12 7 -200.0 一般发展用户 非重要价值客户 5 385.61 1998-01-03 29 -178.0 重要价值用户 非重要价值客户 ... ... ... ... ... ... ... 23566 36.00 1997-03-25 2 -462.0 一般挽留用户 非重要价值客户 23567 20.97 1997-03-25 1 -462.0 一般挽留用户 非重要价值客户 23568 121.70 1997-04-22 6 -434.0 重要挽留用户 非重要价值客户 23569 25.74 1997-03-25 2 -462.0 一般挽留用户 非重要价值客户 23570 94.08 1997-03-26 5 -461.0 一般挽留用户 非重要价值客户
23570 rows × 6 columns
rfm. loc[ rfm. label== '重要价值用户' , 'color' ] = '重要价值用户'
rfm. loc[ ~ ( rfm. label== '重要价值用户' ) , 'color' ] = '非重要价值用户'
rfm
M order_dt F R label color user_id 1 11.77 1997-01-01 1 -545.0 一般挽留用户 非重要价值用户 2 89.00 1997-01-12 6 -534.0 一般挽留用户 非重要价值用户 3 156.46 1998-05-28 16 -33.0 重要价值用户 重要价值用户 4 100.50 1997-12-12 7 -200.0 一般发展用户 非重要价值用户 5 385.61 1998-01-03 29 -178.0 重要价值用户 重要价值用户 ... ... ... ... ... ... ... 23566 36.00 1997-03-25 2 -462.0 一般挽留用户 非重要价值用户 23567 20.97 1997-03-25 1 -462.0 一般挽留用户 非重要价值用户 23568 121.70 1997-04-22 6 -434.0 重要挽留用户 非重要价值用户 23569 25.74 1997-03-25 2 -462.0 一般挽留用户 非重要价值用户 23570 94.08 1997-03-26 5 -461.0 一般挽留用户 非重要价值用户
23570 rows × 6 columns
rfm. to_excel( r'.\RFM模型.xlsx' )
rfm. groupby( 'label' ) . sum ( )
M F R label 一般价值用户 19937.45 1712 -29448.0 一般保持用户 7181.28 650 -36295.0 一般发展用户 196971.23 13977 -591108.0 一般挽留用户 438291.81 29346 -6951815.0 重要价值用户 1592039.62 107789 -517267.0 重要保持用户 167080.83 11121 -358363.0 重要发展用户 45785.01 2023 -56636.0 重要挽留用户 33028.40 1263 -114482.0
从RFM 分层可知,大部分用户是重要保持客户,但是这是由于极值的影响,所以 RFM 的划分标准应该以业务为准,也可以通过切比雪夫去除极值后求均值,并且 RFM 的各个划分标准可以都不一样
pivot_count= df. pivot_table( index= 'user_id' ,
columns= 'month' ,
values= 'order_dt' ,
aggfunc= 'count' ) . fillna( 0 )
pivot_count. head( )
month 1997-01-01 1997-02-01 1997-03-01 1997-04-01 1997-05-01 1997-06-01 1997-07-01 1997-08-01 1997-09-01 1997-10-01 1997-11-01 1997-12-01 1998-01-01 1998-02-01 1998-03-01 1998-04-01 1998-05-01 1998-06-01 user_id 1 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 4 2.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 5 2.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 2.0 1.0 0.0 0.0 0.0 0.0 0.0
df_purchase= pivot_count. applymap( lambda x: 1 if x> 0 else 0 )
df_purchase. head( )
month 1997-01-01 1997-02-01 1997-03-01 1997-04-01 1997-05-01 1997-06-01 1997-07-01 1997-08-01 1997-09-01 1997-10-01 1997-11-01 1997-12-01 1998-01-01 1998-02-01 1998-03-01 1998-04-01 1998-05-01 1998-06-01 user_id 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 4 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 5 1 1 0 1 1 1 1 0 1 0 0 1 1 0 0 0 0 0
len ( df_purchase. columns)
18
def active_status ( data) :
status= [ ]
for i in range ( len ( df_purchase. columns) ) :
if data[ i] == 0 :
if len ( status) > 0 :
if status[ i- 1 ] == 'unreg' :
status. append( 'unreg' )
else :
status. append( 'unactive' )
else :
status. append( 'unreg' )
else :
if len ( status) == 0 :
status. append( 'new' )
else :
if status[ i- 1 ] == 'unactive' :
status. append( 'return' )
elif status[ i- 1 ] == 'unreg' :
status. append( 'new' )
else :
status. append( 'acitve' )
return status
若本月没有消费,这里只是和上个月判断是否注册,有缺陷,可以判断是否存在就可以了
若之前是未注册,则依旧为未注册 若之前有消费,则为流失/不活跃 其他情况,为未注册
若本月有消费
若是第一次消费,则为新用户 如果之前有过消费,则上个月为不活跃,则为回流 如果上个月为未注册,则为新用户 初次之外,为活跃
return:回流
new:新客
unreg:未注册
active:活跃
purchase_status= df_purchase. apply ( lambda x: pd. Series( active_status( x) , index= df_purchase. columns) , axis= 1 )
purchase_status. head( )
month 1997-01-01 1997-02-01 1997-03-01 1997-04-01 1997-05-01 1997-06-01 1997-07-01 1997-08-01 1997-09-01 1997-10-01 1997-11-01 1997-12-01 1998-01-01 1998-02-01 1998-03-01 1998-04-01 1998-05-01 1998-06-01 user_id 1 new unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive 2 new unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive 3 new unactive return acitve unactive unactive unactive unactive unactive unactive return unactive unactive unactive unactive unactive return unactive 4 new unactive unactive unactive unactive unactive unactive return unactive unactive unactive return unactive unactive unactive unactive unactive unactive 5 new acitve unactive return acitve acitve acitve unactive return unactive unactive return acitve unactive unactive unactive unactive unactive
purchase_status. shape
(23570, 18)
purchase_status_count= purchase_status. replace( 'unreg' , np. NAN) . apply ( lambda x: pd. value_counts( x) )
purchase_status_count
month 1997-01-01 1997-02-01 1997-03-01 1997-04-01 1997-05-01 1997-06-01 1997-07-01 1997-08-01 1997-09-01 1997-10-01 1997-11-01 1997-12-01 1998-01-01 1998-02-01 1998-03-01 1998-04-01 1998-05-01 1998-06-01 acitve NaN 1157.0 1681 1773.0 852.0 747.0 746.0 604.0 528.0 532.0 624.0 632.0 512.0 472.0 571.0 518.0 459.0 446.0 new 7846.0 8476.0 7248 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN return NaN NaN 595 1049.0 1362.0 1592.0 1434.0 1168.0 1211.0 1307.0 1404.0 1232.0 1025.0 1079.0 1489.0 919.0 1029.0 1060.0 unactive NaN 6689.0 14046 20748.0 21356.0 21231.0 21390.0 21798.0 21831.0 21731.0 21542.0 21706.0 22033.0 22019.0 21510.0 22133.0 22082.0 22064.0
return_rate= purchase_status_count. apply ( lambda x: x/ x. sum ( ) , axis= 0 )
return_rate
month 1997-01-01 1997-02-01 1997-03-01 1997-04-01 1997-05-01 1997-06-01 1997-07-01 1997-08-01 1997-09-01 1997-10-01 1997-11-01 1997-12-01 1998-01-01 1998-02-01 1998-03-01 1998-04-01 1998-05-01 1998-06-01 acitve NaN 0.070886 0.071319 0.075223 0.036148 0.031693 0.03165 0.025626 0.022401 0.022571 0.026474 0.026814 0.021723 0.020025 0.024226 0.021977 0.019474 0.018922 new 1.0 0.519299 0.307510 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN return NaN NaN 0.025244 0.044506 0.057785 0.067543 0.06084 0.049555 0.051379 0.055452 0.059567 0.052270 0.043487 0.045779 0.063174 0.038990 0.043657 0.044972 unactive NaN 0.409815 0.595927 0.880272 0.906067 0.900764 0.90751 0.924820 0.926220 0.921977 0.913958 0.920916 0.934790 0.934196 0.912601 0.939033 0.936869 0.936105
purchase_status_count_info= purchase_status_count. fillna( 0 ) . T
purchase_status_count_info. reset_index( 'month' ) . astype( str )
purchase_status_count_info
acitve new return unactive month 1997-01-01 0.0 7846.0 0.0 0.0 1997-02-01 1157.0 8476.0 0.0 6689.0 1997-03-01 1681.0 7248.0 595.0 14046.0 1997-04-01 1773.0 0.0 1049.0 20748.0 1997-05-01 852.0 0.0 1362.0 21356.0 1997-06-01 747.0 0.0 1592.0 21231.0 1997-07-01 746.0 0.0 1434.0 21390.0 1997-08-01 604.0 0.0 1168.0 21798.0 1997-09-01 528.0 0.0 1211.0 21831.0 1997-10-01 532.0 0.0 1307.0 21731.0 1997-11-01 624.0 0.0 1404.0 21542.0 1997-12-01 632.0 0.0 1232.0 21706.0 1998-01-01 512.0 0.0 1025.0 22033.0 1998-02-01 472.0 0.0 1079.0 22019.0 1998-03-01 571.0 0.0 1489.0 21510.0 1998-04-01 518.0 0.0 919.0 22133.0 1998-05-01 459.0 0.0 1029.0 22082.0 1998-06-01 446.0 0.0 1060.0 22064.0
purchase_status_count_info. to_excel( r'.\用户分层-新、活跃、流失、回流.xlsx' )
purchase_status_count_T= purchase_status_count. fillna( 0 ) . T. apply ( lambda x: x/ x. sum ( ) , axis= 1 )
purchase_status_count_T
acitve new return unactive month order_dt 1997-01-01 0.000000 1.000000 0.000000 0.000000 1997-02-01 0.070886 0.519299 0.000000 0.409815 1997-03-01 0.071319 0.307510 0.025244 0.595927 1997-04-01 0.075223 0.000000 0.044506 0.880272 1997-05-01 0.036148 0.000000 0.057785 0.906067 1997-06-01 0.031693 0.000000 0.067543 0.900764 1997-07-01 0.031650 0.000000 0.060840 0.907510 1997-08-01 0.025626 0.000000 0.049555 0.924820 1997-09-01 0.022401 0.000000 0.051379 0.926220 1997-10-01 0.022571 0.000000 0.055452 0.921977 1997-11-01 0.026474 0.000000 0.059567 0.913958 1997-12-01 0.026814 0.000000 0.052270 0.920916 1998-01-01 0.021723 0.000000 0.043487 0.934790 1998-02-01 0.020025 0.000000 0.045779 0.934196 1998-03-01 0.024226 0.000000 0.063174 0.912601 1998-04-01 0.021977 0.000000 0.038990 0.939033 1998-05-01 0.019474 0.000000 0.043657 0.936869 1998-06-01 0.018922 0.000000 0.044972 0.936105
purchase_status_count_T/ purchase_status_count_T. shift( )
acitve new return unactive month order_dt 1997-01-01 NaN NaN NaN NaN 1997-02-01 inf 0.519299 NaN inf 1997-03-01 1.006116 0.592163 inf 1.454137 1997-04-01 1.054729 0.000000 1.763025 1.477147 1997-05-01 0.480541 NaN 1.298379 1.029304 1997-06-01 0.876761 NaN 1.168869 0.994147 1997-07-01 0.998661 NaN 0.900754 1.007489 1997-08-01 0.809651 NaN 0.814505 1.019074 1997-09-01 0.874172 NaN 1.036815 1.001514 1997-10-01 1.007576 NaN 1.079273 0.995419 1997-11-01 1.172932 NaN 1.074216 0.991303 1997-12-01 1.012821 NaN 0.877493 1.007613 1998-01-01 0.810127 NaN 0.831981 1.015065 1998-02-01 0.921875 NaN 1.052683 0.999365 1998-03-01 1.209746 NaN 1.379981 0.976884 1998-04-01 0.907180 NaN 0.617193 1.028963 1998-05-01 0.886100 NaN 1.119695 0.997696 1998-06-01 0.971678 NaN 1.030126 0.999185
4.用户周期
order_diff= df. groupby( 'user_id' ) . apply ( lambda x: x. order_dt- x. order_dt. shift( ) )
order_diff. head( 20 )
user_id
1 0 NaT
2 1 NaT
2 0 days
3 3 NaT
4 87 days
5 3 days
6 227 days
7 10 days
8 184 days
4 9 NaT
10 17 days
11 196 days
12 132 days
5 13 NaT
14 13 days
15 21 days
16 66 days
17 50 days
18 16 days
19 36 days
Name: order_dt, dtype: timedelta64[ns]
order_diff. describe( )
count 46089
mean 68 days 23:22:13.567662566
std 91 days 00:47:33.924168893
min 0 days 00:00:00
25% 10 days 00:00:00
50% 31 days 00:00:00
75% 89 days 00:00:00
max 533 days 00:00:00
Name: order_dt, dtype: object
order_diff. reset_index( )
user_id level_1 order_dt 0 1 0 NaT 1 2 1 NaT 2 2 2 0 days 3 3 3 NaT 4 3 4 87 days ... ... ... ... 69654 23568 69654 11 days 69655 23568 69655 17 days 69656 23569 69656 NaT 69657 23570 69657 NaT 69658 23570 69658 1 days
69659 rows × 3 columns
order_diff
user_id
1 0 NaT
2 1 NaT
2 0 days
3 3 NaT
4 87 days
...
23568 69654 11 days
69655 17 days
23569 69656 NaT
23570 69657 NaT
69658 1 days
Name: order_dt, Length: 69659, dtype: timedelta64[ns]
( order_diff/ np. timedelta64( 1 , 'D' ) )
user_id
1 0 NaN
2 1 NaN
2 0.0
3 3 NaN
4 87.0
...
23568 69654 11.0
69655 17.0
23569 69656 NaN
23570 69657 NaN
69658 1.0
Name: order_dt, Length: 69659, dtype: float64
order_diff_info = ( order_diff/ np. timedelta64( 1 , 'D' ) )
order_diff_info_lst= [ i for i in range ( 0 , int ( order_diff_info. max ( ) + 1 ) , 10 ) ]
order_diff_info_hist= pd. cut( order_diff_info, bins= order_diff_info_lst, labels= order_diff_info_lst[ 1 : ] )
order_diff_info_hist
user_id
1 0 NaN
2 1 NaN
2 NaN
3 3 NaN
4 90
...
23568 69654 20
69655 20
23569 69656 NaN
23570 69657 NaN
69658 10
Name: order_dt, Length: 69659, dtype: category
Categories (53, int64): [10 < 20 < 30 < 40 ... 500 < 510 < 520 < 530]
order_diff_info_hist= order_diff_info_hist. fillna( 10 )
order_diff_info_hist. to_excel( r'.\用户购买周期时间差频率直方图.xlsx' )
订单周期呈指数分布 用户的平均购买周期是68天 绝大部分用户的购买周期都低于100天
user_life
min max user_id 1 1997-01-01 1997-01-01 2 1997-01-12 1997-01-12 3 1997-01-02 1998-05-28 4 1997-01-01 1997-12-12 5 1997-01-01 1998-01-03 ... ... ... 23566 1997-03-25 1997-03-25 23567 1997-03-25 1997-03-25 23568 1997-03-25 1997-04-22 23569 1997-03-25 1997-03-25 23570 1997-03-25 1997-03-26
23570 rows × 2 columns
( user_life[ 'max' ] - user_life[ 'min' ] ) . describe( )
count 23570
mean 134 days 20:55:36.987696224
std 180 days 13:46:43.039788104
min 0 days 00:00:00
25% 0 days 00:00:00
50% 0 days 00:00:00
75% 294 days 00:00:00
max 544 days 00:00:00
dtype: object
( user_life[ 'max' ] - user_life[ 'min' ] )
user_id
1 0 days
2 0 days
3 511 days
4 345 days
5 367 days
...
23566 0 days
23567 0 days
23568 28 days
23569 0 days
23570 1 days
Length: 23570, dtype: timedelta64[ns]
user_life_info = ( ( user_life[ 'max' ] - user_life[ 'min' ] ) / np. timedelta64( 1 , "D" ) )
user_life_lst = [ i for i in range ( 0 , int ( user_life_info. max ( ) ) + 1 , 10 ) ]
user_life_info_hist = pd. cut( user_life_info, bins= user_life_lst, labels= user_life_lst[ 1 : ] )
user_life_info_hist
user_id
1 NaN
2 NaN
3 520
4 350
5 370
...
23566 NaN
23567 NaN
23568 30
23569 NaN
23570 10
Length: 23570, dtype: category
Categories (54, int64): [10 < 20 < 30 < 40 ... 510 < 520 < 530 < 540]
user_life_info_hist_2 = user_life_info_hist. fillna( 10 )
user_life_info_hist_2. to_excel( r'.\用户生命周期频率直方图.xlsx' )
用户的生命周期受只购买一次的用户影响比较大 用户均消费134天,中位数仅0天
user_life[ "差值" ] = ( user_life[ "max" ] - user_life[ "min" ] )
user_life. head( )
min max 差值 user_id 1 1997-01-01 1997-01-01 0 days 2 1997-01-12 1997-01-12 0 days 3 1997-01-02 1998-05-28 511 days 4 1997-01-01 1997-12-12 345 days 5 1997-01-01 1998-01-03 367 days
user_life[ "差值" ] = ( user_life[ "max" ] - user_life[ "min" ] ) / np. timedelta64( 1 , "D" )
user_life. head( 5 )
min max 差值 user_id 1 1997-01-01 1997-01-01 0.0 2 1997-01-12 1997-01-12 0.0 3 1997-01-02 1998-05-28 511.0 4 1997-01-01 1997-12-12 345.0 5 1997-01-01 1998-01-03 367.0
user_life_info_hist
user_id
1 NaN
2 NaN
3 520
4 350
5 370
...
23566 NaN
23567 NaN
23568 30
23569 NaN
23570 10
Length: 23570, dtype: category
Categories (54, int64): [10 < 20 < 30 < 40 ... 510 < 520 < 530 < 540]
user_life_info_hist. to_excel( r'.\用户生命周期频率直方图(忽略一次购买).xlsx' )
5.复购率和回购率分析
复购率
自然月内,购买多次的用户占比(即,购买了两次以上)
pivot_count. head( )
month 1997-01-01 1997-02-01 1997-03-01 1997-04-01 1997-05-01 1997-06-01 1997-07-01 1997-08-01 1997-09-01 1997-10-01 1997-11-01 1997-12-01 1998-01-01 1998-02-01 1998-03-01 1998-04-01 1998-05-01 1998-06-01 user_id 1 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 4 2.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 5 2.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 2.0 1.0 0.0 0.0 0.0 0.0 0.0
purchase_r= pivot_count. applymap( lambda x: 1 if x> 1 else ( 0 if x== 1 else np. NaN) )
purchase_r. head( )
month 1997-01-01 1997-02-01 1997-03-01 1997-04-01 1997-05-01 1997-06-01 1997-07-01 1997-08-01 1997-09-01 1997-10-01 1997-11-01 1997-12-01 1998-01-01 1998-02-01 1998-03-01 1998-04-01 1998-05-01 1998-06-01 user_id 1 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3 0.0 NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN 0.0 NaN 4 1.0 NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN 5 1.0 0.0 NaN 0.0 0.0 0.0 0.0 NaN 0.0 NaN NaN 1.0 0.0 NaN NaN NaN NaN NaN
purchase_r_reshop= ( purchase_r. sum ( ) / purchase_r. count( ) ) . reset_index( name= 'reshop' )
purchase_r_reshop
month reshop 0 1997-01-01 0.107571 1 1997-02-01 0.122288 2 1997-03-01 0.155292 3 1997-04-01 0.223600 4 1997-05-01 0.196929 5 1997-06-01 0.195810 6 1997-07-01 0.215138 7 1997-08-01 0.200339 8 1997-09-01 0.202415 9 1997-10-01 0.206634 10 1997-11-01 0.202170 11 1997-12-01 0.219957 12 1998-01-01 0.210800 13 1998-02-01 0.203095 14 1998-03-01 0.229612 15 1998-04-01 0.199026 16 1998-05-01 0.200269 17 1998-06-01 0.214475
purchase_r_reshop[ 'month' ] = purchase_r_reshop[ 'month' ] . astype( str )
purchase_r_reshop. to_excel( r'.\复购人数与总消费人数比例.xlsx' )
回购率
曾经购买过的用户在某一时期的再次购买的占比(可能是在三个月内)
df_purchase. head( )
month 1997-01-01 1997-02-01 1997-03-01 1997-04-01 1997-05-01 1997-06-01 1997-07-01 1997-08-01 1997-09-01 1997-10-01 1997-11-01 1997-12-01 1998-01-01 1998-02-01 1998-03-01 1998-04-01 1998-05-01 1998-06-01 user_id 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 4 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 5 1 1 0 1 1 1 1 0 1 0 0 1 1 0 0 0 0 0
def purchase_back ( data) :
'''判断每一个月是否是回购,根据上个月是否购买来判断,上个月消费下个月没有购买就不是回购'''
status= [ ]
for i in range ( 17 ) :
if data[ i] == 1 :
if data[ i+ 1 ] == 1 :
status. append( 1 )
if data[ i+ 1 ] == 0 :
status. append( 0 )
else :
status. append( np. NaN)
status. append( np. NaN)
return status
index= df[ 'month' ] . sort_values( ) . astype( 'str' ) . unique( )
purchase_b= df_purchase. apply ( lambda x: pd. Series( purchase_back( x) , index= index) , axis= 1 )
purchase_b. head( )
1997-01-01 1997-02-01 1997-03-01 1997-04-01 1997-05-01 1997-06-01 1997-07-01 1997-08-01 1997-09-01 1997-10-01 1997-11-01 1997-12-01 1998-01-01 1998-02-01 1998-03-01 1998-04-01 1998-05-01 1998-06-01 user_id 1 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3 0.0 NaN 1.0 0.0 NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN 0.0 NaN 4 0.0 NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN 5 1.0 0.0 NaN 1.0 1.0 1.0 0.0 NaN 0.0 NaN NaN 1.0 0.0 NaN NaN NaN NaN NaN
purchase_b_backshop= ( purchase_b. sum ( ) / purchase_b. count( ) )
purchase_b_backshop
1997-01-01 0.147464
1997-02-01 0.174504
1997-03-01 0.186161
1997-04-01 0.301914
1997-05-01 0.337398
1997-06-01 0.318940
1997-07-01 0.277064
1997-08-01 0.297968
1997-09-01 0.305923
1997-10-01 0.339315
1997-11-01 0.311637
1997-12-01 0.274678
1998-01-01 0.307092
1998-02-01 0.368150
1998-03-01 0.251456
1998-04-01 0.319415
1998-05-01 0.299731
1998-06-01 NaN
dtype: float64
purchase_b_backshop. to_excel( r'.\回购率.xlsx' )
总结
1.月度消费统计情况
消费金额在前三个月达到最高峰,后续消费较为稳定,有轻微下降趋势;
产品购买量在前三个月达到最高峰,后续消费较为稳定,有轻微下降趋势;
前三个月消费订单人数在10000笔左右,后续月份的平均消费人数则在2500人
2.用户消费金额与频次情况
用户平均消费106元,中位值有43,说明小部分用户购买了大量货物
从直方图可知,用户消费金额,绝大部分呈现集中趋势,小部分异常值干扰了判断,可以使用过滤操作排除异常
按照用户消费金额进行升序排序,由图可以知道50%的用户仅贡献了11%的消费额度,80%的用户仅贡献了32%的消费额度,排名前5000的用户就贡献了60%的消费额度
3.用户消费行为
1997年3月下旬出现用户数量的断崖式下跌,大部分最后一次购买,集中在前三个月,说明很多用户购买了一次后就不再进行购买
有一半的用户,只消费了一次
随着时间的递增,最后一次购买数量也在递增,消费呈现流失上升的状况(这也是正常,随着时间的增长,可能运营每跟上,或者用户忠诚度下降了)
从RFM 分层可知,大部分用户是重要保持客户
4.用户周期
用户的购买周期呈指数分布,用户的平均购买周期是68天,绝大部分用户的购买周期都低于100天
用户的生命周期受只购买一次的用户影响比较显著,排除只购买一次用户后,用户平均生命周期134天
5.复购率与和回购率
复购率稳定在20%左右,前一个月因为有大量新用户涌入,而这批用户只购买了一次,所以导致复购率降低
前三个月回购率呈上升趋势,随着用户数量大幅下跌,回购率后期波动较大