用户购买CD消费行为分析

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams['font.family'] = 'SimHei'              # 解决不能输出中文的问题。不区分大小写,即SimHei’效果等价于‘simhei’,中括号可以不要
plt.rcParams['figure.autolayout'] = True              # 解决不能完整显示的问题(比如因为饼图太大,显示窗口太小)
import seaborn as sns
columns=['user_id','order_dt','order_products','order_amount']
df = pd.read_table('CDNOW_master.txt',names=columns,sep='\s+',parse_dates=['order_dt'],infer_datetime_format=True)
  • user_id:用户id
  • order_dt:订单日期
  • order_products:购买产品数
  • order_amount:购买金额
df.head()
user_idorder_dtorder_productsorder_amount
011997-01-01111.77
121997-01-12112.00
221997-01-12577.00
331997-01-02220.76
431997-03-30220.76
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69659 entries, 0 to 69658
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   user_id         69659 non-null  int64         
 1   order_dt        69659 non-null  datetime64[ns]
 2   order_products  69659 non-null  int64         
 3   order_amount    69659 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 2.1 MB
df.describe()
user_idorder_productsorder_amount
count69659.00000069659.00000069659.000000
mean11470.8545922.41004035.893648
std6819.9048482.33392436.281942
min1.0000001.0000000.000000
25%5506.0000001.00000014.490000
50%11410.0000002.00000025.980000
75%17273.0000003.00000043.700000
max23570.00000099.0000001286.010000
  • 大部分订单只消费少量商品(平均2.4),存在极值干扰
  • 用户的消费金额存在右偏分布,平均消费35.8,中位数为25.9
#第二种转换为时间格式的方法:df['order_dt'] = pd.to_datetime(df.order_dt,format='%Y%m%d')
df['month'] = df.order_dt.values.astype('datetime64[M]')
df
user_idorder_dtorder_productsorder_amountmonth
011997-01-01111.771997-01-01
121997-01-12112.001997-01-01
221997-01-12577.001997-01-01
331997-01-02220.761997-01-01
431997-03-30220.761997-03-01
..................
69654235681997-04-05483.741997-04-01
69655235681997-04-22114.991997-04-01
69656235691997-03-25225.741997-03-01
69657235701997-03-25351.121997-03-01
69658235701997-03-26242.961997-03-01

69659 rows × 5 columns

df.month.value_counts()
1997-03-01    11598
1997-02-01    11272
1997-01-01     8928
1997-04-01     3781
1997-06-01     3054
1997-07-01     2942
1997-05-01     2895
1998-03-01     2793
1997-11-01     2750
1997-10-01     2562
1997-12-01     2504
1997-08-01     2320
1997-09-01     2296
1998-06-01     2043
1998-01-01     2032
1998-02-01     2026
1998-05-01     1985
1998-04-01     1878
Name: month, dtype: int64

1.进行用户消费趋势的分析(按月)

  • 每月的消费总金额
  • 每月的消费次数
  • 每月的产品购买量
  • 每月的消费人数

1.1每月的消费总金额

grouped_month = df.groupby('month')
order_month_amount = grouped_month.order_amount.sum()
order_month_amount.head()
month
1997-01-01    299060.17
1997-02-01    379590.03
1997-03-01    393155.27
1997-04-01    142824.49
1997-05-01    107933.30
Name: order_amount, dtype: float64
order_month_amount.plot()

在这里插入图片描述

由上图可知,前两个月消费金额达到顶峰,三月开始下降,四月后消费金额较为稳定,有略微下降趋势

1.2每月的消费次数

grouped_month.user_id.count().plot()

在这里插入图片描述

前三个月 消费次数 在10000次左右,后续月份在2500左右

1.3每月的产品购买量

grouped_month.order_products.sum().plot()

在这里插入图片描述

第一季度产品购买数在20000件左右,后续开始从7500到5000缓慢下降

1.4每月的消费人数

grouped_month.user_id.apply(lambda x:len(x.drop_duplicates())).plot()

在这里插入图片描述

每月消费人数比消费次数略微下降,但差别不大
前三个月的每月消费人数在8000~10000之间,后续月份在2000左右

df.groupby(['month','user_id']).count().reset_index().groupby('month').user_id.count() == grouped_month.user_id.apply(lambda x:len(x.drop_duplicates()))
month
1997-01-01    True
1997-02-01    True
1997-03-01    True
1997-04-01    True
1997-05-01    True
1997-06-01    True
1997-07-01    True
1997-08-01    True
1997-09-01    True
1997-10-01    True
1997-11-01    True
1997-12-01    True
1998-01-01    True
1998-02-01    True
1998-03-01    True
1998-04-01    True
1998-05-01    True
1998-06-01    True
Name: user_id, dtype: bool
grouped_month1 = pd.pivot_table(df,index='month',
                               values=['user_id','order_products','order_amount'],
                               aggfunc={'order_amount':'sum',
                                       'user_id':'count',
                                       'order_products':'sum'}
                              )
grouped_month1['user_num'] = df.groupby('month').user_id.apply(lambda x:len(x.drop_duplicates()))
grouped_month1
order_amountorder_productsuser_iduser_num
month
1997-01-01299060.171941689287846
1997-02-01379590.0324921112729633
1997-03-01393155.2726159115989524
1997-04-01142824.49972937812822
1997-05-01107933.30727528952214
1997-06-01108395.87730130542339
1997-07-01122078.88813129422180
1997-08-0188367.69585123201772
1997-09-0181948.80572922961739
1997-10-0189780.77620325621839
1997-11-01115448.64781227502028
1997-12-0195577.35641825041864
1998-01-0176756.78527820321537
1998-02-0177096.96534020261551
1998-03-01108970.15743127932060
1998-04-0166231.52469718781437
1998-05-0170989.66490319851488
1998-06-0176109.30528720431506
plt.rcParams['font.family'] = 'SimHei'              # 解决不能输出中文的问题。不区分大小写,即SimHei’效果等价于‘simhei’,中括号可以不要
plt.rcParams['figure.autolayout'] = True              # 解决不能完整显示的问题(比如因为饼图太大,显示窗口太小)

fig,axes = plt.subplots(3,1,figsize=(8,12))
for i in range(2):
    ax=axes[i]
    ax.plot(grouped_month1.iloc[:,i])
ax=axes[2]
ax.plot(grouped_month1.user_id,label='消费次数')
ax.plot(grouped_month1.user_num,label='消费人数')
ax.legend()

在这里插入图片描述

2.用户个体消费分析

  • 用户消费金额、消费次数、产品购买量的描述性统计
  • 用户消费金额和产品购买量的散点图
  • 用户产品购买量的分布图
  • 用户累计消费金额占比(百分之多少的用户占了百分之多少的消费额)

2.1用户消费金额、消费次数、产品购买量的描述性统计

grouped_user1 = pd.pivot_table(df,index='user_id',
                               values=['order_dt','order_products','order_amount'],
                               aggfunc={'order_amount':'sum',
                                       'order_dt':'count',
                                       'order_products':'sum'}
                              )
grouped_user1.rename(columns={'order_dt':'order_num'},inplace=True)
grouped_user1.describe()
order_amountorder_numorder_products
count23570.00000023570.00000023570.000000
mean106.0804262.9554097.122656
std240.9251954.73655816.983531
min0.0000001.0000001.000000
25%19.9700001.0000001.000000
50%43.3950001.0000003.000000
75%106.4750003.0000007.000000
max13990.930000217.0000001033.000000
grouped_user = df.groupby('user_id')
grouped_user.sum().describe([0.1,0.25,0.5,0.75,0.9,0.95])
order_productsorder_amount
count23570.00000023570.000000
mean7.122656106.080426
std16.983531240.925195
min1.0000000.000000
10%1.00000012.970000
25%1.00000019.970000
50%3.00000043.395000
75%7.000000106.475000
90%16.000000242.332000
95%26.000000380.923500
max1033.00000013990.930000

消费金额的平均值为106元,而中位数为43,右偏分布,说明小部分用户消费了大部分金额
用户的平均消费次数约为3次,而一半用户消费次数为1次
同理,用户平均购买了7张CD,中位数为3,有极值干扰

2.2用户消费金额和产品购买量的散点图

grouped_user.sum().plot.scatter(x='order_amount',y='order_products')

在这里插入图片描述

grouped_user.sum().query('order_amount < 4000').plot.scatter(x='order_amount',y='order_products')

在这里插入图片描述

grouped_user1.query('order_amount<6000').plot.scatter(x='order_amount',y='order_products')

在这里插入图片描述

sns.jointplot(grouped_user1.query('order_amount<829').order_products,
              grouped_user1.query('order_amount<829').order_amount, 
              kind='reg')

在这里插入图片描述

2.3用户产品购买量的分布图

grouped_user.sum().order_products.plot.hist(bins=20)

在这里插入图片描述

从产品消费数量直方图可知,产品消费数量,绝大部分呈现集中趋势,小部分异常值干扰了判断,通过过滤操作排除异常

可以根据描述性统计过滤掉异常值,比如计算产品消费数量的95%数据分布情况,即过滤超过26之后的值

grouped_user.sum().query('order_products<30').order_products.plot.hist(bins=15)

在这里插入图片描述

grouped_user1.query('order_amount<400').order_amount.plot.hist(bins=20)

在这里插入图片描述

2.4用户累计消费金额占比(百分之多少的用户占了百分之多少的消费额)

user_cumsum = grouped_user1.sort_values(by='order_amount').apply(lambda x:x.cumsum()/x.sum()).reset_index()
user_cumsum
user_idorder_amountorder_numorder_products
0101750.0000000.0000140.000006
145590.0000000.0000290.000012
219480.0000000.0000430.000018
39250.0000000.0000570.000024
4107980.0000000.0000720.000030
...............
2356579310.9854050.9910560.982940
23566193390.9880250.9918600.985192
2356779830.9908140.9939990.988385
23568140480.9944040.9971150.994538
2356975921.0000001.0000001.000000

23570 rows × 4 columns

user_cumsum.order_amount.plot(xticks=range(0,23570,2000))

在这里插入图片描述

#订单量的累计百分比
user_cumsum.order_num.plot(xticks=range(0,23570,2000))

在这里插入图片描述

按照用户消费金额升序排列,50%的用户仅贡献了10%左右的消费金额,而前3570名用户(15%)贡献了60%的消费额,以及将近50%订单量

3.用户行为分析

  • 用户第一次消费
  • 用户最后一次消费
  • 新老客户消费比
    • 多少客户仅消费了一次
    • 每月新客占比
  • 用户分层
    • RFM用户分层
    • 用户状态分析:注册、活跃、回流、流失(不活跃)
  • 用户购买周期(按订单)
    • 用户消费周期描述
    • 用户消费周期分布
  • 用户生命周期(按第一次&最后一次消费)
    • 用户生命周期描述
    • 用户生命周期分布

3.1用户第一次消费

grouped_user.order_dt.min().value_counts()
1997-02-08    363
1997-02-24    347
1997-02-04    346
1997-02-06    346
1997-03-04    340
             ... 
1997-01-08    213
1997-03-21    213
1997-01-07    211
1997-01-01    209
1997-01-04    174
Name: order_dt, Length: 84, dtype: int64
#用户第一次消费
grouped_user.order_dt.min().value_counts().plot()

在这里插入图片描述

新增用户集中在前三个月,后续没有新用户进入
在1997-02-15号左右两周时间存在‘W’型剧烈波动

3.2用户最后一次消费

#用户最后一次消费
grouped_user.order_dt.max().value_counts().plot()

在这里插入图片描述

大部分用户最后一次购买集中在前三月,说明用户多购买一次就不在此继续消费
从3月开始消费用户存在断崖式下跌,随着时间的递增,最后一次购买数也在增加,用户流失呈上升趋势(当然也有可能是因为活动,购买数上升,需结合具体情况分析)

3.3新老客户消费比

3.3.1多少客户仅消费了一次
#多少客户仅消费了一次
user_life = grouped_user.order_dt.agg(['min','max'])
user_life
minmax
user_id
11997-01-011997-01-01
21997-01-121997-01-12
31997-01-021998-05-28
41997-01-011997-12-12
51997-01-011998-01-03
.........
235661997-03-251997-03-25
235671997-03-251997-03-25
235681997-03-251997-04-22
235691997-03-251997-03-25
235701997-03-251997-03-26

23570 rows × 2 columns

user_life_rate = (user_life['min'] == user_life['max']).value_counts()
user_life_rate
True     12054
False    11516
dtype: int64
print('仅消费一次用户占比{:.2%}'.format(user_life_rate[1]/user_life.shape[0]))

仅消费一次用户占比51.14%

user_life_rate.plot.pie(labels=['仅消费一次用户','消费多次用户'],autopct='%2.2f%%')
plt.legend()

在这里插入图片描述

3.3.2每月新客占比
grouped_user.min()
order_dtorder_productsorder_amountmonth
user_id
11997-01-01111.771997-01-01
21997-01-12112.001997-01-01
31997-01-02116.991997-01-01
41997-01-01114.961997-01-01
51997-01-01113.971997-01-01
...............
235661997-03-25236.001997-03-01
235671997-03-25120.971997-03-01
235681997-03-25114.991997-03-01
235691997-03-25225.741997-03-01
235701997-03-25242.961997-03-01

23570 rows × 4 columns

user_new = grouped_user.min().groupby('month').order_dt.count()
user_new
month
1997-01-01    7846
1997-02-01    8476
1997-03-01    7248
Name: order_dt, dtype: int64
user_new_ = df.drop_duplicates('user_id').groupby('month').order_dt.count()
user_new_
month
1997-01-01    7846
1997-02-01    8476
1997-03-01    7248
Name: order_dt, dtype: int64
user_sum = df.groupby('month').order_dt.count()
user_sum
month
1997-01-01     8928
1997-02-01    11272
1997-03-01    11598
1997-04-01     3781
1997-05-01     2895
1997-06-01     3054
1997-07-01     2942
1997-08-01     2320
1997-09-01     2296
1997-10-01     2562
1997-11-01     2750
1997-12-01     2504
1998-01-01     2032
1998-02-01     2026
1998-03-01     2793
1998-04-01     1878
1998-05-01     1985
1998-06-01     2043
Name: order_dt, dtype: int64
(user_new/user_sum).fillna(0).plot()

在这里插入图片描述

每月新老用户占比到4月为0,说明1997.4后无新用户

3.4用户分层

3.4.1RFM用户分层
rfm = pd.pivot_table(df,index='user_id',
                               values=['order_dt','order_products','order_amount'],
                               aggfunc={'order_dt':'max',
                                       'order_amount':'sum',
                                       'order_products':'count'})
rfm
order_amountorder_dtorder_products
user_id
111.771997-01-011
289.001997-01-122
3156.461998-05-286
4100.501997-12-124
5385.611998-01-0311
............
2356636.001997-03-251
2356720.971997-03-251
23568121.701997-04-223
2356925.741997-03-251
2357094.081997-03-262

23570 rows × 3 columns

rfm.order_dt.max()-rfm.order_dt
user_id
1       545 days
2       534 days
3        33 days
4       200 days
5       178 days
          ...   
23566   462 days
23567   462 days
23568   434 days
23569   462 days
23570   461 days
Name: order_dt, Length: 23570, dtype: timedelta64[ns]
#将时间间隔转化为数值型
(rfm.order_dt.max()-rfm.order_dt).apply(lambda x:x.days)
user_id
1        545
2        534
3         33
4        200
5        178
        ... 
23566    462
23567    462
23568    434
23569    462
23570    461
Name: order_dt, Length: 23570, dtype: int64
rfm['R'] =  (rfm.order_dt.max()-rfm.order_dt).apply(lambda x:x.days)
rfm.rename(columns={'order_amount':'M','order_products':'F'},inplace=True)
rfm
Morder_dtFR
user_id
111.771997-01-011545
289.001997-01-122534
3156.461998-05-28633
4100.501997-12-124200
5385.611998-01-0311178
...............
2356636.001997-03-251462
2356720.971997-03-251462
23568121.701997-04-223434
2356925.741997-03-251462
2357094.081997-03-262461

23570 rows × 4 columns

rfm.describe()
MFR
count23570.00000023570.00000023570.000000
mean106.0804262.955409367.221638
std240.9251954.736558181.211177
min0.0000001.0000000.000000
25%19.9700001.000000207.000000
50%43.3950001.000000471.000000
75%106.4750003.000000505.000000
max13990.930000217.000000545.000000
r_bins = [-1,367,545]
f_bins = [0,3,217]
m_bins = [-1,106,13991]
rfm['r_score'] = pd.cut(rfm.R,r_bins,labels=[1,0])
rfm['f_score'] = pd.cut(rfm.F,f_bins,labels=[0,1])
rfm['m_score'] = pd.cut(rfm.M,m_bins,labels=[0,1])
rfm
Morder_dtFRr_scoref_scorem_score
user_id
111.771997-01-011545000
289.001997-01-122534000
3156.461998-05-28633111
4100.501997-12-124200110
5385.611998-01-0311178111
........................
2356636.001997-03-251462000
2356720.971997-03-251462000
23568121.701997-04-223434001
2356925.741997-03-251462000
2357094.081997-03-262461000

23570 rows × 7 columns

col = ['r_score','f_score','m_score']
for i in col:
    rfm[i] = rfm[i].astype(np.str)
rfm['rfm_group'] = rfm['r_score']+rfm['f_score']+rfm['m_score']
rfm['r_score'].str.cat(rfm['f_score']).str.cat(rfm['m_score'])
user_id
1        000
2        000
3        111
4        110
5        111
        ... 
23566    000
23567    000
23568    001
23569    000
23570    000
Name: r_score, Length: 23570, dtype: object
rfm
Morder_dtFRr_scoref_scorem_scorerfm_group
user_id
111.771997-01-011545000000
289.001997-01-122534000000
3156.461998-05-28633111111
4100.501997-12-124200110110
5385.611998-01-0311178111111
...........................
2356636.001997-03-251462000000
2356720.971997-03-251462000000
23568121.701997-04-223434001001
2356925.741997-03-251462000000
2357094.081997-03-262461000000

23570 rows × 8 columns

def rfm_f(x):
    d = {
        '111':'重要价值客户',
        '011':'重要保持客户',
        '101':'重要挽留客户',
        '001':'重要发展客户',
        '110':'一般价值客户',
        '010':'一般保持客户',
        '100':'一般挽留客户',
        '000':'一般发展客户'
    }
    return d[x.rfm_group]
rfm['label'] = rfm.apply(rfm_f,axis=1)
rfm
Morder_dtFRr_scoref_scorem_scorerfm_grouplabel
user_id
111.771997-01-011545000000一般发展客户
289.001997-01-122534000000一般发展客户
3156.461998-05-28633111111重要价值客户
4100.501997-12-124200110110一般价值客户
5385.611998-01-0311178111111重要价值客户
..............................
2356636.001997-03-251462000000一般发展客户
2356720.971997-03-251462000000一般发展客户
23568121.701997-04-223434001001重要发展客户
2356925.741997-03-251462000000一般发展客户
2357094.081997-03-262461000000一般发展客户

23570 rows × 9 columns

rfm.loc[rfm.label == '重要价值客户','color'] = 'g'
rfm.loc[~(rfm.label == '重要价值客户'),'color'] = 'r'
rfm.plot.scatter('R','F',c=rfm.color)

在这里插入图片描述

rfm_count = rfm.groupby('label').R.count()
rfm_count
label
一般价值客户      907
一般保持客户      119
一般发展客户    14031
一般挽留客户     2598
重要价值客户     4084
重要保持客户      256
重要发展客户      773
重要挽留客户      802
Name: R, dtype: int64
plt.figure(figsize=(6,6),dpi=160)
rfm_count.plot.pie(autopct='%2.2f%%',labels=rfm_count.index)
plt.legend(bbox_to_anchor=(1.5, 1.2))

在这里插入图片描述

由图可知,一般发展客户占到59.53%
重要价值客户占比第二,为17.33%
一般保持和重要保持客户占比较低

3.4.2用户状态分析:注册、活跃、回流、流失(不活跃)
pivoted_counts = pd.pivot_table(df,index='user_id',columns='month',values='order_dt',aggfunc='count').fillna(0)
pivoted_counts
month1997-01-011997-02-011997-03-011997-04-011997-05-011997-06-011997-07-011997-08-011997-09-011997-10-011997-11-011997-12-011998-01-011998-02-011998-03-011998-04-011998-05-011998-06-01
user_id
11.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
22.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
31.00.01.01.00.00.00.00.00.00.02.00.00.00.00.00.01.00.0
42.00.00.00.00.00.00.01.00.00.00.01.00.00.00.00.00.00.0
52.01.00.01.01.01.01.00.01.00.00.02.01.00.00.00.00.00.0
.........................................................
235660.00.01.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
235670.00.01.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
235680.00.01.02.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
235690.00.01.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
235700.00.02.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0

23570 rows × 18 columns

#applymap作用于每一个元素
df_purchase = pivoted_counts.applymap(lambda x:1 if x>0 else 0)
df_purchase
month1997-01-011997-02-011997-03-011997-04-011997-05-011997-06-011997-07-011997-08-011997-09-011997-10-011997-11-011997-12-011998-01-011998-02-011998-03-011998-04-011998-05-011998-06-01
user_id
1100000000000000000
2100000000000000000
3101100000010000010
4100000010001000000
5110111101001100000
.........................................................
23566001000000000000000
23567001000000000000000
23568001100000000000000
23569001000000000000000
23570001000000000000000

23570 rows × 18 columns

def active_status(data):
    status = []
    x = len(data)
    for i in range(x):
        #未消费
        if data[i] == 0:
            if len(status) == 0:
                status.append('unreg')
            else:
                if status[i-1] == 'unreg':
                    status.append('unreg')
                else:
                    status.append('unactive')
        #消费过
        else:
            if len(status) == 0:
                status.append('new')
            else:
                if status[i-1] == 'unreg':
                    status.append('new')
                elif status[i-1] == 'unactive':
                    status.append('return')
                else:
                    status.append('active')
    return status                 
purchase_status = df_purchase.apply(active_status,axis=1,raw=True)
purchase_status
month1997-01-011997-02-011997-03-011997-04-011997-05-011997-06-011997-07-011997-08-011997-09-011997-10-011997-11-011997-12-011998-01-011998-02-011998-03-011998-04-011998-05-011998-06-01
user_id
1newunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactive
2newunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactive
3newunactivereturnactiveunactiveunactiveunactiveunactiveunactiveunactivereturnunactiveunactiveunactiveunactiveunactivereturnunactive
4newunactiveunactiveunactiveunactiveunactiveunactivereturnunactiveunactiveunactivereturnunactiveunactiveunactiveunactiveunactiveunactive
5newactiveunactivereturnactiveactiveactiveunactivereturnunactiveunactivereturnactiveunactiveunactiveunactiveunactiveunactive
.........................................................
23566unregunregnewunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactive
23567unregunregnewunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactive
23568unregunregnewactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactive
23569unregunregnewunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactive
23570unregunregnewunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactiveunactive

23570 rows × 18 columns

purchase_status_ct = purchase_status.replace('unreg',np.nan).apply(lambda x:x.value_counts())
purchase_status_ct
month1997-01-011997-02-011997-03-011997-04-011997-05-011997-06-011997-07-011997-08-011997-09-011997-10-011997-11-011997-12-011998-01-011998-02-011998-03-011998-04-011998-05-011998-06-01
activeNaN1157.016811773.0852.0747.0746.0604.0528.0532.0624.0632.0512.0472.0571.0518.0459.0446.0
new7846.08476.07248NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
returnNaNNaN5951049.01362.01592.01434.01168.01211.01307.01404.01232.01025.01079.01489.0919.01029.01060.0
unactiveNaN6689.01404620748.021356.021231.021390.021798.021831.021731.021542.021706.022033.022019.021510.022133.022082.022064.0
purchase_stack = purchase_status_ct.fillna(0).T
plt.figure(figsize=(6,3),dpi=150)
plt.stackplot(purchase_stack.index,purchase_stack['active'],purchase_stack['new'],purchase_stack['return'],purchase_stack['unactive'],labels=purchase_stack.columns)
plt.legend()
plt.show()
#purchase_stack.plot.area()

在这里插入图片描述

根据每月不同用户的计数统计做面积图:

  • 前三个月用户人数不断增加,新增用户数量占比较大,活跃用户数量上升
  • 1997年4月开始无新用户注册
  • 1997年4月开始活跃用户数量下降,最后趋于稳定,回流用户趋于稳定水平,该两层的用户为消费主力,约2000人上下浮动
  • 不活跃用户始终占大部分
  • 注:回流用户为上月未消费,本月消费用户
purchase_status_ct.fillna(0).T.head()
activenewreturnunactive
month
1997-01-010.07846.00.00.0
1997-02-011157.08476.00.06689.0
1997-03-011681.07248.0595.014046.0
1997-04-011773.00.01049.020748.0
1997-05-01852.00.01362.021356.0
purchase_status_ct.fillna(0).T.apply(lambda x:x/x.sum(),axis=1).head(10)
activenewreturnunactive
month
1997-01-010.0000001.0000000.0000000.000000
1997-02-010.0708860.5192990.0000000.409815
1997-03-010.0713190.3075100.0252440.595927
1997-04-010.0752230.0000000.0445060.880272
1997-05-010.0361480.0000000.0577850.906067
1997-06-010.0316930.0000000.0675430.900764
1997-07-010.0316500.0000000.0608400.907510
1997-08-010.0256260.0000000.0495550.924820
1997-09-010.0224010.0000000.0513790.926220
1997-10-010.0225710.0000000.0554520.921977
#消费用户中不同类型人占比
purchase_status_rate = purchase_status_ct.fillna(0).T.drop(columns=['unactive']).apply(lambda x:x/x.sum(),axis=1)
purchase_status_rate
activenewreturn
month
1997-01-010.0000001.0000000.000000
1997-02-010.1201080.8798920.000000
1997-03-010.1765010.7610250.062474
1997-04-010.6282780.0000000.371722
1997-05-010.3848240.0000000.615176
1997-06-010.3193670.0000000.680633
1997-07-010.3422020.0000000.657798
1997-08-010.3408580.0000000.659142
1997-09-010.3036230.0000000.696377
1997-10-010.2892880.0000000.710712
1997-11-010.3076920.0000000.692308
1997-12-010.3390560.0000000.660944
1998-01-010.3331160.0000000.666884
1998-02-010.3043200.0000000.695680
1998-03-010.2771840.0000000.722816
1998-04-010.3604730.0000000.639527
1998-05-010.3084680.0000000.691532
1998-06-010.2961490.0000000.703851
plt.figure(figsize=(6,3),dpi=150)
plt.plot(purchase_status_rate)
#plt.plot(purchase_status_rate['return'])
plt.legend(purchase_status_rate.columns)
plt.show()

在这里插入图片描述

在消费用户中,4月后只存在活跃用户和回流用户进行消费
后期消费用户中,回流用户占比较大,整体消费用户质量一般

3.5用户购买周期(按订单)

3.5.1用户消费周期描述
order_diff = df.groupby('user_id').apply(lambda x:x.order_dt - x.order_dt.shift(1))
order_diff.head()
user_id   
1        0       NaT
2        1       NaT
         2    0 days
3        3       NaT
         4   87 days
Name: order_dt, dtype: timedelta64[ns]
order_diff.describe()
count                      46089
mean     68 days 23:22:13.567662
std      91 days 00:47:33.924168
min              0 days 00:00:00
25%             10 days 00:00:00
50%             31 days 00:00:00
75%             89 days 00:00:00
max            533 days 00:00:00
Name: order_dt, dtype: object

用户平均购买时间间隔为68天(可以根据品类进行分析是否合理)
用户购买周期中位数为31天,远小于平均值,数据为右偏分布,存在极大值干扰

3.5.2用户消费周期分布
order_diff.apply(lambda x:x.days).hist(bins=20)

在这里插入图片描述

用户消费周期呈指数分布,绝大部分用户消费周期都在100天以内

3.6用户生命周期(按第一次&最后一次消费)

3.6.1用户生命周期描述
user_life = df.groupby('user_id').order_dt.agg(['max','min'])
user_life
maxmin
user_id
11997-01-011997-01-01
21997-01-121997-01-12
31998-05-281997-01-02
41997-12-121997-01-01
51998-01-031997-01-01
.........
235661997-03-251997-03-25
235671997-03-251997-03-25
235681997-04-221997-03-25
235691997-03-251997-03-25
235701997-03-261997-03-25

23570 rows × 2 columns

user_cycle = (user_life['max']-user_life['min'])
user_cycle
user_id
1         0 days
2         0 days
3       511 days
4       345 days
5       367 days
          ...   
23566     0 days
23567     0 days
23568    28 days
23569     0 days
23570     1 days
Length: 23570, dtype: timedelta64[ns]
user_cycle.describe()
count                       23570
mean     134 days 20:55:36.987696
std      180 days 13:46:43.039788
min               0 days 00:00:00
25%               0 days 00:00:00
50%               0 days 00:00:00
75%             294 days 00:00:00
max             544 days 00:00:00
dtype: object

用户平均生命周期为134天,中位数为0天,说明大部分用户仅消费一次就不再进行消费

3.6.2用户生命周期分布
plt.figure()
user_cycle.apply(lambda x:x.days).hist(bins=20)
plt.title('用户生命周期分布')
plt.xlabel('天数')
plt.ylabel('人数')
plt.show()

在这里插入图片描述

用户生命周期受只购买一次的用户影响很大,可以过滤生命周期为0的用户

user_cycle.apply(lambda x:x.days)[user_cycle.apply(lambda x:x.days) > 0].hist(bins=20)

在这里插入图片描述

过滤掉生命周期为0的用户后的分布图:

  • 图像呈现双峰趋势,很多用户虽然消费多次,但其生命周期小于一个月,应该采取引导消费、促进活跃、防止流失等措施
  • 生命周期在350-500天的用户较为稳定,呈伪正态分布,属于忠诚用户,应重点维持

4.用户消费指标

  • 留存率
    • 用户在第一次消费后,进行第二次消费的比率
  • 流失率
    • 未消费用户占比
  • 复购率
    • 自然月内购买多次的用户占比
  • 回购率
    • 曾经购买过的用户在某一段时期内再次购买的占比

4.1留存率

user_purchase = df[['user_id','order_products','order_amount','order_dt']]
order_date_min = user_purchase.groupby('user_id').order_dt.min()
order_date_min
user_id
1       1997-01-01
2       1997-01-12
3       1997-01-02
4       1997-01-01
5       1997-01-01
           ...    
23566   1997-03-25
23567   1997-03-25
23568   1997-03-25
23569   1997-03-25
23570   1997-03-25
Name: order_dt, Length: 23570, dtype: datetime64[ns]
user_purchase_retention = pd.merge(user_purchase,order_date_min,on='user_id',how='left',suffixes=('', '_min'))
user_purchase_retention
user_idorder_productsorder_amountorder_dtorder_dt_min
01111.771997-01-011997-01-01
12112.001997-01-121997-01-12
22577.001997-01-121997-01-12
33220.761997-01-021997-01-02
43220.761997-03-301997-01-02
..................
6965423568483.741997-04-051997-03-25
6965523568114.991997-04-221997-03-25
6965623569225.741997-03-251997-03-25
6965723570351.121997-03-251997-03-25
6965823570242.961997-03-261997-03-25

69659 rows × 5 columns

user_purchase_retention['dtdiff'] = (user_purchase_retention.order_dt - user_purchase_retention.order_dt_min).apply(lambda x:x.days)
user_purchase_retention
user_idorder_productsorder_amountorder_dtorder_dt_mindtdiff
01111.771997-01-011997-01-010
12112.001997-01-121997-01-120
22577.001997-01-121997-01-120
33220.761997-01-021997-01-020
43220.761997-03-301997-01-0287
.....................
6965423568483.741997-04-051997-03-2511
6965523568114.991997-04-221997-03-2528
6965623569225.741997-03-251997-03-250
6965723570351.121997-03-251997-03-250
6965823570242.961997-03-261997-03-251

69659 rows × 6 columns

bins = [0,3,7,15,30,60,90,180,360,540]
user_purchase_retention['dtdiff_bin'] = pd.cut(user_purchase_retention.dtdiff, bins = bins)
user_purchase_retention
user_idorder_productsorder_amountorder_dtorder_dt_mindtdiffdtdiff_bin
01111.771997-01-011997-01-010NaN
12112.001997-01-121997-01-120NaN
22577.001997-01-121997-01-120NaN
33220.761997-01-021997-01-020NaN
43220.761997-03-301997-01-0287(60.0, 90.0]
........................
6965423568483.741997-04-051997-03-2511(7.0, 15.0]
6965523568114.991997-04-221997-03-2528(15.0, 30.0]
6965623569225.741997-03-251997-03-250NaN
6965723570351.121997-03-251997-03-250NaN
6965823570242.961997-03-261997-03-251(0.0, 3.0]

69659 rows × 7 columns

pivoted_retention= user_purchase_retention.pivot_table(index='user_id', 
                                                       columns='dtdiff_bin', 
                                                       values='order_dt',
                                                       aggfunc='count',
                                                       dropna=False)
pivoted_retention
dtdiff_bin(0, 3](3, 7](7, 15](15, 30](30, 60](60, 90](90, 180](180, 360](360, 540]
user_id
1NaNNaNNaNNaNNaNNaNNaNNaNNaN
2NaNNaNNaNNaNNaNNaNNaNNaNNaN
3NaNNaNNaNNaNNaN2.0NaN2.01.0
4NaNNaNNaN1.0NaNNaNNaN2.0NaN
5NaNNaN1.0NaN1.0NaN3.04.01.0
..............................
23566NaNNaNNaNNaNNaNNaNNaNNaNNaN
23567NaNNaNNaNNaNNaNNaNNaNNaNNaN
23568NaNNaN1.01.0NaNNaNNaNNaNNaN
23569NaNNaNNaNNaNNaNNaNNaNNaNNaN
235701.0NaNNaNNaNNaNNaNNaNNaNNaN

23570 rows × 9 columns

pivoted_retention.applymap(lambda x:1 if x>0 else 0).mean().plot.bar(figsize=(10,5))

在这里插入图片描述

从柱状图可以发现,2.68%的用户在首次购买后,会在1到3天内再次消费,随着时间范围加大,人数逐渐增多,12.96%的用户在首次购买后,会在一个月到两个月内再次消费,而在半年到一年之间再次消费的用户达到了25.64%
也可知CD营销确实不是高频次消费行为,若要使利益最大化,应该加强新用户的补充和培养忠诚用户

pivoted_retention.applymap(lambda x:1 if x>0 else 0).mean()
dtdiff_bin
(0, 3]        0.026856
(3, 7]        0.035129
(7, 15]       0.060798
(15, 30]      0.090539
(30, 60]      0.129699
(60, 90]      0.099703
(90, 180]     0.197030
(180, 360]    0.256428
(360, 540]    0.197412
dtype: float64

4.2流失率

purchase_status_ct.fillna(0).T.apply(lambda x:x/x.sum(),axis=1)[['unactive']]
unactive
month
1997-01-010.000000
1997-02-010.409815
1997-03-010.595927
1997-04-010.880272
1997-05-010.906067
1997-06-010.900764
1997-07-010.907510
1997-08-010.924820
1997-09-010.926220
1997-10-010.921977
1997-11-010.913958
1997-12-010.920916
1998-01-010.934790
1998-02-010.934196
1998-03-010.912601
1998-04-010.939033
1998-05-010.936869
1998-06-010.936105
purchase_status_ct.fillna(0).T.apply(lambda x:x/x.sum(),axis=1)[['unactive']].plot.bar(figsize=(10,5))

在这里插入图片描述

4.3复购率

pivoted_counts
month1997-01-011997-02-011997-03-011997-04-011997-05-011997-06-011997-07-011997-08-011997-09-011997-10-011997-11-011997-12-011998-01-011998-02-011998-03-011998-04-011998-05-011998-06-01
user_id
11.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
22.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
31.00.01.01.00.00.00.00.00.00.02.00.00.00.00.00.01.00.0
42.00.00.00.00.00.00.01.00.00.00.01.00.00.00.00.00.00.0
52.01.00.01.01.01.01.00.01.00.00.02.01.00.00.00.00.00.0
.........................................................
235660.00.01.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
235670.00.01.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
235680.00.01.02.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
235690.00.01.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
235700.00.02.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0

23570 rows × 18 columns

purchase_r = pivoted_counts.applymap(lambda x: 1 if x>1 else np.NaN if x==0 else 0)
purchase_r
month1997-01-011997-02-011997-03-011997-04-011997-05-011997-06-011997-07-011997-08-011997-09-011997-10-011997-11-011997-12-011998-01-011998-02-011998-03-011998-04-011998-05-011998-06-01
user_id
10.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
21.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
30.0NaN0.00.0NaNNaNNaNNaNNaNNaN1.0NaNNaNNaNNaNNaN0.0NaN
41.0NaNNaNNaNNaNNaNNaN0.0NaNNaNNaN0.0NaNNaNNaNNaNNaNNaN
51.00.0NaN0.00.00.00.0NaN0.0NaNNaN1.00.0NaNNaNNaNNaNNaN
.........................................................
23566NaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
23567NaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
23568NaNNaN0.01.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
23569NaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
23570NaNNaN1.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

23570 rows × 18 columns

(purchase_r.sum()/purchase_r.count()).plot()

在这里插入图片描述

后期复购率稳定在20%左右,前三个月有大量用户涌入,而这批用户多只购买一次,所以复购率较低

repurchase = df.groupby('user_id').month.count()
repurchase
user_id
1         1
2         2
3         6
4         4
5        11
         ..
23566     1
23567     1
23568     3
23569     1
23570     2
Name: month, Length: 23570, dtype: int64
repurchase[repurchase>1].count()/repurchase.count()

0.4947815019092066

plt.figure(figsize=(6,3),dpi=150)
plt.plot(purchase_r.count())
plt.plot(purchase_r.sum())
plt.xlabel('时间(月)') 
plt.ylabel('用户数(人)') 
plt.legend(['消费人数', '二次消费以上人数'])
plt.show()

在这里插入图片描述

用户消费三个月后,人数迅速下降,最后稳定在2000人左右
而两次消费以上用户数,先随着新用户进入有上升趋势,随后开始缓慢下降,需要采取促活行为

purchase_r.sum()
month
1997-01-01     844.0
1997-02-01    1178.0
1997-03-01    1479.0
1997-04-01     631.0
1997-05-01     436.0
1997-06-01     458.0
1997-07-01     469.0
1997-08-01     355.0
1997-09-01     352.0
1997-10-01     380.0
1997-11-01     410.0
1997-12-01     410.0
1998-01-01     324.0
1998-02-01     315.0
1998-03-01     473.0
1998-04-01     286.0
1998-05-01     298.0
1998-06-01     323.0
dtype: float64

4.4回购率

df_purchase
month1997-01-011997-02-011997-03-011997-04-011997-05-011997-06-011997-07-011997-08-011997-09-011997-10-011997-11-011997-12-011998-01-011998-02-011998-03-011998-04-011998-05-011998-06-01
user_id
1100000000000000000
2100000000000000000
3101100000010000010
4100000010001000000
5110111101001100000
.........................................................
23566001000000000000000
23567001000000000000000
23568001100000000000000
23569001000000000000000
23570001000000000000000

23570 rows × 18 columns

def purchse_back(data):
    status = [np.nan]
    l = len(data)-1
    for i in range(1,l):
        if data[i] == 1:
            if data[i-1] ==1:
                status.append(1)
            else:
                status.append(0)
        else:
            status.append(np.nan)
    status.append(np.nan)
    return status
df_purchase
month1997-01-011997-02-011997-03-011997-04-011997-05-011997-06-011997-07-011997-08-011997-09-011997-10-011997-11-011997-12-011998-01-011998-02-011998-03-011998-04-011998-05-011998-06-01
user_id
1100000000000000000
2100000000000000000
3101100000010000010
4100000010001000000
5110111101001100000
.........................................................
23566001000000000000000
23567001000000000000000
23568001100000000000000
23569001000000000000000
23570001000000000000000

23570 rows × 18 columns

purchase_b = df_purchase.apply(purchse_back,axis=1,raw=True)
purchase_b
month1997-01-011997-02-011997-03-011997-04-011997-05-011997-06-011997-07-011997-08-011997-09-011997-10-011997-11-011997-12-011998-01-011998-02-011998-03-011998-04-011998-05-011998-06-01
user_id
1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3NaNNaN0.01.0NaNNaNNaNNaNNaNNaN0.0NaNNaNNaNNaNNaN0.0NaN
4NaNNaNNaNNaNNaNNaNNaN0.0NaNNaNNaN0.0NaNNaNNaNNaNNaNNaN
5NaN1.0NaN0.01.01.01.0NaN0.0NaNNaN0.01.0NaNNaNNaNNaNNaN
.........................................................
23566NaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
23567NaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
23568NaNNaN0.01.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
23569NaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
23570NaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

23570 rows × 18 columns

(purchase_b.sum()/purchase_b.count()).plot(figsize=(10,5))

在这里插入图片描述

前三个月因为有新用户涌入,但是多只购买一次之后不再消费,说明了对于新用户,在其第一次消费后的三个月内是一段重要的时期,需要营销策略积极引导其再次消费及持续消费
随着消费总人数下降,而回购人数上升,在四月回购率达到顶峰
但随着用户流失,最后忠诚用户回购率稳定在30%左右,对于老客,也要适时推出反馈老客户的优惠活动,以加强老客的忠诚度

plt.figure(figsize=(6,3),dpi=150)
plt.plot(purchase_b.count())
plt.plot(purchase_b.sum())
plt.xlabel('时间(月)') 
plt.ylabel('用户数(人)') 
plt.legend(['消费人数', '回购人数'])
plt.show()

在这里插入图片描述

消费人数从三月份开始,因为没有新用户进入,出现断崖式下跌
而回购人数先缓慢增长,从四月份开始逐渐下降并稳定在500左右

5.总结

  • 网站新用户集中分布在前三月,产品缺乏价值点或难以满足用户需求,从而导致大量用户流失,1997年4月后消费群体全部为老用户
  • CD产品的消费呈现小额低频的特征,且用户消费相关数据多为右偏分布,排名前15%的用户贡献了60%的消费额和50%的订单量,反应出消费也呈现“二八”规律,
  • 消费用户中的回流用户与活跃用户后期均有下降趋势,需要对客户流失进行预警,建议应该优化产品,培养用户忠诚度
  • 新用户大多数只消费过一次,回购率和复购率偏低,整体质量低于老用户,建议推出优惠活动以吸引用户,提高用户粘性,并适时推出反馈老用户的优惠活动,以加强老用户的忠诚度
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值