某CD网站数据用户行为分析

最新推荐文章于 2023-06-20 18:57:26 发布

weixin_44236148

最新推荐文章于 2023-06-20 18:57:26 发布

阅读量803

点赞数

文章标签：数据分析

本文链接：https://blog.csdn.net/weixin_44236148/article/details/88358016

版权

某CD网站数据用户行为分析

	user_id	order_dt	order_products	order_amount
0	1	19970101	1	11.77
1	2	19970112	1	12.00
2	2	19970112	5	77.00
3	3	19970102	2	20.76
4	3	19970330	2	20.76

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69659 entries, 0 to 69658
Data columns (total 4 columns):
user_id           69659 non-null int64
order_dt          69659 non-null int64
order_products    69659 non-null int64
order_amount      69659 non-null float64
dtypes: float64(1), int64(3)
memory usage: 2.1 MB

df.describe()

	user_id	order_dt	order_products	order_amount
count	69659.000000	6.965900e+04	69659.000000	69659.000000
mean	11470.854592	1.997228e+07	2.410040	35.893648
std	6819.904848	3.837735e+03	2.333924	36.281942
min	1.000000	1.997010e+07	1.000000	0.000000
25%	5506.000000	1.997022e+07	1.000000	14.490000
50%	11410.000000	1.997042e+07	2.000000	25.980000
75%	17273.000000	1.997111e+07	3.000000	43.700000
max	23570.000000	1.998063e+07	99.000000	1286.010000

大部分订单只购买了少量商品（平均2.4），50%的订单购买商品数量了在2个及以下，75%的订单购买商品数量在3个及以下，最多的订单购买了99个商品，平均消费商品数量存在一定极值干扰；
订单平均交易金额35元，中位数在25元，75%的订单金额在43元及以下，订单最大金额为1286，有一定极值干扰

df['order_dt']= pd.to_datetime(df.order_dt,format='%Y%m%d')
df['month']=df.order_dt.values.astype('datetime64[M]')

df.head()

	user_id	order_dt	order_products	order_amount	month
0	1	1997-01-01	1	11.77	1997-01-01
1	2	1997-01-12	1	12.00	1997-01-01
2	2	1997-01-12	5	77.00	1997-01-01
3	3	1997-01-02	2	20.76	1997-01-01
4	3	1997-03-30	2	20.76	1997-03-01

二、用户消费趋势分析

2.1每月消费总金额

grouped_month = df.groupby('month')
order_month_amount = grouped_month.order_amount.sum()
order_month_amount.head()

month
1997-01-01    299060.17
1997-02-01    379590.03
1997-03-01    393155.27
1997-04-01    142824.49
1997-05-01    107933.30
Name: order_amount, dtype: float64

import matplotlib.pyplot as plt
%matplotlib inline
#更改设计风格
plt.style.use('ggplot')
order_month_amount.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x2f08d6c8710>

每月消费总金额

消费金额在前三个月达到最高峰，之后急剧下降，1997年4月份之后每月消费总金额比较平稳，有轻微下降趋势。

2.2每月订单总数

grouped_month.user_id.count().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x2f08da00908>

每月订单总数变化

前三个月订单总数量平均为10000笔左右，后续月份平均每个月的订单量在2500笔左右。

2.3每月消费总产品数

grouped_month.order_products.sum().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x2f08da99da0>

每月消费总产品数

前三个月消费总产品数达到最高峰，平均为24000左右，后续月份平均每月消费产品量约在6000左右。

2.4每月消费总人数

df.groupby('month').user_id.apply(lambda x: len(x.drop_duplicates())).plot()

<matplotlib.axes._subplots.AxesSubplot at 0x2f08db04400>

每月消费总人数

每月消费人数低于每月订单总数（即每月消费次数），但差异不大；
前三个月每月的消费人数在8000-10000人左右，后续月份平均消费人数不足2000。

2.5每月用户平均消费金额

grouped_month.apply( lambda x : x.order_amount.sum() / x.user_id.nunique() ).plot()

<matplotlib.axes._subplots.AxesSubplot at 0x2f091378f60>

每月用户平均消费金额

前三个月平均金额稍低，后续月份每月用户平均消费金额在47-57之间波动，变化不大。

2.6每月用户平均消费次数

grouped_month.apply( lambda x : x.user_id.count() / x.user_id.nunique() ).plot()

<matplotlib.axes._subplots.AxesSubplot at 0x2f092ef2978>

每月用户平均消费次数

每月用户平均消费频次在1.13-1.15之间，前三个月由于用户大量涌入，部分用户可能仅消费了一次，导致频次略低；
后续月份用户平均消费频次比较平稳，且略高于前3个月。

#也可以用数据透视的方法
df.pivot_table(index='month',
               values=['order_products','order_amount','user_id'],
               aggfunc={'order_products':'sum','order_amount':'sum','user_id':'count'}).head()

	order_amount	order_products	user_id
month
1997-01-01	299060.17	19416	8928
1997-02-01	379590.03	24921	11272
1997-03-01	393155.27	26159	11598
1997-04-01	142824.49	9729	3781
1997-05-01	107933.30	7275	2895

三、用户个体消费分析

grouped_user = df.groupby('user_id')

grouped_user.sum().describe()

	order_products	order_amount
count	23570.000000	23570.000000
mean	7.122656	106.080426
std	16.983531	240.925195
min	1.000000	0.000000
25%	1.000000	19.970000
50%	3.000000	43.395000
75%	7.000000	106.475000
max	1033.000000	13990.930000

平均每用户购买了7张CD，但是中位数只有3，说明小部分用户购买了大量CD；
用户平均消费106元，中位值为43，75%分位数为106，刚好等于平均消费，有25%的用户消费的比平均金额多，再看最大消费金额为13990（CD狂热爱好者啊），佐证了上面小部分用户购买了大量CD的结论，数据存在极值干扰。

3.1用户消费金额和消费次数的散点图

grouped_user.sum().query('order_amount<4000').plot.scatter(x='order_amount',y='order_products')

<matplotlib.axes._subplots.AxesSubplot at 0x2f08ecd7898>

用户消费金额和消费次数的散点图1 )

可以看到最后面两个极值的存在拉大了图形区域，使得密集区域集中在左下角。通过过滤排除极值干扰：

grouped_user.sum().query('order_products < 350').plot.scatter(x = 'order_products', y = 'order_amount')

<matplotlib.axes._subplots.AxesSubplot at 0x2f095b70550>

用户消费金额和消费次数的散点图2

散点图呈现明显的线性关系，可以推测CD产品比较单一，单价比较稳定。

3.2用户消费金额分布图

grouped_user.sum().order_amount.plot.hist(bins=100)

<matplotlib.axes._subplots.AxesSubplot at 0x2f08ed6b2b0>

用户消费金额分布图1

从直方图可知，用户消费金额，绝大部分呈现集中趋势，大部分用户消费金额在1500以内，小部分异常值干扰了判断。
使用切比雪夫定律（95%的数据位于其平均值+5标准差范围之内）过滤异常值，由3.1的用户消费金额及购买产品数量的描述性统计可知，95%的用户消费金额位于106+5240=1306范围内：

grouped_user.sum().query('order_amount < 1306').order_amount.plot.hist(bins = 20)

<matplotlib.axes._subplots.AxesSubplot at 0x2f09568f748>

用户消费金额分布图2

3.3用户消费次数分布图

# 同上，获取过滤掉极值之后的用户购买产品数量分布：
grouped_user.sum().query('order_products < 100').order_products.plot.hist(bins = 20)

<matplotlib.axes._subplots.AxesSubplot at 0x2f0934cbf60>

用户消费次数分布图

可以看出，大部分用户仅购买了1-5张CD，整体购买数量和消费金额成指数下降趋势。

3.4 用户累计消费金额占比

user_cumsum = grouped_user.sum().sort_values('order_amount').apply(lambda x: x.cumsum()/x.sum())
user_cumsum.reset_index().order_amount.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x2f08efb5908>

用户累计消费金额占比

按用户消费金额进行升序排列，50%的用户仅贡献了约15%的消费额，消费金额排名靠前的20%用户贡献了60%的消费额。

四、用户消费行为分析

4.1用户第一次消费

备注：首购是一个比较重要的维度，它和渠道等息息相关，尤其是客单价比较高用户留存率又比较低的行业，可以通过首购分析第一次购买的用户渠道，然后拓展出运营的方式。
通过对用户第一次购买时间最小值的统计获取用户首购时间分布情况：

grouped_user.min().order_dt.value_counts().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x2f08f09c588>

用户第一次消费

用户第一次购买分布集中在前三个月，其中在2月11号至2月15号之间有一次剧烈波动

4.2用户最近一次消费

grouped_user.max().order_dt.value_counts().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x2f08f055ba8>

用户最近一次消费

用户最后一次购买的分布比第一次购买的分布要广；
大部分最后一次购买集中在前三个月，说明很多用户购买了一次后就不再购买；
随着时间的递增，最后一次购买数也在递增，消费呈现流失上升的状况。

4.3新老客消费比

4.3.1多少用户仅消费一次

user_life = grouped_user.order_dt.agg(['min','max'])
user_life.head()

	min	max
user_id
1	1997-01-01	1997-01-01
2	1997-01-12	1997-01-12
3	1997-01-02	1998-05-28
4	1997-01-01	1997-12-12
5	1997-01-01	1998-01-03

(user_life['min']== user_life['max']).value_counts()

True     12054
False    11516
dtype: int64

超过一半用户仅消费了一次

4.3.2每月新客占比

grouped_month_user = df.groupby(['month', 'user_id'])  # 按月和用户ID分组
tmp = grouped_month_user.order_dt.agg(['min']).join(grouped_user.order_dt.min())  # 求出用户在当月首购时间和用户在所有时间段的首购时间
tmp['is_new'] = (tmp['min'] ==  tmp['order_dt'])  # 新增一列，是否首购时间在当月，即是否为新客
tmp.reset_index().groupby('month').apply(lambda x : x.is_new.sum() / x.user_id.count() ).plot()  # 计算新客占比

<matplotlib.axes._subplots.AxesSubplot at 0x2f08f10d780>

每月新客占比

新客集中在前三个月，后面消费的均为老用户。

4.4用户分层

R：最近一次消费时间
F：消费频率
M：消费金额

rfm = df.pivot_table(index = 'user_id',
                     values = ['order_products','order_amount','order_dt'],
                     aggfunc = {'order_products':'sum','order_amount':'sum','order_dt':'max'})
rfm.head()

	order_amount	order_dt	order_products
user_id
1	11.77	1997-01-01	1
2	89.00	1997-01-12	6
3	156.46	1998-05-28	16
4	100.50	1997-12-12	7
5	385.61	1998-01-03	29

rfm['R'] = -(rfm.order_dt - rfm.order_dt.max()) / np.timedelta64(1,'D')
rfm.rename(columns = {'order_products':'F','order_amount':'M'},inplace = True)

def rfm_func(x):
    level = x.apply(lambda x:'1' if x>1 else '0')
    label = level.R + level.F + level.M
    d = {
        '111':'重要价值客户',
        '011':'重要保持客户',
        '101':'重要发展客户',
        '001':'重要挽留客户',
        '110':'一般价值客户',
        '010':'一般保持客户',
        '100':'一般发展客户',
        '000':'一般挽留客户'
    }
    result = d[label]
    return result
rfm['label'] = rfm[['R','F','M']].apply(lambda x:x - x.mean()).apply(rfm_func,axis = 1)
#将用户在R、F、M三个维度上分为低于平均额度和高于平均额度的用户（划分标准根据不同业务设计不同）:

rfm.groupby('label').sum()

	M	F	R
label
一般价值客户	1767.11	182	8512.0
一般保持客户	5100.77	492	7782.0
一般发展客户	445233.28	29915	6983699.0
一般挽留客户	215075.77	15428	621894.0
重要价值客户	147180.09	9849	286676.0
重要保持客户	1555586.51	105509	476502.0
重要发展客户	49905.80	2322	174340.0
重要挽留客户	80466.30	4184	96009.0

rfm.groupby('label').count()

	M	order_dt	F	R	color
label
一般价值客户	18	18	18	18	18
一般保持客户	53	53	53	53	53
一般发展客户	14138	14138	14138	14138	14138
一般挽留客户	3493	3493	3493	3493	3493
重要价值客户	631	631	631	631	631
重要保持客户	4267	4267	4267	4267	4267
重要发展客户	371	371	371	371	371
重要挽留客户	599	599	599	599	599

从RFM分层结果可知，“重要保持客户”为主要利润来源，即最后一次消费时间越晚，消费频次越高，消费额度越高的用户是优质客户。

rfm.loc[ rfm.label == '重要保持客户', 'color'] = 'brown'
rfm.loc[ rfm.label == '重要发展客户', 'color'] = 'grey'
rfm.loc[ rfm.label == '重要价值客户', 'color'] = 'c'
rfm.loc[ rfm.label == '重要挽留客户', 'color'] = 'r'
rfm.loc[ rfm.label == '一般保持客户', 'color'] = 'k'
rfm.loc[ rfm.label == '一般发展客户', 'color'] = 'y'
rfm.loc[ rfm.label == '一般价值客户', 'color'] = 'm'
rfm.loc[ rfm.label == '一般挽留客户', 'color'] = 'w'

rfm.plot.scatter('F', 'R', c=rfm.color)

<matplotlib.axes._subplots.AxesSubplot at 0x2f090f08cf8>

不同标签客户分布

从图中可以看到大部分用户的购买频次在200以内，少部分的极值可能会对我们使用平均值划分RFM造成一定程度的偏差（平均值对大部分用户来说偏大），所以如果使用平均值划分应该考虑到先根据切比雪夫定律剔除极值

4.5用户生命周期分析

pivoted_counts = df.pivot_table(index = 'user_id',
                                columns = 'month',
                                values = 'order_dt',
                                aggfunc = 'count').fillna(0)
pivoted_counts.head()

month	1997-01-01 00:00:00	1997-02-01 00:00:00	1997-03-01 00:00:00	1997-04-01 00:00:00	1997-05-01 00:00:00	1997-06-01 00:00:00	1997-07-01 00:00:00	1997-08-01 00:00:00	1997-09-01 00:00:00	1997-10-01 00:00:00	1997-11-01 00:00:00	1997-12-01 00:00:00	1998-01-01 00:00:00	1998-02-01 00:00:00	1998-03-01 00:00:00	1998-04-01 00:00:00	1998-05-01 00:00:00	1998-06-01 00:00:00
user_id
1	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	2.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	1.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
4	2.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
5	2.0	1.0	0.0	1.0	1.0	1.0	1.0	0.0	1.0	0.0	0.0	2.0	1.0	0.0	0.0	0.0	0.0	0.0

df_purchase = pivoted_counts.applymap(lambda x: 1 if x>0 else 0)
df_purchase.tail()

month	1997-01-01 00:00:00	1997-02-01 00:00:00	1997-03-01 00:00:00	1997-04-01 00:00:00	1997-05-01 00:00:00	1997-06-01 00:00:00	1997-07-01 00:00:00	1997-08-01 00:00:00	1997-09-01 00:00:00	1997-10-01 00:00:00	1997-11-01 00:00:00	1997-12-01 00:00:00	1998-01-01 00:00:00	1998-02-01 00:00:00	1998-03-01 00:00:00	1998-04-01 00:00:00	1998-05-01 00:00:00	1998-06-01 00:00:00
user_id
23566	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
23567	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
23568	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
23569	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
23570	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

def active_status(data):
    status = []
    for i in range(18):
        
        #若本月没有消费
        if data[i] == 0:
            if len(status) > 0:
                if status[i-1] == 'unreg':
                    status.append('unreg')
                else:
                    status.append('unactive')
            else:
                status.append('unreg')
        
        #若本月消费
        else:
            if len(status) == 0:
                status.append('new')
            else:
                if status[i-1] == 'unactive':
                    status.append('return')
                elif status[i-1] == 'unreg':
                    status.append('new')
                else:
                    status.append('active')
    return status

函数逻辑：
若本月没有消费

若之前是未注册，则依旧为未注册
若之前有消费，则为流失/不活跃
其他情况，为未注册

若本月有消费

若为第一次消费，则为新用户
若果之前有过消费，则上个月为不活跃，则为回流
如果上个月为未注册，则为新用户
除此之外，为活跃

indexs=df['month'].sort_values().astype('str').unique()
purchase_stats = df_purchase.apply(lambda x:pd.Series(active_status(x),index=indexs),axis=1)
purchase_stats.head()

	1997-01-01	1997-02-01	1997-03-01	1997-04-01	1997-05-01	1997-06-01	1997-07-01	1997-08-01	1997-09-01	1997-10-01	1997-11-01	1997-12-01	1998-01-01	1998-02-01	1998-03-01	1998-04-01	1998-05-01	1998-06-01
user_id
1	new	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive
2	new	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive
3	new	unactive	return	active	unactive	unactive	unactive	unactive	unactive	unactive	return	unactive	unactive	unactive	unactive	unactive	return	unactive
4	new	unactive	unactive	unactive	unactive	unactive	unactive	return	unactive	unactive	unactive	return	unactive	unactive	unactive	unactive	unactive	unactive
5	new	active	unactive	return	active	active	active	unactive	return	unactive	unactive	return	active	unactive	unactive	unactive	unactive	unactive

purchase_stats_ct = purchase_stats.replace('unreg',np.NaN).apply(lambda x: pd.value_counts(x))
purchase_stats_ct

	1997-01-01	1997-02-01	1997-03-01	1997-04-01	1997-05-01	1997-06-01	1997-07-01	1997-08-01	1997-09-01	1997-10-01	1997-11-01	1997-12-01	1998-01-01	1998-02-01	1998-03-01	1998-04-01	1998-05-01	1998-06-01
active	NaN	1157.0	1681	1773.0	852.0	747.0	746.0	604.0	528.0	532.0	624.0	632.0	512.0	472.0	571.0	518.0	459.0	446.0
new	7846.0	8476.0	7248	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
return	NaN	NaN	595	1049.0	1362.0	1592.0	1434.0	1168.0	1211.0	1307.0	1404.0	1232.0	1025.0	1079.0	1489.0	919.0	1029.0	1060.0
unactive	NaN	6689.0	14046	20748.0	21356.0	21231.0	21390.0	21798.0	21831.0	21731.0	21542.0	21706.0	22033.0	22019.0	21510.0	22133.0	22082.0	22064.0

purchase_stats_ct.fillna(0).T.head()

	active	new	return	unactive
1997-01-01	0.0	7846.0	0.0	0.0
1997-02-01	1157.0	8476.0	0.0	6689.0
1997-03-01	1681.0	7248.0	595.0	14046.0
1997-04-01	1773.0	0.0	1049.0	20748.0
1997-05-01	852.0	0.0	1362.0	21356.0

purchase_stats_ct.fillna(0).T.plot.area()

<matplotlib.axes._subplots.AxesSubplot at 0x2f09297e0f0>

用户生命周期

由图可知，到了在前三个月，新用户增加的数量非常大。从三月一号开始，用户开始快速流失。到后面的几个月流失用户基本占绝大比例。

purchase_stats_ct.fillna(0).T.apply(lambda x:x/x.sum(),axis=1)

	active	new	return	unactive
1997-01-01	0.000000	1.000000	0.000000	0.000000
1997-02-01	0.070886	0.519299	0.000000	0.409815
1997-03-01	0.071319	0.307510	0.025244	0.595927
1997-04-01	0.075223	0.000000	0.044506	0.880272
1997-05-01	0.036148	0.000000	0.057785	0.906067
1997-06-01	0.031693	0.000000	0.067543	0.900764
1997-07-01	0.031650	0.000000	0.060840	0.907510
1997-08-01	0.025626	0.000000	0.049555	0.924820
1997-09-01	0.022401	0.000000	0.051379	0.926220
1997-10-01	0.022571	0.000000	0.055452	0.921977
1997-11-01	0.026474	0.000000	0.059567	0.913958
1997-12-01	0.026814	0.000000	0.052270	0.920916
1998-01-01	0.021723	0.000000	0.043487	0.934790
1998-02-01	0.020025	0.000000	0.045779	0.934196
1998-03-01	0.024226	0.000000	0.063174	0.912601
1998-04-01	0.021977	0.000000	0.038990	0.939033
1998-05-01	0.019474	0.000000	0.043657	0.936869
1998-06-01	0.018922	0.000000	0.044972	0.936105

由上表可知，每月的用户消费状态变化

活跃用户，持续消费的用户，对应的是消费运营的质量
回流用户，之前不消费，本月才消费，对应的是换回运营
不活跃用户，对应的是流失

order_diff = grouped_user.apply(lambda x:x.order_dt - x.order_dt.shift())
order_diff.head(10)

user_id   
1        0        NaT
2        1        NaT
         2     0 days
3        3        NaT
         4    87 days
         5     3 days
         6   227 days
         7    10 days
         8   184 days
4        9        NaT
Name: order_dt, dtype: timedelta64[ns]

order_diff.describe()

count                      46089
mean     68 days 23:22:13.567662
std      91 days 00:47:33.924168
min              0 days 00:00:00
25%             10 days 00:00:00
50%             31 days 00:00:00
75%             89 days 00:00:00
max            533 days 00:00:00
Name: order_dt, dtype: object

(order_diff/np.timedelta64('1','D')).hist(bins=20)

<matplotlib.axes._subplots.AxesSubplot at 0x2f090431940>

订单周期分布

订单周期呈指数分布
用户的平均购买周期是28天
绝大部分用户的购买周期都低于100天

(user_life['max'] - user_life['min']).describe()

count                       23570
mean     134 days 20:55:36.987696
std      180 days 13:46:43.039788
min               0 days 00:00:00
25%               0 days 00:00:00
50%               0 days 00:00:00
75%             294 days 00:00:00
max             544 days 00:00:00
dtype: object

((user_life['max'] - user_life['min'])/np.timedelta64('1','D')).hist(bins=40)

<matplotlib.axes._subplots.AxesSubplot at 0x2f093b8ca58>

在这里插入图片描述

#过滤掉零值
u_1 = (user_life['max'] - user_life['min'])/np.timedelta64('1','D')
u_1[u_1>0].hist(bins=40)

<matplotlib.axes._subplots.AxesSubplot at 0x2f090d3fc18>

在这里插入图片描述

用户的生命周期受只够买一次的用户的影响比较厉害（可以排除）
用户均消费134天，中位数仅0天

五、复购率和回购率分析

pivoted_counts.head()

month	1997-01-01 00:00:00	1997-02-01 00:00:00	1997-03-01 00:00:00	1997-04-01 00:00:00	1997-05-01 00:00:00	1997-06-01 00:00:00	1997-07-01 00:00:00	1997-08-01 00:00:00	1997-09-01 00:00:00	1997-10-01 00:00:00	1997-11-01 00:00:00	1997-12-01 00:00:00	1998-01-01 00:00:00	1998-02-01 00:00:00	1998-03-01 00:00:00	1998-04-01 00:00:00	1998-05-01 00:00:00	1998-06-01 00:00:00
user_id
1	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	2.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	1.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
4	2.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
5	2.0	1.0	0.0	1.0	1.0	1.0	1.0	0.0	1.0	0.0	0.0	2.0	1.0	0.0	0.0	0.0	0.0	0.0

purchase_r = pivoted_counts.applymap(lambda x: 1 if x>1 else np.NaN if x==0 else 0)
purchase_r.head()

month	1997-01-01 00:00:00	1997-02-01 00:00:00	1997-03-01 00:00:00	1997-04-01 00:00:00	1997-05-01 00:00:00	1997-06-01 00:00:00	1997-07-01 00:00:00	1997-08-01 00:00:00	1997-09-01 00:00:00	1997-10-01 00:00:00	1997-11-01 00:00:00	1997-12-01 00:00:00	1998-01-01 00:00:00	1998-02-01 00:00:00	1998-03-01 00:00:00	1998-04-01 00:00:00	1998-05-01 00:00:00	1998-06-01 00:00:00
user_id
1	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	0.0	NaN	0.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	0.0	NaN
4	1.0	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN
5	1.0	0.0	NaN	0.0	0.0	0.0	0.0	NaN	0.0	NaN	NaN	1.0	0.0	NaN	NaN	NaN	NaN	NaN

5.1复购率（自然月内购买多次的用户占比）

(purchase_r.sum()/purchase_r.count()).plot(figsize=(10,4))

<matplotlib.axes._subplots.AxesSubplot at 0x2f09130db38>

在这里插入图片描述

复购率在前三个月高速增长，主要因为前三个月大量新用户涌入，而这批用户只购买了一次，导致复购率低，最终在大概四月份的时候稳定在20%左右

5.2回购率（曾经购买过的用户在某一时期内的再次购买的占比）

df_purchase = df_purchase.fillna(0)
df_purchase.head()

month	1997-01-01 00:00:00	1997-02-01 00:00:00	1997-03-01 00:00:00	1997-04-01 00:00:00	1997-05-01 00:00:00	1997-06-01 00:00:00	1997-07-01 00:00:00	1997-08-01 00:00:00	1997-09-01 00:00:00	1997-10-01 00:00:00	1997-11-01 00:00:00	1997-12-01 00:00:00	1998-01-01 00:00:00	1998-02-01 00:00:00	1998-03-01 00:00:00	1998-04-01 00:00:00	1998-05-01 00:00:00	1998-06-01 00:00:00
user_id
1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	1	0	1	1	0	0	0	0	0	0	1	0	0	0	0	0	1	0
4	1	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0
5	1	1	0	1	1	1	1	0	1	0	0	1	1	0	0	0	0	0

def purchase_back(data):
    status=[]
    for i in range(17):
        if data[i]==1:
            if data[i+1]==1:
                status.append(1)
            if data[i+1]==0:
                status.append(0)
        else:
            status.append(np.NaN)
    status.append(np.NaN)
    return status

#purchase_b = df_purchase.apply(purchase_back,axis=1)
indexs=df['month'].sort_values().astype('str').unique()
purchase_b = df_purchase.apply(lambda x:pd.Series(purchase_back(x),index=indexs),axis=1)
purchase_b.head()

	1997-01-01	1997-02-01	1997-03-01	1997-04-01	1997-05-01	1997-06-01	1997-07-01	1997-08-01	1997-09-01	1997-10-01	1997-11-01	1997-12-01	1998-01-01	1998-02-01	1998-03-01	1998-04-01	1998-05-01	1998-06-01
user_id
1	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	0.0	NaN	1.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	0.0	NaN
4	0.0	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN
5	1.0	0.0	NaN	1.0	1.0	1.0	0.0	NaN	0.0	NaN	NaN	1.0	0.0	NaN	NaN	NaN	NaN	NaN

(purchase_b.sum()/purchase_b.count()).plot(figsize=(10,4))

<matplotlib.axes._subplots.AxesSubplot at 0x2f093343b00>

在这里插入图片描述

可以看出，回购率在前三个月快速增长，知道四月份增长到30%以后，一直维持在30%左右波动

weixin_44236148

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
某CD网站数据用户行为分析

目录一、数据预处理1.1 数据导入1.2 数据清洗二、用户消费趋势分析2.1每月消费总金额2.2每月订单总数2.3每月消费总产品数2.4每月消费人数2.5每月用户平均消费金额2.6每月用户平均消费次数三、用户个体消费分析3.1 用户消费金额和次数的散点图3.2 用户消费金额的分布图3.3 用户消费次数的分布图3.4用户累计消费金额占比四、用户消费行为分析4.1用户...
复制链接

扫一扫