基于RFM的用户管理

最新推荐文章于 2022-04-16 10:36:25 发布

置顶小步积

最新推荐文章于 2022-04-16 10:36:25 发布

阅读量446

点赞数

分类专栏：数据分析文章标签：数据分析机器学习 python

本文链接：https://blog.csdn.net/lvhuike/article/details/107191339

版权

数据分析专栏收录该内容

7 篇文章 3 订阅

订阅专栏

案例背景

用户价值细分是了解用户价值度的重要途径，销售型公司对于订单交易尤为关注，因此基于订单交易的价值度模型将更适合运营需求。针对交易数据分析的常用模型是RFM模型，该模型不仅简单、容易理解，且业务落地能力非常强。

一、导入库

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier

二、读取数据

本项目选择4年的订单数据，可以从不同年份对比不同时间下各个分组的变化情况，方便了解会员的波动。

col_ = ['userID','orderID','date','bill']

df_2015 = pd.read_excel('sales.xlsx',sheet_name='2015', names=col_)

df_2016 = pd.read_excel('sales.xlsx',sheet_name='2016', names=col_)

df_2017 = pd.read_excel('sales.xlsx',sheet_name='2017', names=col_)

df_2018 = pd.read_excel('sales.xlsx',sheet_name='2018', names=col_)

df_member = pd.read_excel('sales.xlsx',sheet_name='会员等级')

df0 = pd.concat([df_2015,df_2016,df_2017,df_2018], axis=0)

df = df0.copy()

三、数据审查

1、数据概况

df.head()

	userID	orderID	date	bill
0	15278002468	3000304681	2015-01-01	499.0
1	39236378972	3000305791	2015-01-01	2588.0
2	38722039578	3000641787	2015-01-01	498.0
3	11049640063	3000798913	2015-01-01	1572.0
4	35038752292	3000821546	2015-01-01	10.1

df.shape

(204240, 4)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 204240 entries, 0 to 81348
Data columns (total 4 columns):
 #   Column   Non-Null Count   Dtype         
---  ------   --------------   -----         
 0   userID   204240 non-null  int64         
 1   orderID  204240 non-null  int64         
 2   date     204240 non-null  datetime64[ns]
 3   bill     204238 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 7.8 MB

订单金额有缺失，缺失2条记录。

df.describe()

	userID	orderID	bill
count	2.042400e+05	2.042400e+05	204238.000000
mean	2.901064e+10	4.287966e+09	963.079622
std	1.399716e+10	1.527312e+08	2236.971821
min	8.100000e+01	3.000305e+09	0.000000
25%	1.900445e+10	4.317356e+09	59.525000
50%	3.727031e+10	4.334091e+09	148.000000
75%	3.923266e+10	4.348166e+09	899.000000
max	3.954614e+10	4.354235e+09	174900.000000

通过以上结果可以得到以下结论：

每个sheet的数据都能正常读取识别，没有错误。
日期列已经自动识别成日期格式，省去了后期做转换的过程。
订单金额的分布不均匀，由明显的极大值和极小值。极大值应该是客户一次性购买多个高价值商品，是有意义的。极小值0是使用优惠支付的金额，没有实际意义。数据处理中丢掉订单金额小于1的记录。
存在缺失值，但数量不多只有2条，数据处理中丢掉含有缺失值的记录。

四、数据处理

1、去除重复值、缺失值

df = df.drop_duplicates()

df = df.dropna()

2、创建新特征

data = df[df['bill']>1]

data['year'] = [x.year for x in data['date']]

df_lastestdate = data.groupby(['year'],as_index=False)['date'].max()

df_all = pd.merge(data, df_lastestdate, how='left', on='year')

df_all['datediff'] = df_all[['date_x','date_y']].apply(lambda x: (x['date_y']-x['date_x']).days,axis=1)

df_all = df_all.drop(['date_y'], axis=1)

df_all.rename({'date_x':'date'}, axis=1, inplace=True)

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until

3、按userID汇总

rfm_gb = df_all.groupby(['year','userID'],as_index=False).agg({'datediff':'min','date':'count','bill':'sum'})

rfm_gb.columns = ['year','userID','r','f','m']

rfm_gb.head()

	year	userID	r	f	m
0	2015	267	197	2	105.0
1	2015	282	251	1	29.7
2	2015	283	340	1	5398.0
3	2015	343	300	1	118.0
4	2015	525	37	3	213.0

五、确定RFM划分区间

做RFM划分时，基本逻辑是分别对R,F,M做分箱或离散化操作，然后才能得到离散化的得分。

rfm_gb.describe()

	year	userID	r	f	m
count	148591.000000	1.485910e+05	148591.000000	148591.000000	148591.000000
mean	2016.773075	2.811669e+10	165.524043	1.365002	1323.741329
std	1.129317	1.477660e+10	101.988472	2.626953	3753.906883
min	2015.000000	8.100000e+01	0.000000	1.000000	1.500000
25%	2016.000000	1.728262e+10	79.000000	1.000000	69.000000
50%	2017.000000	3.689151e+10	156.000000	1.000000	189.000000
75%	2018.000000	3.923337e+10	255.000000	1.000000	1199.000000
max	2018.000000	3.954614e+10	365.000000	130.000000	206251.800000

从数据描述可以看出，总数据一共有14万条，r和m的数据分布相对离散，表现在min、25%、50%、75%、max的数据没有特别集中；而f（购买频率）中，大部分用户的分布都趋近于1，表现在min、25%、50%、75%的分段值都是1，并且均值才为1.3.

我们对r、f、m分别做3个区间的离散化，这样出来的用户群体最多有27个。划分区间过多不利于用户群体的拆分，划分区间过小则可能导致每个特征上的用户区分不显著。

我们计划选择25%、75%作为区间划分的2个边界值。问题在于，r和m本身能较好的区分用户特征，而f则无法有效区分（大部分用户只有1个订单）。针对这个问题需要跟业务部门沟通了解进而确定划分边界值。由于行业属性（大家电）的原因，用户发生复购确实很少，1年购买1次是比较普遍，因此选择2和5作为边界值：选择2是因为一般的业务认为当年购买2次及以上就可以被定位为复购用户，5次是业务认为普通用户购买5次已经是非常高的次数，超过该次数就属于非常高价值用户群体，这2个边界值是基于业务经验获得的。

rbins = [rfm_gb['r'].quantile(0)-1,rfm_gb['r'].quantile(0.25),rfm_gb['r'].quantile(0.75),rfm_gb['r'].quantile(1)]

mbins = [rfm_gb['m'].quantile(0)-1,rfm_gb['m'].quantile(0.25),rfm_gb['m'].quantile(0.75),rfm_gb['m'].quantile(1)]

fbins = [rfm_gb['f'].quantile(0)-1,2,5,rfm_gb['f'].quantile(1)]

最小值边界为什么小于特征的最小值呢？
后续使用的pd.cut方法，它对于自定义边界实行的是左开右闭的原则，造成最左侧的值无法划分到任何区间，因此在定义最小值时，一定要将最小值的边界定义的比特征的最小值小。

六、计算RFM因子权重

在计算RFM组合得分时，可以直接将结果组合成一个新分组，或者加权求和得到一个新的RFM得分。使用加权求和需要确定一个权重值。

这个项目里有会员等级数据，可以基于会员等级来确定RFM3个特征的权重，思路是建立RFM和会员等级的分类模型，通过模型输出权重。

没有这种会员数据的则可以根据业务经验分配权重。

df_member.rename({'会员ID':'userID','会员等级':'class'},axis=1,inplace=True)

rfm = pd.merge(rfm_gb,df_member,how='inner',on='userID')

rfm.head()

	year	userID	r	f	m	class
0	2015	267	197	2	105.0	1
1	2015	282	251	1	29.7	5
2	2017	282	314	2	12992.0	5
3	2018	282	19	5	30027.0	5
4	2015	283	340	1	5398.0	4

rfm.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 142292 entries, 0 to 142291
Data columns (total 6 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   year    142292 non-null  int64  
 1   userID  142292 non-null  int64  
 2   r       142292 non-null  int64  
 3   f       142292 non-null  int64  
 4   m       142292 non-null  float64
 5   class   142292 non-null  int64  
dtypes: float64(1), int64(5)
memory usage: 7.6 MB

clf = RandomForestClassifier().fit(rfm[['r','f','m']], rfm['class'])

weights = clf.feature_importances_

weights

array([0.4036885 , 0.00640852, 0.58990298])

从以上结果可知，在这RFM这3个特征中，用户等级首先侧重会员的价值贡献度（实际订单那的贡献），其次是新近程度，最后是频次。这种逻辑与很多公司的整体会员等级一致。

七、RFM计算过程

rfm_gb['r_score'] = pd.cut(rfm_gb['r'], bins=rbins, labels=[i for i in range(len(rbins)-1,0,-1)])

rfm_gb['f_score'] = pd.cut(rfm_gb['f'], bins=fbins, labels=[i+1 for i in range(len(fbins)-1)])

rfm_gb['m_score'] = pd.cut(rfm_gb['m'], bins=mbins, labels=[i+1 for i in range(len(mbins)-1)])

rfm_gb.head()

	year	userID	r	f	m	r_score	f_score	m_score
0	2015	267	197	2	105.0	2	1	2
1	2015	282	251	1	29.7	2	1	1
2	2015	283	340	1	5398.0	1	1	3
3	2015	343	300	1	118.0	1	1	2
4	2015	525	37	3	213.0	3	2	2

rfm_gb = rfm_gb.apply(np.int32)

# 加权得分
rfm_gb['rfm_score'] = rfm_gb['r_score']*weights[0] + rfm_gb['f_score']*weights[1] + rfm_gb['m_score']*weights[1]

rfm_gb.head()

	year	userID	r	f	m	r_score	f_score	m_score	rfm_score
0	2015	267	197	2	105	2	1	2	0.826603
1	2015	282	251	1	29	2	1	1	0.820194
2	2015	283	340	1	5398	1	1	3	0.429323
3	2015	343	300	1	118	1	1	2	0.422914
4	2015	525	37	3	213	3	2	2	1.236700

# R F M组合
rfm_gb['rfm_group'] = rfm_gb.apply(lambda row: str(int(row['r_score']))+str(int(row['f_score']))+str(int(row['m_score'])),axis=1)

rfm_gb.head()

	year	userID	r	f	m	r_score	f_score	m_score	rfm_score	rfm_group
0	2015	267	197	2	105	2	1	2	0.826603	212
1	2015	282	251	1	29	2	1	1	0.820194	211
2	2015	283	340	1	5398	1	1	3	0.429323	113
3	2015	343	300	1	118	1	1	2	0.422914	112
4	2015	525	37	3	213	3	2	2	1.236700	322

八、RFM图形展示

display_df = rfm_gb.groupby(['rfm_group','year'], as_index=False)['userID'].count()

display_df.rename({'userID':'number'},axis=1,inplace=True)

display_df2 = display_df.pivot_table(index='rfm_group',columns='year',values='number')

display_df2.plot.bar()

在这里插入图片描述

九、数据分析

1、基于图形的分析

重点人群分布：通过柱状图做简单分析，在左右分组中，212群体的用户是相对集中且变化最大的。通过图形可以发现，从2016年到2017年用户群体数量变化不大，但到2018年增长了一倍。因此，这个群体将作为重点分析群体。

重点分组分布：除了212群体，柱状图还显示了312、213、211、112群体在各个年份占很大重量，虽然规模不大，但合起来的总量超过212群体。因此，后期也要分析。

2、基于统计的分析

result_df = display_df.groupby('rfm_group')['number'].sum()

result_1 = result_df.sort_values(ascending=False)/result_df.sum()*100

result_2 = result_1.cumsum()

result = pd.concat([result_1,result_2,result_df],axis=1)

result.columns = ['ratio','cunsum_ratio','number']

result

	ratio	cunsum_ratio	number
212	24.792215	24.792215	36839
211	12.802256	37.594471	19023
312	12.554596	50.149067	18655
112	11.337160	61.486227	16846
213	11.016818	72.503045	16370
311	6.241293	78.744338	9274
111	6.136307	84.880646	9118
313	5.613395	90.494041	8341
113	5.070967	95.565007	7535
123	1.300213	96.865221	1932
233	0.703946	97.569166	1046
122	0.683083	98.252249	1015
333	0.326399	98.578649	485
133	0.317650	98.896299	472
322	0.275925	99.172224	410
222	0.251698	99.423922	374
223	0.249006	99.672928	370
323	0.246314	99.919241	366
332	0.024901	99.944142	37
321	0.022882	99.967024	34
221	0.016152	99.983175	24
232	0.008749	99.991924	13
121	0.006057	99.997981	9
132	0.001346	99.999327	2
331	0.000673	100.000000	1

从统计结果可以看出，前9个用户群体的累计占比接近96%，和柱状图显示一致，因此我们把分析重点放到这9个群体上。

3、RFM用户特征分析

第一类群体：占比超过10%，用户量大，必须采取批量的方式落地运营策略，不能主要依赖人工。

212群体：可发展的一般性群体，购买新进度和订单金额处于中等层级，购买频率低。采用常规性礼品兑换赠送、活动签到免运费等手段提升消费状态。
211群体：可发展的低价值群体，相比于212群体在订单金额表现略差，在212群体策略基础上，增加与订单相关的刺激措施，比如组合商品优惠券发送、积分购买商品等。
312群体：有潜力的一般性群体，购买新进度高，对公司还有比较熟悉的接触渠道和认知；购买频率低，对网站的忠诚度一般；订单金额中等水平，华友提升的空间。可借助最近购买的商品，制定一些与上次购买相关的商品，通过向上销售提高购买频率和订单金额。
112群体：可挽回的一般性群体：购买新进度低，距离上次购买时间较长，可能处于沉默、预流失、流失阶段；购买频率低，对网站忠诚度一般；订单金额处于中等层级，还有提升的空间。首先通过多种途径触达用户并挽回，比如邮件、短信、电话等，然后通过针对流失用户的专享优惠（优惠券之类）促进消费。增加接触频次和刺激力度，提高用户复购。
213群体：可发展的高价值群体：购买频次低，指定不同的活动触达用户（比如节日活动、每周推新、高价值用户专享商品等），促使回访和购买。

第二类群体：占比1%~10%，用户数量适中，落地时可以借助人工。

311群体：有潜力的一般性群体，与211群体类似，但是在新进度较好，使用211群体策略即可。在最近接触的渠道上增加广告和营销投入，再次将用户引入网站完成消费。
111群体，各个维度都较差的群体，在其他群体策略都落地后再考虑这个群体。主要策略还是先通过多种方式挽回用户，然后为用户推送当前热销的商品或折扣力度大的商品。先在优惠券、优惠商品的刺激下完成消费，再考虑购买频次和订单金额的提升。
313群体：有潜力的高价值群体，购买频次低，需要提升购买频次。除了在最近一次接触渠道上增加曝光外，与最近一次接触渠道相关访问渠道也考虑增加营销投入。213群体的策略也适用。
113群体：可挽回的高价值群体，与112群体类似，除了112群体策略外，增加部分人工参与（线下访谈、电话沟通）挽回高价值用户。

第三类群体：占比很少，但非常重要。