文章目录
案例1
按照人为指定规则,对RFM三个指标进行打分(1~5分),然后计算出得分的平均值。将每个指标下,评分大于和小于平均值的,分别视为该指标下的1和0。
最后根据三个指标下的分层结果,将人群分为8个组。
1. 数据格式
假设已经是清洗好的数据,格式如下:
2. R、F、M指标分布
消费金额
从下图可以看出,消费金额的分布是比较右偏的。70%的人的消费金额集中在3000英镑以下,而有10%左右的人消费的金额超过了8000英镑。比较符合我们日常的认知。
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['font.size'] = 15
# 这里已经把统计结果写出来了,就直接拿过来看吧。(就是从上面的数据集中统计出来的)
lbs = ['1000英镑以内', '[1000, 3000]英镑', '[3000, 5000]英镑', '[5000, 8000]英镑', '[8000, 10000]英镑']
x= [2677, 2216, 837, 552, 685]
plt.figure(figsize=(6, 6))
plt.title('消费金额占比', fontsize = 18)
plt.pie(x, labels = lbs, autopct='%1.1f%%', colors = sns.color_palette('Blues'))
消费频率
在这个数据集中,只发生一次交易的 有 1493人,占比约34.4%。其他 65.6%的客户为在这段期间有复购的用户。
最近一次消费距今天数
这个也是右偏比较明显的。
count 4339.000000
mean 91.038258
std 100.010502
min -1.000000
25% 16.000000
50% 49.000000
75% 140.500000
max 372.000000
Name: 最近一次购买天数, dtype: float64
3. 构建模型
3.1 R、F、M人为规则打分
- 打分规则
def days_score(x):
if x <= 30:
score = 5
elif x>30 & x <=90:
score = 4
elif x >90 & x <=180:
score = 3
elif x>180 & x<=365:
score = 2
else:
score = 1
return score
def frequency_score(x):
if x <= 10:
score = 1
elif x>10 & x <=30:
score = 2
elif x >30 & x <=50:
score = 3
elif x>50 & x<=80:
score = 4
else:
score = 5
return score
def revenue_score(x):
if x <= 1000.0:
score = 1
elif (x>1000) & (x<=3000):
score = 2
elif (x >3000) & (x <=5000):
score = 3
elif (x>5000) & (x<=8000):
score = 4
else:
score = 5
return score
- 3个维度下的打分
customer_data['days_score'] = customer_data['最近一次购买天数'].apply(days_score)
customer_data['frequency_score'] = customer_data['订单数'].apply(frequency_score)
customer_data['revenue_score'] = customer_data['总消费金额'].apply(revenue_score)
customer_data.head()
3.2 获取每个指标下的打分平均值
avg_r_score = customer_data['days_score'].mean()
avg_f_score = customer_data['frequency_score'].mean()
avg_m_score = customer_data['revenue_score'].mean()
print('平均最近一次消费时间间隔 R得分:', avg_r_score)
print('平均消费频次 F得分:', avg_f_score)
print('平均消费金额 M得分:', avg_m_score)
'''
平均最近一次消费时间间隔 R得分: 4.39525236229546
平均消费频次 F得分: 1.0778981332104172
平均消费金额 M得分: 1.60659138050242
'''
3.3 按照是否大于平均分来分层
customer_data['R'] = customer_data['days_score'].apply(lambda x: 1 if x>round(avg_r_score, 1) else 0)
customer_data['F'] = customer_data['frequency_score'].apply(lambda x: 1 if x>round(avg_f_score, 1) else 0)
customer_data['M'] = customer_data['revenue_score'].apply(lambda x: 1 if x>round(avg_m_score, 1) else 0)
customer_data.head()
3.4 分层结果生成标签
customer_data.loc[((customer_data['R']==1) & (customer_data['M']==1) & (customer_data['M']==1)), 'customer_type'] = '重要价值客户'
customer_data.loc[((customer_data['R']==0) & (customer_data['M']==1) & (customer_data['M']==1)), 'customer_type'] = '重要保持客户'
customer_data.loc[((customer_data['R']==1) & (customer_data['M']==0) & (customer_data['M']==1)), 'customer_type'] = '重要发展客户'
customer_data.loc[((customer_data['R']==0) & (customer_data['M']==0) & (customer_data['M']==1)), 'customer_type'] = '重要挽留客户'
customer_data.loc[((customer_data['R']==1) & (customer_data['M']==1) & (customer_data['M']==0)), 'customer_type'] = '一般价值客户'
customer_data.loc[((customer_data['R']==0) & (customer_data['M']==1) & (customer_data['M']==0)), 'customer_type'] = '一般保持客户'
customer_data.loc[((customer_data['R']==1) & (customer_data['M']==0) & (customer_data['M']==0)), 'customer_type'] = '一般发展客户'
customer_data.loc[((customer_data['R']==0) & (customer_data['M']==0) & (customer_data['M']==0)), 'customer_type'] = '流失客户'
customer_data.head()
4. 查看分层人数
customer_data.customer_type.value_counts()
customer_data.customer_type.value_counts().plot(kind = 'pie')
#定位流失客户,可以进行进一步分析
customer_data[customer_data['customer_type']=='流失客户']
案例2
将R、F、M三个指标等深分成2个组,并获取中间的分界值(threshold)。然后对分组结果进行二值化(0和1)。
最后根据三个指标下的分层结果,将人群分为8个组。
这样的好处是不用人为的指定打分规则,并且可以避免数据右偏的情况。不过最终的结果里,因为是等深分箱,所以单个指标的分层是50%:50%的。
1. 基础数据格式
2. 计算R、F、M指标
2.1 F反映客户对打折产品的偏好 interest
- 注意这个例子里面,F计算的是打折订单的比例,而不是整体的购物频次
F = trad_flow.groupby(['cumid', 'type'])[['transID']].count()
display(F.head())
F_trans = pd.pivot_table(F, index='cumid', columns='type', values='transID')
# print(F_trans.head())
F_trans['Special_offer'] = F_trans['Special_offer'].fillna(0)
# print(F_trans.head())
F_trans["interest"] = F_trans['Special_offer'] / (F_trans['Special_offer'] + F_trans['Normal'])
F_trans.head()
- 指标分布
从下图可以可以看到,用户的折扣订单占比 这个指标,会有比较明显的右偏情况。
F_trans['interest'].plot(kind='hist',bins=20,figsize=(15,6))
plt.title('interest')
2.2 M反映客户的总消费金额 value
客户的总消费金额,等于其所有类型订单的消费金额。这里比较简单,直接相加即可。
M = trad_flow.groupby(['cumid', 'type'])[['amount']].sum()
display(M.head())
M_trans = pd.pivot_table(M, index='cumid', columns='type', values='amount')
M_trans['Special_offer'] = M_trans['Special_offer'].fillna(0)
M_trans['returned_goods'] = M_trans['returned_goods'].fillna(0)
M_trans["value"] = M_trans['Normal'] + M_trans['Special_offer'] + M_trans['returned_goods']
M_trans.head()
- 数据分布
相对来说,这个数据集里面的用户消费金额这一指标,并没有明显的右偏。整个分布还是接近正态的。
M_trans['value'].plot(kind='hist',bins=20,figsize=(15,6))
plt.title('value')
2.3 通过计算R反映客户是否为沉默客户 time_new
- 此处修改为 时间戳 方便后面 qcut 函数分箱
# 先请洗一下数据集中的时间
# 定义一个从文本转化为时间的函数
import time
def to_time(t):
out_t = time.mktime(time.strptime(t, '%d%b%y:%H:%M:%S')) # 此处修改为 时间戳 方便后面 qcut 函数分箱
return out_t
trad_flow["time_new"] = trad_flow['time'].apply(to_time)
R = trad_flow.groupby(['cumid'])[['time_new']].max()
R.head()
- 数据分布
这里先不换算距今天天数了,之间按照时间戳看一下。从下图可知,用户最近一次购买距今天数,这个也是比较右偏的。
R["time_new"].plot(kind='hist',bins=20,figsize=(15,6))
plt.title('time_new')
3. 构建模型
3.1 等深分桶并做二值化
- R、F、M 三个指标 分别 等深分成两个桶,并做二值化
from sklearn import preprocessing
threshold = pd.qcut(F_trans['interest'], 2, retbins=True)[1][1] #等深分成两个桶
print(f'\nthreshold: {threshold:.5f}')
binarizer = preprocessing.Binarizer(threshold=threshold) # 二值化
interest_q = pd.DataFrame(binarizer.transform(F_trans['interest'].values.reshape(-1, 1)))
interest_q.index = F_trans.index
interest_q.columns = ["interest"]
# print(interest_q[:5])
display(interest_q['interest'].value_counts())
threshold = pd.qcut(M_trans['value'], 2, retbins=True)[1][1] #等深分成两个桶
print(f'\nthreshold: {threshold:.2f}')
binarizer = preprocessing.Binarizer(threshold=threshold) # 二值化
value_q = pd.DataFrame(binarizer.transform(M_trans['value'].values.reshape(-1, 1)))
value_q.index = M_trans.index
value_q.columns = ["value"]
# print(value_q[:5])
display(value_q['value'].value_counts())
threshold = pd.qcut(R["time_new"], 2, retbins=True)[1][1] #等深分成两个桶
print(f'\nthreshold: {threshold:.0f}')
binarizer = preprocessing.Binarizer(threshold=threshold) # 二值化
time_new_q = pd.DataFrame(binarizer.transform(R["time_new"].values.reshape(-1, 1)))
time_new_q.index = R.index
time_new_q.columns = ["time"]
# print(time_new_q[:5])
display(time_new_q['time'].value_counts())
3.2 分层打标结果
analysis = pd.concat([interest_q, value_q, time_new_q], axis=1)
analysis.head()
3.3 生成标签
# 生成RFM标签
label = {
(0, 0, 0): '无兴趣-低价值-沉默',
(1, 0, 0): '有兴趣-低价值-沉默',
(1, 0, 1): '有兴趣-低价值-活跃',
(0, 0, 1): '无兴趣-低价值-活跃',
(0, 1, 0): '无兴趣-高价值-沉默',
(1, 1, 0): '有兴趣-高价值-沉默',
(1, 1, 1): '有兴趣-高价值-活跃',
(0, 1, 1): '无兴趣-高价值-活跃'
}
analysis['label'] = analysis[['interest', 'value', 'time']].apply(lambda x: label[(x[0], x[1], x[2])], axis=1)
analysis.head()
4. 查看分层人数
这种分层方式生成的结果中,各层人数比较接近(相当于每个指标按照中位数分的上下两组)。如果不希望各层人数太接近的,可以自己自定义分层规则(参考案例1)
output = pd.DataFrame(analysis['label'].value_counts()).rename(columns={'label':'people_cnt'})
output['percentage'] = output['people_cnt']/sum(output['people_cnt'])
output.sort_index(ascending=False)