5.确定RFM划分区间
查看数据的基本分布情况,进而对数据进行RFM离散化
# 查看数据分布
desc_pd = rfm_gb.iloc[:,2:].describe().T
print(desc_pd)
# 定义区间边界
r_bins = [-1,79,255,365] # 注意起始边界小于最小值
f_bins = [0,2,5,130]
m_bins = [0,69,1199,206252]
边界划分原则:
中间值:r、m数据分布相对离散,选取25%和75%作为中间边界值,f(购买频率)因为行业属性问题由部门定义
min :比各个维度小
max :大于等于最大值
(pd.cut方法实行左开右闭原则,所以min要小于最小值)
6.计算RFM权重
# 匹配会员等级和rfm得分
rfm_merge = pd.merge(rfm_gb,sheet_datas[-1],on='会员ID',how='inner')
# rf获得rfm因子得分
clf = RandomForestClassifier()
clf = clf.fit(rfm_merge[['r','f','m']],rfm_merge['会员等级'])
weights = clf.feature_importances_
print('feature importance:',weights)
先建立rfm模型对象,将rfm三列作为特征,会员等级作为目标输入模型进行训练,通过feature_importances_获得权重
7.RFM计算过程
# RFM分箱得分
rfm_gb['r_score'] = pd.cut(rfm_gb['r'], r_bins, labels=[i for i in range(len(r_bins)-1,0,-1)]) # 计算R得分
rfm_gb['f_score'] = pd.cut(rfm_gb['f'], f_bins, labels=[i+1 for i in range(len(f_bins)-1)]) # 计算F得分
rfm_gb['m_score'] = pd.cut(rfm_gb['m'], m_bins, labels=[i+1 for i in range(len(m_bins)-1)]) # 计算M得分
# 计算RFM总得分
# 方法一:加权得分
rfm_gb = rfm_gb.apply(np.int32) # cate转数值
rfm_gb['rfm_score'] = rfm_gb['r_score'] * weights[0] + rfm_gb['f_score'] * weights[1] + rfm_gb[
'm_score'] * weights[2]
# 方法二:RFM组合
rfm_gb['r_score'] = rfm_gb['r_score'].astype(np.str_)
rfm_gb['f_score'] = rfm_gb['f_score'].astype(np.str_)
rfm_gb['m_score'] = rfm_gb['m_score'].astype(np.str_)
rfm_gb['rfm_group'] = rfm_gb['r_score'].str.cat(rfm_gb['f_score']).str.cat(
rfm_gb['m_score'])
8.保存结果到Excel
rfm_gb.to_excel('sales_rfm_score.xlsx')