某在线商店电子产品销售数据分析-RFM分析方法

本文github地址:DataSicence
数据下载链接:链接
本文参考资料:链接

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns 
#seaborn绘图包需要时最新版本

数据读取

  • event_time -购买时间
  • order_id -订单编号
  • product_id -产品编号
  • category_id -产品的类别ID
  • category_code -产品的类别分类法(代码名称)
  • brand -品牌名称
  • price -产品价格
  • user_id -用户ID
df = pd.read_csv('D:\Code\Github\data\kz.csv',sep=',')
df.head()
event_timeorder_idproduct_idcategory_idcategory_codebrandpriceuser_id
02020-04-24 11:50:39 UTC229435993205453698615159662235090899062.268105e+18electronics.tabletsamsung162.011.515916e+18
12020-04-24 11:50:39 UTC229435993205453698615159662235090899062.268105e+18electronics.tabletsamsung162.011.515916e+18
22020-04-24 14:37:43 UTC229444402405808622022739483190571836582.268105e+18electronics.audio.headphonehuawei77.521.515916e+18
32020-04-24 14:37:43 UTC229444402405808622022739483190571836582.268105e+18electronics.audio.headphonehuawei77.521.515916e+18
42020-04-24 19:16:21 UTC229458426315407423622739483168174244392.268105e+18NaNkarcher217.571.515916e+18
df.shape
(2633521, 8)

查看数据缺失情况

df.isnull().sum()
event_time             0
order_id               0
product_id             0
category_id       431954
category_code     612202
brand             506005
price             431954
user_id          2069352
dtype: int64

为处理快捷,删除有所有缺失值的行

df = df.dropna()
df.shape
(420718, 8)

查看数据类型

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 420718 entries, 0 to 2633520
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   event_time     420718 non-null  object 
 1   order_id       420718 non-null  int64  
 2   product_id     420718 non-null  int64  
 3   category_id    420718 non-null  float64
 4   category_code  420718 non-null  object 
 5   brand          420718 non-null  object 
 6   price          420718 non-null  float64
 7   user_id        420718 non-null  float64
dtypes: float64(3), int64(2), object(3)
memory usage: 28.9+ MB
# 将时间列更改为时间类型
df['event_time'] = pd.to_datetime(df.event_time)
print(df.event_time.max(),df.event_time.min())
2020-11-21 10:10:30+00:00 1970-01-01 00:33:40+00:00
# 提取日期中的月份
df['month'] = df.event_time.dt.month

用户消费趋势分析

df_month = df.loc[df.event_time.dt.year == 2020].groupby(['month'])

每月消费总金额

df_month_sum = df_month.sum().reset_index().rename(columns = {'price':'销售额','month':'月份'})
plt.rcParams['font.sans-serif']=['SimSun'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False # 用来正常显示负号
%matplotlib inline
plt.style.use("ggplot")
#plt.figure(figsize = (15,8))
sns.relplot(x='月份',y = '销售额',data= df_month_sum,kind='line',height=4,aspect=15/8)
plt.title('每月消费总金额')
Text(0.5, 1.0, '每月消费总金额')

png

每月消费人数

df_month_count = df_month.count().reset_index().rename(columns = {'price':'活跃人数','month':'月份'})
plt.rcParams['font.sans-serif']=['SimSun'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False # 用来正常显示负号
%matplotlib inline
plt.style.use("ggplot")

sns.relplot(x='月份',y = '活跃人数',data= df_month_count,kind='line',height=4,aspect=15/8)
plt.title('每月消费人数')
Text(0.5, 1.0, '每月消费人数')

png

两者对比

df_month = df.groupby(['month'])['price'].agg(['sum','count']).reset_index().rename(columns = {'sum':'销售额','month':'月份','count':'活跃人数'})
df_month
月份销售额活跃人数
011.670965e+069982
121.928107e+0611566
232.532487e+0612461
341.550330e+068807
457.180919e+0630826
566.834606e+0629750
671.511283e+0761037
782.601872e+0782198
891.571325e+0753591
9101.763243e+0774107
10111.075389e+0746393
# 将表格进行转换
df_month_melt = df_month.melt(id_vars=['月份'],value_vars=['销售额','活跃人数'],var_name='cal',value_name='value')
sns.relplot(data=df_month_melt,x = '月份',y = 'value',col = 'cal',col_wrap=1,height=4,aspect=15/8,kind='line',facet_kws = {'sharey':False})
<seaborn.axisgrid.FacetGrid at 0x21e94bd46d0>

png

  • 7-10月是消费高峰时节,其他月份的消费额和活跃人数相对比较少

品牌消费情况

品牌销售额

df_grand = df.loc[df.event_time.dt.year == 2020].groupby('brand')['price'].agg(['sum']).reset_index().sort_values('sum',ascending = False).rename(
    columns = {'brand':'品牌','sum':'销售额'})
df_grand
品牌销售额
443samsung2.872334e+07
31apple2.590539e+07
300lg7.726328e+06
43asus5.072569e+06
299lenovo4.565506e+06
.........
47att9.200000e-01
163elfe8.800000e-01
547wurth6.700000e-01
102celebrat2.300000e-01
390pedigree2.300000e-01

570 rows × 2 columns

sns.barplot(x = '销售额',y='品牌',data=df_grand.iloc[:15,:])
<matplotlib.axes._subplots.AxesSubplot at 0x21e9445f2b0>

png

  • 三星、苹果、LG的销售额最高

品牌用户数量

df_grand = df.loc[df.event_time.dt.year == 2020].groupby('brand')['user_id'].agg(pd.Series.nunique).reset_index().sort_values('user_id',ascending = False).rename(
    columns = {'brand':'品牌','user_id':'用户数量'})
df_grand
品牌用户数量
443samsung34602.0
31apple18441.0
50ava10095.0
300lg8243.0
554xiaomi7627.0
.........
227highwaybaby1.0
226herschel1.0
445sandisk1.0
320matrix1.0
80blackvue1.0

570 rows × 2 columns

g = sns.barplot(x = '用户数量',y='品牌',data=df_grand.iloc[:15,:])

png

每个品牌人均销售额

df_grand = df.loc[df.event_time.dt.year == 2020].groupby('brand')['price','user_id'].agg({'price':'sum','user_id':pd.Series.nunique}).reset_index()
df_grand.head()
<ipython-input-26-02fbe5db5420>:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  df_grand = df.loc[df.event_time.dt.year == 2020].groupby('brand')['price','user_id'].agg({'price':'sum','user_id':pd.Series.nunique}).reset_index()
brandpriceuser_id
0acana175.212.0
1adguard2.751.0
2aeg50246.5938.0
3aerocool81000.46519.0
4agu99.542.0
df_grand['人均销售额'] = df_grand.price/df_grand.user_id
sns.barplot(x = '人均销售额',y='brand',data=df_grand.sort_values('人均销售额',ascending = False).iloc[:30,:])
<matplotlib.axes._subplots.AxesSubplot at 0x21e949fd4c0>

png

用户个体销售

用户消费次数、消费金额散点图

data = df.groupby('user_id')['order_id','price'].agg({'order_id':'count','price':'sum'}).rename(columns = {'order_id':'消费次数','price':'消费金额'})
data
消费次数消费金额
user_id
1.515916e+181416.64
1.515916e+18256.43
1.515916e+18125984.92
1.515916e+1873785.72
1.515916e+182182.83
.........
1.515916e+181208.31
1.515916e+1813472.20
1.515916e+182277.74
1.515916e+181925.67
1.515916e+181418.96

90800 rows × 2 columns

sns.scatterplot(x = '消费次数',y = '消费金额',data = data)
plt.title('用户消费次数与消费金额之间的关系') 
Text(0.5, 1.0, '用户消费次数与消费金额之间的关系')

sns.displot(data=data.query('消费金额<10000'),x = '消费金额')
<seaborn.axisgrid.FacetGrid at 0x192e63e9c10>
svg
sns.displot(data=data.query('消费次数<50'),x = '消费次数',kind='hist',bins = 20)
<seaborn.axisgrid.FacetGrid at 0x192e6ecc370>
svg
data_cum = data.sort_values('消费金额')['消费金额'].cumsum()/data['消费金额'].sum()
data_cum.reset_index()['消费金额'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x192f8331460>
svg
  • 可以看到消费金额最低50%用户只贡献了10%左右的销售额,超过一半的销售额,都来自不到10%的用户,可见用户之间的购买力相差比较大

用户消费行为

新增人数记录

df.loc[df.event_time.dt.year == 2020].groupby('user_id')['event_time'].min().value_counts().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1928877a520>
svg
data = df.loc[df.event_time.dt.year == 2020].groupby('user_id')['event_time'].min().value_counts().reset_index()
data = data.groupby(data['index'].dt.month)['event_time'].sum().reset_index().rename(columns = {'index':'月份','event_time':'新增人数'})
data 
月份新增人数
011431
121390
231393
343776
4513046
568427
6721540
7822496
898552
9105771
10112959
sns.barplot(data = data,x = '月份',y = '新增人数')
<matplotlib.axes._subplots.AxesSubplot at 0x1928a1c9fa0>
svg

用户流失时间(用户最后一次购买时间)

df.loc[df.event_time.dt.year == 2020].groupby('user_id')['event_time'].min().value_counts().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x192f7eba6a0>
svg
data = df.loc[df.event_time.dt.year == 2020].groupby('user_id')['event_time'].max().value_counts().reset_index()
data = data.groupby(data['index'].dt.month)['event_time'].sum().reset_index().rename(columns = {'index':'月份','event_time':'最后一次购买人数'})
data 
月份最后一次购买人数
01168
12241
23302
341747
458300
566043
6717485
7825120
8913075
91010594
10117706
sns.barplot(data = data,x = '月份',y = '最后一次购买人数')
<matplotlib.axes._subplots.AxesSubplot at 0x1928bd589d0>
svg
  • 初次购买和最后一次购买的人数基本相同,说明2020年中,大部分用户的只购买了一次电子产品

统计新老用户的比例

data = df.groupby('user_id')['event_time'].agg(['min','max'])
data.head()
minmax
user_id
1.515916e+182020-07-09 06:35:18+00:002020-07-09 06:35:18+00:00
1.515916e+182020-09-22 15:11:15+00:002020-10-28 05:53:47+00:00
1.515916e+182020-10-23 03:51:26+00:002020-11-16 15:49:50+00:00
1.515916e+182020-06-10 21:37:30+00:002020-10-06 05:59:30+00:00
1.515916e+182020-05-16 16:09:13+00:002020-07-14 13:04:12+00:00
data = data.reset_index()
data['is_new'] = (data['min'] == data['max'])
data.head()
user_idminmaxis_new
01.515916e+182020-07-09 06:35:18+00:002020-07-09 06:35:18+00:00True
11.515916e+182020-09-22 15:11:15+00:002020-10-28 05:53:47+00:00False
21.515916e+182020-10-23 03:51:26+00:002020-11-16 15:49:50+00:00False
31.515916e+182020-06-10 21:37:30+00:002020-10-06 05:59:30+00:00False
41.515916e+182020-05-16 16:09:13+00:002020-07-14 13:04:12+00:00False
data = data.is_new.value_counts().rename(index={False:'多次购买',True:'一次用户'}).to_frame().reset_index().rename(columns={'index':'用户类型','is_new':'用户数量'})
data 
用户类型用户数量
0多次购买48071
1一次用户42729
sns.barplot(data=data,x = '用户类型',y = '用户数量')
<matplotlib.axes._subplots.AxesSubplot at 0x1928650d280>
svg
  • 从用户类型可以看出,有一半的用户都只购买了一次,复购率很低,需要进一步提高新用户的转换率

用户分层

img

问他咋做数据分析,张口就来RFM,结果还用错!

  • 最近一次消费 (Recency)
  • 消费频率 (Frequency)
  • 消费金额 (Monetary)
data = df.pivot_table(index='user_id',values = ['price','order_id','event_time'],aggfunc=
{'price':'sum','order_id':'count','event_time':'max'}).reset_index().rename(columns={'event_time':'最后购买日期','order_id':'购买次数','price':'消费总金额'})
data.head()
user_id最后购买日期购买次数消费总金额
01.515916e+182020-07-09 06:35:18+00:001416.64
11.515916e+182020-10-28 05:53:47+00:00256.43
21.515916e+182020-11-16 15:49:50+00:00125984.92
31.515916e+182020-10-06 05:59:30+00:0073785.72
41.515916e+182020-07-14 13:04:12+00:002182.83
data['最后一次购买间隔'] = -(data['最后购买日期']-data['最后购买日期'].max())/np.timedelta64(1,'D')
data
user_id最后购买日期购买次数消费总金额最后一次购买间隔
01.515916e+182020-07-09 06:35:18+00:001416.64135.149444
11.515916e+182020-10-28 05:53:47+00:00256.4324.178275
21.515916e+182020-11-16 15:49:50+00:00125984.924.764352
31.515916e+182020-10-06 05:59:30+00:0073785.7246.174306
41.515916e+182020-07-14 13:04:12+00:002182.83129.879375
..................
907951.515916e+182020-11-21 09:13:23+00:001208.310.039664
907961.515916e+182020-11-21 09:18:31+00:0013472.200.036100
907971.515916e+182020-11-21 10:10:01+00:002277.740.000336
907981.515916e+182020-11-21 10:04:42+00:001925.670.004028
907991.515916e+182020-11-21 10:10:13+00:001418.960.000197

90800 rows × 5 columns

通过RFM方法判断顾客种类

def rfm_func(x):
    level = x.apply(lambda x: '1' if x >=0 else '0')
    label = level['最后一次购买间隔']+level['购买次数']+level['消费总金额']
    d={
        '111' : '重要价值客户'
        ,'011': '重要保持客户' 
        ,'101': '重要发展客户'
        ,'001': '重要挽留客户'
        ,'110': '一般价值客户'
        ,'010': '一般保持客户'
        ,'100': '一般发展客户'
        ,'000': '一般挽留客户'
    }
    result = d[label]
    return  result
data['label' ] = data[['最后一次购买间隔','购买次数','消费总金额']].apply(lambda x:x - x.mean()).apply(rfm_func,axis = 1)
data.head() 
user_id最后购买日期购买次数消费总金额最后一次购买间隔label
01.515916e+182020-07-09 06:35:18+00:001416.64135.149444一般发展客户
11.515916e+182020-10-28 05:53:47+00:00256.4324.178275一般挽留客户
21.515916e+182020-11-16 15:49:50+00:00125984.924.764352重要保持客户
31.515916e+182020-10-06 05:59:30+00:0073785.7246.174306重要保持客户
41.515916e+182020-07-14 13:04:12+00:002182.83129.879375一般发展客户
data.groupby('label')[['最后一次购买间隔','购买次数','消费总金额']].agg(['count','sum','mean'])
最后一次购买间隔购买次数消费总金额
countsummeancountsummeancountsummean
label
一般价值客户14561.934602e+05132.871021145690196.1943681456952390.01654.114018
一般保持客户33671.921168e+0557.0587463367219936.53192833672427027.58720.827912
一般发展客户366115.695113e+06155.55743736611567711.5506543661111417308.76311.854600
一般挽留客户280521.831255e+0665.28071928052523241.8652502805211691016.52416.762317
重要价值客户20802.580487e+05124.0618532080180248.66538520805904772.282838.832827
重要保持客户103154.886402e+0547.3718091031523837523.1095491031557886716.905611.896937
重要发展客户29274.261963e+05145.608568292774672.55107629275133287.361753.770878
重要挽留客户59923.576382e+0559.6859465992167452.794559599211516020.921921.899352
data.groupby('label')['user_id'].agg(['count']).sort_values('count').plot(kind = 'barh')
<matplotlib.axes._subplots.AxesSubplot at 0x1928f663880>
svg
  • 需要针对不同类别的客户,制定不同的刺激措施,来提高客户价值

用户留存时间

data = df.groupby('user_id')['event_time'].agg(['min','max'])
data = data.reset_index()
data['留存天数'] = (data['max'] - data['min']).dt.days
data.head()
user_idminmax留存天数
01.515916e+182020-07-09 06:35:18+00:002020-07-09 06:35:18+00:000
11.515916e+182020-09-22 15:11:15+00:002020-10-28 05:53:47+00:0035
21.515916e+182020-10-23 03:51:26+00:002020-11-16 15:49:50+00:0024
31.515916e+182020-06-10 21:37:30+00:002020-10-06 05:59:30+00:00117
41.515916e+182020-05-16 16:09:13+00:002020-07-14 13:04:12+00:0058
sns.histplot(data = data.query('留存天数<200'),x ='留存天数',bins =100)
<matplotlib.axes._subplots.AxesSubplot at 0x1928324d190>
svg
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值