某在线商店电子产品销售数据分析-RFM分析方法

最新推荐文章于 2024-06-11 00:15:37 发布

莫知我哀

最新推荐文章于 2024-06-11 00:15:37 发布

阅读量1.4k

点赞数 3

分类专栏：数据分析技术数据分析项目文章标签：数据分析 python 数据可视化

本文链接：https://blog.csdn.net/weixin_43822124/article/details/114338972

版权

数据分析技术同时被 2 个专栏收录

6 篇文章 4 订阅

订阅专栏

数据分析项目

6 篇文章 1 订阅

订阅专栏

本文github地址：DataSicence
数据下载链接：链接
本文参考资料：链接

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns 
#seaborn绘图包需要时最新版本

数据读取

event_time -购买时间
order_id -订单编号
product_id -产品编号
category_id -产品的类别ID
category_code -产品的类别分类法（代码名称）
brand -品牌名称
price -产品价格
user_id -用户ID

df = pd.read_csv('D:\Code\Github\data\kz.csv',sep=',')
df.head()

	event_time	order_id	product_id	category_id	category_code	brand	price	user_id
0	2020-04-24 11:50:39 UTC	2294359932054536986	1515966223509089906	2.268105e+18	electronics.tablet	samsung	162.01	1.515916e+18
1	2020-04-24 11:50:39 UTC	2294359932054536986	1515966223509089906	2.268105e+18	electronics.tablet	samsung	162.01	1.515916e+18
2	2020-04-24 14:37:43 UTC	2294444024058086220	2273948319057183658	2.268105e+18	electronics.audio.headphone	huawei	77.52	1.515916e+18
3	2020-04-24 14:37:43 UTC	2294444024058086220	2273948319057183658	2.268105e+18	electronics.audio.headphone	huawei	77.52	1.515916e+18
4	2020-04-24 19:16:21 UTC	2294584263154074236	2273948316817424439	2.268105e+18	NaN	karcher	217.57	1.515916e+18

df.shape

(2633521, 8)

查看数据缺失情况

df.isnull().sum()

event_time             0
order_id               0
product_id             0
category_id       431954
category_code     612202
brand             506005
price             431954
user_id          2069352
dtype: int64

为处理快捷，删除有所有缺失值的行

df = df.dropna()
df.shape

(420718, 8)

查看数据类型

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 420718 entries, 0 to 2633520
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   event_time     420718 non-null  object 
 1   order_id       420718 non-null  int64  
 2   product_id     420718 non-null  int64  
 3   category_id    420718 non-null  float64
 4   category_code  420718 non-null  object 
 5   brand          420718 non-null  object 
 6   price          420718 non-null  float64
 7   user_id        420718 non-null  float64
dtypes: float64(3), int64(2), object(3)
memory usage: 28.9+ MB

# 将时间列更改为时间类型
df['event_time'] = pd.to_datetime(df.event_time)
print(df.event_time.max(),df.event_time.min())

2020-11-21 10:10:30+00:00 1970-01-01 00:33:40+00:00

# 提取日期中的月份
df['month'] = df.event_time.dt.month

用户消费趋势分析

df_month = df.loc[df.event_time.dt.year == 2020].groupby(['month'])

每月消费总金额

df_month_sum = df_month.sum().reset_index().rename(columns = {'price':'销售额','month':'月份'})

plt.rcParams['font.sans-serif']=['SimSun'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False # 用来正常显示负号
%matplotlib inline
plt.style.use("ggplot")
#plt.figure(figsize = (15,8))
sns.relplot(x='月份',y = '销售额',data= df_month_sum,kind='line',height=4,aspect=15/8)
plt.title('每月消费总金额')

Text(0.5, 1.0, '每月消费总金额')

png

每月消费人数

df_month_count = df_month.count().reset_index().rename(columns = {'price':'活跃人数','month':'月份'})

plt.rcParams['font.sans-serif']=['SimSun'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False # 用来正常显示负号
%matplotlib inline
plt.style.use("ggplot")

sns.relplot(x='月份',y = '活跃人数',data= df_month_count,kind='line',height=4,aspect=15/8)
plt.title('每月消费人数')

Text(0.5, 1.0, '每月消费人数')

png

两者对比

df_month = df.groupby(['month'])['price'].agg(['sum','count']).reset_index().rename(columns = {'sum':'销售额','month':'月份','count':'活跃人数'})
df_month

	月份	销售额	活跃人数
0	1	1.670965e+06	9982
1	2	1.928107e+06	11566
2	3	2.532487e+06	12461
3	4	1.550330e+06	8807
4	5	7.180919e+06	30826
5	6	6.834606e+06	29750
6	7	1.511283e+07	61037
7	8	2.601872e+07	82198
8	9	1.571325e+07	53591
9	10	1.763243e+07	74107
10	11	1.075389e+07	46393

# 将表格进行转换
df_month_melt = df_month.melt(id_vars=['月份'],value_vars=['销售额','活跃人数'],var_name='cal',value_name='value')

sns.relplot(data=df_month_melt,x = '月份',y = 'value',col = 'cal',col_wrap=1,height=4,aspect=15/8,kind='line',facet_kws = {'sharey':False})

<seaborn.axisgrid.FacetGrid at 0x21e94bd46d0>

png

7-10月是消费高峰时节，其他月份的消费额和活跃人数相对比较少

品牌消费情况

品牌销售额

df_grand = df.loc[df.event_time.dt.year == 2020].groupby('brand')['price'].agg(['sum']).reset_index().sort_values('sum',ascending = False).rename(
    columns = {'brand':'品牌','sum':'销售额'})
df_grand

	品牌	销售额
443	samsung	2.872334e+07
31	apple	2.590539e+07
300	lg	7.726328e+06
43	asus	5.072569e+06
299	lenovo	4.565506e+06
...	...	...
47	att	9.200000e-01
163	elfe	8.800000e-01
547	wurth	6.700000e-01
102	celebrat	2.300000e-01
390	pedigree	2.300000e-01

570 rows × 2 columns

sns.barplot(x = '销售额',y='品牌',data=df_grand.iloc[:15,:])

<matplotlib.axes._subplots.AxesSubplot at 0x21e9445f2b0>

png

三星、苹果、LG的销售额最高

品牌用户数量

df_grand = df.loc[df.event_time.dt.year == 2020].groupby('brand')['user_id'].agg(pd.Series.nunique).reset_index().sort_values('user_id',ascending = False).rename(
    columns = {'brand':'品牌','user_id':'用户数量'})
df_grand

	品牌	用户数量
443	samsung	34602.0
31	apple	18441.0
50	ava	10095.0
300	lg	8243.0
554	xiaomi	7627.0
...	...	...
227	highwaybaby	1.0
226	herschel	1.0
445	sandisk	1.0
320	matrix	1.0
80	blackvue	1.0

570 rows × 2 columns

g = sns.barplot(x = '用户数量',y='品牌',data=df_grand.iloc[:15,:])

png

每个品牌人均销售额

df_grand = df.loc[df.event_time.dt.year == 2020].groupby('brand')['price','user_id'].agg({'price':'sum','user_id':pd.Series.nunique}).reset_index()
df_grand.head()

<ipython-input-26-02fbe5db5420>:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  df_grand = df.loc[df.event_time.dt.year == 2020].groupby('brand')['price','user_id'].agg({'price':'sum','user_id':pd.Series.nunique}).reset_index()

	brand	price	user_id
0	acana	175.21	2.0
1	adguard	2.75	1.0
2	aeg	50246.59	38.0
3	aerocool	81000.46	519.0
4	agu	99.54	2.0

df_grand['人均销售额'] = df_grand.price/df_grand.user_id
sns.barplot(x = '人均销售额',y='brand',data=df_grand.sort_values('人均销售额',ascending = False).iloc[:30,:])

<matplotlib.axes._subplots.AxesSubplot at 0x21e949fd4c0>

png

用户个体销售

用户消费次数、消费金额散点图

data = df.groupby('user_id')['order_id','price'].agg({'order_id':'count','price':'sum'}).rename(columns = {'order_id':'消费次数','price':'消费金额'})
data

	消费次数	消费金额
user_id
1.515916e+18	1	416.64
1.515916e+18	2	56.43
1.515916e+18	12	5984.92
1.515916e+18	7	3785.72
1.515916e+18	2	182.83
...	...	...
1.515916e+18	1	208.31
1.515916e+18	1	3472.20
1.515916e+18	2	277.74
1.515916e+18	1	925.67
1.515916e+18	1	418.96

90800 rows × 2 columns

sns.scatterplot(x = '消费次数',y = '消费金额',data = data)
plt.title('用户消费次数与消费金额之间的关系')

Text(0.5, 1.0, '用户消费次数与消费金额之间的关系')


sns.displot(data=data.query('消费金额<10000'),x = '消费金额')

<seaborn.axisgrid.FacetGrid at 0x192e63e9c10>

sns.displot(data=data.query('消费次数<50'),x = '消费次数',kind='hist',bins = 20)

<seaborn.axisgrid.FacetGrid at 0x192e6ecc370>

data_cum = data.sort_values('消费金额')['消费金额'].cumsum()/data['消费金额'].sum()
data_cum.reset_index()['消费金额'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x192f8331460>

可以看到消费金额最低50%用户只贡献了10%左右的销售额，超过一半的销售额，都来自不到10%的用户，可见用户之间的购买力相差比较大

用户消费行为

新增人数记录

df.loc[df.event_time.dt.year == 2020].groupby('user_id')['event_time'].min().value_counts().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1928877a520>

data = df.loc[df.event_time.dt.year == 2020].groupby('user_id')['event_time'].min().value_counts().reset_index()
data = data.groupby(data['index'].dt.month)['event_time'].sum().reset_index().rename(columns = {'index':'月份','event_time':'新增人数'})
data

	月份	新增人数
0	1	1431
1	2	1390
2	3	1393
3	4	3776
4	5	13046
5	6	8427
6	7	21540
7	8	22496
8	9	8552
9	10	5771
10	11	2959

sns.barplot(data = data,x = '月份',y = '新增人数')

<matplotlib.axes._subplots.AxesSubplot at 0x1928a1c9fa0>

用户流失时间（用户最后一次购买时间）

df.loc[df.event_time.dt.year == 2020].groupby('user_id')['event_time'].min().value_counts().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x192f7eba6a0>

data = df.loc[df.event_time.dt.year == 2020].groupby('user_id')['event_time'].max().value_counts().reset_index()
data = data.groupby(data['index'].dt.month)['event_time'].sum().reset_index().rename(columns = {'index':'月份','event_time':'最后一次购买人数'})
data

	月份	最后一次购买人数
0	1	168
1	2	241
2	3	302
3	4	1747
4	5	8300
5	6	6043
6	7	17485
7	8	25120
8	9	13075
9	10	10594
10	11	7706

sns.barplot(data = data,x = '月份',y = '最后一次购买人数')

<matplotlib.axes._subplots.AxesSubplot at 0x1928bd589d0>

初次购买和最后一次购买的人数基本相同，说明2020年中，大部分用户的只购买了一次电子产品

统计新老用户的比例

data = df.groupby('user_id')['event_time'].agg(['min','max'])
data.head()

	min	max
user_id
1.515916e+18	2020-07-09 06:35:18+00:00	2020-07-09 06:35:18+00:00
1.515916e+18	2020-09-22 15:11:15+00:00	2020-10-28 05:53:47+00:00
1.515916e+18	2020-10-23 03:51:26+00:00	2020-11-16 15:49:50+00:00
1.515916e+18	2020-06-10 21:37:30+00:00	2020-10-06 05:59:30+00:00
1.515916e+18	2020-05-16 16:09:13+00:00	2020-07-14 13:04:12+00:00

data = data.reset_index()
data['is_new'] = (data['min'] == data['max'])
data.head()

	user_id	min	max	is_new
0	1.515916e+18	2020-07-09 06:35:18+00:00	2020-07-09 06:35:18+00:00	True
1	1.515916e+18	2020-09-22 15:11:15+00:00	2020-10-28 05:53:47+00:00	False
2	1.515916e+18	2020-10-23 03:51:26+00:00	2020-11-16 15:49:50+00:00	False
3	1.515916e+18	2020-06-10 21:37:30+00:00	2020-10-06 05:59:30+00:00	False
4	1.515916e+18	2020-05-16 16:09:13+00:00	2020-07-14 13:04:12+00:00	False

data = data.is_new.value_counts().rename(index={False:'多次购买',True:'一次用户'}).to_frame().reset_index().rename(columns={'index':'用户类型','is_new':'用户数量'})

data

	用户类型	用户数量
0	多次购买	48071
1	一次用户	42729

sns.barplot(data=data,x = '用户类型',y = '用户数量')

<matplotlib.axes._subplots.AxesSubplot at 0x1928650d280>

从用户类型可以看出，有一半的用户都只购买了一次，复购率很低，需要进一步提高新用户的转换率

用户分层

问他咋做数据分析，张口就来RFM，结果还用错！

最近一次消费 (Recency)
消费频率 (Frequency)
消费金额 (Monetary)

data = df.pivot_table(index='user_id',values = ['price','order_id','event_time'],aggfunc=
{'price':'sum','order_id':'count','event_time':'max'}).reset_index().rename(columns={'event_time':'最后购买日期','order_id':'购买次数','price':'消费总金额'})
data.head()

	user_id	最后购买日期	购买次数	消费总金额
0	1.515916e+18	2020-07-09 06:35:18+00:00	1	416.64
1	1.515916e+18	2020-10-28 05:53:47+00:00	2	56.43
2	1.515916e+18	2020-11-16 15:49:50+00:00	12	5984.92
3	1.515916e+18	2020-10-06 05:59:30+00:00	7	3785.72
4	1.515916e+18	2020-07-14 13:04:12+00:00	2	182.83

data['最后一次购买间隔'] = -(data['最后购买日期']-data['最后购买日期'].max())/np.timedelta64(1,'D')
data

	user_id	最后购买日期	购买次数	消费总金额	最后一次购买间隔
0	1.515916e+18	2020-07-09 06:35:18+00:00	1	416.64	135.149444
1	1.515916e+18	2020-10-28 05:53:47+00:00	2	56.43	24.178275
2	1.515916e+18	2020-11-16 15:49:50+00:00	12	5984.92	4.764352
3	1.515916e+18	2020-10-06 05:59:30+00:00	7	3785.72	46.174306
4	1.515916e+18	2020-07-14 13:04:12+00:00	2	182.83	129.879375
...	...	...	...	...	...
90795	1.515916e+18	2020-11-21 09:13:23+00:00	1	208.31	0.039664
90796	1.515916e+18	2020-11-21 09:18:31+00:00	1	3472.20	0.036100
90797	1.515916e+18	2020-11-21 10:10:01+00:00	2	277.74	0.000336
90798	1.515916e+18	2020-11-21 10:04:42+00:00	1	925.67	0.004028
90799	1.515916e+18	2020-11-21 10:10:13+00:00	1	418.96	0.000197

90800 rows × 5 columns

通过RFM方法判断顾客种类

def rfm_func(x):
    level = x.apply(lambda x: '1' if x >=0 else '0')
    label = level['最后一次购买间隔']+level['购买次数']+level['消费总金额']
    d={
        '111' : '重要价值客户'
        ,'011': '重要保持客户' 
        ,'101': '重要发展客户'
        ,'001': '重要挽留客户'
        ,'110': '一般价值客户'
        ,'010': '一般保持客户'
        ,'100': '一般发展客户'
        ,'000': '一般挽留客户'
    }
    result = d[label]
    return  result
data['label' ] = data[['最后一次购买间隔','购买次数','消费总金额']].apply(lambda x:x - x.mean()).apply(rfm_func,axis = 1)
data.head()

	user_id	最后购买日期	购买次数	消费总金额	最后一次购买间隔	label
0	1.515916e+18	2020-07-09 06:35:18+00:00	1	416.64	135.149444	一般发展客户
1	1.515916e+18	2020-10-28 05:53:47+00:00	2	56.43	24.178275	一般挽留客户
2	1.515916e+18	2020-11-16 15:49:50+00:00	12	5984.92	4.764352	重要保持客户
3	1.515916e+18	2020-10-06 05:59:30+00:00	7	3785.72	46.174306	重要保持客户
4	1.515916e+18	2020-07-14 13:04:12+00:00	2	182.83	129.879375	一般发展客户

data.groupby('label')[['最后一次购买间隔','购买次数','消费总金额']].agg(['count','sum','mean'])

	最后一次购买间隔			购买次数			消费总金额
	count	sum	mean	count	sum	mean	count	sum	mean
label
一般价值客户	1456	1.934602e+05	132.871021	1456	9019	6.194368	1456	952390.01	654.114018
一般保持客户	3367	1.921168e+05	57.058746	3367	21993	6.531928	3367	2427027.58	720.827912
一般发展客户	36611	5.695113e+06	155.557437	36611	56771	1.550654	36611	11417308.76	311.854600
一般挽留客户	28052	1.831255e+06	65.280719	28052	52324	1.865250	28052	11691016.52	416.762317
重要价值客户	2080	2.580487e+05	124.061853	2080	18024	8.665385	2080	5904772.28	2838.832827
重要保持客户	10315	4.886402e+05	47.371809	10315	238375	23.109549	10315	57886716.90	5611.896937
重要发展客户	2927	4.261963e+05	145.608568	2927	7467	2.551076	2927	5133287.36	1753.770878
重要挽留客户	5992	3.576382e+05	59.685946	5992	16745	2.794559	5992	11516020.92	1921.899352

data.groupby('label')['user_id'].agg(['count']).sort_values('count').plot(kind = 'barh')

<matplotlib.axes._subplots.AxesSubplot at 0x1928f663880>

需要针对不同类别的客户，制定不同的刺激措施，来提高客户价值

用户留存时间

data = df.groupby('user_id')['event_time'].agg(['min','max'])
data = data.reset_index()
data['留存天数'] = (data['max'] - data['min']).dt.days
data.head()

	user_id	min	max	留存天数
0	1.515916e+18	2020-07-09 06:35:18+00:00	2020-07-09 06:35:18+00:00	0
1	1.515916e+18	2020-09-22 15:11:15+00:00	2020-10-28 05:53:47+00:00	35
2	1.515916e+18	2020-10-23 03:51:26+00:00	2020-11-16 15:49:50+00:00	24
3	1.515916e+18	2020-06-10 21:37:30+00:00	2020-10-06 05:59:30+00:00	117
4	1.515916e+18	2020-05-16 16:09:13+00:00	2020-07-14 13:04:12+00:00	58

sns.histplot(data = data.query('留存天数<200'),x ='留存天数',bins =100)

<matplotlib.axes._subplots.AxesSubplot at 0x1928324d190>

莫知我哀

关注

3
点赞
踩
19

收藏

觉得还不错? 一键收藏
2
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录