python数据分析实战---秦璐

最新推荐文章于 2022-08-01 11:28:12 发布

Luara_lyy

最新推荐文章于 2022-08-01 11:28:12 发布

阅读量1.5k

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/weixin_40801364/article/details/100550047

版权

python 专栏收录该内容

10 篇文章 1 订阅

订阅专栏

讲解： 一份数据分析报告 #既包含代码又包含数据分析结果的PDF形式

Markdown写法：说明
code import pandas as pd
快捷切换 esc+m 变为Markdown/esc+m 变为code
图形化呈现：结果

练习1 这是一份用户消费行为的分析报告(数据来源于网络，用户在一家CD网站上的消费)

import pandas as pd
import numpy as np
#加载数据可视化包
import matplotlib.pyplot as plt
#可视化显示页面#%内置的符号，为jupyter专用，做好图之后可以直接在html页面显示
%matplotlib inline
#更改设计
plt.style.use('ggplot')#r语言风格

columns = ['user_id','order_dt','order_products','order_amount']
df = pd.read_table('CDNOW_master.txt',names = columns,sep = '\s+',parse_dates= True)

user_id 用户名
order_dt 购买日期
order_products 购买产品数
order_amount 购买金额

df.head()#显示字段

	user_id	order_dt	order_products	order_amount
0	1	19970101	1	11.77
1	2	19970112	1	12.00
2	2	19970112	5	77.00
3	3	19970102	2	20.76
4	3	19970330	2	20.76

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69659 entries, 0 to 69658
Data columns (total 4 columns):
user_id           69659 non-null int64
order_dt          69659 non-null int64
order_products    69659 non-null int64
order_amount      69659 non-null float64
dtypes: float64(1), int64(3)
memory usage: 2.1 MB

需要对order_dt进行清理

df.describe()

	user_id	order_dt	order_products	order_amount
count	69659.000000	6.965900e+04	69659.000000	69659.000000
mean	11470.854592	1.997228e+07	2.410040	35.893648
std	6819.904848	3.837735e+03	2.333924	36.281942
min	1.000000	1.997010e+07	1.000000	0.000000
25%	5506.000000	1.997022e+07	1.000000	14.490000
50%	11410.000000	1.997042e+07	2.000000	25.980000
75%	17273.000000	1.997111e+07	3.000000	43.700000
max	23570.000000	1.998063e+07	99.000000	1286.010000

大部分订单只消费了少量商品（平均2.4），有一定极值干扰
用户的消费金额比较稳定，平均消费35元，中位数在35，有一定极值干扰

df['order_dt'] = pd.to_datetime(df.order_dt,format = "%Y%m%d") #解析函数，转换某个字段的格式
df['month'] = df.order_dt.values.astype('datetime64[M]')

df

	user_id	order_dt	order_products	order_amount	month
0	1	1997-01-01	1	11.77	1997-01-01
1	2	1997-01-12	1	12.00	1997-01-01
2	2	1997-01-12	5	77.00	1997-01-01
3	3	1997-01-02	2	20.76	1997-01-01
4	3	1997-03-30	2	20.76	1997-03-01
5	3	1997-04-02	2	19.54	1997-04-01
6	3	1997-11-15	5	57.45	1997-11-01
7	3	1997-11-25	4	20.96	1997-11-01
8	3	1998-05-28	1	16.99	1998-05-01
9	4	1997-01-01	2	29.33	1997-01-01
10	4	1997-01-18	2	29.73	1997-01-01
11	4	1997-08-02	1	14.96	1997-08-01
12	4	1997-12-12	2	26.48	1997-12-01
13	5	1997-01-01	2	29.33	1997-01-01
14	5	1997-01-14	1	13.97	1997-01-01
15	5	1997-02-04	3	38.90	1997-02-01
16	5	1997-04-11	3	45.55	1997-04-01
17	5	1997-05-31	3	38.71	1997-05-01
18	5	1997-06-16	2	26.14	1997-06-01
19	5	1997-07-22	2	28.14	1997-07-01
20	5	1997-09-15	3	40.47	1997-09-01
21	5	1997-12-08	4	46.46	1997-12-01
22	5	1997-12-12	3	40.47	1997-12-01
23	5	1998-01-03	3	37.47	1998-01-01
24	6	1997-01-01	1	20.99	1997-01-01
25	7	1997-01-01	2	28.74	1997-01-01
26	7	1997-10-11	7	97.43	1997-10-01
27	7	1998-03-22	9	138.50	1998-03-01
28	8	1997-01-01	1	9.77	1997-01-01
29	8	1997-02-13	1	13.97	1997-02-01
...	...	...	...	...	...
69629	23556	1997-09-27	3	31.47	1997-09-01
69630	23556	1998-01-03	2	28.98	1998-01-01
69631	23556	1998-06-07	2	28.98	1998-06-01
69632	23557	1997-03-25	1	14.37	1997-03-01
69633	23558	1997-03-25	2	28.13	1997-03-01
69634	23558	1997-05-18	3	45.51	1997-05-01
69635	23558	1997-06-24	2	23.74	1997-06-01
69636	23558	1998-02-25	4	48.22	1998-02-01
69637	23559	1997-03-25	2	23.54	1997-03-01
69638	23559	1997-05-18	3	35.31	1997-05-01
69639	23559	1997-06-27	3	52.80	1997-06-01
69640	23560	1997-03-25	1	18.36	1997-03-01
69641	23561	1997-03-25	2	30.92	1997-03-01
69642	23561	1998-01-28	1	15.49	1998-01-01
69643	23561	1998-05-29	3	37.05	1998-05-01
69644	23562	1997-03-25	2	29.33	1997-03-01
69645	23563	1997-03-25	1	10.77	1997-03-01
69646	23563	1997-10-04	2	47.98	1997-10-01
69647	23564	1997-03-25	1	11.77	1997-03-01
69648	23564	1997-05-21	1	11.77	1997-05-01
69649	23564	1997-11-30	3	46.47	1997-11-01
69650	23565	1997-03-25	1	11.77	1997-03-01
69651	23566	1997-03-25	2	36.00	1997-03-01
69652	23567	1997-03-25	1	20.97	1997-03-01
69653	23568	1997-03-25	1	22.97	1997-03-01
69654	23568	1997-04-05	4	83.74	1997-04-01
69655	23568	1997-04-22	1	14.99	1997-04-01
69656	23569	1997-03-25	2	25.74	1997-03-01
69657	23570	1997-03-25	3	51.12	1997-03-01
69658	23570	1997-03-26	2	42.96	1997-03-01

69659 rows × 5 columns

1.进行用户消费趋势的分析(按月)

每月的消费总金额
每月的消费次数
每月的产品购买量
每月的消费人数

grouped_month = df.groupby('month')#数据的聚合
order_month_amount = grouped_month.order_amount.sum()#将order_amount按照grouped_month进行切片，求和
order_month_amount.head()#读取前5行

month
1997-01-01    299060.17
1997-02-01    379590.03
1997-03-01    393155.27
1997-04-01    142824.49
1997-05-01    107933.30
Name: order_amount, dtype: float64

#加载数据可视化包
import matplotlib.pyplot as plt
#可视化显示页面
%matplotlib inline 
#更改设计
plt.style.use('ggplot')#r语言风格
order_month_amount.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x228154dfba8>

[外链图片转存失败(img-QacSZRio-1567614745664)(output_14_1.png)]

由上图可知，消费金额在前三个月达到高峰，后续消费较为稳定，有轻微下降趋势

grouped_month.user_id.count().plot()#求出每月里的订单数

<matplotlib.axes._subplots.AxesSubplot at 0x228150d0898>

[外链图片转存失败(img-jCtrxOWZ-1567614745665)(output_16_1.png)]

前三个月消费订单数在10000笔左右，后续月份的平均消费人数则在2500人

grouped_month.order_products.sum().plot()#产品数需要使用sum,一单不只一个产品

<matplotlib.axes._subplots.AxesSubplot at 0x2281502c588>

在这里插入图片描述

df.groupby('month').user_id.apply(lambda x :len(x.drop_duplicates())).plot() #apply去重的操作,数组uesr_id作为输入，去重，len求个数

<matplotlib.axes._subplots.AxesSubplot at 0x228139590b8>

[外链图片转存失败(img-dQrBlh6t-1567614745665)(output_19_1.png)]

df.groupby(['month','user_id']).count().reset_index()  #第二种,通过groupby去重,重置索引
df.groupby('month').user_id.count().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x2281579fe80>

[外链图片转存失败(img-JUEtZ9Rt-1567614745666)(output_20_1.png)]

每月的消费人数低于每月的消费次数，但差异不大；
前三个月每月的消费人数在8000-10000之间，后续月份，平均消费人数在2000人不到

#数据透视
df.pivot_table(index = 'month',
              values = ['order_products','order_amount','user_id'],
              aggfunc = {'order_products':'sum',
                         'order_amount':'sum',
                         'user_id':'count'}).head()#透视去重比较麻烦，不建议

	order_amount	order_products	user_id
month
1997-01-01	299060.17	19416	8928
1997-02-01	379590.03	24921	11272
1997-03-01	393155.27	26159	11598
1997-04-01	142824.49	9729	3781
1997-05-01	107933.30	7275	2895

问题：

每月用户的平均消费金额趋势
每月用户平均消费次数的趋势

df.groupby('month').order_amount.mean().plot()#可能还要考虑用户相同的情况

<matplotlib.axes._subplots.AxesSubplot at 0x2281582eb00>

[外链图片转存失败(img-ZfOGSWZ8-1567614745666)(output_24_1.png)]

2.用户个体消费分析

用户消费金额，消费次数的描述统计
用户消费金额和消费次数的散点图
用户消费金额的分布图
用户消费次数的分布图（二八分布）
用户累计消费金额占比（百分之多少的用户占百分之多少的消费额），头部用户的多寡，实现运营策略，制定方向

grouped_user = df.groupby('user_id')

grouped_user.sum().describe()

	order_products	order_amount
count	23570.000000	23570.000000
mean	7.122656	106.080426
std	16.983531	240.925195
min	1.000000	0.000000
25%	1.000000	19.970000
50%	3.000000	43.395000
75%	7.000000	106.475000
max	1033.000000	13990.930000

用户平均购买了7张CD，但是中位值只有3，说明小部分用户购买了大量的CD
用户平均消费106元，中位数有43，判断同上，有极值干扰（一个人买了很多）
基本上符合一个2-8分布

#使用用户的散点图观察#sum求出每个用户的明细数据，plot函数方法
grouped_user.sum().plot.scatter(x = 'order_amount',y = 'order_products')
grouped_user.sum().query('order_amount <4000').plot.scatter(x = 'order_amount',y = 'order_products')

<matplotlib.axes._subplots.AxesSubplot at 0x2281846c278>

[外链图片转存失败(img-GbIY7NFU-1567614745666)(output_29_1.png)]

[外链图片转存失败(img-v4RcYBMu-1567614745666)(output_29_2.png)]

散点图对极值比较敏感，在开始时可以进行过滤,query(‘order_amount <4000’)
函数时有顺序的，plot.scatter
CD 数量和客单价有近似正比的关系，可以说明95%的CD的客单价都是差不多的，单一的产品，可能存在极少的珍藏版CD；
如果店家里有多种品类，如CD机，音乐唱片等，可能呈现扇形的扩散状态
散点图揭示的是一种规律

#分布关系
grouped_user.sum().order_amount.plot.hist(bins = 100)

<matplotlib.axes._subplots.AxesSubplot at 0x2281863ceb8>

[外链图片转存失败(img-GXTx3BdJ-1567614745667)(output_31_1.png)]

从直方图中，用户的消费金额，绝大多数呈现集中趋势，小部分异常值干扰判断，可以用**过滤操作**排除异常

grouped_user.sum().query('order_products<100').order_products.plot.hist(bins = 40)

<matplotlib.axes._subplots.AxesSubplot at 0x228189cff98>

在这里插入图片描述

使用切比雪夫过滤掉异常值，计算95%的数据分布情况，去掉5个均值以外的结果

#升序排列，滚动求和
user_cumsum = grouped_user.sum().sort_values('order_amount').apply(lambda x:x.cumsum()/x.sum())#滚动求和
#user_cumsum.order_amount.cumsum()/user_cumsum.order_amount.sum() 
user_cumsum
user_cumsum.reset_index().order_amount.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x22818ab2198>

在这里插入图片描述

按用户消费金额进行升序排列，有图可知50%的用户仅贡献了15%的消费额度，而排名前5000的用户就贡献了60%的消费额（如果运营这60%的用户就可能是效益更高）

3.用户的消费行为

用户第一次消费（收购）–和渠道相关，客单价比较高，用户留存率比较低行业，首购的用户从渠道来，可以拓展运营方式和业务
用户最后一次消费–与用户的流失息息相关
新老客消费比
- 多少用户仅消费了一次？
- 每月新客占比？
用户分层
- RFM
- 新、老、活跃、回流、流失
用户购买周期（按订单）
- 用户消费周期描述
- 用户消费周期分布
用户生命周期(按第一次&最后一次消费)
- 用户生命周期描述
- 用户生命周期分布

(1) 用户第一次消费

grouped_user.min().order_dt.value_counts().plot()#最小的时间，第一次购买

在这里插入图片描述

用户第一次购买分布，集中在前三个月
其中，在2月11日至2月15日有一次剧烈的波动（渠道或奖励机制发生了变化）

(2) 用户最后一次消费

grouped_user.max().order_dt.value_counts().plot()#最近一次消费

在这里插入图片描述

用户最后一次购买的分布比第一次分布广
大部分最后一次购买，集中在前三个月，说明有很多用户购买了一次就不再进行购买（在真实的场景中，可能是逐渐的缓慢累加）
随着时间递增，最后一次购买数也在递增，消费呈现流失上升的状况（运营的用户忠诚度下降）

(3）新老客消费比

user_life = grouped_user.order_dt.agg(['min','max'])#第一次消费时间，最后一次消费时间
user_life.head()

	min	max
user_id
1	1997-01-01	1997-01-01
2	1997-01-12	1997-01-12
3	1997-01-02	1998-05-28
4	1997-01-01	1997-12-12
5	1997-01-01	1998-01-03

(user_life['min'] == user_life['max']).value_counts()
#判断新老客户的对比（理性角度多少用户消费消费一次会逐渐降低，流失得越来越多）--多少用户消费一次的时间占比

True     12054
False    11516
dtype: int64

有一半用户，就消费了一次

（4）用户分层 -RFM(用户消费额，用户消费次数，用户最近一次消费)-象限法

①RFM模型分析

rfm = df.pivot_table(index = 'user_id',
              values = ['order_products','order_amount','order_dt'],
              aggfunc = {'order_dt':'max',
                         'order_amount':'sum',
                         'order_products':'sum'})
rfm.head()

	order_amount	order_dt	order_products
user_id
1	11.77	1997-01-01	1
2	89.00	1997-01-12	6
3	156.46	1998-05-28	16
4	100.50	1997-12-12	7
5	385.61	1998-01-03	29

rfm['R'] = - (rfm.order_dt-rfm.order_dt.max())/np.timedelta64(1,'D')
rfm.rename(columns = {'order_products':'F','order_amount':'M'},inplace = True)
rfm.head()

	M	order_dt	F	R
user_id
1	11.77	1997-01-01	1	545.0
2	89.00	1997-01-12	6	534.0
3	156.46	1998-05-28	16	33.0
4	100.50	1997-12-12	7	200.0
5	385.61	1998-01-03	29	178.0

def rfm_func(x):
    level = x.apply(lambda x:'1'if x>= 0 else '0')
    label = level.R +level.F + level.M 
    d = {'111':'重要价值客户',
         '011':'重要保持客户',
         '101':'重要挽留客户',
         '001':'重要发展客户',
         '110':'一般价值客户',
         '010':'一般保持客户',
         '100':'一般挽留客户',
         '000':'一般发展客户',
    }
    result = d[label]
    return result

rfm['label'] = rfm[['R','F','M']].apply(lambda x:x-x.mean()).apply(rfm_func,axis =1)
rfm.head()#由于极值得影响，这里得平均值偏高，使结果不太准确，改成中位数等，
#5等分等，人工排除（切比雪夫）

#有价值需发展
#有一段时间没消费，比平均消费频次低，金额高
#最近消费过
#+字符串连接

	M	order_dt	F	R	label
user_id
1	11.77	1997-01-01	1	545.0	一般挽留客户
2	89.00	1997-01-12	6	534.0	一般挽留客户
3	156.46	1998-05-28	16	33.0	重要保持客户
4	100.50	1997-12-12	7	200.0	一般发展客户
5	385.61	1998-01-03	29	178.0	重要保持客户

rfm.groupby('label').sum()

	M	F	R
label
一般价值客户	7181.28	650	36295.0
一般保持客户	19937.45	1712	29448.0
一般发展客户	196971.23	13977	591108.0
一般挽留客户	438291.81	29346	6951815.0
重要价值客户	167080.83	11121	358363.0
重要保持客户	1592039.62	107789	517267.0
重要发展客户	45785.01	2023	56636.0
重要挽留客户	33028.40	1263	114482.0

从RFM分层可知，大部分用户为重要保持用户，这是由于极值的影响，所有RFM的划分标准应该为业务为准

尽量用小部分的用户覆盖大部分的额度
不要为了数据好看划分等级

rfm.loc[rfm.label == '重要价值客户','color'] = 'k'
rfm.loc[rfm.label == '重要保持客户','color'] = 'r'
rfm.loc[rfm.label == '重要发展客户','color'] = 'gold'
rfm.loc[rfm.label == '重要挽留客户','color'] = 'y'
rfm.loc[rfm.label == '一般保持客户','color'] = 'g'
rfm.loc[rfm.label == '一般价值客户','color'] = 'b'
rfm.loc[rfm.label == '一般发展客户','color'] = 'c'
rfm.loc[rfm.label == '一般挽留客户','color'] = 'm'
rfm.plot.scatter('F','R',c = rfm.color)

在这里插入图片描述

rfm.head()

	M	order_dt	F	R	label	color
user_id
1	11.77	1997-01-01	1	545.0	一般挽留客户	m
2	89.00	1997-01-12	6	534.0	一般挽留客户	m
3	156.46	1998-05-28	16	33.0	重要保持客户	r
4	100.50	1997-12-12	7	200.0	一般发展客户	c
5	385.61	1998-01-03	29	178.0	重要保持客户	r

② 用户分层 -用户生命周期相关（新客/老客…）

新用户：第一次消费
活跃用户：一直持续消费
回流：之前没有消费，有一段时间没消费，现在又消费了（时间间隔不定）
流失/不活跃：有一短时间没消费，或者上月消费这月不消费

#数据透视
pivoted_counts = df.pivot_table(index = 'user_id',
               columns = 'month',
               values = 'order_dt',
              aggfunc = 'count').fillna(0)
pivoted_counts.head()

month	1997-01-01 00:00:00	1997-02-01 00:00:00	1997-03-01 00:00:00	1997-04-01 00:00:00	1997-05-01 00:00:00	1997-06-01 00:00:00	1997-07-01 00:00:00	1997-08-01 00:00:00	1997-09-01 00:00:00	1997-10-01 00:00:00	1997-11-01 00:00:00	1997-12-01 00:00:00	1998-01-01 00:00:00	1998-02-01 00:00:00	1998-03-01 00:00:00	1998-04-01 00:00:00	1998-05-01 00:00:00	1998-06-01 00:00:00
user_id
1	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	2.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	1.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
4	2.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
5	2.0	1.0	0.0	1.0	1.0	1.0	1.0	0.0	1.0	0.0	0.0	2.0	1.0	0.0	0.0	0.0	0.0	0.0

df_purchase = pivoted_counts.applymap(lambda x: 1 if x >0 else 0)
df_purchase.tail()
#简化，在一月内有消费为1，没消费为0
#透视会补上1，2月份的消费情况，为0，实质从3月份开始才生命周期开始，多重判断

month	1997-01-01 00:00:00	1997-02-01 00:00:00	1997-03-01 00:00:00	1997-04-01 00:00:00	1997-05-01 00:00:00	1997-06-01 00:00:00	1997-07-01 00:00:00	1997-08-01 00:00:00	1997-09-01 00:00:00	1997-10-01 00:00:00	1997-11-01 00:00:00	1997-12-01 00:00:00	1998-01-01 00:00:00	1998-02-01 00:00:00	1998-03-01 00:00:00	1998-04-01 00:00:00	1998-05-01 00:00:00	1998-06-01 00:00:00
user_id
23566	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
23567	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
23568	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
23569	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
23570	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

def active_status(data):
    status = []
    for i in range(18):
        
        #若本月没有消费,输入输出的判断，18个月作为一个输入，每个月的状态都判断
        if data[i] == 0:
            if len(status) > 0 :
                if status[i-1] == 'unreg':
                    status.append('unreg')
                else:
                    status.append('unactive')
            else:
                #针对第一次没消费的进行判断
                status.append('unreg')
            #之前没有注册，当前没有消费，就是未注册用户，如果之前注册，当前未消费，就是不活跃 #若本月消费
    else:
        if len(status) == 0:
            status.append('new')
            #之前的状态是0，为首购用户
        else:
            #上个月是不活跃，本月有状态，是回流
            if status[i-1] == 'unactive':
                status.append('return')
            #上月未注册，就是新用户
            elif status[i-1] == 'unreg':
                status.append('new')
            #上月活跃，活跃用户
            else:
                status.append('active')
    return status

代码逻辑
① 若本月没有消费

若之前是未注册，则依旧未未注册
若之前有消费，则为流失/补活跃
其他情况，为未注册

② 若本月有消费

若是第一次消费，则为新用户
如果之前有过消费，则上个月为不活跃，则为回流
如果上个月为未注册，则为新用户
除此之外，为活跃

#表应用
#上个月的状态left join 当前状态，逻辑加减法（业界主流）
purchase_stats = df_purchase.apply(active_status,axis=1)
purchase_stats.head(5)

purchase_stats_ct = purchase_stats.replace('unreg',np.NaN).apply(lambda x:pd.value_counts(x))
#NaN在之后的计算中不被计算，不希望参与处理
#是默认按照列来计数的吗
#数据的流通和变化，不活跃的用户越来越多，新用户少，回流稳定，得到模型，运营不利等

purchase_stats_ct.fillna(0).T.head()
#对NaN填充，T专列
purchase_stats_ct.fillna(0).T.plot.area()
#填充，转列，面积图，得到每类用户的状态

purchase_stats_ct.fillna(0).T.apply(lambda x:x/x.sum(),axis =1)
#每个月份消费用户的构成（active + new + return）
#回流多：促销好；新增多：市场部门/渠道部门给了
#return/unactive

由表可知，每月的用户消费状态变化

活跃用户，持续消费的用户，对应消费运营的质量
回流用户，之前不消费本月才消费，对应的使唤回运营
不活跃用户，对应流失

（5）用户购买周期（按订单）

order_diff = grouped_user.apply(lambda x:x.order_dt -x.order_dt.shift())#shift，将所有的数据错位,偏移相减
order_diff

user_id       
1        0            NaT
2        1            NaT
         2         0 days
3        3            NaT
         4        87 days
         5         3 days
         6       227 days
         7        10 days
         8       184 days
4        9            NaT
         10       17 days
         11      196 days
         12      132 days
5        13           NaT
         14       13 days
         15       21 days
         16       66 days
         17       50 days
         18       16 days
         19       36 days
         20       55 days
         21       84 days
         22        4 days
         23       22 days
6        24           NaT
7        25           NaT
         26      283 days
         27      162 days
8        28           NaT
         29       43 days
                   ...   
23556    69629    63 days
         69630    98 days
         69631   155 days
23557    69632        NaT
23558    69633        NaT
         69634    54 days
         69635    37 days
         69636   246 days
23559    69637        NaT
         69638    54 days
         69639    40 days
23560    69640        NaT
23561    69641        NaT
         69642   309 days
         69643   121 days
23562    69644        NaT
23563    69645        NaT
         69646   193 days
23564    69647        NaT
         69648    57 days
         69649   193 days
23565    69650        NaT
23566    69651        NaT
23567    69652        NaT
23568    69653        NaT
         69654    11 days
         69655    17 days
23569    69656        NaT
23570    69657        NaT
         69658     1 days
Name: order_dt, Length: 69659, dtype: timedelta64[ns]

df.order_dt.shift().head()#错位函数

0          NaT
1   1997-01-01
2   1997-01-12
3   1997-01-12
4   1997-01-02
Name: order_dt, dtype: datetime64[ns]

df.order_dt.head()#显示前五行

0   1997-01-01
1   1997-01-12
2   1997-01-12
3   1997-01-02
4   1997-03-30
Name: order_dt, dtype: datetime64[ns]

order_diff.describe()#描述错位相减结果

count                      46089
mean     68 days 23:22:13.567662
std      91 days 00:47:33.924168
min              0 days 00:00:00
25%             10 days 00:00:00
50%             31 days 00:00:00
75%             89 days 00:00:00
max            533 days 00:00:00
Name: order_dt, dtype: object

(order_diff/np.timedelta64(1,'D')).hist(bins =20)#去掉单位，排除空值

在这里插入图片描述

user_life = grouped_user.order_dt.agg(['min','max'])#第一次消费时间，最后一次消费时间
(user_life['max'] - user_life['min']).describe()

count                       23570
mean     134 days 20:55:36.987696
std      180 days 13:46:43.039788
min               0 days 00:00:00
25%               0 days 00:00:00
50%               0 days 00:00:00
75%             294 days 00:00:00
max             544 days 00:00:00
dtype: object

((user_life['max'] - user_life['min'])/np.timedelta64(1,'D')).hist(bins =20)#去掉单位，排除空值

在这里插入图片描述

用户的生命周期受只购买一次的用户影响比较厉害(可以排除)
用户均消费134天，中位数仅0天

#提取出一次以上的用户
u_1 = ((user_life['max'] - user_life['min']).reset_index()[0]/np.timedelta64(1,'D'))
u_1[u_1>0].hist(bins = 20)

在这里插入图片描述

（6）复购率和回购率(核心电商相关指标)

复购率
- 自然月内，购买多次的用户占比
回购率
- 曾经购买的用户在某一时期内再次购买的占比

pivoted_counts.head()#用户在每个月的消费次数

month	1997-01-01 00:00:00	1997-02-01 00:00:00	1997-03-01 00:00:00	1997-04-01 00:00:00	1997-05-01 00:00:00	1997-06-01 00:00:00	1997-07-01 00:00:00	1997-08-01 00:00:00	1997-09-01 00:00:00	1997-10-01 00:00:00	1997-11-01 00:00:00	1997-12-01 00:00:00	1998-01-01 00:00:00	1998-02-01 00:00:00	1998-03-01 00:00:00	1998-04-01 00:00:00	1998-05-01 00:00:00	1998-06-01 00:00:00
user_id
1	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	2.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	1.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
4	2.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
5	2.0	1.0	0.0	1.0	1.0	1.0	1.0	0.0	1.0	0.0	0.0	2.0	1.0	0.0	0.0	0.0	0.0	0.0

purchase_r = pivoted_counts.applymap(lambda x: 1 if x>1 else np.NaN if x==0 else 0)
purchase_r.head()

month	1997-01-01 00:00:00	1997-02-01 00:00:00	1997-03-01 00:00:00	1997-04-01 00:00:00	1997-05-01 00:00:00	1997-06-01 00:00:00	1997-07-01 00:00:00	1997-08-01 00:00:00	1997-09-01 00:00:00	1997-10-01 00:00:00	1997-11-01 00:00:00	1997-12-01 00:00:00	1998-01-01 00:00:00	1998-02-01 00:00:00	1998-03-01 00:00:00	1998-04-01 00:00:00	1998-05-01 00:00:00	1998-06-01 00:00:00
user_id
1	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	0.0	NaN	0.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	0.0	NaN
4	1.0	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN
5	1.0	0.0	NaN	0.0	0.0	0.0	0.0	NaN	0.0	NaN	NaN	1.0	0.0	NaN	NaN	NaN	NaN	NaN

applymap和apply的差别：

apply 用在dataframe上，用于对row或者column进行计算；
applymap 用于dataframe上，是元素级别的操作；
map （其实是python自带的）用于series上，是元素级别的操作。
链接： https://www.cnblogs.com/cymwill/p/7577369.html
这个地方是由于使用if条件语句而使用applymap?

(purchase_r.sum()/purchase_r.count()).plot(figsize =(10,4))#复购的用户/总人数，10,4分别为宽和高

在这里插入图片描述

复购率稳定在20%左右，前三个月因为有大量新用户涌入，而这批用户只购买了一次，所有导致复购率降低

df_purchase.head()#1表示本月消费过，0表示本月未消费过

month	1997-01-01 00:00:00	1997-02-01 00:00:00	1997-03-01 00:00:00	1997-04-01 00:00:00	1997-05-01 00:00:00	1997-06-01 00:00:00	1997-07-01 00:00:00	1997-08-01 00:00:00	1997-09-01 00:00:00	1997-10-01 00:00:00	1997-11-01 00:00:00	1997-12-01 00:00:00	1998-01-01 00:00:00	1998-02-01 00:00:00	1998-03-01 00:00:00	1998-04-01 00:00:00	1998-05-01 00:00:00	1998-06-01 00:00:00
user_id
1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	1	0	1	1	0	0	0	0	0	0	1	0	0	0	0	0	1	0
4	1	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0
5	1	1	0	1	1	1	1	0	1	0	0	1	1	0	0	0	0	0

def purchase_back(data):
    status = []
    for i in range(17):
        if data[i] == 1:
            if data[i+1] ==1:
                status.append(1)
            if data[i+1] ==0:
                status.append(0)
        else:
            status.append(np.NaN)
    status.append(np.NaN)
    return status
    #填充最后一个月，

import pandas as pd
import numpy as np
purchase_b = df_purchase.apply(purchase_back,axis =1)
purchase_b.head(5)

user_id
1    [0, nan, nan, nan, nan, nan, nan, nan, nan, na...
2    [0, nan, nan, nan, nan, nan, nan, nan, nan, na...
3    [0, nan, 1, 0, nan, nan, nan, nan, nan, nan, 0...
4    [0, nan, nan, nan, nan, nan, nan, 0, nan, nan,...
5    [1, 0, nan, 1, 1, 1, 0, nan, 0, nan, nan, 1, 0...
dtype: object

(purchase_b.sum()/purchase_b.count()).plot(figsize =(10,4))
##这个地方怎么回事，为解决画不出图

Luara_lyy

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
2
评论
python数据分析实战---秦璐

讲解：一份数据分析报告 #既包含代码又包含数据分析结果的PDF形式Markdown写法：说明code import pandas as pd快捷切换 esc+m 变为Markdown/esc+m 变为code图形化呈现：结果练习1 这是一份用户消费行为的分析报告(数据来源于网络，用户在一家CD网站上的消费)import pandas as pdimport numpy as...
复制链接

扫一扫

专栏目录