SQL+Tableau化妆品数据分析

3 篇文章 0 订阅
3 篇文章 0 订阅

一、项目介绍

本项目通过对化妆品10月、11月、12月、1月、2月的用户行为数据的探索,通过SQL数据处理以及tableau可视化,整个项目分为项目目的的确定、数据的预处理、对数据的分析和项目总结这五个部分。

二、项目流程

项目目的

从结果指标出发确定目标,通过过程指标定位问题,提出合理建议

数据来源

数据来源与和鲸社区,包含一个大型多品类在线商店5个月的用户行为数据,由Open CDP项目收集。文件中的每一行代表一个事件。所有事件都与产品和用户有关。共计9个字段。

数据预处理

因为单个月份的数据量高达300万以上,共计千万级数据集,产生较大的时间成本和内存成本,所以不将数据合并,对其分表查询,以下预处理对五个表都进行同样操作

缺失值处理

 对于dec表,分别对各个字段缺失值查取,得到结果为brand、category和user_session存在缺失值,因为后续操作用不到category和user_session,故将其过滤,而对于brand填充为'未知'

#缺失值处理
select event_time FROM dec_2019 where   event_time is null;
select event_type FROM dec_2019 where   event_type is null;
select product_id FROM dec_2019 where   product_id is null;
select category_id FROM dec_2019 where   category_id is null;
-- 缺失标记
select category_code FROM dec_2019 where   category_code is null;
-- 缺失标记
select brand FROM dec_2019 where   brand is null;
select price FROM dec_2019 where   price is null;
select user_id FROM dec_2019 where   user_id is null;
-- 缺失标记
select user_session FROM dec_2019_1 where   user_session is null;
select product_id FROM dec_2019_1 where   brand is null;
##dec_2019
-- 缺失值替换未知
UPDATE dec_2019
SET brand='未知'
where brand is NULL;

异常值处理

通过对price列探索发现存在负值,故将price列存在负值的数据删除

DELETE FROM dec_2019  where  price<0;

重复值处理

对单列重复值查询,发现存在重复值,采用distinct过滤

select DISTINCT a.event_time,a.event_type,a.product_id,a.category_id,a.brand,a.price,a.user_id from  dec_2019

添加时间维度

对于event_time列,在将数据导入时因为格式为2019-12-01 00:00:00 UTC,博主通过Navicat直接导入存在将event_time认定为null,输出1900-01-20 00:00:00,所以博主将event_time的数据类型定义为字符串,后续再将其转化为datetime,新建一个处理完的数据表,过滤不需要的字段

CREATE TABLE dec_2019_fin as 
select  DATE_FORMAT(left(a.event_time,10) ,'%Y-%m-%d') as time,
				year(left(a.event_time,10)) as year,
				MONTH(left(a.event_time,10)) as month,
				WEEK(left(a.event_time,10),1) as week,
				a.event_type,
				a.product_id,
				a.category_id,
				a.brand,
				a.price,
				a.user_id 
from  
(select DISTINCT event_time,event_type,product_id,category_id,brand,price,user_id from  dec_2019) a

可视化分析

对销售量和订单量汇总,并将其可视化,可以看到12月出现了明显下降趋势,将月份拆分,对11月和12月数据观察。

SELECT month as '月份',sum(price) as '销售额',count(1) as '订单量' FROM dec_2019_fin WHERE event_type = 'purchase' GROUP BY month
UNION ALL
SELECT month as '月份',sum(price) as '销售额',count(1) as '订单量'  FROM nov_2019_fin WHERE event_type = 'purchase' GROUP BY month
UNION ALL
SELECT month as '月份',sum(price) as '销售额',count(1) as '订单量'  FROM oct_2019_fin WHERE event_type = 'purchase' GROUP BY month
UNION ALL
SELECT month as '月份',sum(price) as '销售额',count(1) as '订单量'  FROM feb_2020_fin WHERE event_type = 'purchase' GROUP BY month
UNION ALL
SELECT month as '月份',sum(price) as '销售额',count(1) as '订单量'  FROM jan_2020_fin WHERE event_type = 'purchase' GROUP BY month

 观察对月购买人数与月客单价对比,先通过SQL计算月购买人数与月销售额,通过tableau计算字段月销售额/月购买人数以计算客单价,可以看到12月的购买人数较低,11月和1月都出现增长,可以推测11月和1月都进行活动推广,且11月的客单价较低,可能通过促销推广,12月虽然月购买人数较低,但客单价还是持正常水平,可能流失各个画像的群体都出现流失或者某个画像群体。

-- 月购买人数
SELECT month,sum(cut) 月购买人数 FROM
(SELECT month,user_id,count(1) cut FROM dec_2019_fin WHERE event_type = 'purchase' GROUP BY  month,user_id) b
GROUP BY month
UNION
SELECT month,sum(cut) 月购买人数 FROM
(SELECT month,user_id,count(1) cut FROM nov_2019_fin WHERE event_type = 'purchase' GROUP BY  month,user_id) b
GROUP BY month
UNION
SELECT month,sum(cut) 月购买人数 FROM
(SELECT month,user_id,count(1) cut FROM oct_2019_fin WHERE event_type = 'purchase' GROUP BY  month,user_id) b
GROUP BY month
UNION
SELECT month,sum(cut) 月购买人数 FROM
(SELECT month,user_id,count(1) cut FROM feb_2020_fin WHERE event_type = 'purchase' GROUP BY  month,user_id) b
GROUP BY month
UNION
SELECT month,sum(cut) 月购买人数 FROM
(SELECT month,user_id,count(1) cut FROM jan_2020_fin WHERE event_type = 'purchase' GROUP BY  month,user_id) b
GROUP BY month

 对于化妆品的PV和UV计算,可视化观察得到,11月和1月相比于前一个月份都出现上升趋势,可以推出可能11月和1月做了推广活动,12月的PV和UV的环比增长率都出现下降趋势,由以上分析可得,用户对着落页阅读浏览下降,可能的原因是着陆页的设计出现问题,可以对其网页优化。

-- PV 
SELECT month,sum(cut) pv FROM
(SELECT month,user_id,count(1) cut FROM dec_2019_fin WHERE event_type='view' GROUP BY  month,user_id) b
GROUP BY month
union
SELECT month,sum(cut) pv FROM
(SELECT month,user_id,count(1) cut FROM nov_2019_fin WHERE event_type='view' GROUP BY  month,user_id) b
GROUP BY month
union
SELECT month,sum(cut) pv FROM
(SELECT month,user_id,count(1) cut FROM oct_2019_fin WHERE event_type='view' GROUP BY  month,user_id) b
GROUP BY month
union
SELECT month,sum(cut) pv FROM
(SELECT month,user_id,count(1) cut FROM feb_2020_fin WHERE event_type='view' GROUP BY  month,user_id) b
GROUP BY month
union
SELECT month,sum(cut) pv FROM
(SELECT month,user_id,count(1) cut FROM jan_2020_fin WHERE event_type='view' GROUP BY  month,user_id) b
GROUP BY month




-- UV计算
SELECT month,sum(cut) uv FROM
(SELECT month,user_id,count(1) cut FROM dec_2019_fin GROUP BY  month,user_id) b
GROUP BY month
UNION
SELECT month,sum(cut) uv FROM
(SELECT month,user_id,count(1) cut FROM nov_2019_fin GROUP BY  month,user_id) b
GROUP BY month
UNION
SELECT month,sum(cut) uv FROM
(SELECT month,user_id,count(1) cut FROM oct_2019_fin GROUP BY  month,user_id) b
GROUP BY month
UNION
SELECT month,sum(cut) uv FROM
(SELECT month,user_id,count(1) cut FROM feb_2020_fin GROUP BY  month,user_id) b
GROUP BY month
UNION
SELECT month,sum(cut) uv FROM
(SELECT month,user_id,count(1) cut FROM jan_2020_fin GROUP BY  month,user_id) b
GROUP BY month

  对月购买人数和UV进行对比计算支付转化率,先计算SQL计算月购买人数和月UV,再通过tableau创建计算字段,月销售额/月UV,观察11月的支付转化率较1月高,说明11月的推广比1月较好,而12月转化率不低于10月转化率,可能的原因是11月的流量使得12月支付转化率相对较好。

 对复购人数和购买总人数对比,通过SQL计算复购人数和购买总人数,再通过tableau 计算字段月复购人数/购买人数,可以看到12月复购率较大,可能的原因是12月流失较多忠诚度较低的用户,而11月和1月复购率较低,可能的原因是推广带来相对较多的忠诚度较低的用户。

-- 复购率
-- 复购人数
SELECT month,count(cut) 复购人数 FROM
(SELECT month,user_id,count(1) cut FROM dec_2019_fin WHERE event_type = 'purchase' GROUP BY month,user_id) b
WHERE cut>1
GROUP BY month
UNION 
SELECT month,count(cut) 复购人数 FROM
(SELECT month,user_id,count(1) cut FROM nov_2019_fin WHERE event_type = 'purchase' GROUP BY month,user_id) b
WHERE cut>1
GROUP BY month
UNION
SELECT month,count(cut) 复购人数 FROM
(SELECT month,user_id,count(1) cut FROM oct_2019_fin WHERE event_type = 'purchase' GROUP BY month,user_id) b
WHERE cut>1
GROUP BY month
UNION
SELECT month,count(cut) 复购人数 FROM
(SELECT month,user_id,count(1) cut FROM feb_2020_fin WHERE event_type = 'purchase' GROUP BY month,user_id) b
WHERE cut>1
GROUP BY month
UNION
SELECT month,count(cut) 复购人数 FROM
(SELECT month,user_id,count(1) cut FROM jan_2020_fin WHERE event_type = 'purchase' GROUP BY month,user_id) b
WHERE cut>1
GROUP BY month

 对各月份品牌分析,可以看到对于12月可能流失italwax的用户群体,可能的原因是供应不足,对于runail、masurea等这种基本产品应该保证其供应,freedecor这种周期性较长的用户,可以加强对其推广。

SELECT b.month,b.brand,b.cut,b.row_index
FROM
(select a.month,a.brand,a.cut,ROW_NUMBER() OVER (PARTITION BY month ORDER BY cut DESC) as row_index
FROM
(SELECT month ,brand,count(1) cut  FROM dec_2019_fin WHERE  brand <>'未知' AND  event_type = 'purchase' 
GROUP BY brand
UNION ALL
SELECT month ,brand,count(1) cut  FROM nov_2019_fin  WHERE  brand <>'未知' AND  event_type = 'purchase'  
GROUP BY brand
UNION ALL
SELECT month ,brand,count(1) cut  FROM oct_2019_fin  WHERE  brand <>'未知' AND  event_type = 'purchase'  
GROUP BY brand
UNION ALL
SELECT month ,brand,count(1) cut  FROM feb_2020_fin WHERE brand <>'未知' AND  event_type = 'purchase'  
GROUP BY brand
UNION ALL
SELECT month ,brand,count(1) cut  FROM jan_2020_fin   WHERE brand <>'未知' AND  event_type = 'purchase'  
GROUP BY brand
) a

) b
where row_index<=10

对各月用户行为漏斗图分析,可以看到10月和11月加入购物车行为占比都较高,且11月最终购买行为占比最大可以看出推广效果极好,12月加入购物出的行为占比较低,且最终购买行为也较低,1月推广带来的变化不大,加入购物出行为相比12月高,但移除购物出相比12月较高。12月移除购物车比重较低,可以看出12月有较多忠诚用户,可以观察12月的忠诚用户进行分析建议。

SELECT month,event_type,count(*) cut FROM dec_2019_fin   GROUP BY event_type
UNION ALL 
SELECT month,event_type,count(*) cut FROM nov_2019_fin   GROUP BY event_type
UNION
SELECT month,event_type,count(*) cut FROM oct_2019_fin   GROUP BY event_type
UNION ALL 
SELECT month,event_type,count(*) cut FROM feb_2020_fin   GROUP BY event_type
UNION ALL 
SELECT month,event_type,count(*) cut FROM jan_2020_fin   GROUP BY event_type

 通过首购月确定新老用户,可以看到11月的新用户远远大于1月的新用户,12月和1月相对于前三个月老用户增加,出现的原因可能是10月、11月和12月带来的老用户

CREATE VIEW rfm_a AS 
SELECT time,month,event_type,user_id,brand,price from dec_2019_fin
WHERE event_type = 'purchase'
UNION
SELECT time,month,event_type,user_id,brand,price from nov_2019_fin
WHERE event_type = 'purchase'
UNION
SELECT time,month,event_type,user_id,brand,price from oct_2019_fin
WHERE event_type = 'purchase'
UNION
SELECT time,month,event_type,user_id,brand,price from feb_2020_fin
WHERE event_type = 'purchase'
UNION
SELECT time,month,event_type,user_id,brand,price from jan_2020_fin
WHERE event_type = 'purchase'


SELECT month,event_type,brand,price,user_id,count(1) cut,用户 FROM
(
SELECT c.time,c.month,c.event_type,c.brand,c.price,c.user_id,IF(c.month=b.首购月份,'新用户','老用户') 用户 FROM rfm_a c
INNER JOIN
(SELECT a.user_id,a.首购日,MONTH(首购日) 首购月份 from
(SELECT user_id,min(time) 首购日 FROM rfm_a   GROUP BY user_id) a) b
ON c.user_id=b.user_id
 ) d

GROUP BY month,user_id

 对新老用户品牌分析,新老用户都热衷于品牌为runail,可以对持品牌增大销售力度,而对于其他占比较小的品牌可以通过广告或者活动进行推广。

 构建RFM 模型

首先通过datediff获得R值,count、distinct获得F值,sum获得M值,再通过R值、F值、M值的中位数确定RFM模型的阈值(存在较大的主观和经验,需向其他人沟通确定),最终给其打上标签,得出各月份的各类用户占比图,对于11月出现的较多重要价值用户和重要保持用户,出现的原因可能是11月的重要发展用户的积累和运营,可以看出11月的推广效果极好,同时也应该对高价值用户开放更多福利以留存高价值用户,对重要保持用户对其开放会员,以及专属优惠,使其向高价值用户发展,对于一般价值用户推出裂变活动。

SELECT month,

客户类型,

COUNT(*) AS 客户数量
FROM
(SELECT

month,

R_Value,

F_Value,

M_Value,

(CASE

WHEN R_Value = 1 AND F_Value = 1 AND M_Value = 1 THEN '1重要价值客户'

WHEN R_Value = 0 AND F_Value = 1 AND M_Value = 1 THEN '2重要保持客户'

WHEN R_Value = 1 AND F_Value = 0 AND M_Value = 1 THEN '3重要发展客户'

WHEN R_Value = 0 AND F_Value = 0 AND M_Value = 1 THEN '4重要挽留客户'

WHEN R_Value = 1 AND F_Value = 1 AND M_Value = 0 THEN '5一般价值客户'

WHEN R_Value = 1 AND F_Value = 0 AND M_Value = 0 THEN '6一般发展客户'

WHEN R_Value = 0 AND F_Value = 1 AND M_Value = 0 THEN '7一般保持客户'

ELSE '8一般挽留客户'

END

) AS 客户类型

FROM 
(
SELECT month,user_id,MIN(DATEDIFF('2020-03-01',time)) AS r1,DATEDIFF('2020-03-01',MAX(time)) AS r2,
COUNT(DISTINCT price ) AS f,ROUND(sum(price),2) AS m,
IF(MIN(DATEDIFF('2020-03-01',time))>70,0,1) AS R_Value,
IF(COUNT(DISTINCT price )>3,1,0) AS F_Value,
IF(ROUND(sum(price),2) >15,1,0) AS M_Value
FROM rfm_b
WHERE  event_type = 'purchase'
GROUP BY month,user_id
ORDER BY user_id
)
tp) tp2
GROUP BY month,客户类型

 结论

      本次实验在面对千万级数据时,使用SQL略显乏力,但只要对其优化,分部查询,减少使用*,避 免%xxx式查询,or改为in等。如果非要创建表,可能会出现内存不足,以及临时表也会因为C盘内存不足而出现报错,移动data数据链接:移动MySQL的data文件夹方法_今朝鱼的博客-CSDN博客_mysql 移动data,分部查询对于全表查询查询速度更快,博主尝试全表查询,花费5分钟以上(当然可能是博主电脑配置较差),对于union all 如果分别order by 会出现报错,可以通过窗口函数排序,再取值。对于时间列,在Navicat导入时应该关注分隔符,如果识别不出,MySQL会将其1900,也可以通过将其字符串导入再在查询中将其转换。对于去重,尽量不要distinct*,本次实验最大的感悟是想念python了。

  • 2
    点赞
  • 26
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值