《项目实战-电子商务消费行为分析》

前言

学习过程中的小项目,记录一下,若有不足,还请谅解
要求:
 ● 掌握Zeppelin的使用
 ● 数据探索
 ● 数据清洗
 ● 基于Hive的数据分析

任务描述

对某零售企业最近1年门店收集的数据进行数据分析
 ◆ 潜在客户画像
 ◆ 用户消费统计
 ◆ 门店的资源利用率
 ◆ 消费的特征人群定位
 ◆ 数据的可视化展现

数据结构

Customer表
问题:language字段数据存在错误

ALT
Transaction表
问题:表中transaction_id有重复,但数据有效,需要修复数据


Store表
在store有多少employee

在这里插入图片描述
Review表
表中有无效的score数据
表中有将transaction_id映射到错误的store_id

数据准备和了解

我们这里使用zeppelin导入Json的模板编写
数据获取
上传数据到虚拟机指定目录下
[root@master~]mkdir -p /tmp/data
在这里插入图片描述检查数据行数
在这里插入图片描述
header行
在这里插入图片描述

上传数据到HDFS
在这里插入图片描述
创建适当的外部表来保存数据

create database shopping
use shopping
--顾客表
create external table if not exists ext_customer_details (
customer_id string,
first_name string,
last_name string,
email string,
gender string,
address string,
country string,
language string,
job string,
credit_type string,
credit_no string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/customer'
tblproperties ("skip.header.line.count"="1")

--交易明细表
create external table if not exists ext_transaction_details (
transaction_id string,
customer_id string,
store_id string,
price decimal(8,2),
product string,
purchase_date string,
purchase_time string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/transaction' 
tblproperties ("skip.header.line.count"="1")

--店铺表
create external table if not exists ext_store_details (
store_id string,
store_name string,
employee_number int
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/store' --this must tblproperties 
tblproperties ("skip.header.line.count"="1")

--评价表
create external table if not exists ext_store_review (
transaction_id string,
store_id string,
review_score int
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/review' --this must tblproperties 
tblproperties ("skip.header.line.count"="1")

■ 查看并了解数据

数据清洗

vw_customer_details

移除不需要字段(language),个人信息加密

create view if not exists vw_customer_details
as
select
customer_id,
first_name ,
unbase64(last_name) as last_name ,
unbase64(email)as email ,
gender ,
unbase64(address) as address ,
country ,
job,
credit_type,
unbase64(credit_no)as credit_no
from ext_customer_details

transaction_details

重新组织transaction数据按照日期YYYY-MM做分区

set hive.exec.dynamic.partition.mode=nonstrict

create table if not exists transaction_details (
transaction_id string,
customer_id string,
store_id string,
price decimal(8,2),
product string,
purchase_time string,
purchase_date date
)
partitioned by (purchase_month string)

导入数据时再清洗
对transaction_details中的重复数据生成新ID

with base as (
select 
transaction_id,
customer_id,
store_id,
price,
product,
purchase_time,
purchase_date,
from_unixtime(unix_timestamp(purchase_date ,'yyyy-MM-dd'), 'yyyy-MM') as 
purchase_month,
row_number() over(partition by transaction_id order by store_id) as rn
from ext_transaction_details
where customer_id<>'customer_id'
)
from base
insert overwrite table transaction_details partition(purchase_month)
select 
if(rn = 1, transaction_id, concat(transaction_id, '_fix', rn)) as transaction_id,
customer_id,
store_id,
price,
product,
purchase_time,
purchase_date,
purchase_month

vw_store_review

过滤掉store_review中没有评分的数据把清洗好的数据放到另一个表或者用View表示

create view if not exists vw_store_review
as select
transaction_id,
review_score
from
ext_store_review
where review_score <> ''

在这里插入图片描述
基本的清洗就完成了,还剩一张表不大,在查询的时候再做处理!

数据分析

Customer分析

- 6.1找出顾客最常用的信用卡

select
credit_type,count(distinct credit_no) as cnt
from vw_customer_details
group by credit_type
order by cnt desc
limit 6

在这里插入图片描述

- 6.2找出客户资料中排名前五的职位名称

select job,count(*) num
from vw_customer_details
group by job
order by num desc

在这里插入图片描述

- 6.3在美国女性最常用的信用卡

select 
credit_type,count(*) as cnt
from vw_customer_details
where gender='Female' and country='United States'
group by credit_type
order by cnt desc
limit 5

在这里插入图片描述

- 6.4按性别和国家进行客户统计

select gender,country,count(*)
from vw_customer_details
group by gender,country

在这里插入图片描述

Transaction分析-1

- 7.1计算每月总收入

select purchase_month,sum(price) 
from transaction_details
group by purchase_month

在这里插入图片描述

- 7.2计算每个季度的总收入

with base as(
select 
price,
concat_ws("-",substr(purchase_date,0,4),
cast(ceil(month(purchase_month)/3.0) as string)
)as year_quarter
from transaction_details
)
select year_quarter,
sum(price) as revenue_total
from base group by year_quarter
order by year_quarter

在这里插入图片描述

- 7.3按年计算总收入

with base as (
select price,substr(purchase_date,0,4) as year
from transaction_details
)
select year,sum(price)
from base group by year
order by year

在这里插入图片描述

- 7.4按工作日计算总收入

select dayofweek(cast(purchase_date as string))-1 work_date,sum(price)
from transaction_details
where dayofweek(cast(purchase_date as string))between 2 and 6
group by dayofweek(cast(purchase_date as string))

在这里插入图片描述

- 7.5按时间段计算总收入(需要清理数据)

--define time bucket 
--early morning: (5:00, 8:00]
--morning: (8:00, 11:00]
--noon: (11:00, 13:00]
--afternoon: (13:00, 18:00]
--evening: (18:00, 22:00]
--night: (22:00, 5:00] --make it as else, since it is not liner increasing
--We also format the time. 1st format time to 19:23 like, then compare, then convert minites to hours
-- 使用正则表达式清理数据然后使用case when 分组查询
with
t1 as(
select *, if(instr(purchase_time,'PM')>0,
				if(cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12>=24,
					0,
					cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12),
				cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)) as timeTrans
from transaction_details), t2 as(
select t1.*,case when t1.timeTrans<=8 and t1.timeTrans>5 then 'early morning'
				 when t1.timeTrans<=11 and t1.timeTrans>8 then 'morning'
				 when t1.timeTrans<=13 and t1.timeTrans>11 then 'noon'
				 when t1.timeTrans<=18 and t1.timeTrans>13 then 'afternoon'
				 when t1.timeTrans<=22 and t1.timeTrans>18 then 'evening'
				 else 'night'
			end as timeSplit
from t1)
select t2.timeSplit,sum(price)
from t2 
group by t2.timeSplit

在这里插入图片描述

- 7.6按时间段计算平均消费

with
t1 as(
select *, if(instr(purchase_time,'PM')>0,
				if(cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12>=24,
					0,
					cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12),
				cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)) as timeTrans
from transaction_details), t2 as(
select t1.*,case when t1.timeTrans<=8 and t1.timeTrans>5 then 'early morning'
				 when t1.timeTrans<=11 and t1.timeTrans>8 then 'morning'
				 when t1.timeTrans<=13 and t1.timeTrans>11 then 'noon'
				 when t1.timeTrans<=18 and t1.timeTrans>13 then 'afternoon'
				 when t1.timeTrans<=22 and t1.timeTrans>18 then 'evening'
				 else 'night'
			end as timeSplit
from t1)
select t2.timeSplit,avg(price)
from t2 
group by t2.timeSplit

在这里插入图片描述

- 7.7按工作日计算平均消费

select dayofweek(cast(purchase_date as string))-1 as workdate,
avg(price)
from transaction_details
where dayofweek(cast(purchase_date as string)) between 2 and 6
group by dayofweek(cast(purchase_date as string))

在这里插入图片描述

- 7.8计算年、月、日的交易总数

--select purchase_date,count(1) from transaction_details group by purchase_date
--select concat(year(purchase_date),"-",month(purchase_date)),count(1) from transaction_details group by year(purchase_date),month(purchase_date)
--select year(purchase_date),count(1) from transaction_details group by year(purchase_date)
select count(1) over(partition by year(purchase_date)) year,
       count(1) over(partition by year(purchase_date),month(purchase_date)) month,
       count(1) over(partition by year(purchase_date),month(purchase_date),day(purchase_date))day from transaction_details

在这里插入图片描述

- 7.9找出交易量最大的10个客户

select 
customer_id,count(1) as num
from transaction_details
group by customer_id
order by num desc limit 10

在这里插入图片描述

- 7.10找出消费最多的前10位顾客

select 
customer_id,sum(price)as n
from transaction_details
group by customer_id
order by n desc
limit 10

在这里插入图片描述

Transaction分析-2

- 7.11统计该期间交易数量最少的用户

select 
customer_id,count(1) as num
from transaction_details
group by customer_id
order by num limit 10

在这里插入图片描述

- 7.12计算每个季度的独立客户总数

select
concat(
year(purchase_date),'年',ceil(month(purchase_date)/3.0),'季度'),
count(distinct customer_id)
from transaction_details
group by year(purchase_date),ceil(month(purchase_date)/3.0)

在这里插入图片描述

- 7.13计算每周的独立客户总数

select 
concat(year(purchase_date),'年第',weekofyear(purchase_date),'周'),
count(distinct customer_id)
from transaction_details
group by year(purchase_date),weekofyear(purchase_date)

在这里插入图片描述

- 7.14计算整个活动客户平均花费的最大值

select 
customer_id,max(avg_s)
from(
select customer_id,avg(price) avg_s
from transaction_details
group by customer_id
)t
group by customer_id

在这里插入图片描述

- 7.15统计每月花费最多的客户

select b.mon,b.id,b.sum_price
from(
select a.mon,a.id,a.sum_price ,
row_number() over(partition by a.mon order by a.sum_price desc) as num
from(
select 
concat(year(purchase_date),'-',month(purchase_date)) mon,
customer_id id,
sum(price) sum_price
from transaction_details
group by year(purchase_date),month(purchase_date),customer_id
)a
)b 
where b.num=1

- 7.16统计每月访问次数最多的客户

select b.mon,b.id,b.c
from(
select a.mon,a.id,a.c ,
row_number() over(partition by a.mon order by a.c desc) as num
from(
select 
concat(year(purchase_date),'-',month(purchase_date)) mon,
customer_id id,
count(1) c
from transaction_details
group by year(purchase_date),month(purchase_date),customer_id
)a
)b 
where b.num=1

在这里插入图片描述

- 7.17按总价找出最受欢迎的5种产品

select product,sum(price) s
from transaction_details
group by product
order by s desc
limit 5

在这里插入图片描述

- 7.18根据购买频率找出最畅销的5种产品

select product,
count(1) s
from transaction_details
group by product
order by s desc
limit 5

在这里插入图片描述
- 7.19根据客户数量找出最受欢迎的5种产品

select product,
count(distinct customer_id) c
from transaction_details
group by product
order by c desc
limit 5

在这里插入图片描述

- 7.20验证前5个details

select * from transaction_details where product in ('Goat - Whole Cut')

Store分析

- 8.1按客流量找出最受欢迎的商店

select store_id,
count(1) c
from transaction_details
group by store_id
order by c desc limit 1

在这里插入图片描述

- 8.2根据顾客消费价格找出最受欢迎的商店

select store_id,
sum(price) s
from transaction_details
group by store_id
order by s desc
limit 1

在这里插入图片描述

- 8.3根据顾客交易情况找出最受欢迎的商店

select store_id,
count(1) c,
sum(price) s
from transaction_details
group by store_id
order by c desc,s desc
limit 1

在这里插入图片描述

- 8.4根据商店和唯一的顾客id获取最受欢迎的产品

select m.store_id,m.product
from(
select t.store_id,t.product,t.c,
row_number() over(partition by store_id order by t.c desc)as rn
from
(
select store_id,product,
count(distinct customer_id)c 
from transaction_details
group by store_id,product)t
)m
where m.rn=1

在这里插入图片描述

- 8.5获取每个商店的员工与顾客比

select t1.store_id,
employee_number/c 
from ext_store_details t1
join(
select store_id,
count(distinct customer_id) c
from transaction_details
group by store_id
)t2
on t1.store_id=t2.store_id

在这里插入图片描述

- 8.6按年和月计算每家店的收入

select distinct *
from(
select store_id,
year(purchase_date) year,
sum(price) over(partition by year(purchase_date)) sum_year,
month(purchase_date) month,
sum(price) over(partition by year(purchase_date),month(purchase_date)) sum_month
from transaction_details
)t

在这里插入图片描述

- 8.7按店铺制作总收益饼图

select store_id,sum(price)
from transaction_details
group by store_id

在这里插入图片描述

- 8.8找出每个商店最繁忙的时间段

t1 as(
select *, if(instr(purchase_time,'PM')>0,
				if(cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12>=24,
					0,
					cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12),
				cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)) as timeTrans
from transaction_details), t2 as(
select t1.*,case when t1.timeTrans<=8 and t1.timeTrans>5 then 'early morning'
				 when t1.timeTrans<=11 and t1.timeTrans>8 then 'morning'
				 when t1.timeTrans<=13 and t1.timeTrans>11 then 'noon'
				 when t1.timeTrans<=18 and t1.timeTrans>13 then 'afternoon'
				 when t1.timeTrans<=22 and t1.timeTrans>18 then 'evening'
				 else 'night'
			end as timeSplit
from t1),
t3 as(
select t2.store_id,t2.timeSplit,count(1) c 
from t2 
group by t2.store_id,t2.timeSplit),
t4 as(
select t3.store_id,t3.timeSplit,row_number() over(partition by store_id order by t3.timeSplit desc)as win1
from t3 )
select t4.store_id,t4.timeSplit
from t4
where t4.win1=1

在这里插入图片描述

- 8.9找出每家店的忠实顾客
购买超过6次

select a.*
from(
select store_id,customer_id,count(1) c
from transaction_details
group by store_id,customer_id)a 
where a.c>6

8.10根据每位员工的最高收入找出明星商店
求总收入与雇员比值的最大值

with base as
(
select store_id,sum(price) s
from transaction_details
group by store_id
)
select base.store_id,
base.s/store.employee_number P
from base 
join ext_store_details store 
on base.store_id=store.store_id
order by P desc
limit 1

在这里插入图片描述

Review分析

  • 9.1在ext_store_review中找出存在冲突的交易映射关系
select transaction_id
from vw_store_details
group by transaction_id
having count(1)>1

在这里插入图片描述

- 9.2了解客户评价的覆盖率
求各个店共有多少顾客评价

with t as 
(
select 
r2.store_id,r2.transaction_id,r2.customer_id
from 
vw_store_review r1
join transaction_details r2
on r1.transaction_id=r2.transaction_id
) 
select
t.store_id,count(distinct customer_id)
from t
group by t.store_id

在这里插入图片描述

- 9.3根据评分了解客户的分布情况
求每家店每个评分有多少个客户给的

with t1 as
(
select store_id,review_score,customer_id
from vw_store_review v1 
join transaction_details tr 
on v1.transaction_id=tr.transaction_id 
) 
select store_id,review_score,count(distinct customer_id) num
from t1
group by store_id,review_score

在这里插入图片描述

- 9.4根据交易了解客户的分布情况
求每家店每个客户的订单数

select store_id,customer_id,
count(1)
from transaction_details
group by store_id,customer_id

在这里插入图片描述

- 9.5客户给出的最佳评价是否总是同一家门店
每位顾客对每家店的评分只取最大值,然后筛选每家店评分为5的数量,最大就是最优店

with 
t1 as(
select tr.store_id,tr.customer_id,max(vw.review_score) max_score
from vw_store_review vw 
join transaction_details tr 
on vw.transaction_id=tr.transaction_id
group by tr.store_id,tr.customer_id
),
t2 as(
select * from t1 where t1.max_score=5
)
select store_id,count(t2.max_score) num
from t2
group by store_id
order by num desc limit 1

在这里插入图片描述

  • 4
    点赞
  • 29
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值