Hive/Spark练习--电商消费数据查询

最新推荐文章于 2021-07-31 18:21:38 发布

鲸森林

最新推荐文章于 2021-07-31 18:21:38 发布

阅读量1k

点赞数 1

分类专栏：大数据文章标签： sql spark hive

本文链接：https://blog.csdn.net/weixin_45421584/article/details/108665621

版权

大数据专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Hive/Spark小练习-电子商务数据分析

背景

基于Hive或Spark SQL进行使用练习–我们可以用hive或spark做什么
某零售企业根据最近一年门店收集的数据进行数据分析展示

数据类型介绍及准备

环境准备
1. 使用Zeppelin、Spark SQL进行编写，因为我的zeppelin上不支持hive
2. 验证是否可以正常使用，如图，没问题

数据文件介绍

customer table–数据可能存在瑕疵

filed	details
customers_id	Int,1-500
first_name	string
last_name	string
email	string,such as willddy@mail.com
gender	string,Male or female
address	string
country	string
language	string
job	string, job title/position
credit_type	string,credit card type, such as visa
credit_no	string,credit card number

Transaction Table–有些数据是重复的，但是不能随便删除，所以需要数据修复

filed details
transaction_id Int, 1-1000
customer_id Int, 1-500
store_id Int, 1-5
price decimal, such as 5.08
product string, things bought
date string, when to purchase
time string, what time to purchase
Store Table

filed details
store_id Int, 1-5
store_name string
employee_number Int, how many employee in this store
Review Table–有些数据是null，有些是错误的映射

filed details
stransaction_id Int, 1-8000
store_id Int,1-5
review_store Int,1-5

数据准备
1. 下载数据–我直接放我云盘里了；检查数据
  1. 链接：https://pan.baidu.com/s/11fa1wf0HC_I5DEW5WBor3A
    提取码：c2jy
  2. 查了一下没问题
2. 上传数据到hdfs里–如果在zeppelin里直接放就行了，如果在终端里放记得切换成zeppelin管理员

filed	details
transaction_id	Int, 1-1000
customer_id	Int, 1-500
store_id	Int, 1-5
price	decimal, such as 5.08
product	string, things bought
date	string, when to purchase
time	string, what time to purchase

filed	details
store_id	Int, 1-5
store_name	string
employee_number	Int, how many employee in this store

filed	details
stransaction_id	Int, 1-8000
store_id	Int,1-5
review_store	Int,1-5

建表及数据验证

创建外部表来保存数据

%spark2.sql
create external table if not exists ext_customer_details (
customer_id string, 
first_name string,
last_name string,
email string,
gender string,
address string,
country string,
language string,
job string,
credit_type string,
credit_no string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/customer' 
tblproperties ("skip.header.line.count"="1")

%spark2.sql
create external table if not exists ext_transaction_details(
transaction_id string,
customer_id string,
store_id string,
price decimal(8,2),
product string,
purchase_date string,
purchase_time string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/transaction' 
tblproperties ("skip.header.line.count"="1")

%spark2.sql
create external table if not exists ext_store_details(
store_id string,
store_name string,
employee_number int
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/store' --this must tblproperties 
tblproperties ("skip.header.line.count"="1")

%spark2.sql
create external table if not exists ext_store_review(
transaction_id string,
store_id string,
review_score int
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/review' --this must tblproperties 
tblproperties ("skip.header.line.count"="1")

都进去了[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传在这里插入图片描述

数据清洗

要求：

transaction_details中的重复数据，对重复数据生成新ID
过滤掉store_review中没有评分的数据
可以把清洗好的数据放到另一个表或者用View表示
找出PII（persional information identification）或PCI（personal confidential information）数据进行加密或hash
重新组织transaction数据按照日期YYYY-MM做分区

过程

对部分姓名，卡证信息进行加密，效果如下所示
需要进行一些数据修复，例如这些数据
1. 即将完成
2. 结果测试，ok
3. 清理另一个表

数据分析

面向客户的数据分析

找出客户最常用的卡

select credit_type, count(distinct credit_no) as credit_cnt from vw_customer_details group by country, credit_type order by credit_cnt desc

在这里插入图片描述

找出客户数据中的前5大标题

select job, count(*) as pn from vw_customer_details group by job order by pn desc limit 5

在这里插入图片描述

对美国女性来说，她们中最流行持有什么卡

select credit_type, count(*) as ct from vw_customer_details where country = 'United States' and gender = 'Female'

按性别和国家进行客户统计

select count(*), country, gender from vw_customer_details group by country, gender

面向流水的分析

计算总每月、每季度、每年、每周收益（算是四个小题）

select sum(price) as revenue_mom, purchase_month from transaction_details group by purchase_month order by purchase_month
--
with base as (
select price, 
concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter
from 
shopping.transaction_details
)
select 
sum(price) as revenue_qoq, --quarter does not support until hive 1.3
year_quarter
from base group by year_quarter
--
select sum(price) as revenue_mom, substr(purchase_date, 1, 4) as year from transaction_details group by substr(purchase_date, 1, 4)
--
select sum(price) as revenue_wow, date_format(purchase_date, 'u') as weekday from transaction_details group by  date_format(purchase_date, 'u') --1 is Mon and 7 is Sun

在这里插入图片描述

根据时间分桶计算收益、花销

with base as (
select price, purchase_time, if(purchase_time like '%M', from_unixtime(unix_timestamp(purchase_time,'hh:mm aa'),'HH:mm'), purchase_time) as time_format from transaction_details
),
timeformat as (
select
purchase_time, price, (cast(split(time_format, ':')[0] as decimal(4,2)) +  cast(split(time_format, ':')[1] as decimal(4,2))/60) as purchase_time_in_hrs
from base
),
timebucket as (
select
price, purchase_time, purchase_time_in_hrs,
if(purchase_time_in_hrs > 5 and purchase_time_in_hrs <=8, 'early morning',
if(purchase_time_in_hrs > 8 and purchase_time_in_hrs <=11, 'morning',
if(purchase_time_in_hrs > 11 and purchase_time_in_hrs <=13, 'noon',
if(purchase_time_in_hrs > 13 and purchase_time_in_hrs <=18, 'afternoon',
if(purchase_time_in_hrs > 18 and purchase_time_in_hrs <=22, 'evening', 'night'))))) as time_bucket
from timeformat
)
--select * from timebucket --this is to verify
select time_bucket, avg(price) as avg_spend, sum(price)/1000 as revenue_k from timebucket group by time_bucket

根据每周计算花销

select 
avg(price) as avg_price, 
date_format(purchase_date, 'u') as weekday
from transaction_details
where
date_format(purchase_date, 'u') is not null
group by date_format(purchase_date, 'u')

计算每年、月、周的交易数量

with base as (
select 
transaction_id, date_format(purchase_date, 'u') as weekday, purchase_month,
concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter, substr(purchase_date, 1, 4) as year
from transaction_details
where purchase_month is not null
)
select count(distinct transaction_id) as total, weekday, purchase_month, year_quarter, year 
from base group by weekday, purchase_month, year_quarter, year order by year, purchase_month

找出交易最多的10个顾客

with base as (
select 
customer_id,
count(distinct transaction_id) as trans_cnt,
sum(price) as spend_total
from transaction_details
where purchase_month is not null
group by customer_id
),
cust_detail as (
select
*,
concat_ws(' ', first_name, last_name) as cust_name
from
base td join vw_customer_details cd on td.customer_id = cd.customer_id
)
select 
trans_cnt,
cust_name as top10_trans_cust
from cust_detail order by trans_cnt desc limit 10

找出花销做多的10个客户

with base as (
select 
customer_id,
count(distinct transaction_id) as trans_cnt,
sum(price) as spend_total
from transaction_details
where purchase_month is not null
group by customer_id
),
cust_detail as (
select
*,
concat_ws(' ', first_name, last_name) as cust_name
from
base td join vw_customer_details cd on td.customer_id = cd.customer_id
)
select 
spend_total,
cust_name as top10_trans_cust
from cust_detail order by spend_total desc limit 10

该交易周期里谁的交易数量最少

with base as (
select 
customer_id,
count(distinct transaction_id) as trans_cnt
from
transaction_details
where purchase_month is not null
group by customer_id
)
select
*
from
base order by trans_cnt limit 10

计算每个季度、每周的不重复（唯一、unique customer）顾客的数量

with base as (
select 
transaction_id,
concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter, substr(purchase_date, 1, 4) as year
from transaction_details
where purchase_month is not null
)
select 
count(distinct transaction_id) as total, year_quarter, year 
from base 
group by year_quarter, year 
order by year_quarter

计算所有活动的顾客的最大平均值

with base as (
select 
customer_id,
avg(price) as price_avg,
max(price) as price_max
from
transaction_details
where 
purchase_month is not null
group by customer_id
)
select 
max(price_avg) 
from base

每月花销最大的是谁

with base as (
select customer_id, purchase_month, sum(price) as price_sum, count(transaction_id) as trans_cnt
from transaction_details
where purchase_month is not null group by purchase_month, customer_id
),
rank_sum as (
select 
rank() over(partition by purchase_month order by price_sum desc) as rn_sum,
rank() over(partition by purchase_month order by trans_cnt desc) as rn_cnt,
purchase_month,
price_sum,
trans_cnt,
customer_id
from base
)
select 
purchase_month,
'spend' as measure_name,
price_sum as measure_value,
customer_id
from rank_sum where rn_sum = 1
union all
select
purchase_month,
'visit' as measure_name,
trans_cnt as measure_value,
customer_id
from rank_sum where rn_cnt = 1 
order by measure_name, purchase_month

根据总价格、购买频率、顾客数量找出5个最受欢迎的产品

select 
product,
sum(price) as price_sum
from
transaction_details
where 
purchase_month is not null
group by product
order by price_sum desc limit 5
--
select 
product,
count(transaction_id) as freq_buy
from
transaction_details
where 
purchase_month is not null
group by product
order by freq_buy desc limit 5
--
select 
product,
count(customer_id) as freq_cust
from
transaction_details
where 
purchase_month is not null
group by product
order by freq_cust desc limit 5

验证前5的细节

select * from transaction_details where product in ('Goat - Whole Cut')

面向门面的分析

根据-不重复顾客id和进店次数、-顾客的购买次数、-顾得的交易找出最受欢迎的门店（算3个题）

select
sd.store_name,
count(distinct customer_id) as unique_visit 
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
group by store_name order by unique_visit desc
limit 5
--
select
sd.store_name,
sum(td.price) as total_revnue 
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
group by store_name 
order by total_revnue desc
limit 5
--
select
sd.store_name,
count(transaction_id) as unique_purchase
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
group by store_name order by unique_purchase desc
limit 5

根据门店和顾客id找出最受欢迎的产品

with base as (select 
store_id,
product,
count(distinct customer_id) as freq_cust
from
transaction_details
where 
purchase_month is not null
group by store_id, product
),
prod_rank as (
select
store_id,
product,
freq_cust,
rank() over(partition by store_id order by freq_cust desc) as rn
from base
)
select store_name, product, freq_cust
from prod_rank td join ext_store_details sd on td.store_id = sd.store_id
where td.rn = 1

计算每个店的员工与顾客访问比率（应该员工数/顾客访问次数的意思）

with base as (
select 
store_id,
count(distinct customer_id, purchase_date) as cust_visit
from
transaction_details
where 
purchase_month is not null
group by store_id
)
select
store_name,
cust_visit,
employee_number,
round(cust_visit/employee_number,2) as cust_per_employee_within_period
from base td join ext_store_details sd on td.store_id = sd.store_id

计算每个店每月每年的收益

select 
store_name,
purchase_month,
sum(price) as revenue
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
where 
purchase_month is not null
group by store_name, purchase_month

制作每个店的总收益饼图

select 
store_name,
sum(price) as revenue
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
where 
purchase_month is not null
group by store_name

找出每个门最忙的时候

with base as (
select transaction_id, purchase_time, if(purchase_time like '%M', from_unixtime(unix_timestamp(purchase_time,'hh:mm aa'),'HH:mm'), purchase_time) as time_format, store_id from transaction_details
where 
purchase_month is not null
),
timeformat as (
select
purchase_time, transaction_id, (cast(split(time_format, ':')[0] as decimal(4,2)) +  cast(split(time_format, ':')[1] as decimal(4,2))/60) as purchase_time_in_hrs, store_id
from base
),
timebucket as (
select
transaction_id, purchase_time, purchase_time_in_hrs, store_id,
if(purchase_time_in_hrs > 5 and purchase_time_in_hrs <=8, 'early morning',
if(purchase_time_in_hrs > 8 and purchase_time_in_hrs <=11, 'morning',
if(purchase_time_in_hrs > 11 and purchase_time_in_hrs <=13, 'noon',
if(purchase_time_in_hrs > 13 and purchase_time_in_hrs <=18, 'afternoon',
if(purchase_time_in_hrs > 18 and purchase_time_in_hrs <=22, 'evening', 'night'))))) as time_bucket
from timeformat
)
--select * from timebucket --this is to verify
select sd.store_name, count(transaction_id) as tran_cnt, time_bucket 
from timebucket td join ext_store_details sd on td.store_id = sd.store_id
group by sd.store_name, time_bucket order by sd.store_name, tran_cnt desc
-- from the result we can further divide the bucket afternnon and morning into more details

找出每个店的忠诚顾客

with base as (
select
store_name,
customer_id,
sum(td.price) as total_cust_purphase
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
where 
purchase_month is not null
group by store_name, customer_id 
),
rk_cust as (
select 
store_name,
customer_id,
total_cust_purphase,
rank() over(partition by store_name order by total_cust_purphase desc) as rn
from base
)
select * from rk_cust where rn <= 5

找出明星店的每个员工的最大收入

with base as (
select 
store_id,
sum(price) as revenue
from
transaction_details
where 
purchase_month is not null
group by store_id
)
select
store_name,
revenue,
employee_number,
round(revenue/employee_number,2) as revenue_per_employee_within_period
from base td
join ext_store_details sd
on td.store_id = sd.store_id

针对评分的分析

在ext_store_review中找出冲突交易

--冲突交易意味着事务表和审查表中的transaction_id、store_id对不同
select
count(*) --725/8101
from
transaction_details td join ext_store_review sd on td.transaction_id = sd.transaction_id
where 
purchase_month is not null and td.store_id != sd.store_id

找出每个顾客的平均评论数

select
count(td.transaction_id) as total_trans,
sum(if(sd.transaction_id is null, 1, 0)) as total_review_missed,
sum(if(sd.transaction_id is not null, 1, 0)) as total_review_exist
from
transaction_details td left join ext_store_review sd on td.transaction_id = sd.transaction_id
where 
purchase_month is not null

找出评论分数的顾客分布、交易分布

select
review_score,
count(distinct customer_id) as num_customer,
count(*) as num_reviews
from
transaction_details td join ext_store_review sd on td.transaction_id = sd.transaction_id
where 
purchase_month is not null and review_score <> ''
group by review_score

顾客总是在光顾同一家店的时候给好评吗？

select -- not always
count(*) as visit_cnt,
customer_id,
td.store_id
from
transaction_details td join ext_store_review sd on td.transaction_id = sd.transaction_id
where 
purchase_month is not null and review_score = '5'
group by customer_id, td.store_id order by visit_cnt desc

总结

认识数据问题的影响
确定活动的影响
练习sql

鲸森林

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
1
评论
Hive/Spark练习--电商消费数据查询

Hive/Spark小练习-电子商务数据分析背景基于Hive或Spark SQL进行使用练习–我们可以用hive或spark做什么某零售企业根据最近一年门店收集的数据进行数据分析展示数据类型介绍及准备环境准备使用Zeppelin、Spark SQL进行编写，因为我的zeppelin上不支持hive验证是否可以正常使用，如图，没问题数据文件介绍customer table–数据可能存在瑕疵fileddetailscustomers_idIn
复制链接

扫一扫