案例分析之消费数据

掌握基于Hive 或 Spark SQL的数据分析

某零售企业根据最近1年门店收集的数据进行数据分析

    潜在客户画像

    用户消费统计

    门店的资源利用率

    消费的特征人群定位

    数据的可视化展现

客户细节参数:

语言数据不正确

交通参数:

根据transaction_id,该表具有重复行。

然而,这些都是有效的数据,所以我们需要修复它们

商店参数:

商店回顾参数:

该表有为null的无效分数

该表将transaction_id映射到错误的store_id

表之间的关系:

将数据传入HDFS中:

hdfs dfs -rm -r -f -skipTrash /tmp/shopping/

hdfs dfs -mkdir -p /tmp/shopping/data/customer

hdfs dfs -put customer_details.csv /tmp/shopping/data/customer/

hdfs dfs -mkdir -p /tmp/shopping/data/transaction

hdfs dfs -put transaction_details.csv /tmp/shopping/data/transaction/

hdfs dfs -mkdir -p /tmp/shopping/data/store

hdfs dfs -put store_details.csv /tmp/shopping/data/store/

hdfs dfs -mkdir -p /tmp/shopping/data/review

hdfs dfs -put store_review.csv /tmp/shopping/data/review/

hdfs dfs -ls -R /tmp/shopping/data/

在hive中创建database:

drop database if exists shopping cascade

create database if not exists shopping

创建数据表table

客户细节表:

create external table if not exists ext_customer_details (

customer_id string,

first_name string,

last_name string,

email string,

gender string,

address string,

country string,

language string,

job string,

credit_type string,

credit_no string

)

row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

location '/tmp/shopping/data/customer'

tblproperties ("skip.header.line.count"="1")

交通表:

create external table if not exists ext_transaction_details (

transaction_id string,

customer_id string,

store_id string,

price decimal(8,2),

product string,

purchase_date string,

purchase_time string

)

row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

location '/tmp/shopping/data/transaction' --this must tblproperties

tblproperties ("skip.header.line.count"="1")

商店表:

create external table if not exists ext_store_details (

store_id string,

store_name string,

employee_number int

)

row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

location '/tmp/shopping/data/store' --this must tblproperties

tblproperties ("skip.header.line.count"="1")

商店回顾表:

create external table if not exists ext_store_review (

transaction_id string,

store_id string,

review_score int

)

row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

location '/tmp/shopping/data/review' --this must tblproperties

tblproperties ("skip.header.line.count"="1")

数据清洗:

解决以下有问题的数据:

transaction_details中的重复数据,对重复数据生成新ID

过滤掉store_review中没有评分的数据

可以把清洗好的数据放到另一个表或者用View表示

找出个人信息确认:PII(personal information identification)或个人机密信息:PCI(personal confidential information)数据进行加密或hash

重新组织transaction数据按照日期YYYY-MM做分区

个人信息进行加密:

create view if not exists vw_customer_details as

select

customer_id,

first_name,

unbase64(last_name) as last_name,

unbase64(email) as email,

gender,

unbase64(address) as address,

country,

job,

credit_type,

unbase64(concat(unbase64(credit_no), 'seed')) as credit_no --better way than hash since it can be decoded base64(regexp_replace(base64(credit_no), unbase64('seed'), ''))

from ext_customer_details

ext_transaction_details中有重复的transaction_id:

先建立一个transaction_details表:

create table if not exists transaction_details (

transaction_id string,

customer_id string,

store_id string,

price decimal(8,2),

product string,

purchase_time string,

purchase_date date

)

partitioned by (purchase_month string)

 

select transaction_id, count(*) from ext_transaction_details group by transaction_id having count(*) > 1(标准的sql查重)

设置临时结果集base:查找出相同的transaction_id,相同的transaction_id会产生不同的行number

with base as (

select row_number() over(partition by transaction_id order by 1) as rn,*

from ext_transaction_details

)

select count(*) from base where rn > 1

交通细节表:生成临时子集base,使用transaction_id分组

with base as (

select

transaction_id,

customer_id,

store_id,

price,

product,

purchase_time,

purchase_date,

--对时间格式进行转换

from_unixtime(unix_timestamp(purchase_date ,'yyyy-MM-dd'), 'yyyy-MM') as purchase_month,

--以transaction_id为基础排列row-number

row_number() over(partition by transaction_id order by store_id) as rn

from ext_transaction_details

where customer_id<>'customer_id'

)

 

from base

insert overwrite table transaction_details partition(purchase_month)

select

if(rn = 1, transaction_id, concat(transaction_id, '_fix', rn)) as transaction_id,

--rn = 1的时候输出transaction_id,当rn >= 2的时候通过concat(transaction_id, '_fix', rn)连接字符串

customer_id,

store_id,

price,

product,

purchase_time,

purchase_date,

purchase_month

商户回顾表:创建视图,取出不为空的部分

--select count(*) from ext_store_review r join transaction_details t on r.transaction_id = t.transaction_id and r.store_id = t.store_id where review_score <> '' --153 row

--select count(*) from ext_store_review where review_score <> '' --937 rows

因为store_id有很多行是错误的,所以在这里我们将忽略store_id

create view if not exists vw_store_review as

select transaction_id, review_score

from ext_store_review

where review_score <> ''

分析数据:

以客户为中心:

1. 找出顾客最常用的信用卡:

select credit_type, count(distinct credit_no) as credit_cnt

from vw_customer_details group by country, credit_type order by credit_cnt desc

2. 找出客户数据中的前5个职位名称:

select job, count(*) as pn from vw_customer_details group by job order by pn desc limit 5

3. 对于美国女性来说,她们手中最受欢迎的卡片是什么

select credit_type, count(distinct credit_no) as ct

from vw_customer_details

where country = 'United States' and gender = 'Female'

group by credit_type order by ct desc

4. 按性别和国家计算客户统计

select count(*), country, gender

from vw_customer_details

group by country, gender

以交易为中心:

1. 计算每月总收入

select sum(price) as revenue_mom, purchase_month from transaction_details

group by purchase_month order by purchase_month

2. 按季度计算总收入

with base as (

select price,

concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter

from transaction_details

)

select

sum(price) as revenue_qoq, year_quarter --quarter does not support until hive 1.3

from base group by year_quarter

3. 按年计算总收入

select sum(price) as revenue_mom, substr(purchase_date, 1, 4) as year

from transaction_details

group by substr(purchase_date, 1, 4)

4. 按工作日星期计算总收入

select sum(price) as revenue_wow, date_format(purchase_date, 'u') as weekday

from transaction_details

group by  date_format(purchase_date, 'u') --1 is Mon and 7 is Sun

可以合并在一起:

with base as (

select price, date_format(purchase_date, 'u') as weekday, purchase_month,

concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter, substr(purchase_date, 1, 4) as year

from transaction_details

)

select sum(price) as total, weekday, purchase_month, year_quarter, year from base group by weekday, purchase_month, year_quarter, year

5. (需要清理数据)(按时间桶(上午、中午等)计算总收入/平均购买量)

--定义时间分桶

--early morning: (5:00, 8:00]

--morning: (8:00, 11:00]

--noon: (11:00, 13:00]

--afternoon: (13:00, 18:00]

--evening: (18:00, 22:00]

--night: (22:00, 5:00]

--因为它不是线性递增的,所以把它当作else

--我们也安排时间。第一个格式时间到19:23喜欢,然后比较,然后转换分钟到小时

with base as (

select price, purchase_time, if(purchase_time like '%M',

--12小时制转换成24小时制

from_unixtime(unix_timestamp(purchase_time,'hh:mm aa'),'HH:mm'), purchase_time) as time_format

from transaction_details   ),

timeformat as (

select

--整合时间变成13.5这种形式

purchase_time, price, (cast(split(time_format, ':')[0] as decimal(4,2)) + cast(split(time_format, ':')[1] as decimal(4,2))/60) as purchase_time_in_hrs

from base   ),

timebucket as (

select

price, purchase_time, purchase_time_in_hrs,

if(purchase_time_in_hrs > 5 and purchase_time_in_hrs <=8, 'early morning',

if(purchase_time_in_hrs > 8 and purchase_time_in_hrs <=11, 'morning',

if(purchase_time_in_hrs > 11 and purchase_time_in_hrs <=13, 'noon',

if(purchase_time_in_hrs > 13 and purchase_time_in_hrs <=18, 'afternoon',

--其他是night

if(purchase_time_in_hrs > 18 and purchase_time_in_hrs <=22, 'evening', 'night'))))) as time_bucket

from timeformat

)

--select * from timebucket 这样可以知道,时间格式太多了需要清洗

select time_bucket, avg(price) as avg_spend, sum(price)/1000 as revenue_k from timebucket group by time_bucket

-- divide 1k to see the chater more clear

6. 按工作日计算平均消费

select avg(price) as avg_price, date_format(purchase_date, 'u') as weekday

from transaction_details

where date_format(purchase_date, 'u') is not null

group by date_format(purchase_date, 'u')

7. 计算年、月、日交易总额

with base as (

select transaction_id, date_format(purchase_date, 'u') as weekday, purchase_month,

concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter, substr(purchase_date, 1, 4) as year

from transaction_details

where purchase_month is not null    )

select count(distinct transaction_id) as total, weekday, purchase_month, year_quarter, year

from base group by weekday, purchase_month, year_quarter, year order by year, purchase_month

8. 找出交易量最大的10个客户

with base as (

select customer_id, count(distinct transaction_id) as trans_cnt, sum(price) as spend_total

from transaction_details

where purchase_month is not null group by customer_id

),

cust_detail as (

select *, concat_ws(' ', first_name, last_name) as cust_name

--表连接因为信息不全

From base td join vw_customer_details cd on td.customer_id = cd.customer_id

)

select trans_cnt, cust_name as top10_trans_cust

from cust_detail order by trans_cnt desc limit 10

9. 找出消费最多的十大客户

with base as (

select

customer_id, count(distinct transaction_id) as trans_cnt, sum(price) as spend_total

from transaction_details

where purchase_month is not null group by customer_id

),

cust_detail as (

select *, concat_ws(' ', first_name, last_name) as cust_name

from base td join vw_customer_details cd on td.customer_id = cd.customer_id   )

select spend_total, cust_name as top10_trans_cust

from cust_detail order by spend_total desc limit 10

10. 谁拥有该期间的最小交易数量

with base as (

select customer_id, count(distinct transaction_id) as trans_cnt

from transaction_details

where purchase_month is not null

group by customer_id   )

select * from base order by trans_cnt limit 10

11. 按季度/年计算独立客户总数

with base as (

select transaction_id,

concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter, substr(purchase_date, 1, 4) as year

from transaction_details

where purchase_month is not null   )

select count(distinct transaction_id) as total, year_quarter, year

from base

group by year_quarter, year order by year_quarter

12. 计算整个活动中客户的平均最大值

with base as (

select customer_id, avg(price) as price_avg, max(price) as price_max

from transaction_details

where purchase_month is not null group by customer_id   )

select max(price_avg)

from base

13. 每个月谁花的钱最多

with base as (

select customer_id, purchase_month, sum(price) as price_sum, count(transaction_id) as trans_cnt

from transaction_details

where purchase_month is not null group by purchase_month, customer_id   ),

rank_sum as (

select

rank() over(partition by purchase_month order by price_sum desc) as rn_sum,

rank() over(partition by purchase_month order by trans_cnt desc) as rn_cnt,

purchase_month, price_sum, trans_cnt, customer_id

from base   )

select purchase_month, 'spend' as measure_name, price_sum as measure_value, customer_id

from rank_sum where rn_sum = 1

14. 每个月谁是最频繁的访客

with base as (

select customer_id, purchase_month, sum(price) as price_sum, count(transaction_id) as trans_cnt

from transaction_details

where purchase_month is not null group by purchase_month, customer_id   ),

rank_sum as (

select

rank() over(partition by purchase_month order by price_sum desc) as rn_sum,

rank() over(partition by purchase_month order by trans_cnt desc) as rn_cnt,

purchase_month, price_sum, trans_cnt, customer_id

from base   )

select purchase_month, 'visit' as measure_name, trans_cnt as measure_value, customer_id

from rank_sum where rn_cnt = 1 order by measure_name, purchase_month

15. 找出最受欢迎的5种产品的总价格

select product, sum(price) as price_sum

from transaction_details

where purchase_month is not null group by product

order by price_sum desc limit 5

16. 根据购买频率找出最受欢迎的5种产品

select product, count(transaction_id) as freq_buy

from transaction_details

where purchase_month is not null

group by product order by freq_buy desc limit 5

17. 根据客户数量找出最受欢迎的5种产品

select product, count(customer_id) as freq_cust

from transaction_details

where purchase_month is not null

group by product order by freq_cust desc limit 5

以商店为中心:

1. 找出最受欢迎的商店正在访问的独特的客户

select sd.store_name, count(distinct customer_id) as unique_visit

from transaction_details td join ext_store_details sd on td.store_id = sd.store_id

group by store_name order by unique_visit desc limit 5

2. 根据顾客购买的数量找出最受欢迎的商店

select sd.store_name, sum(td.price) as total_revnue

from transaction_details td join ext_store_details sd on td.store_id =sd.store_id

group by store_name

order by total_revnue desc limit 5

3. 根据顾客交易情况找出最受欢迎的商店

select sd.store_name, count(transaction_id) as unique_purchase

from transaction_details td join ext_store_details sd on td.store_id = sd.store_id

group by store_name order by unique_purchase desc limit 5

4. 通过唯一的客户id获得商店最受欢迎的产品

with base as (

select store_id, product, count(distinct customer_id) as freq_cust

from transaction_details

where purchase_month is not null group by store_id, product   ),

prod_rank as (

select store_id, product, freq_cust, rank()

over(partition by store_id order by freq_cust desc) as rn

from base   )

select store_name, product, freq_cust

from prod_rank td join ext_store_details sd on td.store_id = sd.store_id

where td.rn = 1

5. 获得每个商店的员工与客户访问率

with base as (

select store_id, count(distinct customer_id, purchase_date) as cust_visit

from transaction_details

where purchase_month is not null

group by store_id

)

Select store_name, cust_visit, employee_number,

round(cust_visit/employee_number,2) as cust_per_employee_within_period

from base td join ext_store_details sd on td.store_id = sd.store_id

6. 按年、月计算各门店收入

select store_name, purchase_month, sum(price) as revenue

from transaction_details td join ext_store_details sd on td.store_id = sd.store_id

where purchase_month is not null

group by store_name, purchase_month

7. 根据商店的总收入制作饼图

select store_name, sum(price) as revenue

from transaction_details td join ext_store_details sd on td.store_id = sd.store_id

where purchase_month is not null

group by store_name

8. 找出每个商店最繁忙的时间段

--define time bucket

--early morning: (5:00, 8:00]

--morning: (8:00, 11:00]

--noon: (11:00, 13:00]

--afternoon: (13:00, 18:00]

--evening: (18:00, 22:00]

--night: (22:00, 5:00] --make it as else, since it is not liner increasing

--We also format the time. 1st format time to 19:23 like, then compare, then convert minites to hours

with base as (

select transaction_id, purchase_time, if(purchase_time like '%M', from_unixtime(unix_timestamp(purchase_time,'hh:mm aa'),'HH:mm'), purchase_time) as time_format, store_id from transaction_details

where purchase_month is not null

),

timeformat as (

select purchase_time, transaction_id, (cast(split(time_format, ':')[0] as decimal(4,2)) +  cast(split(time_format, ':')[1] as decimal(4,2))/60) as purchase_time_in_hrs, store_id

from base

),

timebucket as (

select

transaction_id, purchase_time, purchase_time_in_hrs, store_id,

if(purchase_time_in_hrs > 5 and purchase_time_in_hrs <=8, 'early morning',

if(purchase_time_in_hrs > 8 and purchase_time_in_hrs <=11, 'morning',

if(purchase_time_in_hrs > 11 and purchase_time_in_hrs <=13, 'noon',

if(purchase_time_in_hrs > 13 and purchase_time_in_hrs <=18, 'afternoon',

if(purchase_time_in_hrs > 18 and purchase_time_in_hrs <=22, 'evening', 'night'))))) as time_bucket

from timeformat

)

--select * from timebucket 这样可以知道,时间格式太多了需要清洗

select sd.store_name, count(transaction_id) as tran_cnt, time_bucket

from timebucket td join ext_store_details sd on td.store_id = sd.store_id

group by sd.store_name, time_bucket order by sd.store_name, tran_cnt desc

9. 找出每家店的忠实顾客

--忠诚顾客是这段时间内在商店里消费最多的前5名顾客

with base as (

select store_name, customer_id, sum(td.price) as total_cust_purphase

from transaction_details td join ext_store_details sd on td.store_id = sd.store_id

where purchase_month is not null

group by store_name, customer_id

),

rk_cust as (

select store_name, customer_id, total_cust_purphase,

rank() over(partition by store_name order by total_cust_purphase desc) as rn

from base

)

select * from rk_cust where rn <= 5

10. 找出明星店的最大收入

with base as (

select store_id, sum(price) as revenue

from transaction_details

where purchase_month is not null

group by store_id

)

Select store_name, revenue, employee_number,

round(revenue/employee_number,2) as revenue_per_employee_within_period

from base td join ext_store_details sd on td.store_id = sd.store_id

以评估为中心:

1. 找出要在ext_store_review中检查映射的冲突事务

--冲突意味着事务表和检查表中的transaction_id、store_id对不同

--td.transaction_id, td.store_id, sd.store_id

select count(*) --725/8101

from transaction_details td join ext_store_review sd on td.transaction_id = sd.transaction_id

where purchase_month is not null and td.store_id != sd.store_id

2. 了解客户评审范围

Select count(td.transaction_id) as total_trans,

sum(if(sd.transaction_id is null, 1, 0)) as total_review_missed,

sum(if(sd.transaction_id is not null, 1, 0)) as total_review_exist

from transaction_details td left join ext_store_review sd on td.transaction_id = sd.transaction_id

where purchase_month is not null

3. 了解客户/交易评审分数的分布情况

Select review_score, count(distinct customer_id) as num_customer, count(*) as num_reviews

From transaction_details td join ext_store_review sd on td.transaction_id = sd.transaction_id

where purchase_month is not null and review_score <> ''

group by review_score

4. 顾客总是光顾同一家商店吗

select -- not always

count(*) as visit_cnt,

customer_id,

td.store_id

from transaction_details td join ext_store_review sd on td.transaction_id = sd.transaction_id

where

purchase_month is not null and review_score = '5'

group by customer_id, td.store_id order by visit_cnt desc

  • 3
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值