Hive/Spark练习--电商消费数据查询

Hive/Spark小练习-电子商务数据分析

背景

  1. 基于Hive或Spark SQL进行使用练习–我们可以用hive或spark做什么
  2. 某零售企业根据最近一年门店收集的数据进行数据分析展示

数据类型介绍及准备

  1. 环境准备

    1. 使用Zeppelin、Spark SQL进行编写,因为我的zeppelin上不支持hive在这里插入图片描述

    2. 验证是否可以正常使用,如图,没问题在这里插入图片描述

  2. 数据文件介绍

    1. customer table–数据可能存在瑕疵

      fileddetails
      customers_idInt,1-500
      first_namestring
      last_namestring
      emailstring,such as willddy@mail.com
      genderstring,Male or female
      addressstring
      countrystring
      languagestring
      jobstring, job title/position
      credit_typestring,credit card type, such as visa
      credit_nostring,credit card number
    2. Transaction Table–有些数据是重复的,但是不能随便删除,所以需要数据修复

      fileddetails
      transaction_idInt, 1-1000
      customer_idInt, 1-500
      store_idInt, 1-5
      pricedecimal, such as 5.08
      productstring, things bought
      datestring, when to purchase
      timestring, what time to purchase
    3. Store Table

      fileddetails
      store_idInt, 1-5
      store_namestring
      employee_numberInt, how many employee in this store
    4. Review Table–有些数据是null,有些是错误的映射

      fileddetails
      stransaction_idInt, 1-8000
      store_idInt,1-5
      review_storeInt,1-5
  3. 数据准备

    1. 下载数据–我直接放我云盘里了;检查数据

      1. 链接:https://pan.baidu.com/s/11fa1wf0HC_I5DEW5WBor3A
        提取码:c2jy
      2. 查了一下没问题在这里插入图片描述
    2. 上传数据到hdfs里–如果在zeppelin里直接放就行了,如果在终端里放记得切换成zeppelin管理员

      1. 在这里插入图片描述

建表及数据验证

​ 创建外部表来保存数据

  1. %spark2.sql
    create external table if not exists ext_customer_details (
    customer_id string, 
    first_name string,
    last_name string,
    email string,
    gender string,
    address string,
    country string,
    language string,
    job string,
    credit_type string,
    credit_no string
    )
    row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    location '/tmp/shopping/data/customer' 
    tblproperties ("skip.header.line.count"="1")
    
    %spark2.sql
    create external table if not exists ext_transaction_details(
    transaction_id string,
    customer_id string,
    store_id string,
    price decimal(8,2),
    product string,
    purchase_date string,
    purchase_time string
    )
    row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    location '/tmp/shopping/data/transaction' 
    tblproperties ("skip.header.line.count"="1")
    
    %spark2.sql
    create external table if not exists ext_store_details(
    store_id string,
    store_name string,
    employee_number int
    )
    row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    location '/tmp/shopping/data/store' --this must tblproperties 
    tblproperties ("skip.header.line.count"="1")
    
    %spark2.sql
    create external table if not exists ext_store_review(
    transaction_id string,
    store_id string,
    review_score int
    )
    row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    location '/tmp/shopping/data/review' --this must tblproperties 
    tblproperties ("skip.header.line.count"="1")
    

    都进去了[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传在这里插入图片描述

数据清洗

要求:

  1. transaction_details中的重复数据, 对重复数据生成新ID
  2. 过滤掉store_review中没有评分的数据
  3. 可以把清洗好的数据放到另一个表或者用View表示
  4. 找出PII(persional information identification)或PCI(personal confidential information)数据进行加密或hash
  5. 重新组织transaction数据按照日期YYYY-MM做分区

过程

  1. 对部分姓名,卡证信息进行加密,效果如下所示在这里插入图片描述

  2. 需要进行一些数据修复,例如这些数据

    1. 即将完成在这里插入图片描述

    2. 结果测试,ok在这里插入图片描述

    3. 清理另一个表在这里插入图片描述

数据分析

  1. 面向客户的数据分析

    1. 找出客户最常用的卡

      select credit_type, count(distinct credit_no) as credit_cnt from vw_customer_details group by country, credit_type order by credit_cnt desc
      

    在这里插入图片描述

    1. 找出客户数据中的前5大标题

      select job, count(*) as pn from vw_customer_details group by job order by pn desc limit 5
      

在这里插入图片描述

  1. 对美国女性来说,她们中最流行持有什么卡

    select credit_type, count(*) as ct from vw_customer_details where country = 'United States' and gender = 'Female'
    
  2. 按性别和国家进行客户统计

    select count(*), country, gender from vw_customer_details group by country, gender
    
  3. 面向流水的分析

    1. 计算总每月、每季度、每年、每周收益(算是四个小题)

      select sum(price) as revenue_mom, purchase_month from transaction_details group by purchase_month order by purchase_month
      --
      with base as (
      select price, 
      concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter
      from 
      shopping.transaction_details
      )
      select 
      sum(price) as revenue_qoq, --quarter does not support until hive 1.3
      year_quarter
      from base group by year_quarter
      --
      select sum(price) as revenue_mom, substr(purchase_date, 1, 4) as year from transaction_details group by substr(purchase_date, 1, 4)
      --
      select sum(price) as revenue_wow, date_format(purchase_date, 'u') as weekday from transaction_details group by  date_format(purchase_date, 'u') --1 is Mon and 7 is Sun
      

      在这里插入图片描述

    2. 根据时间分桶计算收益、花销

      with base as (
      select price, purchase_time, if(purchase_time like '%M', from_unixtime(unix_timestamp(purchase_time,'hh:mm aa'),'HH:mm'), purchase_time) as time_format from transaction_details
      ),
      timeformat as (
      select
      purchase_time, price, (cast(split(time_format, ':')[0] as decimal(4,2)) +  cast(split(time_format, ':')[1] as decimal(4,2))/60) as purchase_time_in_hrs
      from base
      ),
      timebucket as (
      select
      price, purchase_time, purchase_time_in_hrs,
      if(purchase_time_in_hrs > 5 and purchase_time_in_hrs <=8, 'early morning',
      if(purchase_time_in_hrs > 8 and purchase_time_in_hrs <=11, 'morning',
      if(purchase_time_in_hrs > 11 and purchase_time_in_hrs <=13, 'noon',
      if(purchase_time_in_hrs > 13 and purchase_time_in_hrs <=18, 'afternoon',
      if(purchase_time_in_hrs > 18 and purchase_time_in_hrs <=22, 'evening', 'night'))))) as time_bucket
      from timeformat
      )
      --select * from timebucket --this is to verify
      select time_bucket, avg(price) as avg_spend, sum(price)/1000 as revenue_k from timebucket group by time_bucket
      
    3. 根据每周计算花销

      select 
      avg(price) as avg_price, 
      date_format(purchase_date, 'u') as weekday
      from transaction_details
      where
      date_format(purchase_date, 'u') is not null
      group by date_format(purchase_date, 'u')
      
    4. 计算每年、月、周的交易数量

      with base as (
      select 
      transaction_id, date_format(purchase_date, 'u') as weekday, purchase_month,
      concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter, substr(purchase_date, 1, 4) as year
      from transaction_details
      where purchase_month is not null
      )
      select count(distinct transaction_id) as total, weekday, purchase_month, year_quarter, year 
      from base group by weekday, purchase_month, year_quarter, year order by year, purchase_month
      
    5. 找出交易最多的10个顾客

      with base as (
      select 
      customer_id,
      count(distinct transaction_id) as trans_cnt,
      sum(price) as spend_total
      from transaction_details
      where purchase_month is not null
      group by customer_id
      ),
      cust_detail as (
      select
      *,
      concat_ws(' ', first_name, last_name) as cust_name
      from
      base td join vw_customer_details cd on td.customer_id = cd.customer_id
      )
      select 
      trans_cnt,
      cust_name as top10_trans_cust
      from cust_detail order by trans_cnt desc limit 10
      
    6. 找出花销做多的10个客户

      with base as (
      select 
      customer_id,
      count(distinct transaction_id) as trans_cnt,
      sum(price) as spend_total
      from transaction_details
      where purchase_month is not null
      group by customer_id
      ),
      cust_detail as (
      select
      *,
      concat_ws(' ', first_name, last_name) as cust_name
      from
      base td join vw_customer_details cd on td.customer_id = cd.customer_id
      )
      select 
      spend_total,
      cust_name as top10_trans_cust
      from cust_detail order by spend_total desc limit 10
      
    7. 该交易周期里谁的交易数量最少

      with base as (
      select 
      customer_id,
      count(distinct transaction_id) as trans_cnt
      from
      transaction_details
      where purchase_month is not null
      group by customer_id
      )
      select
      *
      from
      base order by trans_cnt limit 10
      
    8. 计算每个季度、每周的不重复(唯一、unique customer)顾客的数量

      with base as (
      select 
      transaction_id,
      concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter, substr(purchase_date, 1, 4) as year
      from transaction_details
      where purchase_month is not null
      )
      select 
      count(distinct transaction_id) as total, year_quarter, year 
      from base 
      group by year_quarter, year 
      order by year_quarter
      
    9. 计算所有活动的顾客的最大平均值

      with base as (
      select 
      customer_id,
      avg(price) as price_avg,
      max(price) as price_max
      from
      transaction_details
      where 
      purchase_month is not null
      group by customer_id
      )
      select 
      max(price_avg) 
      from base
      
    10. 每月花销最大的是谁

      with base as (
      select customer_id, purchase_month, sum(price) as price_sum, count(transaction_id) as trans_cnt
      from transaction_details
      where purchase_month is not null group by purchase_month, customer_id
      ),
      rank_sum as (
      select 
      rank() over(partition by purchase_month order by price_sum desc) as rn_sum,
      rank() over(partition by purchase_month order by trans_cnt desc) as rn_cnt,
      purchase_month,
      price_sum,
      trans_cnt,
      customer_id
      from base
      )
      select 
      purchase_month,
      'spend' as measure_name,
      price_sum as measure_value,
      customer_id
      from rank_sum where rn_sum = 1
      union all
      select
      purchase_month,
      'visit' as measure_name,
      trans_cnt as measure_value,
      customer_id
      from rank_sum where rn_cnt = 1 
      order by measure_name, purchase_month
      
    11. 根据总价格、购买频率、顾客数量找出5个最受欢迎的产品

      select 
      product,
      sum(price) as price_sum
      from
      transaction_details
      where 
      purchase_month is not null
      group by product
      order by price_sum desc limit 5
      --
      select 
      product,
      count(transaction_id) as freq_buy
      from
      transaction_details
      where 
      purchase_month is not null
      group by product
      order by freq_buy desc limit 5
      --
      select 
      product,
      count(customer_id) as freq_cust
      from
      transaction_details
      where 
      purchase_month is not null
      group by product
      order by freq_cust desc limit 5
      
    12. 验证前5的细节

      select * from transaction_details where product in ('Goat - Whole Cut')
      
  4. 面向门面的分析

    1. 根据-不重复顾客id和进店次数、-顾客的购买次数、-顾得的交易 找出最受欢迎的门店(算3个题)

      select
      sd.store_name,
      count(distinct customer_id) as unique_visit 
      from
      transaction_details td join ext_store_details sd on td.store_id = sd.store_id
      group by store_name order by unique_visit desc
      limit 5
      --
      select
      sd.store_name,
      sum(td.price) as total_revnue 
      from
      transaction_details td join ext_store_details sd on td.store_id = sd.store_id
      group by store_name 
      order by total_revnue desc
      limit 5
      --
      select
      sd.store_name,
      count(transaction_id) as unique_purchase
      from
      transaction_details td join ext_store_details sd on td.store_id = sd.store_id
      group by store_name order by unique_purchase desc
      limit 5
      
    2. 根据门店和顾客id找出最受欢迎的产品

      with base as (select 
      store_id,
      product,
      count(distinct customer_id) as freq_cust
      from
      transaction_details
      where 
      purchase_month is not null
      group by store_id, product
      ),
      prod_rank as (
      select
      store_id,
      product,
      freq_cust,
      rank() over(partition by store_id order by freq_cust desc) as rn
      from base
      )
      select store_name, product, freq_cust
      from prod_rank td join ext_store_details sd on td.store_id = sd.store_id
      where td.rn = 1
      
    3. 计算每个店的员工与顾客访问比率(应该员工数/顾客访问次数的意思)

      with base as (
      select 
      store_id,
      count(distinct customer_id, purchase_date) as cust_visit
      from
      transaction_details
      where 
      purchase_month is not null
      group by store_id
      )
      select
      store_name,
      cust_visit,
      employee_number,
      round(cust_visit/employee_number,2) as cust_per_employee_within_period
      from base td join ext_store_details sd on td.store_id = sd.store_id
      
    4. 计算每个店每月每年的收益

      select 
      store_name,
      purchase_month,
      sum(price) as revenue
      from
      transaction_details td join ext_store_details sd on td.store_id = sd.store_id
      where 
      purchase_month is not null
      group by store_name, purchase_month
      
    5. 制作每个店的总收益饼图

      select 
      store_name,
      sum(price) as revenue
      from
      transaction_details td join ext_store_details sd on td.store_id = sd.store_id
      where 
      purchase_month is not null
      group by store_name
      
    6. 找出每个门最忙的时候

      with base as (
      select transaction_id, purchase_time, if(purchase_time like '%M', from_unixtime(unix_timestamp(purchase_time,'hh:mm aa'),'HH:mm'), purchase_time) as time_format, store_id from transaction_details
      where 
      purchase_month is not null
      ),
      timeformat as (
      select
      purchase_time, transaction_id, (cast(split(time_format, ':')[0] as decimal(4,2)) +  cast(split(time_format, ':')[1] as decimal(4,2))/60) as purchase_time_in_hrs, store_id
      from base
      ),
      timebucket as (
      select
      transaction_id, purchase_time, purchase_time_in_hrs, store_id,
      if(purchase_time_in_hrs > 5 and purchase_time_in_hrs <=8, 'early morning',
      if(purchase_time_in_hrs > 8 and purchase_time_in_hrs <=11, 'morning',
      if(purchase_time_in_hrs > 11 and purchase_time_in_hrs <=13, 'noon',
      if(purchase_time_in_hrs > 13 and purchase_time_in_hrs <=18, 'afternoon',
      if(purchase_time_in_hrs > 18 and purchase_time_in_hrs <=22, 'evening', 'night'))))) as time_bucket
      from timeformat
      )
      --select * from timebucket --this is to verify
      select sd.store_name, count(transaction_id) as tran_cnt, time_bucket 
      from timebucket td join ext_store_details sd on td.store_id = sd.store_id
      group by sd.store_name, time_bucket order by sd.store_name, tran_cnt desc
      -- from the result we can further divide the bucket afternnon and morning into more details
      
    7. 找出每个店的忠诚顾客

      with base as (
      select
      store_name,
      customer_id,
      sum(td.price) as total_cust_purphase
      from
      transaction_details td join ext_store_details sd on td.store_id = sd.store_id
      where 
      purchase_month is not null
      group by store_name, customer_id 
      ),
      rk_cust as (
      select 
      store_name,
      customer_id,
      total_cust_purphase,
      rank() over(partition by store_name order by total_cust_purphase desc) as rn
      from base
      )
      select * from rk_cust where rn <= 5
      
    8. 找出明星店的每个员工的最大收入

      with base as (
      select 
      store_id,
      sum(price) as revenue
      from
      transaction_details
      where 
      purchase_month is not null
      group by store_id
      )
      select
      store_name,
      revenue,
      employee_number,
      round(revenue/employee_number,2) as revenue_per_employee_within_period
      from base td
      join ext_store_details sd
      on td.store_id = sd.store_id
      
  5. 针对评分的分析

    1. 在ext_store_review中找出冲突交易

      --冲突交易意味着事务表和审查表中的transaction_id、store_id对不同
      select
      count(*) --725/8101
      from
      transaction_details td join ext_store_review sd on td.transaction_id = sd.transaction_id
      where 
      purchase_month is not null and td.store_id != sd.store_id
      
    2. 找出每个顾客的平均评论数

      select
      count(td.transaction_id) as total_trans,
      sum(if(sd.transaction_id is null, 1, 0)) as total_review_missed,
      sum(if(sd.transaction_id is not null, 1, 0)) as total_review_exist
      from
      transaction_details td left join ext_store_review sd on td.transaction_id = sd.transaction_id
      where 
      purchase_month is not null
      
    3. 找出评论分数的顾客分布、交易分布

      select
      review_score,
      count(distinct customer_id) as num_customer,
      count(*) as num_reviews
      from
      transaction_details td join ext_store_review sd on td.transaction_id = sd.transaction_id
      where 
      purchase_month is not null and review_score <> ''
      group by review_score
      
    4. 顾客总是在光顾同一家店的时候给好评吗?

      select -- not always
      count(*) as visit_cnt,
      customer_id,
      td.store_id
      from
      transaction_details td join ext_store_review sd on td.transaction_id = sd.transaction_id
      where 
      purchase_month is not null and review_score = '5'
      group by customer_id, td.store_id order by visit_cnt desc
      

总结

  1. 认识数据问题的影响
  2. 确定活动的影响
  3. 练习sql
  • 1
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值