2020.9.23课堂笔记(项目实战-电子商务消费行为分析)

cd /tmp/data
ls
wc -l customer_details.csv
wc -l store_details.csv
wc -l store_review.csv
wc -l transaction_details.csv

head -2 customer_details.csv

customer_idfirst_namelast_nameemailgenderaddresscountrylanguagejobcredit_typecredit_no
1SpencerRaffeortysraffeorty0@dropbox.comMale9274 Lyons CourtChinaKhmerSafety Technician IIIjcb3589373385487669

head -2 store_details.csv

store_idstore_nameemployee_number
1NoFrill10

head -2 store_review.csv

transaction_idstore_idreview_score
743015

head -2 transaction_details.csv

transaction_idcustomer_idstore_idpriceproductdatetime
1225547.02Bamboo Shoots - Sliced2017-08-048:18

将文件上传到hdfs目录下:

cd /tmp/data
hdfs dfs -rm -r -f /tmp/shopping
hdfs dfs -mkdir -p /tmp/shopping/data/customer/
hdfs dfs -mkdir -p /tmp/shopping/data/transaction/
hdfs dfs -mkdir -p /tmp/shopping/data/store/
hdfs dfs -mkdir -p /tmp/shopping/data/review/

hdfs dfs -put customer_details.csv /tmp/shopping/data/customer/
hdfs dfs -put transaction_details.csv /tmp/shopping/data/transaction/
hdfs dfs -put store_details.csv /tmp/shopping/data/store/
hdfs dfs -put store_review.csv /tmp/shopping/data/review/

创建外部表来保存数据:
文件的第一行不是数据,是字段名,要过滤掉

create external table if not exists ext_customer_details(
customer_id string, 
first_name string,
last_name string,
email string,
gender string,
address string,
country string,
language string,
job string,
credit_type string,
credit_no string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/customer'
tblproperties("skip.header.line.count"="1")
create external table ext_transaction_details(
transaction_id string,
customer_id string,
store_id string,
price string,
product string,
date string,
time string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/transaction'
tblproperties("skip.header.line.count"="1")
create external table if not exists ext_store_details(
store_id string,
store_name string,
employee_number string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/store'
tblproperties("skip.header.line.count"="1")
create external table if not exists ext_store_review(
transaction_id string,
store_id string,
review_score string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/review'
tblproperties("skip.header.line.count"="1")

确认每张表都有数据:

select * from ext_customer_details limit 20
select * from ext_transaction_details limit 20
select * from ext_store_details limit 20 
select * from ext_store_review limit 20

数据清洗:
第一张表:ext_customer_details 如果再创建一张表会把数据再拷贝一份,这里可以选择view或者临时表

-- 创建view vw_customer_details
create view if not exists vw_customer_details
as
select 
customer_id,
first_name,
unbase64(last_name) as last_name,
unbase64(email) as email,
gender,
unbase64(address) as address,
country,
job,
credit_type,
unbase64(credit_no) as credit_no
from ext_customer_details

创建交易明细分区表,把数据导入:
把外部表里的数据加载到分区表里,通过CSV加载进来,都是string类型的,这里可以把字段的类型改一下,date和time都是关键字,可以改一下

create table if not exists transaction_details(
transaction_id string,
customer_id string,
store_id string,
price decimal(8,2),
product string,
purchase_time string,
purchase_date date
)
partitioned by (purchase_month string)

处理transaction_id的重复数据,生成新的id:
先看一下有多少id是重复的:

select transaction_id, count(*) from ext_transaction_details group by transaction_id having count(*) > 1

不能直接去忽略掉,两条数据都是有用的,要对transaction_id做一下修复:
使用动态分区的方式往分区表中导入数据,根据数据决定使用静态分区还是动态分区。数据混杂在一块的情况适合使用动态分区,设置非严格模式,然后导入数据

set hive.exec.dynamic.partition.mode=nonstrict

purchase_month截取字段可以采用的方法:

// 2017-12-15
substr(purchase_date,1,7)
unix_timestamp(purchase_date,"yyyy-MM-dd")  转换成时间戳再转换成想要的格式
from_unixtime( unix_timestamp(purchase_date,"yyyy-MM-dd"),"yyyy-MM")

通过动态分区的方式插入数据,要指定分区的字段,要插入的字段,还要修复相同的transaction_id 两个相同的transaction_id 下面就不单单是transaction_id了,可以用if给它做修改

with base as(
select
transaction_id,
customer_id,
store_id,
price,
product,
time,
date,
from_unixtime( unix_timestamp(date,"yyyy-MM-dd"),"yyyy-MM") as purchase_month,
row_number() over(partition by transaction_id order by store_id) as rn
from ext_transaction_details
)
from base 
insert overwrite table transaction_details partition(purchase_month)
select 
if(rn=1,transaction_id,concat(transaction_id,"_fix",rn)) as transaction_id,
customer_id,
store_id,
price,
product,
time,
date,
purchase_month

查看修复后的数据:

select * from transaction_details where transaction_id like '%fix%'

在store_review表里就只要transaction_id了(transaction_id和store_id是对应的)
把评分为空的忽略掉,创建一个view,只要transaction_id和review_score这两个字段就可以了。

create view if not exists vw_store_review 
as select transaction_id,review_score from ext_store_review where review_score <> ''

查看结果数据:

select * from vw_store_review limit 20

数据清洗完了,查看一下当前的表:

hive> show tables;
ext_customer_details
ext_store_details
ext_store_review
ext_transaction_details
transaction_details
vw_customer_details
vw_store_review

6.1找出顾客最常用的信用卡:

select 
credit_type,count(distinct credit_no) as cnt
from vw_customer_details
group by credit_type order by cnt desc limit 6

6.4按性别和国家进行客户统计

select gender,country,count(*) as cnt
from vw_customer_details
group by gender,country order by cnt desc

7.2计算每个季度的总收入
怎么去计算每个月份属于哪个季度?cese when 要写12次,还有另外一种写法
purchase_month是一个"2017-12"的字符串
ceil()求出来的结果是个int类型的,要转成字符串类型

-- 1 2 3    4 5 6    7 8 9    10 11 12
--  1/3  2/3  3/3  <=1   ceil(1/3)=1
--  4/3  5/3  6/3  <=2
--  7/3  8/3  9/3  <=3
-- 10/3 11/3 12/3  <=4

with base as(
select
price,
concat_ws("-",substr(purchase_date,1,4),cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter
from transaction_details
)
select
year_quarter,
sum(price) as revenue_total
from base group by year_quarter
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值