2020.9.23课堂笔记(项目实战-电子商务消费行为分析)

最新推荐文章于 2023-03-06 15:04:29 发布

超可爱慕之

最新推荐文章于 2023-03-06 15:04:29 发布

阅读量320

点赞数

分类专栏：笔记

本文链接：https://blog.csdn.net/m0_48758256/article/details/108785069

版权

笔记专栏收录该内容

138 篇文章 5 订阅

订阅专栏

cd /tmp/data
ls
wc -l customer_details.csv
wc -l store_details.csv
wc -l store_review.csv
wc -l transaction_details.csv

head -2 customer_details.csv

customer_id	first_name	last_name	email	gender	address	country	language	job	credit_type	credit_no
1	Spencer	Raffeorty	sraffeorty0@dropbox.com	Male	9274 Lyons Court	China	Khmer	Safety Technician III	jcb	3589373385487669

head -2 store_details.csv

store_id	store_name	employee_number
1	NoFrill	10

head -2 store_review.csv

transaction_id	store_id	review_score
7430	1	5

head -2 transaction_details.csv

transaction_id	customer_id	store_id	price	product	date	time
1	225	5	47.02	Bamboo Shoots - Sliced	2017-08-04	8:18

将文件上传到hdfs目录下：

cd /tmp/data
hdfs dfs -rm -r -f /tmp/shopping
hdfs dfs -mkdir -p /tmp/shopping/data/customer/
hdfs dfs -mkdir -p /tmp/shopping/data/transaction/
hdfs dfs -mkdir -p /tmp/shopping/data/store/
hdfs dfs -mkdir -p /tmp/shopping/data/review/

hdfs dfs -put customer_details.csv /tmp/shopping/data/customer/
hdfs dfs -put transaction_details.csv /tmp/shopping/data/transaction/
hdfs dfs -put store_details.csv /tmp/shopping/data/store/
hdfs dfs -put store_review.csv /tmp/shopping/data/review/

创建外部表来保存数据：
文件的第一行不是数据，是字段名，要过滤掉

create external table if not exists ext_customer_details(
customer_id string, 
first_name string,
last_name string,
email string,
gender string,
address string,
country string,
language string,
job string,
credit_type string,
credit_no string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/customer'
tblproperties("skip.header.line.count"="1")

create external table ext_transaction_details(
transaction_id string,
customer_id string,
store_id string,
price string,
product string,
date string,
time string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/transaction'
tblproperties("skip.header.line.count"="1")

create external table if not exists ext_store_details(
store_id string,
store_name string,
employee_number string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/store'
tblproperties("skip.header.line.count"="1")

create external table if not exists ext_store_review(
transaction_id string,
store_id string,
review_score string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/review'
tblproperties("skip.header.line.count"="1")

确认每张表都有数据：

select * from ext_customer_details limit 20
select * from ext_transaction_details limit 20
select * from ext_store_details limit 20 
select * from ext_store_review limit 20

数据清洗：
第一张表：ext_customer_details 如果再创建一张表会把数据再拷贝一份，这里可以选择view或者临时表

-- 创建view vw_customer_details
create view if not exists vw_customer_details
as
select 
customer_id,
first_name,
unbase64(last_name) as last_name,
unbase64(email) as email,
gender,
unbase64(address) as address,
country,
job,
credit_type,
unbase64(credit_no) as credit_no
from ext_customer_details

创建交易明细分区表，把数据导入：
把外部表里的数据加载到分区表里，通过CSV加载进来，都是string类型的，这里可以把字段的类型改一下，date和time都是关键字，可以改一下

create table if not exists transaction_details(
transaction_id string,
customer_id string,
store_id string,
price decimal(8,2),
product string,
purchase_time string,
purchase_date date
)
partitioned by (purchase_month string)

处理transaction_id的重复数据，生成新的id：
先看一下有多少id是重复的：

select transaction_id, count(*) from ext_transaction_details group by transaction_id having count(*) > 1

不能直接去忽略掉，两条数据都是有用的，要对transaction_id做一下修复：
使用动态分区的方式往分区表中导入数据，根据数据决定使用静态分区还是动态分区。数据混杂在一块的情况适合使用动态分区，设置非严格模式，然后导入数据

set hive.exec.dynamic.partition.mode=nonstrict

purchase_month截取字段可以采用的方法：

// 2017-12-15
substr(purchase_date,1,7)
unix_timestamp(purchase_date,"yyyy-MM-dd")  转换成时间戳再转换成想要的格式
from_unixtime( unix_timestamp(purchase_date,"yyyy-MM-dd"),"yyyy-MM")

通过动态分区的方式插入数据，要指定分区的字段,要插入的字段，还要修复相同的transaction_id 两个相同的transaction_id 下面就不单单是transaction_id了，可以用if给它做修改

with base as(
select
transaction_id,
customer_id,
store_id,
price,
product,
time,
date,
from_unixtime( unix_timestamp(date,"yyyy-MM-dd"),"yyyy-MM") as purchase_month,
row_number() over(partition by transaction_id order by store_id) as rn
from ext_transaction_details
)
from base 
insert overwrite table transaction_details partition(purchase_month)
select 
if(rn=1,transaction_id,concat(transaction_id,"_fix",rn)) as transaction_id,
customer_id,
store_id,
price,
product,
time,
date,
purchase_month

查看修复后的数据：

select * from transaction_details where transaction_id like '%fix%'

在store_review表里就只要transaction_id了(transaction_id和store_id是对应的)
把评分为空的忽略掉，创建一个view，只要transaction_id和review_score这两个字段就可以了。

create view if not exists vw_store_review 
as select transaction_id,review_score from ext_store_review where review_score <> ''

查看结果数据：

select * from vw_store_review limit 20

数据清洗完了，查看一下当前的表：

hive> show tables;
ext_customer_details
ext_store_details
ext_store_review
ext_transaction_details
transaction_details
vw_customer_details
vw_store_review

6.1找出顾客最常用的信用卡：

select 
credit_type,count(distinct credit_no) as cnt
from vw_customer_details
group by credit_type order by cnt desc limit 6

6.4按性别和国家进行客户统计

select gender,country,count(*) as cnt
from vw_customer_details
group by gender,country order by cnt desc

7.2计算每个季度的总收入
怎么去计算每个月份属于哪个季度？cese when 要写12次，还有另外一种写法
purchase_month是一个"2017-12"的字符串
ceil()求出来的结果是个int类型的，要转成字符串类型

-- 1 2 3    4 5 6    7 8 9    10 11 12
--  1/3  2/3  3/3  <=1   ceil(1/3)=1
--  4/3  5/3  6/3  <=2
--  7/3  8/3  9/3  <=3
-- 10/3 11/3 12/3  <=4

with base as(
select
price,
concat_ws("-",substr(purchase_date,1,4),cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter
from transaction_details
)
select
year_quarter,
sum(price) as revenue_total
from base group by year_quarter