电子商务消费行为分析
- 一.简单介绍
- 二.前提条件
- 三.问题分析
- 四.数据准备和了解
- 五.数据清洗
- 六.Customer分析
- 七.Transaction分析
-
- 1.计算每月总收入
- 2.计算每个季度的总收入
- 3.按年计算总收入
- 4.按工作日计算总收入
- 5.按时间段计算总收入(需要清理数据)
- 6.按时间段计算平均消费
- 7.按工作日计算平均消费
- 8.计算年、月、日的交易总数
- 9.找出交易量最大的10个客户
- 10.找出消费最多的前10位顾客
- 11.统计该期间交易数量最少的用户
- 12.计算每个季度的独立客户总数
- 13.计算每周的独立客户总数
- 14.计算整个活动客户平均花费的最大值
- 15.统计每月花费最多的客户
- 16.统计每月访问次数最多的客户
- 17.按总价找出最受欢迎的5种产品
- 18.根据购买频率找出最畅销的5种产品
- 19.根据客户数量找出最受欢迎的5种产品
- 20.验证前5个details
- 八.Store分析
- 九.Review分析
一.简单介绍
1.需要技能点
- Zeppelin的使用
- 数据采集
- 数据探索(了解业务)
- 数据清洗(脏数据,数据格式转换,数据内容,字段)——得到数据结果表
- 数据分析(根据需求对数据结果表进行分析)
- 可视化(将分析结果进行可视化)
2.需求概述
对某零售企业最近1年门店收集的数据进行数据分析
- 潜在客户画像
- 用户消费统计
- 门店的资源利用率
- 消费的特征人群定位
- 数据的可视化展现
二.前提条件
资源链接:
链接:https://pan.baidu.com/s/1Hb206CyqzY9EwvjpQDX6FQ
提取码:pdu9
1.启动hadoop:start-all.sh
2.后台启动hive2:nohup hive --service hiveserver2 &
3.启动zeppelin:
[root@hadoop00 ~]# cd /opt/install/zeppelin/bin
[root@hadoop00 bin]# ./zeppelin-daemon.sh start
4.打开网页版zeppelin:192.168.36.38:8000
5.jps验证
[root@hadoop00 bin]# jps
s21747 RemoteInterpreterServer
15124 DataNode
15780 NodeManager
17140 ZeppelinServer
26581 Jps
15302 SecondaryNameNode
16358 NameNode
19368 RemoteInterpreterServer
15483 ResourceManager
20044 RunJar
6.上传文件到linux上
[root@hadoop00 data]#mkdir -p /tmp/data
将文件customer_details.csv,store_details.csv,store_review.csv,transaction_details.csv上传到/tmp/data文件下
7.导入json文件Case Study - Shopping Analysis Blank.json
(1)进入zeppelin主页点击import note导入文件
(2)按照以下选择导入文件Case Study - Shopping Analysis Blank.json
(3)点击右上角设置,将hive解释器移到最上角
三.问题分析
1.Customer表
customer_details | details |
---|---|
customer_id | Int, 1 - 500 |
first_name | string |
last_name | string |
string, such as willddy@gmail.com | |
gender | string, Male or female |
address | string |
country | string |
language | string |
job | string, job title/position |
credit_type | string, credit card type, such as visa |
credit_no | string, credit card number |
问题:language字段数据存在错误
2.Transaction表
transaction_details | details |
---|---|
transaction_id | Int, 1 - 1000 |
customer_id | Int, 1 - 500 |
store_id | Int, 1 - 5 |
price | decimal, such as 5.08 |
product | string, things bought |
date | string, when to purchase |
time | string, what time to purchase |
问题:表中transaction_id有重复,但数据有效,需要修复数据
3.Store表
store_details | details |
---|---|
store_id | Int, 1 - 5 |
store_name | string |
employee_number | Int, 在store有多少employee |
4.Review表
store_review | details |
---|---|
stransaction_id | Int, 1 - 8000 |
store_id | Int, 1 - 5 |
review_store | Int, 1 - 5 |
问题:表中有无效的score数据表中有将transaction_id映射到错误的store_id
5.数据结构
Customer表—customer_details
Transaction表—transaction_details
Store表—store_details
Review表—store_review
四.数据准备和了解
忽略第一步Download Data,已经将数据上传到linux下的/tmp/data文件夹下了
1.数据获取
将四张表传到linux下的/tmp/data文件夹
2.检查行数和header行
%sh
## /tmp/data/
cd /tmp/data/
ls
wc -l customer_details.csv
wc -l transaction_details.csv
wc -l store_details.csv
wc -l store_review.csv
#查看数据类型(zeppelin不能显示,需要使用linux界面进行查看)
head -2 customer_details.csv
head -2 transaction_details.csv
head -2 store_details.csv
head -2 store_review.csv
#结果
customer_details.csv
store_details.csv
store_review.csv
transaction_details.csv
501 customer_details.csv
8101 transaction_details.csv
6 store_details.csv
1001 store_review.csv
[root@hadoop00 data]# head -2 customer_details.csv
customer_id,first_name,last_name,email,gender,address,country,language,job,credit_type,credit_no
1,Spencer,Raffeorty,sraffeorty0@dropbox.com,Male,9274 Lyons Court,China,Khmer,Safety Technician III,jcb,3589373385487669
[root@hadoop00 data]# head -2 store_details.csv
store_id,store_name,employee_number
1,NoFrill,10
[root@hadoop00 data]# head -2 store_review.csv
transaction_id,store_id,review_score
7430,1,5
[root@hadoop00 data]# head -2 transaction_details.csv
transaction_id,customer_id,store_id,price,product,date,time
1,225,5,47.02,Bamboo Shoots - Sliced,2017-08-04,8:18
3.上传数据到HDFS
%sh
cd /tmp/data
#创建目录
hdfs dfs -rm -r -f /shopping/data
hdfs dfs -mkdir -p /tmp/shopping/data/customer/
hdfs dfs -mkdir -p /tmp/shopping/data/transaction/
hdfs dfs -mkdir -p /tmp/shopping/data/store/
hdfs dfs -mkdir -p /tmp/shopping/data/review/
#上传文件到对应文件夹中
hdfs dfs -put customer_details.csv /tmp/shopping/data/customer/
hdfs dfs -put transaction_details.csv /tmp/shopping/data/transaction/
hdfs dfs -put store_details.csv /tmp/shopping/data/store/
hdfs dfs -put store_review.csv /tmp/shopping/data/review/
4.创建适当的外部表来保存数据
Clear all tables if exists
drop database if exists shopping cascade
create database if not exists shopping
use shopping
--创建顾客表
create external table if not exists ext_customer_details(
customer_id string,
first_name string,
last_name string,
email string,
gender string,
address string,
country string,
language string,
job string,
credit_type string,
credit_no string)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' --默认分割符为","
location '/tmp/shopping/data/customer'
tblproperties("skip.header.line.count"="1") --去掉首行
--交易表
create external table if not exists ext_transaction_details(
transaction_id string,
customer_id string,
store_id string,
price string,
product string,
date string,
time string)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' --默认分割符为","
location '/tmp/shopping/data/transaction'
tblproperties("skip.header.line.count"="1")
--商店详情表
create external table if not exists ext_store_details(
store_id string,
store_name string,
employee_number string)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' --默认分割符为","
location '/tmp/shopping/data/store'
tblproperties("skip.header.line.count"="1")
--评价表
create external table if not exists ext_store_review(
transaction_id string,
store_id string,
review_score string)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' --默认分割符为","
location '/tmp/shopping/data/review'
tblproperties("skip.header.line.count"="1")
5.查看并了解数据
Verify all Tables are Created
%hive
--select * from ext_customer_details limit 20
select * from ext_transaction_details limit 20
--select * from ext_store_details limit 20
--select * from ext_store_review limit 20
五.数据清洗
解决以下有问题的数据
- 对transaction_details中的重复数据生成新ID
- 过滤掉store_review中没有评分的数据
- 可以把清洗好的数据放到另一个表或者用View表示
- 找出PII (personal information identification) 或PCI (personal confidential information) 数据进行加密或hash
- 重新组织transaction数据按照日期YYYY-MM做分区
- 对transaction_details中的重复数据生成新ID
- 过滤掉store_review中没有评分的数据
- 通过Zeppelin实现交易数据按月分区
1.Clean and Mask customer_details
--创建视图vw_customer_details,清洗一下表数据
--将last_name,email,address,credit_no进行加密;去掉不重要language字段
create view if not exists vw_customer_details
as select
customer_id ,
first_name ,
unbase64(last_name) as last_name ,
unbase64(email) as email ,
gender ,
unbase64(address) as address ,
country ,
job ,
credit_type ,
unbase64(credit_no) as credit_no
from ext_customer_details
2.Clean transaction_details into partition table
%hive
--按照月份进行分区,清洗的表可以根据数据具体情况选择数据类型,之前创建外部表使用都是string类型,因为存储格式是按照OpenCSVSerde,默认为string类型,清洗数据的时候可以根据需要进行类型重新定义
create table if not exists transaction_details(
transaction_id string,
customer_id string,
store_id string,
price decimal(8,2),
product string,
purchase_time string,
purchase_date date
)
partitioned by (purchase_month string)
--使用动态分区进行数据导入
--非严格模式
set hive.exec.dynamic.partition.mode=nonstrict
--向transaction_details导入数据
with base as(
select
transaction_id ,
customer_id ,
store_id ,
price ,
product ,
time ,
date ,
from_unixtime(unix_timestamp(date,"yyyy-MM-dd"),"yyyy-MM") as purchase_month,
row_number() over(partition by transaction_id order by store_id) as rn
from ext_transaction_details