项目实战-电子商务消费行为分析

电子商务消费行为分析

一.简单介绍

1.需要技能点

  • Zeppelin的使用
  • 数据采集
  • 数据探索(了解业务)
  • 数据清洗(脏数据,数据格式转换,数据内容,字段)——得到数据结果表
  • 数据分析(根据需求对数据结果表进行分析)
  • 可视化(将分析结果进行可视化)

2.需求概述

对某零售企业最近1年门店收集的数据进行数据分析

  • 潜在客户画像
  • 用户消费统计
  • 门店的资源利用率
  • 消费的特征人群定位
  • 数据的可视化展现

二.前提条件

资源链接:

链接:https://pan.baidu.com/s/1Hb206CyqzY9EwvjpQDX6FQ
提取码:pdu9

1.启动hadoop:start-all.sh

2.后台启动hive2:nohup hive --service hiveserver2 &

3.启动zeppelin:

[root@hadoop00 ~]# cd /opt/install/zeppelin/bin
[root@hadoop00 bin]# ./zeppelin-daemon.sh  start

4.打开网页版zeppelin:192.168.36.38:8000

5.jps验证

[root@hadoop00 bin]# jps
s21747 RemoteInterpreterServer
15124 DataNode
15780 NodeManager
17140 ZeppelinServer
26581 Jps
15302 SecondaryNameNode
16358 NameNode
19368 RemoteInterpreterServer
15483 ResourceManager
20044 RunJar

6.上传文件到linux上

[root@hadoop00 data]#mkdir -p /tmp/data

将文件customer_details.csv,store_details.csv,store_review.csv,transaction_details.csv上传到/tmp/data文件下

7.导入json文件Case Study - Shopping Analysis Blank.json

(1)进入zeppelin主页点击import note导入文件

在这里插入图片描述

(2)按照以下选择导入文件Case Study - Shopping Analysis Blank.json

在这里插入图片描述

(3)点击右上角设置,将hive解释器移到最上角
在这里插入图片描述

三.问题分析

1.Customer表

customer_details details
customer_id Int, 1 - 500
first_name string
last_name string
email string, such as willddy@gmail.com
gender string, Male or female
address string
country string
language string
job string, job title/position
credit_type string, credit card type, such as visa
credit_no string, credit card number

问题:language字段数据存在错误

2.Transaction表

transaction_details details
transaction_id Int, 1 - 1000
customer_id Int, 1 - 500
store_id Int, 1 - 5
price decimal, such as 5.08
product string, things bought
date string, when to purchase
time string, what time to purchase

问题:表中transaction_id有重复,但数据有效,需要修复数据

3.Store表

store_details details
store_id Int, 1 - 5
store_name string
employee_number Int, 在store有多少employee

4.Review表

store_review details
stransaction_id Int, 1 - 8000
store_id Int, 1 - 5
review_store Int, 1 - 5

问题:表中有无效的score数据表中有将transaction_id映射到错误的store_id

5.数据结构

Customer表—customer_details
Transaction表—transaction_details
Store表—store_details
Review表—store_review

在这里插入图片描述

四.数据准备和了解

忽略第一步Download Data,已经将数据上传到linux下的/tmp/data文件夹下了

1.数据获取

将四张表传到linux下的/tmp/data文件夹

2.检查行数和header行

%sh
## /tmp/data/
cd /tmp/data/
ls
wc -l customer_details.csv
wc -l transaction_details.csv
wc -l store_details.csv
wc -l store_review.csv

#查看数据类型(zeppelin不能显示,需要使用linux界面进行查看)
head -2 customer_details.csv
head -2 transaction_details.csv
head -2 store_details.csv
head -2 store_review.csv

#结果
customer_details.csv
store_details.csv
store_review.csv
transaction_details.csv
501 customer_details.csv
8101 transaction_details.csv
6 store_details.csv
1001 store_review.csv

[root@hadoop00 data]# head -2 customer_details.csv
customer_id,first_name,last_name,email,gender,address,country,language,job,credit_type,credit_no
1,Spencer,Raffeorty,sraffeorty0@dropbox.com,Male,9274 Lyons Court,China,Khmer,Safety Technician III,jcb,3589373385487669

[root@hadoop00 data]# head -2 store_details.csv
store_id,store_name,employee_number
1,NoFrill,10

[root@hadoop00 data]# head -2 store_review.csv
transaction_id,store_id,review_score
7430,1,5

[root@hadoop00 data]# head -2 transaction_details.csv
transaction_id,customer_id,store_id,price,product,date,time
1,225,5,47.02,Bamboo Shoots - Sliced,2017-08-04,8:18

3.上传数据到HDFS

%sh
cd /tmp/data
#创建目录
hdfs dfs -rm -r -f /shopping/data
hdfs dfs -mkdir -p /tmp/shopping/data/customer/
hdfs dfs -mkdir -p /tmp/shopping/data/transaction/
hdfs dfs -mkdir -p /tmp/shopping/data/store/
hdfs dfs -mkdir -p /tmp/shopping/data/review/
#上传文件到对应文件夹中
hdfs dfs -put customer_details.csv    /tmp/shopping/data/customer/
hdfs dfs -put transaction_details.csv /tmp/shopping/data/transaction/
hdfs dfs -put store_details.csv       /tmp/shopping/data/store/
hdfs dfs -put store_review.csv        /tmp/shopping/data/review/

4.创建适当的外部表来保存数据

Clear all tables if exists

drop database if exists shopping cascade
create database if not exists shopping
use shopping
--创建顾客表
create external table if not exists ext_customer_details(
customer_id    string,
first_name     string,
last_name      string,
email          string,
gender         string,
address        string,
country        string,
language       string,
job            string,
credit_type    string,
credit_no      string)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' --默认分割符为","
location '/tmp/shopping/data/customer'
tblproperties("skip.header.line.count"="1") --去掉首行
--交易表
create external table if not exists ext_transaction_details(
transaction_id string,
customer_id    string,
store_id       string,
price          string,
product        string,
date           string,
time           string)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' --默认分割符为","
location '/tmp/shopping/data/transaction'
tblproperties("skip.header.line.count"="1")
--商店详情表
create external table if not exists ext_store_details(
store_id        string,
store_name      string,
employee_number string)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' --默认分割符为","
location '/tmp/shopping/data/store'
tblproperties("skip.header.line.count"="1")
--评价表
create external table if not exists ext_store_review(
transaction_id  string,
store_id        string,
review_score    string)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' --默认分割符为","
location '/tmp/shopping/data/review'
tblproperties("skip.header.line.count"="1")

5.查看并了解数据

Verify all Tables are Created

%hive
--select * from ext_customer_details limit 20
select * from ext_transaction_details limit 20
--select * from ext_store_details limit 20
--select * from ext_store_review limit 20

五.数据清洗

解决以下有问题的数据

  • 对transaction_details中的重复数据生成新ID
  • 过滤掉store_review中没有评分的数据
  • 可以把清洗好的数据放到另一个表或者用View表示
  • 找出PII (personal information identification) 或PCI (personal confidential information) 数据进行加密或hash
  • 重新组织transaction数据按照日期YYYY-MM做分区
  • 对transaction_details中的重复数据生成新ID
  • 过滤掉store_review中没有评分的数据
  • 通过Zeppelin实现交易数据按月分区

1.Clean and Mask customer_details

--创建视图vw_customer_details,清洗一下表数据
--将last_name,email,address,credit_no进行加密;去掉不重要language字段
create view if not exists vw_customer_details
as select
customer_id  ,
first_name  ,
unbase64(last_name) as last_name    ,
unbase64(email) as email        ,
gender       ,
unbase64(address) as address      ,
country      ,
job          ,
credit_type  ,
unbase64(credit_no) as credit_no    
from ext_customer_details

2.Clean transaction_details into partition table

%hive
--按照月份进行分区,清洗的表可以根据数据具体情况选择数据类型,之前创建外部表使用都是string类型,因为存储格式是按照OpenCSVSerde,默认为string类型,清洗数据的时候可以根据需要进行类型重新定义
create table if not exists transaction_details(
transaction_id string,
customer_id    string,
store_id       string,
price          decimal(8,2),
product        string,
purchase_time           string,
purchase_date           date
)
partitioned by (purchase_month string)
--使用动态分区进行数据导入
--非严格模式
set hive.exec.dynamic.partition.mode=nonstrict
--向transaction_details导入数据
with base as(
select 
transaction_id   ,
customer_id      ,
store_id         ,
price            ,
product          ,
time    ,
date    ,
from_unixtime(unix_timestamp(date,"yyyy-MM-dd"),"yyyy-MM") as purchase_month,
row_number() over(partition by transaction_id order by store_id) as rn
from ext_transaction_details
  • 3
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值