文章目录
(一)数据仓库介绍
数据仓库分为 4 层: ods 层、 dwd 层、 dws 层、 app 层。
- ODS层:原始数据层,数据源中的数据,采集过来之后,原样保存。
- DWD层:明细数据层:这一层是对ODS层的数据进行清洗,解决一些数据质量问题和数据的完整度
问题。 - DWS层:这一层是对DWD层的数据进行轻度聚合汇总,生成一系列的中间表,提升公共指标的复用性,减少重复加工,并且构建出来一些宽表,用于提供后续的业务查询。
- APP层:根据业务需要,由前面三层的数据统计而出的结果,可以直接提供查询展现,一般会把APP层的数据导出到MySQL中供线上系统使用,提供报表展示、数据监控及其它功能。也有公司把这层称为DM层。虽然名字不一样,但是性质是一样的。
(二)用户行为数据数仓开发
(1)ods层开发
-
在 hive中创建数据库 ods_mall。
hive (default)> create database ods_mall; -
在 ods_mall 中需要创建以下针对用户行为数据的表
ods_user_active 用户主动活跃表(act=1)
ods_click_good 点击商品表(act=2)
ods_good_item 商品详情页表 (act=3)
ods_good_list 商品列表页表(act=4)
ods_app_close APP 崩溃数据表 (act=5)
注意:
- 由于在构建数据仓库的时候我们会创建多个数据库,所以在创建以及使用表的时候最好都在表名前面带上对应的数据库名称,否则可能会出现一些不必要的问题,可能会把 ods层的表建到dwd 层。
- 考虑到SQL 重跑的情况,需要在 SQL 语句中添加 if not exists :
- hive 中可以用 string, date, timestamp表示日期时间, date用yyyy-MM-dd :的形式表示, timestamp用:yyyy-MM-dd hh:mm:ss 的形式表示, string可以表示 和 string yyyy-MM-dd和 yyy-MM-dd hh:mm:ss表示。
这三种格式之间可以互相转换,不过在用的时候建议格式统一, 可以表示另外两种格 String式,并且也支持日期的大小比较,所以在这里针对时间统一使用 表示。
- 创建ods层的表
表名 :ods_user_active
( json 存储原始的 格式的用户主动活跃数据)
建表语句:
create external table if not exists ods_mall.ods_user_active(
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';
alter table ods_mall.ods_user_active add if not exists partition(dt='20260101') location '20260101/1';
表名 :ods_click_good
create external table if not exists ods_mall.ods_click_good(
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';
alter table ods_mall.ods_click_good add if not exists partition(dt='20260101') location '20260101/2';
表名:ods_good_item
create external table if not exists ods_mall.ods_good_item(
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';
alter table ods_mall.ods_good_item add if not exists partition(dt='20260101') location '20260101/3';
表名:ods_good_list
create external table if not exists ods_mall.ods_good_list(
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';
alter table ods_mall.ods_good_list add if not exists partition(dt='20260101') location '20260101/4';
表名:ods_app_close
create external table if not exists ods_mall.ods_app_close(
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';
alter table ods_mall.ods_app_close add if not exists partition(dt='20260101') location '20260101/5';
- 针对 ods 层抽取脚本
表初始化脚本 初始化执行一次
ods_mall_init_table.sh
#!/bin/bash
# ods层数据库和表初始化脚本,只需要执行一次
hive -e "
create database if not exists ods_mall;
create external table if not exists ods_mall.ods_user_active(
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';
create external table if not exists ods_mall.ods_click_good(
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';
create external table if not exists ods_mall.ods_good_item(
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';
create external table if not exists ods_mall.ods_good_list(
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';
create external table if not exists ods_mall.ods_app_close(
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';
"
添加分区数据脚本 每天执行一次
add_partition.sh:这个脚本是通用的,所有添加分区的地方都可以使用。
#!/bin/bash
# 给外部分区表添加分区
# 接收三个参数
#1:表名
#2:分区字段dt的值:格式20260101
#3:分区路径(相对路径或者绝对路径都可以)
if [ $# != 3 ]
then
echo "参数异常:add_partition.sh <tabkle_name> <dt> <path>"
exit 100
fi
table_name=$1
dt=$2
path=$3
hive -e "
alter table ${table_name} add if not exists partition(dt='${dt}') location '${path}';
"
ods_mall_add_partition.sh
#!/bin/bash
# 给ods层的表添加分区,这个脚本后期每天执行一次
# 每天凌晨,添加昨天的分区,添加完分区之后,再执行后面的计算脚本
# 默认获取昨天的日期,也支持传参指定一个日期
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 days ago"`
else
dt=$1
fi
#alter table ods_mall.ods_user_active add if not exists partition(dt='20260101') location '20260101/1';
#alter table ods_mall.ods_click_good add if not exists partition(dt='20260101') location '20260101/2';
#alter table ods_mall.ods_good_item add if not exists partition(dt='20260101') location '20260101/3';
#alter table ods_mall.ods_good_list add if not exists partition(dt='20260101') location '20260101/4';
#alter table ods_mall.ods_app_close add if not exists partition(dt='20260101') location '20260101/5';
sh add_partition.sh ods_mall.ods_user_active ${dt} ${dt}/1
sh add_partition.sh ods_mall.ods_click_good ${dt} ${dt}/2
sh add_partition.sh ods_mall.ods_good_item ${dt} ${dt}/3
sh add_partition.sh ods_mall.ods_good_list ${dt} ${dt}/4
sh add_partition.sh ods_mall.ods_app_close ${dt} ${dt}/5
(2)dwd层开发
针对 ods层表中的数据进行清洗,参考数据清洗规则,按照实际情况对数据进行清洗
注意:如果清洗规则使用 SQL 可以实现,那么就使用 SQL 实现数据清洗,如果清洗的规则使用 SQL 实现起来非常麻烦,或者使用 SQL 压根无法实现,此时就可以考虑需要使用MapReduce 代码或者 Spark 代码对数据进行清洗了。
由于我们这里采集的数据还是比较规整的,可以使用 SQL 实现,所以我们就直接使用 SQL实现数据清洗了。
-
在 hive 中创建数据库 dwd_mall
hive (default)> create database dwd_mall; -
创建 dwd 层的表
注意:
- 原始 json数据中的用户 id字段名称为 uid,但是在商品订单数据中用户 id字段名称为 user_id ,这块需要注意一下,在实际工作中会有这种情况,客户端数据和服务端数据的个别字段名称不一致,所以我们在使用的时候最好是统一一下,后期使用起来比较方便,所以在这里我会通过 uid 解析数据,解析之后,给字段起别名为 user_id 。
- hive 中的 只能解析yyyy-MM-dd HH:MM:SS 格式的数据,所以针对这里面的acttime 字段我们使用 bigint 类型
- 为了考虑到SQL 重跑的情况,在使用 insert into table的时候最好改为 insert overwrite ,否则 重复执行的时候会重复写入数据.
- dwd_user_active( 存储解析清洗之后的用户主动活跃数据,对数据进行去重,并 且 过 滤 掉xaid为 空的数据 )
create external table if not exists dwd_mall.dwd_user_active(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint,
ad_status tinyint,
loading_time bigint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/user_active/';
insert overwrite table dwd_mall.dwd_user_active partition(dt='20260101') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.ad_status') as ad_status,
get_json_object(log,'$.loading_time') as loading_time
from
(
select log from ods_mall.ods_user_active where dt = '20260101' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';
- 创建dwd_click_good
create external table if not exists dwd_mall.dwd_click_good(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint,
goods_id bigint,
location tinyint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/click_good/';
insert overwrite table dwd_mall.dwd_click_good partition(dt='20260101') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.goods_id') as goods_id,
get_json_object(log,'$.location') as location
from
(
select log from ods_mall.ods_click_good where dt = '20260101' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';
- 创建dwd_good_item
create external table if not exists dwd_mall.dwd_good_item(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint,
goods_id bigint,
stay_time bigint,
loading_time bigint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/good_item/';
insert overwrite table dwd_mall.dwd_good_item partition(dt='20260101') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.goods_id') as goods_id,
get_json_object(log,'$.stay_time') as stay_time,
get_json_object(log,'$.loading_time') as loading_time
from
(
select log from ods_mall.ods_good_item where dt = '20260101' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';
- 创建dwd_good_list
create external table if not exists dwd_mall.dwd_good_list(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint,
loading_time bigint,
loading_type tinyint,
goods_num tinyint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/good_list/';
insert overwrite table dwd_mall.dwd_good_list partition(dt='20260101') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.loading_time') as loading_time,
get_json_object(log,'$.loading_type') as loading_type,
get_json_object(log,'$.goods_num') as goods_num
from
(
select log from ods_mall.ods_good_list where dt = '20260101' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';
5)创建dwd_app_close
create external table if not exists dwd_mall.dwd_app_close(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/app_close/';
insert overwrite table dwd_mall.dwd_app_close partition(dt='20260101') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime
from
(
select log from ods_mall.ods_app_close where dt = '20260101' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';
- 针对 dwd 层抽取脚本
- 表初始化脚本 初始化执行一次
dwd_mall_init_table.sh
#!/bin/bash
# dwd层数据库和表初始化脚本,只需要执行一次即可
hive -e "
create database if not exists dwd_mall;
create external table if not exists dwd_mall.dwd_user_active(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint,
ad_status tinyint,
loading_time bigint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/user_active/';
create external table if not exists dwd_mall.dwd_click_good(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint,
goods_id bigint,
location tinyint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/click_good/';
create external table if not exists dwd_mall.dwd_good_item(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint,
goods_id bigint,
stay_time bigint,
loading_time bigint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/good_item/';
create external table if not exists dwd_mall.dwd_good_list(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint,
loading_time bigint,
loading_type tinyint,
goods_num tinyint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/good_list/';
create external table if not exists dwd_mall.dwd_app_close(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/app_close/';
"
- 添加分区数据脚本 每天执行一次
dwd_mall_add_partition.sh
#!/bin/bash
# 基于ods层的表进行清洗,将清洗之后的数据添加到dwd层对应表的对应分区中
# 每天凌晨执行一次
# 默认获取昨天的日期,也支持传参指定一个日期
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 days ago"`
else
dt=$1
fi
hive -e "
insert overwrite table dwd_mall.dwd_user_active partition(dt='${dt}') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.ad_status') as ad_status,
get_json_object(log,'$.loading_time') as loading_time
from
(
select log from ods_mall.ods_user_active where dt = '${dt}' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';
insert overwrite table dwd_mall.dwd_click_good partition(dt='${dt}') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.goods_id') as goods_id,
get_json_object(log,'$.location') as location
from
(
select log from ods_mall.ods_click_good where dt = '${dt}' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';
insert overwrite table dwd_mall.dwd_good_item partition(dt='${dt}') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.goods_id') as goods_id,
get_json_object(log,'$.stay_time') as stay_time,
get_json_object(log,'$.loading_time') as loading_time
from
(
select log from ods_mall.ods_good_item where dt = '${dt}' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';
insert overwrite table dwd_mall.dwd_good_list partition(dt='${dt}') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.loading_time') as loading_time,
get_json_object(log,'$.loading_type') as loading_type,
get_json_object(log,'$.goods_num') as goods_num
from
(
select log from ods_mall.ods_good_list where dt = '${dt}' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';
insert overwrite table dwd_mall.dwd_app_close partition(dt='${dt}') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime
from
(
select log from ods_mall.ods_app_close where dt = '${dt}' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';
"
(三)需求分析
前面的两层中的表和需求一般没什么关系,就是把已有的数据接入进来,然后对数据进行清洗处理。但是后面的 dws 层和 app 层是和业务有关联的,所以在构建这两层中的表的时候,我们需要根据一些典型的业务场景来进行分析,在根据具体业务建表的时候尽可能把表设计的更加通用,可以满足后期一些类似业务需求
就是说我们在基于业务构建表的时候,不要直接一个 SQL搞定,可以把一些复杂的 SQL基于一些维度进行拆分,拆分出来一些中间表,再基于这些中间表统计最终的结果。这样这个中间表里面的数据,我们后期针对一些类似的业务需求还是可以服用的。
- 需求一:每日新增用户相关指标
- 需求二:每日活跃用户相关指标
- 需求三:用户 7日流失 push提醒
- 需求四:每日启动 App 次数相关指标
- 需求五:操作系统活跃用户相关指标
- 需求六: APP 崩溃相关指标
使用数据:从 2026-02-01 到 2026-02-28 的数据
(1)需求一: 每日新增用户相关指标
在统计新增用户时,用户是以设备标识字段来判断的,每一个设备都有一个唯一设备 (xaid )码,因为会存在用户不登录的情况,以及多人共用一个账号的情况,所以根据用户进行 id统计是不准确的。
新增用户是指第一次安装 app并且使用的用户,后期卸载之后再使用就不算新用户了。
这个新增用户其实也可以称为新增设备,一个设备对应一个用户。
1:每日新增用户量
2:每日新增用户量的日环比和周同比
1)先详细分析第1 个指标
每日新增用户量 在实际工作中通过这个指标可以衡量我们产品的用户增长速度,如果每日新增用户量一直是上升的,说明我们的产品势头正好,如果在一段时间内增速减缓或者下降,这个时候需要考虑如何获取新用户。
通过前面分析,新增用户是指第一次安装并且使用 的用户,咱们有一个埋点会上报 APP用户打开 这个行为,所以计算新增用户量就使用这一份数据 。
ods 层的表名为:ods_user_active
dwd 层的表名为:dwd_user_active
(2)实现需求一
(1)实现思路如下:
1:我们基于清洗之后的打开 上报的数据创建一个历史表,这个表里面包含的有 字 app xaid段,针对每天的数据基于 进行去重 xaid
2 2026 02 1 app :如果我们要计算 2026年 02月1 日的新增用户量的话,就拿这一天上报的打开 的数据,和前面的历史表进行 left join,使用 xaid进行关联,关联不上的数据则为新增数据。
(1) dws_user_active_history xaid :第一步会产生一个历史表: ,这个表中有一个 字段
dws_user_active_history
xaid
a1
b1
c1
d1
(2) app :第二步会产生一个临时表,表里面包含的是那一天上报的打开 的数据
dws_user_active_20260201_tmp
xaid
a1
b1
x1
y1
z1
(3) left join :对这两个表进行
dws_user_active_20260201_tmp.xaid dws_user_active_history.xaid
a1 a1
b1 b1
x1 null
y1 null
z1 null
此时,dws_user_active_history.xaid null 为 的数据条数即为当日新增用户数
3 dws_user_new_item :将计算出来的每日新增用户信息保存到表 dws_user_new_item表中,这个表按照天作为分区,便于后期其它需求使用这个表.
4 dws_user_new_item :基于 dws_user_new_item对数据进行聚合计算,将计算出来的新增用户数量保存到结果表 中。 app_user_new_count中。
(2)接下来是第 个指标,每日新增用户量的日环比和周同比
- 同比一般是指本期统计数据和往年的同时期的统计数据比较,例如 2026 年 2月和 2025年 2月相比较;这个统计周期也可以是按月或者周
- 环比一般是指本期统计数据和上一期的统计数据作比较,例如2026 年2 月和 2026 年 1月 相比较;这个统计周期也可以是按周或者日
在实际工作中通过同比和环比是可以衡量某一个指标的变化速度,供产品经理做一些决策的时候使用。
日环比 =(本期的数据 - 上一期的数据)/ 上一期的数据
日环比中的时间单位是 天
周同比 = (本期的数据 - 上一期的数据 )/ 上一期的数据
周同比中的时间单位是周 (7天 )
实现思路
直接基于 进行统计即可,可以统计出来某一天的日环比和周同比 app_user_new_count
生成一个新表 app_user_new_count_ratio
里面包含日期、新增用户量、日环比、周同比
总结:
针对前面对需求的分析,我们最终在 dws层需要创建三个表
dws_user_active_20260201_tmp
和
dws_user_new_item
和
dws_user_active_history
在 app 层需要创建两个表
app_user_new_count
和
app_user_new_count_ratio
3) 建表
- dws层
create table if not exists dws_mall.dws_user_active_20260201_tmp(
xaid string,
times int
);
insert overwrite table dws_mall.dws_user_active_20260201_tmp select
xaid,
count(*) as times
from dwd_mall.dwd_user_active
where dt = '20260201'
group by xaid;
create external table if not exists dws_mall.dws_user_active_history(
xaid string,
times int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_active_history';
create external table if not exists dws_mall.dws_user_new_item(
xaid string
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_new_item';
insert overwrite table dws_mall.dws_user_new_item partition(dt='20260201') select
duat.xaid
from dws_mall.dws_user_active_20260201_tmp duat
left join (select xaid from dws_mall.dws_user_active_history group by xaid) duah
on duat.xaid = duah.xaid
where duah.xaid is null;
insert overwrite table dws_mall.dws_user_active_history partition(dt='20260201') select
xaid,
times
from dws_mall.dws_user_active_20260201_tmp;
create external table if not exists dws_mall.dws_user_lost_item(
xaid string
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_lost_item';
insert overwrite table dws_mall.dws_user_lost_item partition(dt='20260201') select
xaid
from dws_mall.dws_user_active_history
where dt >= regexp_replace(date_add('2026-02-01',-7),'-','')
group by xaid
having max(dt) = regexp_replace(date_add('2026-02-01',-7),'-','');
- app层
create external table if not exists app_mall.app_user_new_count(
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_new_count';
insert overwrite table app_mall.app_user_new_count partition(dt='20260201') select
count(*) as num
from dws_mall.dws_user_new_item
where dt = '20260201';
create external table if not exists app_mall.app_user_new_count_ratio(
num int,
day_ratio double,
week_ratio double
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_new_count_ratio';
# 日环比,周同比
insert overwrite table app_mall.app_user_new_count_ratio partition(dt='20260201') select
num,
(num-num_1)/num_1 as day_ratio,
(num-num_7)/num_7 as week_ratio
from(
select
dt,
num,
lead(num,1) over(order by dt desc) as num_1,
lead(num,7) over(order by dt desc) as num_7
from app_mall.app_user_new_count
where dt >=regexp_replace(date_add('2026-02-01',-7),'-','')
) as t
where dt = '20260201';
4)开发脚本
针对 层抽取脚本 dws
(1) :表初始化脚本 初始化执行一次
dws_mall_init_table_1.sh
#!/bin/bash
# 需求一:每日新增用户相关指标
# dws 层数据库和表初始化脚本,只需要执行一次即可
# 由于这个表需要每天创建一个,用完之后就删除,所以选择把这个建表语句放到添加分区
数据的表中
#create table if not exists dws_mall.dws_user_active_20260201_tmp(
# xaid string,
# times int
#);
hive -e "
create database if not exists dws_mall;
create external table if not exists dws_mall.dws_user_active_history(
xaid string,
times int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_active_history/';
create external table if not exists dws_mall.dws_user_new_item(
xaid string
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_new_item/';
"
(2) :添加分区数据脚本 每天执行一次
dws_mall_add_partition_1.sh
#!/bin/bash
# 需求一:每日新增用户相关指标
# 每天凌晨执行一次
# 默认获取昨天的日期,也支持传参指定一个日期
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 days ago"`
else
dt=$1
fi
hive -e "
create table if not exists dws_mall.dws_user_active_${dt}_tmp(
xaid string,
times int
);
insert overwrite table dws_mall.dws_user_active_${dt}_tmp select
xaid,
count(*) as times
from dwd_mall.dwd_user_active
where dt = '${dt}'
group by xaid;
-- 注 意 : 考 虑 掉 脚 本 重 跑 的 情 况 , 所 以 在 这 里 面 每 次 执 行 的 时 候 都 会 先 删 除
dws_user_active_history 表中指定分区的数据
-- dws_user_active_history 因为在计算每日新增用户的时候需要和 进行关联查询。
alter table dws_mall.dws_user_active_history drop partition(dt='${dt}');
insert overwrite table dws_mall.dws_user_new_item partition(dt='${dt}') select
duat.xaid
from dws_mall.dws_user_active_${dt}_tmp duat
left join (select xaid from dws_mall.dws_user_active_history group by xaid) duah
on duat.xaid = duah.xaid
where duah.xaid is null;
insert overwrite table dws_mall.dws_user_active_history partition(dt='${dt}') select
xaid,times from dws_mall.dws_user_active_${dt}_tmp;
"
针对 app 层抽取脚本
1 :表初始化脚本 ( 初始化执行一次 )
app_mall_init_table_1.sh
#!/bin/bash
# 需求一:每日新增用户相关指标
# app 层数据库和表初始化脚本,只需要执行一次即可
hive -e "
create database if not exists app_mall;
create external table if not exists app_mall.app_user_new_count(
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_new_count';
create external table if not exists app_mall.app_user_new_count_ratio(
num int,
day_ratio double,
week_ratio double
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_new_count_ratio';
"
2:添加分区数据脚本(每天执行一次)
app_mall_add_partition_1.sh
#!/bin/bash
# 需求一:每日新增用户相关指标
# 每天凌晨执行一次
# 默认获取昨天的日期,也支持传参指定一个日期
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 days ago"`
else
dt=$1
fi
# 转换日期格式, 20260201 改为 2026-02-01
dt_new=`date +%Y-%m-%d --date="${dt}"`
hive -e "
insert overwrite table app_mall.app_user_new_count partition(dt='${dt}') select
count(*) as num
from dws_mall.dws_user_new_item
where dt = '${dt}';
insert overwrite table app_mall.app_user_new_count_ratio partition(dt='${dt}') select
num,
(num-num_1)/num_1 as day_ratio,
(num-num_7)/num_7 as week_ratio
from
(
select
dt,
num,
lead(num,1) over(order by dt desc) as num_1,
lead(num,7) over(order by dt desc) as num_7
from app_mall.app_user_new_count
where dt >=regexp_replace(date_add('${dt_new}',-7),'-','')
) as t
where dt = '${dt}';
"
5) 验证结果
由于日环比最起码要有两天的数据,周环比最起码要有一周的数据
所以在这我们需要至少计算一周的数据
计算 2026-02-01~2026-02-02-09
针对 dws 层
sh dws_mall_add_partition_1.sh 20260201
sh dws_mall_add_partition_1.sh 20260202
sh dws_mall_add_partition_1.sh 20260203
sh dws_mall_add_partition_1.sh 20260204
sh dws_mall_add_partition_1.sh 20260205
sh dws_mall_add_partition_1.sh 20260206
sh dws_mall_add_partition_1.sh 20260207
sh dws_mall_add_partition_1.sh 20260208
sh dws_mall_add_partition_1.sh 20260209
针对 app 层
sh app_mall_add_partition_1.sh 20260201
sh app_mall_add_partition_1.sh 20260202
sh app_mall_add_partition_1.sh 20260203
sh app_mall_add_partition_1.sh 20260204
sh app_mall_add_partition_1.sh 20260205
sh app_mall_add_partition_1.sh 20260206
sh app_mall_add_partition_1.sh 20260207
sh app_mall_add_partition_1.sh 20260208
sh app_mall_add_partition_1.sh 20260209
针对 app 层查看表 app_mall.app_user_new_count和app_mall.app_user_new_count_ratio中的数据