离线数仓—DWS层交易域汇总表的设计实现
前言
前面使用了一个例子简单说明了DWS层的分析和设计,接下来真正进行DWS层的设计和实现。
首先进行交易域汇总表的设计。
一、交易域用户商品粒度订单汇总表
1.用户商品粒度订单表的说明
为什么要建用户商品粒度订单表?
原因:在上一节中说过,是为了增加表的公用性。
2.最近1日汇总表
2.1 建表语句
DROP TABLE IF EXISTS dws_trade_user_sku_order_1d;
CREATE EXTERNAL TABLE dws_trade_user_sku_order_1d
(
`user_id` STRING COMMENT '用户id',
`sku_id` STRING COMMENT 'sku_id',
`sku_name` STRING COMMENT 'sku名称',
`category1_id` STRING COMMENT '一级分类id',
`category1_name` STRING COMMENT '一级分类名称',
`category2_id` STRING COMMENT '一级分类id',
`category2_name` STRING COMMENT '一级分类名称',
`category3_id` STRING COMMENT '一级分类id',
`category3_name` STRING COMMENT '一级分类名称',
`tm_id` STRING COMMENT '品牌id',
`tm_name` STRING COMMENT '品牌名称',
`order_count_1d` BIGINT COMMENT '最近1日下单次数',
`order_num_1d` BIGINT COMMENT '最近1日下单件数',
`order_original_amount_1d` DECIMAL(16, 2) COMMENT '最近1日下单原始金额',
`activity_reduce_amount_1d` DECIMAL(16, 2) COMMENT '最近1日活动优惠金额',
`coupon_reduce_amount_1d` DECIMAL(16, 2) COMMENT '最近1日优惠券优惠金额',
`order_total_amount_1d` DECIMAL(16, 2) COMMENT '最近1日下单最终金额'
) COMMENT '交易域用户商品粒度订单最近1日汇总事实表'
PARTITIONED BY (`dt` STRING)
STORED AS ORC
LOCATION '/warehouse/gmall/dws/dws_trade_user_sku_order_1d'
TBLPROPERTIES ('orc.compress' = 'snappy');
语句说明:
表中sku_name这些进行了维度退化的操作,order_original_amount_1d、activity_reduce_amount_1d这些字段是为了适应更多的需求,目前来看用不到,后面用到了再回来看。
2.2 首日数据装载
假设2020-06-14是数仓的第一天,因为在业务数据库中可能有多天的业务数据,所以在DWD层中可能存在2020-06-13、2020-06-12、2020-06-11等等日期的数据,所以首日装载时,要对这些历史数据也要进行处理。
装载语句如下:
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dws_trade_user_sku_order_1d partition(dt)
select
user_id,
id,
sku_name,
category1_id,
category1_name,
category2_id,
category2_name,
category3_id,
category3_name,
tm_id,
tm_name,
order_count_1d,
order_num_1d,
order_original_amount_1d,
activity_reduce_amount_1d,
coupon_reduce_amount_1d,
order_total_amount_1d,
dt
from
(
select
dt,
user_id,
sku_id,
count(*) order_count_1d,
sum(sku_num) order_num_1d,
sum(split_original_amount) order_original_amount_1d,
sum(nvl(split_activity_amount,0.0)) activity_reduce_amount_1d,
sum(nvl(split_coupon_amount,0.0)) coupon_reduce_amount_1d,
sum(split_total_amount) order_total_amount_1d
from dwd_trade_order_detail_inc
group by dt,user_id,sku_id
)od
left join
(
select
id,
sku_name,
category1_id,
category1_name,
category2_id,
category2_name,
category3_id,
category3_name,
tm_id,
tm_name
from dim_sku_full
where dt='2020-06-14'
)sku
on od.sku_id=sku.id;
2.3 每日数据装载
每日数据装载时,不用考虑多个分区的问题,把当天的数据放到当天的分区即可。
装载语句如下:
insert overwrite table dws_trade_user_sku_order_1d partition(dt='2020-06-15')
select
user_id,
id,
sku_name,
category1_id,
category1_name,
category2_id,
category2_name,
category3_id,
category3_name,
tm_id,
tm_name,
order_count,
order_num,
order_original_amount,
activity_reduce_amount,
coupon_reduce_amount,
order_total_amount
from
(
select
user_id,
sku_id,
count(*) order_count,
sum(sku_num) order_num,
sum(split_original_amount) order_original_amount,
sum(nvl(split_activity_amount,0)) activity_reduce_amount,
sum(nvl(split_coupon_amount,0)) coupon_reduce_amount,
sum(split_total_amount) order_total_amount
from dwd_trade_order_detail_inc
where dt='2020-06-15'
group by user_id,sku_id
)od
left join
(
select
id,
sku_name,
category1_id,
category1_name,
category2_id,
category2_name,
category3_id,
category3_name,
tm_id,
tm_name
from dim_sku_full
where dt='2020-06-15'
)sku
on od.sku_id=sku.id;
注意说明:
1)在一天中,可能一个用户对一件商品下了多个订单,可能会出现一个订单使用了优惠券,一个订单没有使用优惠券的情况,没有使用优惠券时split_activity_amount的值为null,进行sum求和的结果也为null,所以要对上面的字段进行处理,活动优惠字段也一样。
sum(nvl(split_activity_amount,0)) activity_reduce_amount
3.最近n日汇总表
3.1 建表语句
DROP TABLE IF EXISTS dws_trade_user_sku_order_nd;
CREATE EXTERNAL TABLE dws_trade_user_sku_order_nd
(
`user_id` STRING COMMENT '用户id',
`sku_id` STRING COMMENT 'sku_id',
`sku_name` STRING COMMENT 'sku名称',
`category1_id` STRING COMMENT '一级分类id',
`category1_name` STRING COMMENT '一级分类名称',
`category2_id` STRING COMMENT '二级分类id',
`category2_name` STRING COMMENT '二级分类名称',
`category3_id` STRING COMMENT '三级分类id',
`category3_name` STRING COMMENT '三级分类名称',
`tm_id` STRING COMMENT '品牌id',
`tm_name` STRING COMMENT '品牌名称',
`order_count_7d` STRING COMMENT '最近7日下单次数',
`order_num_7d` BIGINT COMMENT '最近7日下单件数',
`order_original_amount_7d` DECIMAL(16, 2) COMMENT '最近7日下单原始金额',
`activity_reduce_amount_7d` DECIMAL(16, 2) COMMENT '最近7日活动优惠金额',
`coupon_reduce_amount_7d` DECIMAL(16, 2) COMMENT '最近7日优惠券优惠金额',
`order_total_amount_7d` DECIMAL(16, 2) COMMENT '最近7日下单最终金额',
`order_count_30d` BIGINT COMMENT '最近30日下单次数',
`order_num_30d` BIGINT COMMENT '最近30日下单件数',
`order_original_amount_30d` DECIMAL(16, 2) COMMENT '最近30日下单原始金额',
`activity_reduce_amount_30d` DECIMAL(16, 2) COMMENT '最近30日活动优惠金额',
`coupon_reduce_amount_30d` DECIMAL(16, 2) COMMENT '最近30日优惠券优惠金额',
`order_total_amount_30d` DECIMAL(16, 2) COMMENT '最近30日下单最终金额'
) COMMENT '交易域用户商品粒度订单最近n日汇总事实表'
PARTITIONED BY (`dt` STRING)
STORED AS ORC
LOCATION '/warehouse/gmall/dws/dws_trade_user_sku_order_nd'
TBLPROPERTIES ('orc.compress' = 'snappy');
3.2 数据装载
说明:假如第一天日期是2020-06-14,实际上数仓dws_trade_user_sku_order_1d这个表中可能有以前的数据,但是这里没有区分首日数据装载和每日数据装载,因为即使有以前的数据,算出来也没有什么意义,所以不去计算之前的数据(而且也没想到什么好的方法能一下计算出来以前的所有数据)
insert overwrite table dws_trade_user_sku_order_nd partition(dt='2020-06-14')
select
user_id,
sku_id,
sku_name,
category1_id,
category1_name,
category2_id,
category2_name,
category3_id,
category3_name,
tm_id,
tm_name,
sum(if(dt>=date_add('2020-06-14',-6),order_count_1d,0)),
sum(if(dt>=date_add('2020-06-14',-6),order_num_1d,0)),
sum(if(dt>=date_add('2020-06-14',-6),order_original_amount_1d,0)),
sum(if(dt>=date_add('2020-06-14',-6),activity_reduce_amount_1d,0)),
sum(if(dt>=date_add('2020-06-14',-6),coupon_reduce_amount_1d,0)),
sum(if(dt>=date_add('2020-06-14',-6),order_total_amount_1d,0)),
sum(order_count_1d),