数仓搭建--DWS层

中长跑路上crush

已于 2023-11-22 19:57:35 修改

阅读量270

点赞数 4

文章标签： hadoop mysql hive

于 2023-11-22 19:32:41 首次发布

本文链接：https://blog.csdn.net/weixin_58026490/article/details/134557140

版权

数仓搭建–DW层

DWS层（数据服务层）

02商品主题宽表（中等）

背景:
1.

注意：建表操作需要在hive中进行，presto不支持hive的建表语法(creta table as select语句，一般不使用)

该主题宽表是否需要建为分区表，取决与我们的商品数量的多少
如果商品量很大，且交易量很大需要建成分区表
如果商品量很少，或者每天交易量很少，则可以不必分区

思考一：

什么时候使用大sql来进行计算（销售主题），什么时候使用多个sql，利用CET来计算？
1、当所以指标分析的维度是同一个维度字段，则我们可以写成一个大的sql，这样便于分析
2、但是如果你分析的指标，她来自不同的维度，而这个维度刚好在上一层DWB层没有降维，或者说它需要来自不同的表或库，则可以写成多个sql

注意，即使我们写成了多个SQL，在最后还是可以建成一个张宽表的

思考二：

数仓分层之后，上一层的表只能查询下一层的，能不能跨层？
1、数仓分层之后，上一层的表只能查询下一层的，能不能跨层？

思考三：《两表连接》

由于需求导致我们使用的是多个sql去完成，我们个如何将这些指标计算表合并？什么样的连接方式？
内连接，左连接，左半连接，都会造成数据的“缺失”，在该场景下，我们所计算出来的指标，都是我们所需要的，不允许有缺失，所以不满足
交叉连接：笛卡尔积，能不用就不用
全连接：左右两张表的数据都保留，如果连接不成功则置为空。

-- todo 1 建表
-- todo 1.1 下单表
create table hive.db.t1(
    dt varchar,
    goods_id varchar,
    order_count int,
    order_num int,
    order_amount int
)
;

-- todo 1.2 购物车表
create table hive.db.t2(
    dt varchar,
    goods_id varchar,
    cart_count int,
    cart_num int
)
;

-- todo 1.3 收藏表
create table hive.db.t3(
    dt varchar,
    goods_id varchar,
    favor_count int
)
;

-- todo 2 加载数据
-- todo 2.1 下单表
insert into hive.db.t1 values('2022-04-05', 'g01', 100, 140, 1000);
insert into hive.db.t1 values('2022-04-05', 'g02', 80, 130, 1500);
insert into hive.db.t1 values('2022-04-06', 'g01', 110, 160, 1200);
insert into hive.db.t1 values('2022-04-06', 'g03', 30, 30, 500);

select * from hive.db_3.t1;

-- todo 2.2 购物车表
insert into hive.db.t2 values('2022-04-05', 'g01', 10, 20);
insert into hive.db.t2 values('2022-04-05', 'g03', 20, 30);
insert into hive.db.t2 values('2022-04-06', 'g02', 10, 10);
insert into hive.db.t2 values('2022-04-06', 'g04', 5, 5);

select * from hive.db_3.t2;

-- todo 2.3 收藏表
insert into hive.db.t3 values('2022-04-05', 'g01', 10);
insert into hive.db.t3 values('2022-04-05', 'g02', 50);
insert into hive.db.t3 values('2022-04-05', 'g03', 30);
insert into hive.db.t3 values('2022-04-06', 'g05', 20);

select * from hive.db.t3;

-- todo 1 两表全关联
-- todo 1.1 两表全连接
-- todo 1.2 合并操作 dt, goods_id 转化操作: 如果值为null 转化成 0

-- todo 1 两表全关联
-- todo 1.1 两表全连接
select t1.*,
       t2.*
from hive.db.t1
         full join hive.db.t2 on t1.dt = t2.dt and t1.goods_id = t2.goods_id
order by t1.dt, t1.goods_id
;

-- todo 1.2 合并操作 dt, goods_id 转化操作: 如果值为null 转化成 0
select coalesce(t1.dt, t2.dt)             as dt,
       coalesce(t1.goods_id, t2.goods_id) as goods_id,
       coalesce(order_count, 0)           as order_count,
       coalesce(order_num, 0)             as order_num,
       coalesce(order_amount, 0)          as order_amount,
       coalesce(cart_count, 0)            as cart_count,
       coalesce(cart_num, 0)              as cart_num
from hive.db.t1
         full join hive.db_3.t2 on t1.dt = t2.dt and t1.goods_id = t2.goods_id
order by t1.dt, t1.goods_id

思考四：《三表连接》

假如我们按照t1.id=t2.id t1.id=t3.id连接
会造成这样一种情况，本来t1 t2 t3 三表可以连在一起的，现在被拆分成了两行。如何解决这种问题
根据关联条件进行分组聚合。 group by(关联的字段)

-- todo 2 三表全关联
-- todo 2.1 三表全连接
-- todo 2.2 合并操作 dt, goods_id 转化操作: 如果值为null 转化成 0
-- todo 2.3 分组聚合操作

-- todo 2 三表全关联
-- todo 2.1 三表全连接
select t1.*,
       t2.*,
       t3.*
from hive.db.t1
         full join hive.db_3.t2 on t1.dt = t2.dt and t1.goods_id = t2.goods_id
         full join hive.db_3.t3 on t1.dt = t3.dt and t1.goods_id = t3.goods_id
;
-- todo 2.2 合并操作 dt, goods_id 转化操作: 如果值为null 转化成 0
select
    coalesce(t1.dt, t2.dt, t3.dt) as dt,
    coalesce(t1.goods_id, t2.goods_id, t3.goods_id) as goods_id,
    coalesce(order_count, 0) as order_count,
    coalesce(order_num, 0) as order_num,
    coalesce(order_amount, 0) as order_amount,
    coalesce(cart_count, 0) as cart_count,
    coalesce(cart_num, 0) as cart_num,
    coalesce(favor_count, 0) as favor_count
from hive.db.t1
         full join hive.db.t2 on t1.dt = t2.dt and t1.goods_id = t2.goods_id
         full join hive.db.t3 on t1.dt = t3.dt and t1.goods_id = t3.goods_id
order by dt, goods_id
;

-- todo 2.2 分组聚合操作
with t4 as (
    select
        coalesce(t1.dt, t2.dt, t3.dt) as dt,
        coalesce(t1.goods_id, t2.goods_id, t3.goods_id) as goods_id,
        coalesce(order_count, 0) as order_count,
        coalesce(order_num, 0) as order_num,
        coalesce(order_amount, 0) as order_amount,
        coalesce(cart_count, 0) as cart_count,
        coalesce(cart_num, 0) as cart_num,
        coalesce(favor_count, 0) as favor_count
    from hive.db_3.t1
             full join hive.db_3.t2 on t1.dt = t2.dt and t1.goods_id = t2.goods_id
             full join hive.db_3.t3 on t1.dt = t3.dt and t1.goods_id = t3.goods_id
    order by dt, goods_id
)
select
    dt,
    goods_id,
    sum(order_count) as order_count,
    sum(order_num) as order_num,
    sum(order_amount) as order_amount,
    sum(cart_count) as cart_count,
    sum(cart_num) as cart_num,
    sum(favor_count) as favor_count
from t4
group by dt, goods_id
order by dt, goods_id
;