【企业流行新数仓】Day02:DWS层(按日分区的宽表)、DWT层(全量累计表)、ADS层、总结...

一、DWS层

1、概括

dwd层的数据,每日轻度聚合,建宽表

表名粒度
dws_uv_detail_daycount一个设备是一行
dws_user_action_daycount(只统计今天登录的会员)一个会员是一行
dws_sku_action_daycount(只统计被下单或平均或支付或加购或收藏的商品)一个商品是一行
dws_coupon_use_daycount(只统计未过期的优惠券)一个优惠券是一行
dws_activity_info_daycount(统计所有活动)一个活动是一行
dws_sale_detail_daycount(每日购买数据)一个用户购买的一款商品是一行

2、dws_uv_detail_daycount(每日设备行为)-一台设备有多个行为,每列的多个行为进行字符串拼接

(1)建表

create external table dws_uv_detail_daycount
(
    -- 从启动日志dwd_start_log表取以下字段
    `mid_id` string COMMENT '设备唯一标识',
    `user_id` string COMMENT '用户标识',
    `version_code` string COMMENT '程序版本号', 
    `version_name` string COMMENT '程序版本名', 
    `lang` string COMMENT '系统语言', 
    `source` string COMMENT '渠道号', 
    `os` string COMMENT '安卓系统版本', 
    `area` string COMMENT '区域', 
    `model` string COMMENT '手机型号', 
    `brand` string COMMENT '手机品牌', 
    `sdk_version` string COMMENT 'sdkVersion', 
    `gmail` string COMMENT 'gmail', 
    `height_width` string COMMENT '屏幕宽高',
    `app_time` string COMMENT '客户端日志产生时的时间',
    `network` string COMMENT '网络模式',
    `lng` string COMMENT '经度',
    `lat` string COMMENT '纬度',
    -- 从启动日志dwd_start_log表按照mid_id进行聚合,之后count(*)取以下字段
    `login_count` bigint COMMENT '活跃次数'
)

(2)数据导入

insert overwrite table dws_uv_detail_daycount PARTITION(dt='2020-05-06')
select 
    mid_id,
    concat_ws('|',collect_set(user_id)),
    concat_ws('|',collect_set(version_code)),
    concat_ws('|',collect_set(version_name)),
    concat_ws('|',collect_set(lang)),
    concat_ws('|',collect_set(source)),
    concat_ws('|',collect_set(os)),
    concat_ws('|',collect_set(area)),
    concat_ws('|',collect_set(model)),
    concat_ws('|',collect_set(brand)),
    concat_ws('|',collect_set(sdk_version)),
    concat_ws('|',collect_set(gmail)),
    concat_ws('|',collect_set(height_width)),
    concat_ws('|',collect_set(app_time)),
    concat_ws('|',collect_set(network)),
    concat_ws('|',collect_set(lng)),
    concat_ws('|',collect_set(lat)),
    count(*)
FROM dwd_start_log where dt='2020-05-06'
GROUP by mid_id

3、dws_user_action_daycount(每日会员行为)

用户登录、加购、下单、支付次数及金额

导入数据时,类似于建表/视图操作

witht1 
as (select user_id,count(*) login_count  from dwd_start_log where dt='2020-05-06' and user_id is not NULL GROUP BY user_id),
t3 as (select user_id,count(*) order_count, sum(final_total_amount) order_amount from dwd_fact_order_info where dt='2020-05-06' GROUP by user_id ),
t4 as (select user_id,count(*) payment_count,sum(payment_amount) payment_amount from dwd_fact_payment_info where dt='2020-05-06' GROUP by user_id),
t2 as (select user_id,count(*) cart_count,sum(cart_price*sku_num) cart_amount from dwd_fact_cart_info where dt='2020-05-06' and date_format(create_time,'yyyy-MM-dd')='2020-05-06' GROUP by user_id )
insert overwrite TABLE dws_user_action_daycount PARTITION(dt='2020-05-06')select   t1.user_id,login_count,   nvl(cart_count,0),   nvl(cart_amount,0),   nvl(order_count,0),  
nvl(order_amount,0),   nvl(payment_count,0),   nvl(payment_amount,0)from t1 left join t2 on t1.user_id=t2.user_idleft join t3 on t1.user_id=t3.user_idleft join t4 on t1.user_id=t4.user_id

4、dws_sku_action_daycount(每日商品行为)

被下单次数、被支付、退款、加购、好评、差评次数

with 
t1 as
(select  sku_id,count(*) order_count,sum(sku_num) order_num,
sum(total_amount) order_amount
from dwd_fact_order_detail where dt='2020-05-06'
GROUP by sku_id),
t2 as
(select 
    sku_id,
    sum(sku_num)  payment_num,sum(total_amount) payment_amount,
    count(*) payment_count
from
(SELECT order_id,sku_id,sku_num,total_amount from dwd_fact_order_detail 
where dt='2020-05-06' or dt=date_sub('2020-05-06',1)) tmp1
 join
(select order_id from dwd_fact_payment_info where dt='2020-05-06') tmp2
on tmp1.order_id=tmp2.order_id
GROUP by sku_id),
t3 as 
(SELECT sku_id,
        count(*) refund_count,sum(refund_num) refund_num,
        sum(refund_amount) refund_amount
from dwd_fact_order_refund_info where dt='2020-05-06'
GROUP by sku_id),
t4 as
(select sku_id,
        count(*) cart_count,sum(sku_num) cart_num
from dwd_fact_cart_info where dt='2020-05-06' and sku_num>0
GROUP by sku_id),
t5 as
(SELECT sku_id,
        count(*) favor_count      
from dwd_fact_favor_info where dt='2020-05-06' and is_cancel=0
group by sku_id),
t6 as
(SELECT sku_id,
sum(if(appraise='1201',1,0)) appraise_good_count,
sum(if(appraise='1202',1,0)) appraise_mid_count,
sum(if(appraise='1203',1,0)) appraise_bad_count,
sum(if(appraise='1204',1,0)) appraise_default_count
from  dwd_fact_comment_info where dt='2020-05-06'
group by sku_id)
insert overwrite table dws_sku_action_daycount partition(dt='2020-05-06')
SELECT
nvl(nvl(nvl(nvl(nvl(t1.sku_id,t2.sku_id),t3.sku_id),t4.sku_id),t5.sku_id),t6.sku_id),
    nvl(order_count,0), 
    nvl(order_num,0),
    nvl(order_amount,0),
    nvl(payment_count,0),
    nvl(payment_num,0),
    nvl(payment_amount,0),
    nvl(refund_count,0),
    nvl(refund_num,0),
    nvl(refund_amount,0),
    nvl(cart_count,0),
    nvl(cart_num,0),
    nvl(favor_count,0),
    nvl(appraise_good_count,0),
    nvl(appraise_mid_count,0),
    nvl(appraise_bad_count,0),
    nvl(appraise_default_count,0)
from t1 
full join t2 on t1.sku_id=t2.sku_id
full join t3 on t1.sku_id=t3.sku_id
full join t4 on t1.sku_id=t4.sku_id
full join t5 on t1.sku_id=t5.sku_id
full join t6 on t1.sku_id=t6.sku_id

5、dws_coupon_use_daycount(每日优惠券使用行为)

范围、商品id、品牌、品类、领用次数、下单次数

insert overwrite table dws_coupon_use_daycount PARTITION(dt='2020-05-06')
select 
  t1.id coupon_id,coupon_name, coupon_type, condition_amount, 
condition_num, activity_id, benefit_amount, benefit_discount,
create_time, range_type, spu_id, tm_id, category3_id, limit_num,
    get_count,using_count, used_count
from
(SELECT *
from dwd_dim_coupon_info 
where dt='2020-05-06' and nvl(expire_time,'9999-99-99') >'2020-05-06') t1
left join
(select coupon_id,
sum(if(date_format(get_time,'yyyy-MM-dd')='2020-05-06',1,0)) get_count,
sum(if(date_format(using_time,'yyyy-MM-dd')='2020-05-06',1,0)) using_count,
sum(if(date_format(used_time,'yyyy-MM-dd')='2020-05-06',1,0)) used_count 
from dwd_fact_coupon_use
GROUP by coupon_id) t2
on t1.id=t2.coupon_id

6、dws_activity_info_daycount(每日活动行为)

活动类型、时间、下单、支付次数

with 
 t1 as
 (select 
    id,activity_name,activity_type,
    start_time,end_time,create_time  
 from dwd_dim_activity_info 
 where dt='2020-05-06'
 GROUP by id,activity_name,activity_type,
    start_time,end_time,create_time),
 t2 as 
 (select 
    activity_id,count(*) order_count
 from dwd_fact_order_info 
 where dt='2020-05-06'
 GROUP by activity_id),
  t5 as 
 (SELECT
    activity_id,count(*) payment_count
 from
 (SELECT order_id,id from dwd_fact_payment_info where dt='2020-05-06') t3
  join
 (SELECT id,activity_id from dwd_fact_order_info WHERE dt='2020-05-06' or dt=date_sub('2020-05-06',1)) t4
  on   t3.order_id=t4.id
  GROUP by activity_id)
  insert overwrite table dws_activity_info_daycount partition(dt='2020-05-06')
  SELECT
    t1.id,activity_name, activity_type, 
  start_time, end_time, create_time, 
  nvl(order_count,0),
  nvl(payment_count,0)
  from t1 
  left join t2 on t1.id=t2.activity_id
  left join t5 on t1.id=t5.activity_id

7、dws_sale_detail_daycount(每日用户购买商品详情)

用户、商品、sku、购买次数、下单次数、下单金额

GROUP by user_id,sku_id)
insert overwrite table dws_sale_detail_daycount PARTITION(dt='2020-05-06')
SELECT
    t7.user_id, t7.sku_id, user_gender, user_age, 
    user_level, order_price, sku_name, sku_tm_id,
    sku_category3_id, sku_category2_id, sku_category1_id, 
    sku_category3_name, sku_category2_name, 
    sku_category1_name, spu_id, sku_num, order_count,
    order_amount
from
(select 
    nvl(t3.user_id,t4.user_id) user_id,
    nvl(t3.sku_id,t4.sku_id) sku_id,
    nvl(order_count,0) order_count, 
    nvl(order_amount,0) order_amount,
    nvl(sku_num,0) sku_num 
FROM t3 full join t4 on 
t3.user_id=t4.user_id and t3.sku_id=t4.sku_id) t7
 join t1 on t7.user_id=t1.user_id 
 join t2 on t7.sku_id=t2.sku_id

二、DWT层-当前表及DWS表中按日期的汇总(合并及更新)

1、概述

将DWS层每日聚合的数据进行累积

不是分区表,是一个累积型全量表

累积型全量表: ①查询要改动的旧数据  ②查询新增和变化的新数据  ③新旧关联,以新换旧  ④导入覆盖

2、dwt_uv_topic

create external table dwt_uv_topic

用户及设备信息、首次活跃、当日活跃、末次活跃、累计活跃天数

今天未登录的老用户:new.mid_id is null

老用户:old.mid_id is not null

新用户:old.mid_id is null

今天登录的老用户:new.mid_id is not null and old.mid_id is not null

insert overwrite table gmall.dwt_uv_topic
select 
    nvl(old.mid_id,new.mid_id),
    concat_ws('|',old.user_id,new.user_id),
    concat_ws('|',old.version_code,new.version_code),
    concat_ws('|',old.version_name,new.version_name),
    concat_ws('|',old.lang,new.lang),
    concat_ws('|',old.source,new.source),
    concat_ws('|',old.os,new.os),
    concat_ws('|',old.area,new.area),
    concat_ws('|',old.model,new.model),
    concat_ws('|',old.brand,new.brand),
    concat_ws('|',old.sdk_version,new.sdk_version),
    concat_ws('|',old.gmail,new.gmail),
    concat_ws('|',old.height_width,new.height_width),
    concat_ws('|',old.app_time,new.app_time),
    concat_ws('|',old.network,new.network),
    concat_ws('|',old.lng,new.lng),
    concat_ws('|',old.lat,new.lat),
    nvl(old.login_date_first,'2020-05-06') login_date_first,
    IF(new.mid_id is null,old.login_date_last,'2020-05-06') login_date_last, 
    nvl(new.login_count,0) login_day_count, 
    nvl(old.login_count,0)+if(new.login_count is not null,1,0) login_count
from
dwt_uv_topic old
full join
(select * from dws_uv_detail_daycount where dt='2020-05-06') new
on old.mid_id=new.mid_id

3、dwt_user_topic

用户首末次登录、下单时间、天数,累计、最近30天下单支付金额、次数

insert overwrite table dwt_user_topic
SELECT
    t1.user_id,login_date_first, 
login_date_last, login_count, nvl(login_last_30d_count,0), 
order_date_first, order_date_last, order_count, order_amount,
nvl(order_last_30d_count,0), nvl(order_last_30d_amount,0), payment_date_first, 
payment_date_last, payment_count, payment_amount, nvl(payment_last_30d_count,0), 
nvl(payment_last_30d_amount,0)
from
 (SELECT
    nvl(old.user_id,new.user_id) user_id,
    nvl(old.login_date_first,'2020-05-06') login_date_first,
    nvl(old.order_date_first,if(new.order_count>0,'2020-05-06',null)) order_date_first,
    nvl(old.payment_date_first,if(new.payment_count>0,'2020-05-06',null))  payment_date_first,
    if(new.user_id is null,old.login_date_last,'2020-05-06') login_date_last,
    if(new.order_count>0,'2020-05-06',old.order_date_last) order_date_last,
    if(new.payment_count>0,'2020-05-06',old.payment_date_last) payment_date_last,
    nvl(old.login_count,0)+if(new.user_id is not null,1,0) login_count,
    nvl(old.order_count,0)+nvl(new.order_count,0) order_count,
    nvl(old.order_amount,0)+nvl(new.order_amount,0) order_amount,
    nvl(old.payment_count,0)+nvl(new.payment_count,0) payment_count,
    nvl(old.payment_amount,0)+nvl(new.payment_amount,0) payment_amount 
 from
 dwt_user_topic old
 full join (select * from dws_user_action_daycount where dt='2020-05-06') new
 on old.user_id=new.user_id) t1
 left join
 ( 
  SELECT
    user_id,
    sum(order_count) order_last_30d_count,
    sum(order_amount) order_last_30d_amount,
    sum(payment_count) payment_last_30d_count,
    sum(payment_amount) payment_last_30d_amount,
    count(*) login_last_30d_count
  FROM dws_user_action_daycount
  where dt BETWEEN date_sub('2020-05-06',29) and '2020-05-06'
  GROUP by user_id) t2
  on t1.user_id=t2.user_id

4、dwt_sku_topic

 最近30日及累计下单、支付、退款、加购、收藏、好中差评次数

create external table dwt_sku_topic
(
    sku_id string comment 'sku_id',
    spu_id string comment 'spu_id',
    -- 从dws_sku_action_daycount  取,where 30天之前<=dt<=今天,sum()
    order_last_30d_count bigint comment '最近30日被下单次数',
insert overwrite TABLE dwt_sku_topic
SELECT
    t2.sku_id, t2.spu_id, 
    nvl(order_last_30d_count,0), 

5、dwt_coupon_topic

优惠券当日及累计领用、下单、支付次数

insert overwrite table dwt_coupon_topic
select 
    nvl(old.coupon_id,new.coupon_id) coupon_id,
    nvl(new.get_count,0) get_day_count,
    nvl(new.using_count,0) using_day_count,
    nvl(new.used_count,0) used_day_count,
    nvl(old.get_count,0)+nvl(new.get_count,0) get_count, 
    nvl(old.get_count,0)+nvl(new.using_count,0) using_count,
    nvl(old.get_count,0)+nvl(new.used_count,0) used_count
from dwt_coupon_topic old 
full join (select * from dws_coupon_use_daycount where dt='2020-05-06')new 
on old.coupon_id=new.coupon_id

6、dwt_activity_topic

活动当日及累计下单、支付次数

insert overwrite table dwt_activity_topic
select 
    nvl(old.id,new.id) id,
    nvl(old.activity_name,new.activity_name) activity_name,
    nvl(new.order_count,0) order_day_count,
    nvl(new.payment_count,0) payment_day_count,
    nvl(old.order_count,0)+nvl(new.order_count,0) order_count, 
    nvl(old.payment_count,0)+nvl(new.payment_count,0) payment_count
from dwt_activity_topic old 
full join (select * from dws_activity_info_daycount where dt='2020-05-06')new 
on old.id=new.id

三、ADS层

1、概述

将需求根据要查询的数据源进行分类

同一类需求创建一张表进行统计

创建的表均为全量表

2、构造数据

将集群的时间,调整到要导入数据的前一天

上传jar包

启动采集通道,启动hive

执行脚本

3、设备主题

(1)活跃设备数(日、周、月)

从dws_uv_daycount 或 dwt_uv_topic 表取数据

日活、周活、月活【字段:是否是周末或月末】:至少活跃一次

create external table ads_uv_count( 
    `dt` string COMMENT '统计日期',
    // 从dws层取当天的,也可以从dwt层取
    `day_count` bigint COMMENT '当日用户数量',
    // 从dws层取当周的,也可以从dwt层取
    `wk_count`  bigint COMMENT '当周用户数量',
    // 从dws层取当月的,也可以从dwt层取
    `mn_count`  bigint COMMENT '当月用户数量',
    // 借助next_day()
    `is_weekend` string COMMENT 'Y,N是否是周末,用于得到本周最终结果',
    // 借助last_day()
    `is_monthend` string COMMENT 'Y,N是否是月末,用于得到本月最终结果' 
) COMMENT '活跃设备数'
insert into table ads_uv_count
 SELECT
    '2020-05-06',day_count,wk_count,mn_count,
    if('2020-05-06'=date_sub(next_day('2020-05-06','MO'),1),'Y','N') is_weekend,
    if('2020-05-06'=last_day('2020-05-06'),'Y','N') is_monthend
 from
 (SELECT '2020-05-06' dt,count(*)  day_count

(2)每日新增设备, login_date_first=今天

insert into ads_new_mid_count
SELECT 
    '2020-05-06' create_date, 
    count(*) new_mid_count
FROM dwt_uv_topic
where login_date_first='2020-05-06';

(3)沉默用户数

只在安装当天启动过: login_date_first='当天'= login_date_last

启动时间是在7天前: login_date_last< 今天的7天前

insert into table ads_silent_count
SELECT
    '2020-05-06',
    count(*)
from dwt_uv_topic
where login_date_first=login_date_last
      and
      login_date_last<date_sub('2020-05-06',7)

(4)本周回流用户数

本周登录过的,没在上周登录过的老用户数

本周活跃与上周活跃,两个结果集使用left join后取差集(本周活跃但上周不活跃):  

on t1.mid_id=t2.mid_id
where t2.mid_id is null

login_date_last >= date_sub(next_day('2020-05-06','MO'),7)

(5)流失用户数:连续7天未活跃的设备

login_date_last<date_sub('2020-05-06',7)

(6)留存率:留存用户 占  某天新增用户的 比率

某天新增的用户中,在n天后继续使用的用户称为留存用户

①某一天新增的人数

②留存的天数,留存的日期=新增的天数+留存的天数

③取留存日期当天的留存人数

(7)最近连续三周活跃用户数

用户在这三周中,都至少需要出现一次

(8)最近七天内连续三天活跃用户数

4、会员主题

(1)会员信息

用户新鲜率、活跃率、付费率

cast(sum(if(login_date_last='2020-05-19',1,0)) / count(*) * 100 as decimal(10,2)) day_users2users

(2)转化率

访问/加购,加购/下单,下单/支付

cast( sum(if(payment_count>0,1,0)) / sum(if(order_count>0,1,0)) * 100 as decimal(10,2))

5、商品主题

(1)商品个数信息-各个商品的种类数

insert into table ads_product_info
SELECT
    '2020-05-19' dt,
    count(*) sku_num,
    count(DISTINCT spu_id) spu_num
from dwt_sku_topic

(2)商品累积销量排名

FROM dwt_sku_topic
where payment_num>0
order by payment_num desc
limit 10

(3)商品收藏排名

(4)加入购物车排名

(5)最近30天退款率

(6)差评率排名

6、营销主题

 (1)每日下单统计ads_order_daycount

(2)每日支付统计ads_payment_daycount

(3)品牌的月复购率ads_sale_tm_category1_stat_mn

单次、多次复购率

四、总结

1、数据来源

 

 2、各层数据的来源于导入

数据源建模如何导入数据备注
hdfs  采用lzo压缩的格式
ODS原数据有几个字段是什么类型,就怎么建模必须指定ODS的表使用能够读取LZO压缩格式的输入格式,为LZO格式创建索引 
用户行为DWD用户行为数据根据不同类型数据的字段明细,进行建模启动日志: get_json_object 事件日志: 自定义UDF,UDTF,将事件日志中的每个事件,解析到一个base_event表中,再使用get_json_object展开事件明细。 
业务数据DWD维度表:维度退化,将多个同一类型维度的字段合并到一张表中。事实表:采取星型模型,基于3w原则,按照选取业务线---确认粒度---选取维度---选取度量进行建模维度表:多表Join 事实表:选择一张事实表作为主表,通过外键关联维度表,选取维度字段。再选取度量! 
  事务型事实表:选取ods层某一天分区的数据,再关联维度表,选取维度字段,再选取度量! 
  周期型快照事实表:直接从ODS层全量导入(加入购物车,收藏表) 
  累积型快照事实表: 按照事实发生最初的事件作为分区字段!①选择要覆盖的老的分区的所有数据②选取今日新增和变化的新数据③新旧交替,以新换旧④覆盖到指定的分区 
  拉链表(缓慢变化维度):old left join new ,将old中过期的数据的end_date修改为new中start_date的前一天。 再union all new。导入到临时表,再导入到原表 
dws层紧紧贴合需求。将同一类型的需求,汇总,分类,以某个需求的统计目标为主题(设备,用户,商品,优惠券,活动,购买行为),创建宽表取dwd层每日最新的分区,进行多表关联 
dwt层紧紧贴合需求。将同一类型的需求,汇总,分类,以某个需求的统计目标为主题(设备,用户,商品,优惠券,活动,购买行为),创建宽表dwt full join dws 当日分区的数据①新旧交替,以新换旧②覆盖原表 
ads紧紧贴合需求。将同一类型的需求,汇总,分类,以某个需求的统计目标为主题(用户,商品,会员,营销),取某一天的历史切片数据,从dws层取,如果要取当前的数据或累计状态,从dwt层取 
导出mysql update_mode: allowinsert update-key: dt 
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值