50万年薪大数据大佬学习总结之电商数仓系统(二)

电商数仓系统(二)


##一、DWS/DWT准备

1.1 业务术语

-- 1. 用户
    a、统计流量相关的需求:使用设备id,如统计活跃用户
    b、统计业务相关的指标,使用用户id,如统计下单数量

--2. 新增用户
    a、如果一个用户首次打开某APP,那这个用户定义为新增用户;
    b、卸载再安装的设备,不会被算作一次新增。
    c、新增用户包括日新增用户、周新增用户、月新增用户。

-- 3. 活跃用户
	a、打开应用的用户即为活跃用户,不考虑用户的使用情况。
	b、每天一台设备打开多次会被计为一个活跃用户

-- 4. 周活跃用户
	a、某个自然周(月)内启动过应用的用户;
	b、该周(月)内的多次启动只记一个活跃用户

-- 5. 月活跃率
	a、当月活跃用户 / 总用户数

-- 6. 沉默用户
	a、用户仅在安装当天(次日)启动一次,后续时间无再启动行为
	b、指标可以反映新增用户质量和用户与APP的匹配程度

-- 7. 版本分布
	a、不同版本的周内各天新增用户数,活跃用户数和启动次数
	b、判断APP各个版本之间的优劣和用户行为习惯

-- 8. 本周回流用
	a、上周未启动应用, 本周启动了应用的用户,且不是本周新增用户

-- 9. 连续n周
	a、连续n周, 每周至少启动一次

-- 10. 忠诚用户
    a、连续活跃5周以上的用户

-- 10. 连续活跃用户
	a、连续两周以上的活跃用户

-- 11. 近期流失用户
	a、连续n(2-4)周没有启动应用的用户

-- 12. 留存用户
	a、某段时间内的新增用户,经过一段时间后,仍然使用应用的被认作是留存用户;这部分用户占当时新增用户的比例即是留存率。
	b、例如,5月份新增用户200,这200人启动情况:
	  6月份:启动人数为100人,5月份新增用户一个月后的留存率是50%
	  7月份:启动人数为80人,5月份新增用户二个月后的留存率是40%
	  8月份:启动人数为100人,5月份新增用户三个月后的留存率是50%
	c、留存用户一般是统计留存率,同时必须有两个参数:哪个月份的几月的留存率

-- 13. 用户新鲜度
	a、每天启动应用的新老用户比例
	b、用户新鲜度 = 新用户数量 / 当日活跃用户数 

-- 14. 单次使用时长
	a、每次启动使用的时间长度

-- 15. 日使用时长
	a、累计一天内的使用时间长度。

-- 16. 启动次数计算标准
   '背景':IOS平台应用退到后台,是真的退出应用,后台会释放资源
          Android平台应用退到后台,并不会立即退出应用,因为应用和应用之间存在互相唤醒的问题。
    '统计方式':
        IOS平台应用退到后台就算一次独立的启动
        Android平台我们规定,两次启动之间的间隔小于30秒,被计算一次启动

1.2 系统函数

1.2.1 celloect_set
-- 1. 作用:将一列数据收集变成一行数据,返回一个数组
-- 2. 区别:
      collect_set : 会去重
      collect_list: 不会去重
--  3. 案例:
      1. 数据准备:
          drop table if exists stud;
          create table stud (name string, area string, course string, score int);
      2. 插入数据:
         insert into table stud values('zhang3','bj','math',88);
         insert into table stud values('li4','bj','math',99);
         insert into table stud values('wang5','sh','chinese',92);
         insert into table stud values('zhao6','sh','chinese',54);
         insert into table stud values('tian7','bj','chinese',91);
      3. 查询数据:
         1. select course , collect_set(area),avg(score) from stud group by course
            '打印结果':
            course	_c1	_c2
            chinese	["sh","bj"]	79.0
            math	["bj"]	93.5
         2. select course ,collset_list(area),avg(score) from stud group by course
            '打印结果'
            course	_c1	_c2
            math	["bj","bj"]	93.5
            chinese	["sh","sh","bj"]	79.0
1.2.2 nvl
-- 1. 函数:
     nvl(表达式1,表达式2-- 2. 返回值:
      只能传递两个参数,从左往右找,找到一个非null的值,就返回,如果两个表示式都为null,则返回null
-- 3. 参数说明:
      1.表达式类型:数字型、字符型和日期型
      2.两个表达式的数据类型必须相同
 -- 4. 案例:
       1. select nvl(10,"lianzp"); =>结果:10
       2. select nvl(date_add('2020-06-25',-1),1);
       报错:The first and seconds arguments of function NLV should have the same type
       3. select 'lianzp'+10; => 结果为null
       
1.2.3 coalesce
-- 1.  coalesce(a1, a2, ...) 
-- 2.  作用:- Returns the first non-null argument,返回第一不为null的值
-- 3.  和nvl主要区别是:
       a、可以有多个参数
-- 4. 要求所有参数的类型保持一致:
       The expressions after COALESCE should all have the same type
1.2.4日期函数(重要)
-- 1. date_format函数(根据格式整理日期)
      hive (gmall)> select date_format('2020-06-14','yyyy-MM');
      =>2020-06
-- 2. date_add函数(加减日期)
      hive (gmall)> select date_add('2020-06-14',-1);
       => 2020-06-13
      hive (gmall)> select date_add('2020-06-14',1);
       => 2020-06-15
-- 3. next_day(start_date, day_of_week) 
      1. 返回当期日期的下一个周几。
      2. day_of_week:可选参数:-- 可以写前面两个字母,实际开发中,周一和周末使用频率最高
      	星期一到星期日的英文(Monday,Tuesday、Wednesday、Thursday、Friday、Saturday、Sunday)
    Returns the first date which is later than start_date and named as indicated.
      3. 需求:
          a、需求1:返回下一个周一
             select next_day('2020-07-07','mo');
			=>2020-07-13
          b、需求2:返回本周末:求出下一个周一再减1select date_add(next_day('2020-07-07','mo'),-1);
			=> 2020-07-12       
          c、需求3:返回上个周日,求出下一个周一再减8select date_add(next_day('2020-07-07','mo'),-8);
             => 2020-07-05
-- 4. last_day(date) 
     返回当前日期的最后一天
     Returns the last day of the month which the date belongs to
	select last_day('2020-07-07'); => 2020-07-31
-- 5. 当前日期:current_date
      select current_day;
      =>2020-07-07

二、DWS/DWT层

2.1 DWS/DWT思想

-- 1. dwd层和dws/dwt的区别?
     	DWD层建表时不要考虑用户的需求,而dws和dwd层,以用户需求为驱动,统计各个维度的相关指标;
    
-- 2. dws和dwt都是建宽表,建宽表的目的是优化查询,减少重复查询的步骤。

-- 3. 用户后续的需求是什么样的呢?
       主要是报表,体现的指标有个数、次数、金额等指标。
       '体现的方式':维度 + 时间 + 指标
       '案例':
           a、地区维度:北京地区今天的下单金额
           b、商品维度:1号商品今天下单数量

-- 4. 建宽表的思路:
        1. 一个维度作为一个宽表;
        2. 字段:以维度作为关键字 + 和该维度相关的所有事实表的度量值
    
-- 5. DWS和DWT的主要区别:
       1. dws的数据来源于DWD层,站在维度的角度,看事实表的度量值,统计每个主题当天的行为
       2. dwt的主要数据来源dws,站在维度的角度,看事实表的开始时间、结束时间、累积到现在的度量值,累积一段时间的度量值,
       统计每个主题累计行为,如统计最近7天的下单金额。

-- 6. 宽表是应对一些常用的需求,并不是所有的需求都会包含进来,如果一些特殊需求,可以去到dwd层获取数据

2.2 用户行为数据

2.2.1 每日设备DWS层
  • 建表语句
drop table if exists dws_uv_detail_daycount;
create external table dws_uv_detail_daycount
(
    `mid_id` string, 
    `brand` string, 
    `model` string, 
    `login_count` bigint COMMENT '活跃次数',
    `page_stats` array<struct<page_id:string,page_count:bigint>> COMMENT '页面访问统计'
)
partitioned by(dt string)
stored as parquet
location '/warehouse/gmall/dws/dws_uv_detail_daycount';
  • 加载数据
with 
tmp_start_log as (
    select 
        mid_id , 
        brand , 
        model,
        count(*) login_count
    from dwd_start_log 
    where dt = '2020-06-25'
    group by mid_id,brand,model
),
tmp_page_log as (
    select 
            mid_id , 
            brand , 
            model,      
            collect_set(named_struct("page_id",page_id,"page_count",page_count)) page_stats
    from  (
            select 
                mid_id , 
                brand , 
                model,
                page_id,
                count(*) page_count     
            from dwd_page_log
            where dt ='2020-06-25'
            group by  mid_id ,brand,model,page_id
        )tmp
    group by  mid_id ,brand,model
)

insert overwrite table dws_uv_detail_daycount partition(dt='2020-06-25')

select 
        tmp_start_log.mid_id , 
        tmp_start_log.brand , 
        tmp_start_log.model,
        login_count,
        page_stats
from  tmp_start_log 
join tmp_page_log 
on tmp_start_log.mid_id = tmp_page_log.mid_id
and tmp_start_log.brand = tmp_page_log.brand
and tmp_start_log.model = tmp_page_log.model

2.2.2 设备主题DWT层
  • 建表语句
drop table if exists dwt_uv_topic;
create external table dwt_uv_topic
(
    `mid_id` string,
    `brand` string,
    `model` string,
    `login_date_first` string  comment '首次活跃时间',
    `login_date_last` string  comment '末次活跃时间',
    `login_day_count` bigint comment '当日活跃次数',
    `login_count` bigint comment '累积活跃天数'
)
stored as parquet
location '/warehouse/gmall/dwt/dwt_uv_topic';
  • 插入数据
insert overwrite table dwt_uv_topic
select 
    nvl(old.mid_id,new.mid_id),
    nvl(old.brand,new.brand),
    nvl(old.model,new.model),
    if(old.login_date_first is null and new.login_count > 0 ,'2020-06-25',old.login_date_first),
    if(new.login_count > 0 ,'2020-06-25',old.login_date_last),
    nvl(new.login_count,0),
    nvl(old.login_count,0) + if(new.login_count> 0 , 1 ,0)
from dwt_uv_topic old 
full join (
    select 
        mid_id , 
        brand , 
        model, 
        login_count
    from dws_uv_detail_daycount
    where dt ='2020-06-25'
)new 
on old.mid_id = new.mid_id

2.3 业务数据

2.3.1 会员行为
2.3.1.1 会员DWS层
-- 1. 建表过程:
   '准备':  事实表:订单、订单详情、优惠券领用、支付、退款、收藏、加购物车、评价
   '第一步':找到和用户维度相关的所有事实表:订单、订单详情、优惠券领用、支付、退款、收藏、加购物车、评价
   '第二步': 找到这些事实表的所有度量值字段:
   '第三步':维度主键 + 第二步获取的字段作为dws宽表的字段。
   -- 备注:理论上上述所有事实表都需要进行统计,但是在本案例中,只统计了订单、订单详情表、支付,加购物车四个事实表的数据。
-- 2. 数据的来源:来自于DWD层
-- 3. 表中的数据说明:
      1. 是分区表,每个分区为当天的数据
      2. 每一行数据代表一个用户当天的行为
-- 4. 根据建表字段,确定每个字段来自于哪张表中
-- 5. 数据存储格式:列式存储 + lzo压缩
  • 建表语句
drop table if exists dws_user_action_daycount;
create external table dws_user_action_daycount
(   
    user_id string comment '用户 id',
    login_count bigint comment '登录次数',
    cart_count bigint comment '加入购物车次数',
    order_count bigint comment '下单次数',
    order_amount    decimal(16,2)  comment '下单金额',
    payment_count   bigint      comment '支付次数',
    payment_amount  decimal(16,2) comment '支付金额',
    order_detail_stats array<struct<sku_id:string,sku_num:bigint,order_count:bigint,order_amount:decimal(20,2)>> comment '下单明细统计'
) COMMENT '每日用户行为'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_user_action_daycount/'
tblproperties ("parquet.compression"="lzo");
  • 第一步:确定各个字段来自哪个表
user_id string comment '用户 id',--从dwd_start_log获取 
    login_count bigint comment '登录次数', --从dwd_start_log获取
    cart_count bigint comment '加入购物车次数',--dwd_action_log
    order_count bigint comment '下单次数',--dwd_fact_order_info 
    order_amount    decimal(16,2)  comment '下单金额',--dwd_fact_order_info 
    payment_count   bigint      comment '支付次数',--dwd_fact_order_info 
    payment_amount  decimal(16,2) comment '支付金额',--dwd_fact_order_info 
    order_detail_stats array<struct<sku_id:string,sku_num:bigint,order_count:bigint,order_amount:decimal(20,2)>> comm
	  -- dwd_fact_order_detail
	 说明:1. 加入购车的数据,需要去启动日志中获取,因为加购事实表是每天一个快照,不保留中间操作的过程,所以去到日志action
	      	获取,点击一次加购物车操作,记一次加入购物车的次数
	      2. 用户登录次数,从启动日志中获取,启动一次算作今天登录一次,但是要注意有些登录不是会员,所以需要过滤user_id为
	         null的数据;
  • 第二步:确定dws一行数据代表什么意思?
用户行为表的一行数据代表:一个用户今天登陆次数、加购物车数量、下单数量、下单金额等数据
  • 第三步:确定dwd层所有相关表的同步策略
   'dwd_start_log':每日新增数据
   'dwd_action_log':每日新增数据
   'dwd_fact_order_info':事务型
   'dwd_fact_order_detail':事务型
   'dwd_fact_payment_info':事务型
  • 第四步:从各个表中获取对应的字段
 -- 1. 获取用户id和登录次数
	select 
		  user_id,
		  count(*) login_count
	from dwd_start_log
	where dt='2020-06-25'
	and user_id is not null 
	group by user_id
	---------------------------------------------
	-- 2. 获取加购物车的次数
	
		 select
			user_id,
			count(*) cart_count
		from dwd_action_log
		where dt='2020-06-25'
		and user_id is not null
		and action_id='cart_add'
		group by user_id		
	
  ------------------------------------------ 
   -- 3. 获取下单的次数和下单金额
		
	select
		user_id,
		count(*) order_count
	sum(final_total_amount)  order_amount
	from dwd_fact_order_info
	where dt = '2020-06-25'
	group by user_id
   
   ------------------------------------------------
   -- 4. 获取支付的金额和支付次数
  
	 select
		 user_id,
		 count(*) payment_count,
		 sum(payment_amount) payment_amount
	 from dwd_fact_payment_info
	 where dt = '2020-06-25'
	 group by user_id  
   
   ----------------------------------------------
   -- 5. 获取下单明细
	   select 
			user_id,
			collect_set(named_struct("sku_id",sku_id,"sku_num",sku_num,"order_count",order_count,"order_amount",order_amount))order_detail_stats
	   from (
		   select 
			  user_id,
			  sku_id,
			  sum(sku_num) sku_num,
			  count(*) order_count,
			  cast(sum(final_amount_d) as demical(20,2)) order_amount 
		   from dwd_fact_order_detail 
		   where dt = '2020-06-25'
		   group by user_id,sku_id
	   )tmp
	   group by user_id
  • 第五步:组装以后插入数据
with 
    tmp_start_log as (
    select 
          user_id,
          count(*) login_count
    from dwd_start_log
    where dt='2020-06-25'
    and user_id is not null 
    group by user_id
    ),
    tmp_action_log as (
         select
            user_id,
            count(*) cart_count
        from dwd_action_log
        where dt='2020-06-25'
        and user_id is not null
        and action_id='cart_add'
        group by user_id        
    ),
    tmp_order_info as (
    
    select
        user_id,
        count(*) order_count,
        sum(final_total_amount)  order_amount
    from dwd_fact_order_info
    where dt = '2020-06-25'
    group by user_id
    ),
    tmp_payment_info as (
     select
         user_id,
         count(*) payment_count,
         sum(payment_amount) payment_amount
     from dwd_fact_payment_info
     where dt = '2020-06-25'
     group by user_id  
   ),
   tmp_order_detail as (
       select 
            user_id,
            collect_set(named_struct("sku_id",sku_id,"sku_num",sku_num,"order_count",order_count,"order_amount",order_amount)) order_detail_stats
       from (
           select 
              user_id,
              sku_id,
              sum(sku_num) sku_num,
              count(*) order_count,
              cast(sum(final_amount_d) as decimal(20,2)) order_amount 
           from dwd_fact_order_detail 
           where dt = '2020-06-25'
           group by user_id,sku_id
       )tmp
       group by user_id
    )
    
    insert overwrite table dws_user_action_daycount partition (dt='2020-06-25')
    select   
        tmp_start_log.user_id ,
        login_count ,
        nvl(cart_count,0) ,
        nvl(order_count,0) ,
        nvl(order_amount,0.0)   ,
        nvl(payment_count,0)  ,
        nvl(payment_amount,0.0) ,
        order_detail_stats       
    from  tmp_start_log 
    left join tmp_action_log on tmp_start_log.user_id = tmp_action_log.user_id
    left join tmp_order_info on tmp_start_log.user_id = tmp_order_info.user_id
    left join tmp_payment_info on tmp_start_log.user_id = tmp_payment_info.user_id
    left join tmp_order_detail on  tmp_start_log.user_id = tmp_order_detail.user_id
2.3.1.2 会员DWT层
-- 1. 说明:
      1. dwt和dws层的字段基本是一一对应的。
      2. dws是当天的数据,dwt是累积数据,累积涉及到时间,比如累积7天,累积3-- 2. dwt和dws的维度字段,我们也可以直接放到表中,后续统计需求可以用到。

-- 3. dwt数据说明:
     1. 是维度全量表
     2. 数据源来自于dws层
     3. 不是分区表
 
-- 4. 如何维护dwt表,即如何向这个全量表中插入数据?
    1. dwt的数据每天都需要进行更新;
    2. 更新涉及到新数据和老数据
    3. 更新方式:
       a、取累积时间周期的数据;
       b、如果是累积字段,使用聚合函数求值
       c、如果非累积字段,使用判断语句求当天的数据
       d、然后新旧数据今天合并,由于旧数据不是分区数据
         那么累积数据直接使用新表累积值,而非累积字段,采用更新的方式
  • 建表语句
drop table if exists dwt_user_topic;
create external table dwt_user_topic
(
    user_id string  comment '用户id',
    login_date_first string  comment '首次登录时间',
    login_date_last string  comment '末次登录时间',
    login_count bigint comment '累积登录天数',
    login_last_30d_count bigint comment '最近30日登录天数',
    order_date_first string  comment '首次下单时间',
    order_date_last string  comment '末次下单时间',
    order_count bigint comment '累积下单次数',
    order_amount decimal(16,2) comment '累积下单金额',
    order_last_30d_count bigint comment '最近30日下单次数',
    order_last_30d_amount bigint comment '最近30日下单金额',
    payment_date_first string  comment '首次支付时间',
    payment_date_last string  comment '末次支付时间',
    payment_count decimal(16,2) comment '累积支付次数',
    payment_amount decimal(16,2) comment '累积支付金额',
    payment_last_30d_count decimal(16,2) comment '最近30日支付次数',
    payment_last_30d_amount decimal(16,2) comment '最近30日支付金额'
 )COMMENT '用户主题宽表'
stored as parquet
location '/warehouse/gmall/dwt/dwt_user_topic/'
tblproperties ("parquet.compression"="lzo");
  • 分析
	旧表:dwt前一天的数据
   新表:从dws层获取的累积30天的数据
    user_id string  comment '用户id', --
    login_date_first string  comment '首次登录时间',--如果旧表中有登录时间就使用旧表中的时间,否则使用当天时间
    login_date_last string  comment '末次登录时间',-- 如果新表中今天的登入次数大于0,那么末次登录时间使用今天时间,否则使用旧时间
    login_count bigint comment '累积登录天数', -- 如果新表中的登入次数大于0,那么旧表中数据 + 1 
    login_last_30d_count bigint comment '最近30日登录天数', -- 使用新表中累积计算数据,如果为null,则选择0
    order_date_first string  comment '首次下单时间',-- 如果旧表中首次下单时间为null且新表中下单次数大于0,那么使用今天时间,否则使用旧表数据
    order_date_last string  comment '末次下单时间',-- 如果新表的下单数据大于0,则使用当天时间,否则使用旧表数据
    order_count bigint comment '累积下单次数',-- 将旧表下单数据和新表下单次数直接相加
    order_amount decimal(16,2) comment '累积下单金额',--将旧表下单金额和新表下单金额相加
    order_last_30d_count bigint comment '最近30日下单次数',--直接使用新表数据,如果新表数据为null,则使用0
    order_last_30d_amount bigint comment '最近30日下单金额',--直接使用新表数据,如果新表数据为null,则使用0
    payment_date_first string  comment '首次支付时间',
    payment_date_last string  comment '末次支付时间',
    payment_count decimal(16,2) comment '累积支付次数',
    payment_amount decimal(16,2) comment '累积支付金额',
    payment_last_30d_count decimal(16,2) comment '最近30日支付次数',
    payment_last_30d_amount decimal(16,2) comment '最近30日支付金额'
  • 装载数据
 insert overwrite table dwt_user_topic 
   select 
       nvl(old.user_id,new.user_id) ,
       nvl(old.login_date_first,'2020-06-25'),
       if(new.user_id is not null , '2020-06-25',old.login_date_last),
       nvl(old.login_count,0) + if(new.user_id is not null,1,0),
       nvl(new.login_last_30d_count,0),
       if(old.order_date_first is null and new.order_count > 0 , '2020-06-25',old.order_date_first),
       if(new.order_count > 0,'2020-06-25',old.order_date_first),
       nvl(old.order_count,0) + nvl(new.order_count,0),
       nvl(old.order_amount,0) + nvl(new.order_amount,0),
       nvl(new.order_last_30d_count,0),
       nvl(new.order_last_30d_amount,0),
       if(old.payment_date_first is null and new.payment_count > 0 ,'2020-06-25',old.payment_date_first),
       if(new.payment_count > 0 ,'2020-06-25',old.payment_date_last),
       nvl(old.payment_count,0) + nvl(new.payment_count,0),
       nvl(old.payment_amount,0) + nvl(new.payment_amount,0),
       nvl(new.payment_last_30d_count,0),
       nvl(new.payment_last_30d_amount,0)
   
   from  dwt_user_topic old
   full join (
       select 
           user_id,
           sum(if(dt='2020-06-25',login_count,0)) login_count,--当天登录次数
           sum(if(dt='2020-06-25',cart_count,0)) cart_count,--当天加入购物车次数
           sum(if(dt='2020-06-25',order_count,0)) order_count,--当天下单次数
           sum(if(dt='2020-06-25',order_amount,0.0)) order_amount,--当天下单金额
           sum(if(dt='2020-06-25',payment_count,0)) payment_count,--当天支付次数
           sum(if(dt='2020-06-25',payment_amount,0.0)) payment_amount,--当天支付金额
        
           sum(if(login_count > 0,1,0)) login_last_30d_count,--累积30天登录次数
           sum(order_count) order_last_30d_count,--最近30日下单次数
           sum(order_amount) order_last_30d_amount,--最近30日下单金额
           sum(payment_count) payment_last_30d_count,--最近30日下单次数
           sum(payment_amount) payment_last_30d_amount--最近30日下单金额
                     
       from dws_user_action_daycount 
       where dt > date_add(dt , -30)
       group by user_id
   )new
   on old.user_id = new.user_id
2.3.2 商品行为
2.3.2.1 商品DWS层
-- 解析:
    1. 表的字段如何创建
        '准备':  事实表:订单、订单详情、优惠券领用、支付、退款、收藏、加购物车、评价
        '第一步':找到和商品这个维度相关的所有事实表:订单详情、支付、退款、收藏、加购物车、评价
        '第二步':获取上述事实表的并取其度量值
        '第三步':将商品维度id + 第二步度量值的字段作为dws层的字段,创建商品行为的dws层表
        
    2. 如何向表中插入数据?
        '第一步':根据dws层表的字段,确定每个字段来自于哪个表
        '第二步':确定第一步中所有表的同步策略,确定表中存储的数据是什么及每行数据存储的是什么
        '第三步':一个一个字段来获取最后进行合并

    3. 表保存什么数据
       
    4. 表中每行数据是什么

    5. 存储格式
       1. 分区表,每个分区保留当前最新的数据
       2. 列式存储 + lzo压缩
  • 建表语句
drop table if exists dws_sku_action_daycount;
create external table dws_sku_action_daycount 
(   
    sku_id string comment 'sku_id',
    order_count bigint comment '被下单次数',
    order_num bigint comment '被下单件数',
    order_amount decimal(16,2) comment '被下单金额',
    payment_count bigint  comment '被支付次数',
    payment_num bigint comment '被支付件数',
    payment_amount decimal(16,2) comment '被支付金额',
    refund_count bigint  comment '被退款次数',
    refund_num bigint comment '被退款件数',
    refund_amount  decimal(16,2) comment '被退款金额',
    cart_count bigint comment '被加入购物车次数',
    favor_count bigint comment '被收藏次数',
    appraise_good_count bigint comment '好评数',
    appraise_mid_count bigint comment '中评数',
    appraise_bad_count bigint comment '差评数',
    appraise_default_count bigint comment '默认评价数'
) COMMENT '每日商品行为'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_sku_action_daycount/'
tblproperties ("parquet.compression"="lzo");
  • 分析过程

==================================================
   -- 1. 相关表:订单详情事实表
      2. 同步策略:事务型事实表,以订单的创建时间为分区数据
	  3. 保存的数据:一个分区保存当天所有的下单的信息
	  3. 相关字段如下:
   
    order_count bigint comment '被下单次数',
    order_num bigint comment '被下单件数',
    order_amount decimal(16,2) comment '被下单金额',
	
	select 
	   sku_id,
	   count(*) order_count,--被下单次数
	   sum(sku_num) order_num, --被下单件数
	   sum(original_amount_d) order_amount --被下单金额	
	from dwd_fact_order_detail
	where dt= '2020-06-25'
	group by sku_id
	-------------------------------------------------
   -- 1. 相关表:支付事实表
      2. 同步策略:事务型事实表,以订单支付时间为分区
	  3. 保存的数据:一个分区保存当天所有的支付信息
	  4. 相关字段如下:
	  5. 说明: 有一种情况,昨天晚上下单,但是今天才支付,那么下单的sku_id不在支付中,这样可能会导致数据丢失
	payment_count bigint  comment '被支付次数',
    payment_num bigint comment '被支付件数',
    payment_amount decimal(16,2) comment '被支付金额',
	
	select 	    
		sku_id,
		count(*) payment_count,
		sum(final_amount_d) payment_amount
		
	from dwd_fact_order_detail
	where dt='2020-06-25'
	and order_id in (
		select
				order_id
		from dwd_fact_payment_info
		where (dt='2020-06-25'
        or dt=date_add('2020-06-25',-1))
        and date_format(payment_time,'yyyy-MM-dd')='2020-06-25'
	)
	group by sku_id
	
	-------------------------------------------
	-- 1. 相关表:退款事实表
      2. 同步策略:事务型事实表,退款时间为分区
	  3. 保存的数据:一个分区保存当天所有退款的数据
	  4. 相关字段如下:
		refund_count bigint  comment '被退款次数',
		refund_num bigint comment '被退款件数',
		refund_amount  decimal(16,2) comment '被退款金额',
		
		select 
		      sku_id,
			  sum(refund_count) refund_num,
			  sum(refund_amount) refund_amount		  
		from  dwd_fact_order_refund_info 
		where dt = '2020-06-25'
		group by sku_id
		
  -----------------------------------------------
	-- 1. 相关表:加购车
      2. 同步策略:周期型快照事实表
	  3. 保存的数据:每天一个快照
	  4. 相关字段如下:
		    cart_count bigint comment '被加入购物车次数',

        select  
		    sku_id,
			count(*) cart_count	
		from dwd_fact_cart_info
		where create_time = '2020-06-25'
		group by sku_id

    -------------------------------------------
   -- 1. 相关表:收藏表
      2. 同步策略:周期型快照事实表
	  3. 保存的数据:每天一个快照
	  4. 相关字段如下:
		    favor_count bigint comment '被收藏次数',
			
			select
			sku_id,
			count(*) favor_count
			from dwd_fact_favor_info
			where create_time = '2020-06-25'
			group by sku_id
	
	
	-----------------------------------------------
     -- 1. 相关表:评价事实表
      2. 同步策略:事务型事实表,以订单支付时间为分区
	  3. 保存的数据:每个分区保留当天的评价数据
	  4. 相关字段如下:
	appraise_good_count bigint comment '好评数',
    appraise_mid_count bigint comment '中评数',
    appraise_bad_count bigint comment '差评数',
    appraise_default_count bigint comment '默认评价数'
	
	    select
		sku_id,
		sum(if(appraise='1201',1,0)) appraise_good_count,
		sum(if(appraise='1202',1,0)) appraise_mid_count,
		sum(if(appraise='1203',1,0)) appraise_bad_count,
		sum(if(appraise='1204',1,0)) appraise_default_count,
		from dwd_fact_comment_info 
		where dt = '2020-06-25'
		group by sku_id
  • 插入数据
with 
tmp_order as
(
    select
        sku_id,
        count(*) order_count,
        sum(sku_num) order_num,
        sum(final_amount_d) order_amount
    from dwd_fact_order_detail
    where dt='2020-06-25'
    group by sku_id
),
tmp_payment as
(
    select
        sku_id,
        count(*) payment_count,
        sum(sku_num) payment_num,
        sum(final_amount_d) payment_amount
    from dwd_fact_order_detail
    where dt='2020-06-25'
    and order_id in
    (
        select
            id
        from dwd_fact_order_info
        where (dt='2020-06-25'
        or dt=date_add('2020-06-25',-1))
        and date_format(payment_time,'yyyy-MM-dd')='2020-06-25'
    )
    group by sku_id
),
tmp_refund as
(
    select
        sku_id,
        count(*) refund_count,
        sum(refund_num) refund_num,
        sum(refund_amount) refund_amount
    from dwd_fact_order_refund_info
    where dt='2020-06-25'
    group by sku_id
),
tmp_cart as
(
    select
        item sku_id,
        count(*) cart_count
    from dwd_action_log
    where dt='2020-06-25'
    and user_id is not null
    and action_id='cart_add'
    group by item 
),tmp_favor as
(
    select
        item sku_id,
        count(*) favor_count
    from dwd_action_log
    where dt='2020-06-25'
    and user_id is not null
    and action_id='favor_add'
    group by item 
),
tmp_appraise as
(
select
    sku_id,
    sum(if(appraise='1201',1,0)) appraise_good_count,
    sum(if(appraise='1202',1,0)) appraise_mid_count,
    sum(if(appraise='1203',1,0)) appraise_bad_count,
    sum(if(appraise='1204',1,0)) appraise_default_count
from dwd_fact_comment_info
where dt='2020-06-25'
group by sku_id
)

insert overwrite table dws_sku_action_daycount partition(dt='2020-06-25')
select
    sku_id,
    sum(order_count),
    sum(order_num),
    sum(order_amount),
    sum(payment_count),
    sum(payment_num),
    sum(payment_amount),
    sum(refund_count),
    sum(refund_num),
    sum(refund_amount),
    sum(cart_count),
    sum(favor_count),
    sum(appraise_good_count),
    sum(appraise_mid_count),
    sum(appraise_bad_count),
    sum(appraise_default_count)
from
(
    select
        sku_id,
        order_count,
        order_num,
        order_amount,
        0 payment_count,
        0 payment_num,
        0 payment_amount,
        0 refund_count,
        0 refund_num,
        0 refund_amount,
        0 cart_count,
        0 favor_count,
        0 appraise_good_count,
        0 appraise_mid_count,
        0 appraise_bad_count,
        0 appraise_default_count
    from tmp_order
    union all
    select
        sku_id,
        0 order_count,
        0 order_num,
        0 order_amount,
        payment_count,
        payment_num,
        payment_amount,
        0 refund_count,
        0 refund_num,
        0 refund_amount,
        0 cart_count,
        0 favor_count,
        0 appraise_good_count,
        0 appraise_mid_count,
        0 appraise_bad_count,
        0 appraise_default_count
    from tmp_payment
    union all
    select
        sku_id,
        0 order_count,
        0 order_num,
        0 order_amount,
        0 payment_count,
        0 payment_num,
        0 payment_amount,
        refund_count,
        refund_num,
        refund_amount,
        0 cart_count,
        0 favor_count,
        0 appraise_good_count,
        0 appraise_mid_count,
        0 appraise_bad_count,
        0 appraise_default_count        
    from tmp_refund
    union all
    select
        sku_id,
        0 order_count,
        0 order_num,
        0 order_amount,
        0 payment_count,
        0 payment_num,
        0 payment_amount,
        0 refund_count,
        0 refund_num,
        0 refund_amount,
        cart_count,
        0 favor_count,
        0 appraise_good_count,
        0 appraise_mid_count,
        0 appraise_bad_count,
        0 appraise_default_count
    from tmp_cart
    union all
    select
        sku_id,
        0 order_count,
        0 order_num,
        0 order_amount,
        0 payment_count,
        0 payment_num,
        0 payment_amount,
        0 refund_count,
        0 refund_num,
        0 refund_amount,
        0 cart_count,
        favor_count,
        0 appraise_good_count,
        0 appraise_mid_count,
        0 appraise_bad_count,
        0 appraise_default_count
    from tmp_favor
    union all
    select
        sku_id,
        0 order_count,
        0 order_num,
        0 order_amount,
        0 payment_count,
        0 payment_num,
        0 payment_amount,
        0 refund_count,
        0 refund_num,
        0 refund_amount,
        0 cart_count,
        0 favor_count,
        appraise_good_count,
        appraise_mid_count,
        appraise_bad_count,
        appraise_default_count
    from tmp_appraise
)tmp
group by sku_id;
2.3.2.2 商品DWT层
  • 建表语句
drop table if exists dwt_sku_topic;
create external table dwt_sku_topic
(
    sku_id string comment 'sku_id',
    spu_id string comment 'spu_id',
    order_last_30d_count bigint comment '最近30日被下单次数',
    order_last_30d_num bigint comment '最近30日被下单件数',
    order_last_30d_amount decimal(16,2)  comment '最近30日被下单金额',
    order_count bigint comment '累积被下单次数',
    order_num bigint comment '累积被下单件数',
    order_amount decimal(16,2) comment '累积被下单金额',
    payment_last_30d_count   bigint  comment '最近30日被支付次数',
    payment_last_30d_num bigint comment '最近30日被支付件数',
    payment_last_30d_amount  decimal(16,2) comment '最近30日被支付金额',
    payment_count   bigint  comment '累积被支付次数',
    payment_num bigint comment '累积被支付件数',
    payment_amount  decimal(16,2) comment '累积被支付金额',
    refund_last_30d_count bigint comment '最近三十日退款次数',
    refund_last_30d_num bigint comment '最近三十日退款件数',
    refund_last_30d_amount decimal(16,2) comment '最近三十日退款金额',
    refund_count bigint comment '累积退款次数',
    refund_num bigint comment '累积退款件数',
    refund_amount decimal(16,2) comment '累积退款金额',
    cart_last_30d_count bigint comment '最近30日被加入购物车次数',
    cart_count bigint comment '累积被加入购物车次数',
    favor_last_30d_count bigint comment '最近30日被收藏次数',
    favor_count bigint comment '累积被收藏次数',
    appraise_last_30d_good_count bigint comment '最近30日好评数',
    appraise_last_30d_mid_count bigint comment '最近30日中评数',
    appraise_last_30d_bad_count bigint comment '最近30日差评数',
    appraise_last_30d_default_count bigint comment '最近30日默认评价数',
    appraise_good_count bigint comment '累积好评数',
    appraise_mid_count bigint comment '累积中评数',
    appraise_bad_count bigint comment '累积差评数',
    appraise_default_count bigint comment '累积默认评价数'
 )COMMENT '商品主题宽表'
stored as parquet
location '/warehouse/gmall/dwt/dwt_sku_topic/'
tblproperties ("parquet.compression"="lzo");
  • 插入数据
insert overwrite table dwt_sku_topic
select 
    nvl(new.sku_id,old.sku_id),
    sku_info.spu_id,
    nvl(new.order_count30,0),
    nvl(new.order_num30,0),
    nvl(new.order_amount30,0),
    nvl(old.order_count,0) + nvl(new.order_count,0),
    nvl(old.order_num,0) + nvl(new.order_num,0),
    nvl(old.order_amount,0) + nvl(new.order_amount,0),
    nvl(new.payment_count30,0),
    nvl(new.payment_num30,0),
    nvl(new.payment_amount30,0),
    nvl(old.payment_count,0) + nvl(new.payment_count,0),
    nvl(old.payment_num,0) + nvl(new.payment_count,0),
    nvl(old.payment_amount,0) + nvl(new.payment_count,0),
    nvl(new.refund_count30,0),
    nvl(new.refund_num30,0),
    nvl(new.refund_amount30,0),
    nvl(old.refund_count,0) + nvl(new.refund_count,0),
    nvl(old.refund_num,0) + nvl(new.refund_num,0),
    nvl(old.refund_amount,0) + nvl(new.refund_amount,0),
    nvl(new.cart_count30,0),
    nvl(old.cart_count,0) + nvl(new.cart_count,0),
    nvl(new.favor_count30,0),
    nvl(old.favor_count,0) + nvl(new.favor_count,0),
    nvl(new.appraise_good_count30,0),
    nvl(new.appraise_mid_count30,0),
    nvl(new.appraise_bad_count30,0),
    nvl(new.appraise_default_count30,0)  ,
    nvl(old.appraise_good_count,0) + nvl(new.appraise_good_count,0),
    nvl(old.appraise_mid_count,0) + nvl(new.appraise_mid_count,0),
    nvl(old.appraise_bad_count,0) + nvl(new.appraise_bad_count,0),
    nvl(old.appraise_default_count,0) + nvl(new.appraise_default_count,0) 
from 
dwt_sku_topic old
full outer join 
(
    select 
        sku_id,
        sum(if(dt='2020-06-25', order_count,0 )) order_count,
        sum(if(dt='2020-06-25',order_num ,0 ))  order_num, 
        sum(if(dt='2020-06-25',order_amount,0 )) order_amount ,
        sum(if(dt='2020-06-25',payment_count,0 )) payment_count,
        sum(if(dt='2020-06-25',payment_num,0 )) payment_num,
        sum(if(dt='2020-06-25',payment_amount,0 )) payment_amount,
        sum(if(dt='2020-06-25',refund_count,0 )) refund_count,
        sum(if(dt='2020-06-25',refund_num,0 )) refund_num,
        sum(if(dt='2020-06-25',refund_amount,0 )) refund_amount,  
        sum(if(dt='2020-06-25',cart_count,0 )) cart_count,
        sum(if(dt='2020-06-25',favor_count,0 )) favor_count,
        sum(if(dt='2020-06-25',appraise_good_count,0 )) appraise_good_count,  
        sum(if(dt='2020-06-25',appraise_mid_count,0 ) ) appraise_mid_count ,
        sum(if(dt='2020-06-25',appraise_bad_count,0 )) appraise_bad_count,  
        sum(if(dt='2020-06-25',appraise_default_count,0 )) appraise_default_count,
        sum(order_count) order_count30 ,
        sum(order_num) order_num30,
        sum(order_amount) order_amount30,
        sum(payment_count) payment_count30,
        sum(payment_num) payment_num30,
        sum(payment_amount) payment_amount30,
        sum(refund_count) refund_count30,
        sum(refund_num) refund_num30,
        sum(refund_amount) refund_amount30,
        sum(cart_count) cart_count30,
        sum(favor_count) favor_count30,
        sum(appraise_good_count) appraise_good_count30,
        sum(appraise_mid_count) appraise_mid_count30,
        sum(appraise_bad_count) appraise_bad_count30,
        sum(appraise_default_count) appraise_default_count30 
    from dws_sku_action_daycount
    where dt >= date_add ('2020-06-25', -30)
    group by sku_id    
)new 
on new.sku_id = old.sku_id
left join 
(select * from dwd_dim_sku_info where dt='2020-06-25') sku_info
on nvl(new.sku_id,old.sku_id)= sku_info.id;
2.3.4 活动统计
2.3.4.1 活动DWS层
  • 建表
drop table if exists dwt_activity_topic;
create external table dwt_activity_topic(
    `id` string COMMENT '编号',
    `activity_name` string  COMMENT '活动名称',
    `activity_type` string  COMMENT '活动类型',
    `start_time` string  COMMENT '开始时间',
    `end_time` string  COMMENT '结束时间',
    `create_time` string  COMMENT '创建时间',
    `display_day_count` bigint COMMENT '当日曝光次数',
    `order_day_count` bigint COMMENT '当日下单次数',
    `order_day_amount` decimal(20,2) COMMENT '当日下单金额',
    `payment_day_count` bigint COMMENT '当日支付次数',
    `payment_day_amount` decimal(20,2) COMMENT '当日支付金额',
    `display_count` bigint COMMENT '累积曝光次数',
    `order_count` bigint COMMENT '累积下单次数',
    `order_amount` decimal(20,2) COMMENT '累积下单金额',
    `payment_count` bigint COMMENT '累积支付次数',
    `payment_amount` decimal(20,2) COMMENT '累积支付金额'
) COMMENT '活动主题宽表'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/dwt/dwt_activity_topic/'
tblproperties ("parquet.compression"="lzo");
  1. 插入数据
with
tmp_op as
(
    select
        activity_id,
        sum(if(date_format(create_time,'yyyy-MM-dd')='2020-06-25',1,0)) order_count,
        sum(if(date_format(create_time,'yyyy-MM-dd')='2020-06-25',final_total_amount,0)) order_amount,
        sum(if(date_format(payment_time,'yyyy-MM-dd')='2020-06-25',1,0)) payment_count,
        sum(if(date_format(payment_time,'yyyy-MM-dd')='2020-06-25',final_total_amount,0)) payment_amount
    from dwd_fact_order_info
    where (dt='2020-06-25' or dt=date_add('2020-06-25',-1))
    and activity_id is not null
    group by activity_id
),
tmp_display as
(
    select
        item activity_id,
        count(*) display_count
    from dwd_display_log
    where dt='2020-06-25'
    and item_type='activity_id'
    group by item
),
tmp_activity as
(
    select
        *
    from dwd_dim_activity_info
    where dt='2020-06-25'
)
insert overwrite table dws_activity_info_daycount partition(dt='2020-06-25')
select
    nvl(tmp_op.activity_id,tmp_display.activity_id),
    tmp_activity.activity_name,
    tmp_activity.activity_type,
    tmp_activity.start_time,
    tmp_activity.end_time,
    tmp_activity.create_time,
    tmp_display.display_count,
    tmp_op.order_count,
    tmp_op.order_amount,
    tmp_op.payment_count,
    tmp_op.payment_amount
from tmp_op
full outer join tmp_display on tmp_op.activity_id=tmp_display.activity_id
left join tmp_activity on nvl(tmp_op.activity_id,tmp_display.activity_id)=tmp_activity.id;
2.3.4.2 活动DWT层
  • 建表
drop table if exists dwt_activity_topic;
create external table dwt_activity_topic(
    `id` string COMMENT '编号',
    `activity_name` string  COMMENT '活动名称',
    `activity_type` string  COMMENT '活动类型',
    `start_time` string  COMMENT '开始时间',
    `end_time` string  COMMENT '结束时间',
    `create_time` string  COMMENT '创建时间',
    `display_day_count` bigint COMMENT '当日曝光次数',
    `order_day_count` bigint COMMENT '当日下单次数',
    `order_day_amount` decimal(20,2) COMMENT '当日下单金额',
    `payment_day_count` bigint COMMENT '当日支付次数',
    `payment_day_amount` decimal(20,2) COMMENT '当日支付金额',
    `display_count` bigint COMMENT '累积曝光次数',
    `order_count` bigint COMMENT '累积下单次数',
    `order_amount` decimal(20,2) COMMENT '累积下单金额',
    `payment_count` bigint COMMENT '累积支付次数',
    `payment_amount` decimal(20,2) COMMENT '累积支付金额'
) COMMENT '活动主题宽表'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/dwt/dwt_activity_topic/'
tblproperties ("parquet.compression"="lzo");
  • 插入数据
insert overwrite table dwt_activity_topic
select
    nvl(new.id,old.id),
    nvl(new.activity_name,old.activity_name),
    nvl(new.activity_type,old.activity_type),
    nvl(new.start_time,old.start_time),
    nvl(new.end_time,old.end_time),
    nvl(new.create_time,old.create_time),
    nvl(new.display_count,0),
    nvl(new.order_count,0),
    nvl(new.order_amount,0.0),
    nvl(new.payment_count,0),
    nvl(new.payment_amount,0.0),
    nvl(new.display_count,0)+nvl(old.display_count,0),
    nvl(new.order_count,0)+nvl(old.order_count,0),
    nvl(new.order_amount,0.0)+nvl(old.order_amount,0.0),
    nvl(new.payment_count,0)+nvl(old.payment_count,0),
    nvl(new.payment_amount,0.0)+nvl(old.payment_amount,0.0)
from
(
    select
        *
    from dwt_activity_topic
)old
full outer join
(
    select
        *
    from dws_activity_info_daycount
    where dt='2020-06-25'
)new
on old.id=new.id;
2.3.5 地区统计
2.3.5.1 地区DWS层
  • 建表语句
drop table if exists dws_area_stats_daycount;
create external table dws_area_stats_daycount(
    `id` bigint COMMENT '编号',
    `province_name` string COMMENT '省份名称',
    `area_code` string COMMENT '地区编码',
    `iso_code` string COMMENT 'iso编码',
    `region_id` string COMMENT '地区ID',
    `region_name` string COMMENT '地区名称',
    `login_count` string COMMENT '活跃设备数',
    `order_count` bigint COMMENT '下单次数',
    `order_amount` decimal(20,2) COMMENT '下单金额',
    `payment_count` bigint COMMENT '支付次数',
    `payment_amount` decimal(20,2) COMMENT '支付金额'
) COMMENT '每日地区信息表'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_area_stats_daycount/'
tblproperties ("parquet.compression"="lzo");
  • 插入数据
with 
tmp_login as
(
    select
        area_code,
        count(*) login_count
    from dwd_start_log
    where dt='2020-06-25'
    group by area_code
),
tmp_op as
(
    select
        province_id,
        sum(if(date_format(create_time,'yyyy-MM-dd')='2020-06-25',1,0)) order_count,
        sum(if(date_format(create_time,'yyyy-MM-dd')='2020-06-25',final_total_amount,0)) order_amount,
        sum(if(date_format(payment_time,'yyyy-MM-dd')='2020-06-25',1,0)) payment_count,
        sum(if(date_format(payment_time,'yyyy-MM-dd')='2020-06-25',final_total_amount,0)) payment_amount
    from dwd_fact_order_info
    where (dt='2020-06-25' or dt=date_add('2020-06-25',-1))
    group by province_id
)
insert overwrite table dws_area_stats_daycount partition(dt='2020-06-25')
select
    pro.id,
    pro.province_name,
    pro.area_code,
    pro.iso_code,
    pro.region_id,
    pro.region_name,
    nvl(tmp_login.login_count,0),
    nvl(tmp_op.order_count,0),
    nvl(tmp_op.order_amount,0.0),
    nvl(tmp_op.payment_count,0),
    nvl(tmp_op.payment_amount,0.0)
from dwd_dim_base_province pro
left join tmp_login on pro.area_code=tmp_login.area_code
left join tmp_op on pro.id=tmp_op.province_id;
2.3.5.2 地区DWT层
  • 建表语句
drop table if exists dwt_area_topic;
create external table dwt_area_topic(
    `id` bigint COMMENT '编号',
    `province_name` string COMMENT '省份名称',
    `area_code` string COMMENT '地区编码',
    `iso_code` string COMMENT 'iso编码',
    `region_id` string COMMENT '地区ID',
    `region_name` string COMMENT '地区名称',
    `login_day_count` string COMMENT '当天活跃设备数',
    `login_last_30d_count` string COMMENT '最近30天活跃设备数',
    `order_day_count` bigint COMMENT '当天下单次数',
    `order_day_amount` decimal(16,2) COMMENT '当天下单金额',
    `order_last_30d_count` bigint COMMENT '最近30天下单次数',
    `order_last_30d_amount` decimal(16,2) COMMENT '最近30天下单金额',
    `payment_day_count` bigint COMMENT '当天支付次数',
    `payment_day_amount` decimal(16,2) COMMENT '当天支付金额',
    `payment_last_30d_count` bigint COMMENT '最近30天支付次数',
    `payment_last_30d_amount` decimal(16,2) COMMENT '最近30天支付金额'
) COMMENT '地区主题宽表'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/dwt/dwt_area_topic/'
tblproperties ("parquet.compression"="lzo");
  • 插入数据
insert overwrite table dwt_area_topic
select
    nvl(old.id,new.id),
    nvl(old.province_name,new.province_name),
    nvl(old.area_code,new.area_code),
    nvl(old.iso_code,new.iso_code),
    nvl(old.region_id,new.region_id),
    nvl(old.region_name,new.region_name),
    nvl(new.login_day_count,0),
    nvl(new.login_last_30d_count,0),
    nvl(new.order_day_count,0),
    nvl(new.order_day_amount,0.0),
    nvl(new.order_last_30d_count,0),
    nvl(new.order_last_30d_amount,0.0),
    nvl(new.payment_day_count,0),
    nvl(new.payment_day_amount,0.0),
    nvl(new.payment_last_30d_count,0),
    nvl(new.payment_last_30d_amount,0.0)
from 
(
    select
        *
    from dwt_area_topic
)old
full outer join
(
    select
        id,
        province_name,
        area_code,
        iso_code,
        region_id,
        region_name,
        sum(if(dt='2020-06-25',login_count,0)) login_day_count,
        sum(if(dt='2020-06-25',order_count,0)) order_day_count,
        sum(if(dt='2020-06-25',order_amount,0.0)) order_day_amount,
        sum(if(dt='2020-06-25',payment_count,0)) payment_day_count,
        sum(if(dt='2020-06-25',payment_amount,0.0)) payment_day_amount,
        sum(login_count) login_last_30d_count,
        sum(order_count) order_last_30d_count,
        sum(order_amount) order_last_30d_amount,
        sum(payment_count) payment_last_30d_count,
        sum(payment_amount) payment_last_30d_amount
    from dws_area_stats_daycount
    where dt>=date_add('2020-06-25',-30)
    group by id,province_name,area_code,iso_code,region_id,region_name
)new
on old.id=new.id;

2.4 脚本

2.4.1 DWS脚本
  1. 在/home/atguigu/bin目录下创建脚本dwd_to_dws.sh
[atguigu@hadoop102 bin]$ vim dwd_to_dws.sh
  1. 脚本内容
#!/bin/bash

APP=gmall
hive=/opt/module/hive/bin/hive

# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
    do_date=$1
else
    do_date=`date -d "-1 day" +%F`
fi

sql="
set mapreduce.job.queuename=hive;
with
tmp_start as
(
    select  
        mid_id,
        brand,
        model,
        count(*) login_count
    from ${APP}.dwd_start_log
    where dt='$do_date'
    group by mid_id,brand,model
),
tmp_page as
(
    select
        mid_id,
        brand,
        model,        
        collect_set(named_struct('page_id',page_id,'page_count',page_count)) page_stats
    from
    (
        select
            mid_id,
            brand,
            model,
            page_id,
            count(*) page_count
        from ${APP}.dwd_page_log
        where dt='$do_date'
        group by mid_id,brand,model,page_id
    )tmp
    group by mid_id,brand,model
)
insert overwrite table ${APP}.dws_uv_detail_daycount partition(dt='$do_date')
select
    nvl(tmp_start.mid_id,tmp_page.mid_id),
    nvl(tmp_start.brand,tmp_page.brand),
    nvl(tmp_start.model,tmp_page.model),
    tmp_start.login_count,
    tmp_page.page_stats
from tmp_start 
full outer join tmp_page
on tmp_start.mid_id=tmp_page.mid_id
and tmp_start.brand=tmp_page.brand
and tmp_start.model=tmp_page.model;


with
tmp_login as
(
    select
        user_id,
        count(*) login_count
    from ${APP}.dwd_start_log
    where dt='$do_date'
    and user_id is not null
    group by user_id
),
tmp_cart as
(
    select
        user_id,
        count(*) cart_count
    from ${APP}.dwd_action_log
    where dt='$do_date'
    and user_id is not null
    and action_id='cart_add'
    group by user_id
),tmp_order as
(
    select
        user_id,
        count(*) order_count,
        sum(final_total_amount) order_amount
    from ${APP}.dwd_fact_order_info
    where dt='$do_date'
    group by user_id
) ,
tmp_payment as
(
    select
        user_id,
        count(*) payment_count,
        sum(payment_amount) payment_amount
    from ${APP}.dwd_fact_payment_info
    where dt='$do_date'
    group by user_id
),
tmp_order_detail as
(
    select
        user_id,
        collect_set(named_struct('sku_id',sku_id,'sku_num',sku_num,'order_count',order_count,'order_amount',order_amount)) order_stats
    from
    (
        select
            user_id,
            sku_id,
            sum(sku_num) sku_num,
            count(*) order_count,
            cast(sum(final_amount_d) as decimal(20,2)) order_amount
        from ${APP}.dwd_fact_order_detail
        where dt='$do_date'
        group by user_id,sku_id
    )tmp
    group by user_id
)

insert overwrite table ${APP}.dws_user_action_daycount partition(dt='$do_date')
select
    tmp_login.user_id,
    login_count,
    nvl(cart_count,0),
    nvl(order_count,0),
    nvl(order_amount,0.0),
    nvl(payment_count,0),
    nvl(payment_amount,0.0),
    order_stats
from tmp_login
left outer join tmp_cart on tmp_login.user_id=tmp_cart.user_id
left outer join tmp_order on tmp_login.user_id=tmp_order.user_id
left outer join tmp_payment on tmp_login.user_id=tmp_payment.user_id
left outer join tmp_order_detail on tmp_login.user_id=tmp_order_detail.user_id;

with 
tmp_order as
(
    select
        sku_id,
        count(*) order_count,
        sum(sku_num) order_num,
        sum(final_amount_d) order_amount
    from ${APP}.dwd_fact_order_detail
    where dt='$do_date'
    group by sku_id
),
tmp_payment as
(
    select
        sku_id,
        count(*) payment_count,
        sum(sku_num) payment_num,
        sum(final_amount_d) payment_amount
    from ${APP}.dwd_fact_order_detail
    where dt='$do_date'
    and order_id in
    (
        select
            id
        from ${APP}.dwd_fact_order_info
        where (dt='$do_date'
        or dt=date_add('$do_date',-1))
        and date_format(payment_time,'yyyy-MM-dd')='$do_date'
    )
    group by sku_id
),
tmp_refund as
(
    select
        sku_id,
        count(*) refund_count,
        sum(refund_num) refund_num,
        sum(refund_amount) refund_amount
    from ${APP}.dwd_fact_order_refund_info
    where dt='$do_date'
    group by sku_id
),
tmp_cart as
(
    select
        item sku_id,
        count(*) cart_count
    from ${APP}.dwd_action_log
    where dt='$do_date'
    and user_id is not null
    and action_id='cart_add'
    group by item 
),tmp_favor as
(
    select
        item sku_id,
        count(*) favor_count
    from ${APP}.dwd_action_log
    where dt='$do_date'
    and user_id is not null
    and action_id='favor_add'
    group by item 
),
tmp_appraise as
(
select
    sku_id,
    sum(if(appraise='1201',1,0)) appraise_good_count,
    sum(if(appraise='1202',1,0)) appraise_mid_count,
    sum(if(appraise='1203',1,0)) appraise_bad_count,
    sum(if(appraise='1204',1,0)) appraise_default_count
from ${APP}.dwd_fact_comment_info
where dt='$do_date'
group by sku_id
)

insert overwrite table ${APP}.dws_sku_action_daycount partition(dt='$do_date')
select
    sku_id,
    sum(order_count),
    sum(order_num),
    sum(order_amount),
    sum(payment_count),
    sum(payment_num),
    sum(payment_amount),
    sum(refund_count),
    sum(refund_num),
    sum(refund_amount),
    sum(cart_count),
    sum(favor_count),
    sum(appraise_good_count),
    sum(appraise_mid_count),
    sum(appraise_bad_count),
    sum(appraise_default_count)
from
(
    select
        sku_id,
        order_count,
        order_num,
        order_amount,
        0 payment_count,
        0 payment_num,
        0 payment_amount,
        0 refund_count,
        0 refund_num,
        0 refund_amount,
        0 cart_count,
        0 favor_count,
        0 appraise_good_count,
        0 appraise_mid_count,
        0 appraise_bad_count,
        0 appraise_default_count
    from tmp_order
    union all
    select
        sku_id,
        0 order_count,
        0 order_num,
        0 order_amount,
        payment_count,
        payment_num,
        payment_amount,
        0 refund_count,
        0 refund_num,
        0 refund_amount,
        0 cart_count,
        0 favor_count,
        0 appraise_good_count,
        0 appraise_mid_count,
        0 appraise_bad_count,
        0 appraise_default_count
    from tmp_payment
    union all
    select
        sku_id,
        0 order_count,
        0 order_num,
        0 order_amount,
        0 payment_count,
        0 payment_num,
        0 payment_amount,
        refund_count,
        refund_num,
        refund_amount,
        0 cart_count,
        0 favor_count,
        0 appraise_good_count,
        0 appraise_mid_count,
        0 appraise_bad_count,
        0 appraise_default_count        
    from tmp_refund
    union all
    select
        sku_id,
        0 order_count,
        0 order_num,
        0 order_amount,
        0 payment_count,
        0 payment_num,
        0 payment_amount,
        0 refund_count,
        0 refund_num,
        0 refund_amount,
        cart_count,
        0 favor_count,
        0 appraise_good_count,
        0 appraise_mid_count,
        0 appraise_bad_count,
        0 appraise_default_count
    from tmp_cart
    union all
    select
        sku_id,
        0 order_count,
        0 order_num,
        0 order_amount,
        0 payment_count,
        0 payment_num,
        0 payment_amount,
        0 refund_count,
        0 refund_num,
        0 refund_amount,
        0 cart_count,
        favor_count,
        0 appraise_good_count,
        0 appraise_mid_count,
        0 appraise_bad_count,
        0 appraise_default_count
    from tmp_favor
    union all
    select
        sku_id,
        0 order_count,
        0 order_num,
        0 order_amount,
        0 payment_count,
        0 payment_num,
        0 payment_amount,
        0 refund_count,
        0 refund_num,
        0 refund_amount,
        0 cart_count,
        0 favor_count,
        appraise_good_count,
        appraise_mid_count,
        appraise_bad_count,
        appraise_default_count
    from tmp_appraise
)tmp
group by sku_id;

with 
tmp_login as
(
    select
        area_code,
        count(*) login_count
    from ${APP}.dwd_start_log
    where dt='$do_date'
    group by area_code
),
tmp_op as
(
    select
        province_id,
        sum(if(date_format(create_time,'yyyy-MM-dd')='$do_date',1,0)) order_count,
        sum(if(date_format(create_time,'yyyy-MM-dd')='$do_date',final_total_amount,0)) order_amount,
        sum(if(date_format(payment_time,'yyyy-MM-dd')='$do_date',1,0)) payment_count,
        sum(if(date_format(payment_time,'yyyy-MM-dd')='$do_date',final_total_amount,0)) payment_amount
    from ${APP}.dwd_fact_order_info
    where (dt='$do_date' or dt=date_add('$do_date',-1))
    group by province_id
)
insert overwrite table ${APP}.dws_area_stats_daycount partition(dt='$do_date')
select
    pro.id,
    pro.province_name,
    pro.area_code,
    pro.iso_code,
    pro.region_id,
    pro.region_name,
    nvl(tmp_login.login_count,0),
    nvl(tmp_op.order_count,0),
    nvl(tmp_op.order_amount,0.0),
    nvl(tmp_op.payment_count,0),
    nvl(tmp_op.payment_amount,0.0)
from ${APP}.dwd_dim_base_province pro
left join tmp_login on pro.area_code=tmp_login.area_code
left join tmp_op on pro.id=tmp_op.province_id;


with
tmp_op as
(
    select
        activity_id,
        sum(if(date_format(create_time,'yyyy-MM-dd')='$do_date',1,0)) order_count,
        sum(if(date_format(create_time,'yyyy-MM-dd')='$do_date',final_total_amount,0)) order_amount,
        sum(if(date_format(payment_time,'yyyy-MM-dd')='$do_date',1,0)) payment_count,
        sum(if(date_format(payment_time,'yyyy-MM-dd')='$do_date',final_total_amount,0)) payment_amount
    from ${APP}.dwd_fact_order_info
    where (dt='$do_date' or dt=date_add('$do_date',-1))
    and activity_id is not null
    group by activity_id
),
tmp_display as
(
    select
        item activity_id,
        count(*) display_count
    from ${APP}.dwd_display_log
    where dt='$do_date'
    and item_type='activity_id'
    group by item
),
tmp_activity as
(
    select
        *
    from ${APP}.dwd_dim_activity_info
    where dt='$do_date'
)
insert overwrite table ${APP}.dws_activity_info_daycount partition(dt='$do_date')
select
    nvl(tmp_op.activity_id,tmp_display.activity_id),
    tmp_activity.activity_name,
    tmp_activity.activity_type,
    tmp_activity.start_time,
    tmp_activity.end_time,
    tmp_activity.create_time,
    tmp_display.display_count,
    tmp_op.order_count,
    tmp_op.order_amount,
    tmp_op.payment_count,
    tmp_op.payment_amount
from tmp_op
full outer join tmp_display on tmp_op.activity_id=tmp_display.activity_id
left join tmp_activity on nvl(tmp_op.activity_id,tmp_display.activity_id)=tmp_activity.id;
"

$hive -e "$sql"
2.4.2 DWT脚本
  1. 在/home/atguigu/bin目录下创建脚本dws_to_dwt.sh
[atguigu@hadoop102 bin]$ vim dws_to_dwt.sh
  1. 脚本内容
#!/bin/bash

APP=gmall
hive=/opt/module/hive/bin/hive

# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
    do_date=$1
else 
    do_date=`date -d "-1 day" +%F`
fi

sql="
set mapreduce.job.queuename=hive;
insert overwrite table ${APP}.dwt_uv_topic
select
    nvl(new.mid_id,old.mid_id),
    nvl(new.model,old.model),
    nvl(new.brand,old.brand),
    if(old.mid_id is null,'$do_date',old.login_date_first),
    if(new.mid_id is not null,'$do_date',old.login_date_last),
    if(new.mid_id is not null, new.login_count,0),
    nvl(old.login_count,0)+if(new.login_count>0,1,0)
from
(
    select
        *
    from ${APP}.dwt_uv_topic
)old
full outer join
(
    select
        *
    from ${APP}.dws_uv_detail_daycount
    where dt='$do_date'
)new
on old.mid_id=new.mid_id;

insert overwrite table ${APP}.dwt_user_topic
select
    nvl(new.user_id,old.user_id),
    if(old.login_date_first is null and new.login_count>0,'$do_date',old.login_date_first),
    if(new.login_count>0,'$do_date',old.login_date_last),
    nvl(old.login_count,0)+if(new.login_count>0,1,0),
    nvl(new.login_last_30d_count,0),
    if(old.order_date_first is null and new.order_count>0,'$do_date',old.order_date_first),
    if(new.order_count>0,'$do_date',old.order_date_last),
    nvl(old.order_count,0)+nvl(new.order_count,0),
    nvl(old.order_amount,0)+nvl(new.order_amount,0),
    nvl(new.order_last_30d_count,0),
    nvl(new.order_last_30d_amount,0),
    if(old.payment_date_first is null and new.payment_count>0,'$do_date',old.payment_date_first),
    if(new.payment_count>0,'$do_date',old.payment_date_last),
    nvl(old.payment_count,0)+nvl(new.payment_count,0),
    nvl(old.payment_amount,0)+nvl(new.payment_amount,0),
    nvl(new.payment_last_30d_count,0),
    nvl(new.payment_last_30d_amount,0)
from
${APP}.dwt_user_topic old
full outer join
(
    select
        user_id,
        sum(if(dt='$do_date',login_count,0)) login_count,
        sum(if(dt='$do_date',order_count,0)) order_count,
        sum(if(dt='$do_date',order_amount,0)) order_amount,
        sum(if(dt='$do_date',payment_count,0)) payment_count,
        sum(if(dt='$do_date',payment_amount,0)) payment_amount,
        sum(if(login_count>0,1,0)) login_last_30d_count,
        sum(order_count) order_last_30d_count,
        sum(order_amount) order_last_30d_amount,
        sum(payment_count) payment_last_30d_count,
        sum(payment_amount) payment_last_30d_amount
    from ${APP}.dws_user_action_daycount
    where dt>=date_add( '$do_date',-30)
    group by user_id
)new
on old.user_id=new.user_id;

insert overwrite table ${APP}.dwt_sku_topic
select 
    nvl(new.sku_id,old.sku_id),
    sku_info.spu_id,
    nvl(new.order_count30,0),
    nvl(new.order_num30,0),
    nvl(new.order_amount30,0),
    nvl(old.order_count,0) + nvl(new.order_count,0),
    nvl(old.order_num,0) + nvl(new.order_num,0),
    nvl(old.order_amount,0) + nvl(new.order_amount,0),
    nvl(new.payment_count30,0),
    nvl(new.payment_num30,0),
    nvl(new.payment_amount30,0),
    nvl(old.payment_count,0) + nvl(new.payment_count,0),
    nvl(old.payment_num,0) + nvl(new.payment_count,0),
    nvl(old.payment_amount,0) + nvl(new.payment_count,0),
    nvl(new.refund_count30,0),
    nvl(new.refund_num30,0),
    nvl(new.refund_amount30,0),
    nvl(old.refund_count,0) + nvl(new.refund_count,0),
    nvl(old.refund_num,0) + nvl(new.refund_num,0),
    nvl(old.refund_amount,0) + nvl(new.refund_amount,0),
    nvl(new.cart_count30,0),
    nvl(old.cart_count,0) + nvl(new.cart_count,0),
    nvl(new.favor_count30,0),
    nvl(old.favor_count,0) + nvl(new.favor_count,0),
    nvl(new.appraise_good_count30,0),
    nvl(new.appraise_mid_count30,0),
    nvl(new.appraise_bad_count30,0),
    nvl(new.appraise_default_count30,0)  ,
    nvl(old.appraise_good_count,0) + nvl(new.appraise_good_count,0),
    nvl(old.appraise_mid_count,0) + nvl(new.appraise_mid_count,0),
    nvl(old.appraise_bad_count,0) + nvl(new.appraise_bad_count,0),
    nvl(old.appraise_default_count,0) + nvl(new.appraise_default_count,0) 
from 
(
    select
        sku_id,
        spu_id,
        order_last_30d_count,
        order_last_30d_num,
        order_last_30d_amount,
        order_count,
        order_num,
        order_amount  ,
        payment_last_30d_count,
        payment_last_30d_num,
        payment_last_30d_amount,
        payment_count,
        payment_num,
        payment_amount,
        refund_last_30d_count,
        refund_last_30d_num,
        refund_last_30d_amount,
        refund_count,
        refund_num,
        refund_amount,
        cart_last_30d_count,
        cart_count,
        favor_last_30d_count,
        favor_count,
        appraise_last_30d_good_count,
        appraise_last_30d_mid_count,
        appraise_last_30d_bad_count,
        appraise_last_30d_default_count,
        appraise_good_count,
        appraise_mid_count,
        appraise_bad_count,
        appraise_default_count 
    from ${APP}.dwt_sku_topic
)old
full outer join 
(
    select 
        sku_id,
        sum(if(dt='$do_date', order_count,0 )) order_count,
        sum(if(dt='$do_date',order_num ,0 ))  order_num, 
        sum(if(dt='$do_date',order_amount,0 )) order_amount ,
        sum(if(dt='$do_date',payment_count,0 )) payment_count,
        sum(if(dt='$do_date',payment_num,0 )) payment_num,
        sum(if(dt='$do_date',payment_amount,0 )) payment_amount,
        sum(if(dt='$do_date',refund_count,0 )) refund_count,
        sum(if(dt='$do_date',refund_num,0 )) refund_num,
        sum(if(dt='$do_date',refund_amount,0 )) refund_amount,  
        sum(if(dt='$do_date',cart_count,0 )) cart_count,
        sum(if(dt='$do_date',favor_count,0 )) favor_count,
        sum(if(dt='$do_date',appraise_good_count,0 )) appraise_good_count,  
        sum(if(dt='$do_date',appraise_mid_count,0 ) ) appraise_mid_count ,
        sum(if(dt='$do_date',appraise_bad_count,0 )) appraise_bad_count,  
        sum(if(dt='$do_date',appraise_default_count,0 )) appraise_default_count,
        sum(order_count) order_count30 ,
        sum(order_num) order_num30,
        sum(order_amount) order_amount30,
        sum(payment_count) payment_count30,
        sum(payment_num) payment_num30,
        sum(payment_amount) payment_amount30,
        sum(refund_count) refund_count30,
        sum(refund_num) refund_num30,
        sum(refund_amount) refund_amount30,
        sum(cart_count) cart_count30,
        sum(favor_count) favor_count30,
        sum(appraise_good_count) appraise_good_count30,
        sum(appraise_mid_count) appraise_mid_count30,
        sum(appraise_bad_count) appraise_bad_count30,
        sum(appraise_default_count) appraise_default_count30 
    from ${APP}.dws_sku_action_daycount
    where dt >= date_add ('$do_date', -30)
    group by sku_id    
)new 
on new.sku_id = old.sku_id
left join 
(select * from ${APP}.dwd_dim_sku_info where dt='$do_date') sku_info
on nvl(new.sku_id,old.sku_id)= sku_info.id;

insert overwrite table ${APP}.dwt_activity_topic
select
    nvl(new.id,old.id),
    nvl(new.activity_name,old.activity_name),
    nvl(new.activity_type,old.activity_type),
    nvl(new.start_time,old.start_time),
    nvl(new.end_time,old.end_time),
    nvl(new.create_time,old.create_time),
    nvl(new.display_count,0),
    nvl(new.order_count,0),
    nvl(new.order_amount,0.0),
    nvl(new.payment_count,0),
    nvl(new.payment_amount,0.0),
    nvl(new.display_count,0)+nvl(old.display_count,0),
    nvl(new.order_count,0)+nvl(old.order_count,0),
    nvl(new.order_amount,0.0)+nvl(old.order_amount,0.0),
    nvl(new.payment_count,0)+nvl(old.payment_count,0),
    nvl(new.payment_amount,0.0)+nvl(old.payment_amount,0.0)
from
(
    select
        *
    from ${APP}.dwt_activity_topic
)old
full outer join
(
    select
        *
    from ${APP}.dws_activity_info_daycount
    where dt='$do_date'
)new
on old.id=new.id;

insert overwrite table ${APP}.dwt_area_topic
select
    nvl(old.id,new.id),
    nvl(old.province_name,new.province_name),
    nvl(old.area_code,new.area_code),
    nvl(old.iso_code,new.iso_code),
    nvl(old.region_id,new.region_id),
    nvl(old.region_name,new.region_name),
    nvl(new.login_day_count,0),
    nvl(new.login_last_30d_count,0),
    nvl(new.order_day_count,0),
    nvl(new.order_day_amount,0.0),
    nvl(new.order_last_30d_count,0),
    nvl(new.order_last_30d_amount,0.0),
    nvl(new.payment_day_count,0),
    nvl(new.payment_day_amount,0.0),
    nvl(new.payment_last_30d_count,0),
    nvl(new.payment_last_30d_amount,0.0)
from 
(
    select
        *
    from ${APP}.dwt_area_topic
)old
full outer join
(
    select
        id,
        province_name,
        area_code,
        iso_code,
        region_id,
        region_name,
        sum(if(dt='$do_date',login_count,0)) login_day_count,
        sum(if(dt='$do_date',order_count,0)) order_day_count,
        sum(if(dt='$do_date',order_amount,0.0)) order_day_amount,
        sum(if(dt='$do_date',payment_count,0)) payment_day_count,
        sum(if(dt='$do_date',payment_amount,0.0)) payment_day_amount,
        sum(login_count) login_last_30d_count,
        sum(order_count) order_last_30d_count,
        sum(order_amount) order_last_30d_amount,
        sum(payment_count) payment_last_30d_count,
        sum(payment_amount) payment_last_30d_amount
    from ${APP}.dws_area_stats_daycount
    where dt>=date_add('$do_date',-30)
    group by id,province_name,area_code,iso_code,region_id,region_name
)new
on old.id=new.id;
"

$hive -e "$sql"

三、数仓搭建ADS层

3.0 ADS层思想

-- 1. 解题思路:
   第一步:分析指标,明确指标具体的含义,不可有歧义
   第二步:确定数据来源,寻找数据的方式,先从dwt层 -> dws -> dwd 层依次往回找,找到能满足需求的表
   第三表: 找步骤2表的方式:找相关主题的表,因为在ads层的统计指标,也是按照维度 + 时间 + 指标的方式今天统计。
-- 2. 统计流量,使用用户行为数据,统计和业务相关的指标,使用业务数据。
-- 3. 老生常谈:知道每一张表中存储了什么数据。每行数据代表什么意思。关键、关键、关键。

3.1 设备主题

3.1.1 活跃设备数
-- 1. 什么是活跃设备
      a、打开应用的用户即为活跃用户,不考虑用户的使用情况。
	  b、每天一台设备打开多次会被计为一个活跃用户
-- 2. 需求:
     '日活':当日活跃的设备数
     '周活':当周活跃的设备数,在这一周内,多次活跃也计算为1'月活':当月活跃的设备数,在这一月内,多次活跃也计算为1
  • 建表
drop table if exists ads_uv_count;
create external table ads_uv_count(
    `dt` string COMMENT '统计日期',
    `day_count` bigint COMMENT '当日用户数量',
    `wk_count`  bigint COMMENT '当周用户数量',
    `mn_count`  bigint COMMENT '当月用户数量',
    `is_weekend` string COMMENT 'Y,N是否是周末,用于得到本周最终结果',
    `is_monthend` string COMMENT 'Y,N是否是月末,用于得到本月最终结果' 
) COMMENT '活跃设备数'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_uv_count/';
  • 插入数据
insert into ads_uv_count
select 
    '2020-06-25' dt,
    day_count,
    wk_count,
    mn_count ,       
    if('2020-06-25'=date_add(next_day('2020-06-25','mo'),-1),'Y','N')  is_weekend ,
    if('2020-06-25'=last_day('2020-06-25'),'Y','N')    is_monthend
from
    (
     select 
         '2020-06-25' dt,
         sum(if(login_day_count > 0 , 1 , 0 ))  day_count,
         sum(if(login_date_last  >= date_add(next_day('2020-06-25','mo'),-7) , 1 , 0)) wk_count,
         sum(if(date_format(login_date_last,'yyyy-MM')  = date_format('2020-06-25','yyyy-MM'),1,0))  mn_count     
     from dwt_uv_topic
    )tmp
3.1.2 每日新增设备
  • 建表
drop table if exists ads_new_mid_count;
create external table ads_new_mid_count
(
    `create_date`     string comment '创建时间' ,
    `new_mid_count`   BIGINT comment '新增设备数量' 
)  COMMENT '每日新增设备信息数量'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_new_mid_count/';
  • 插入数据
insert into table ads_new_mid_count 
select
    '2020-06-25',
    count(*)
from dwt_uv_topic
where login_date_first='2020-06-25';
3.1.3 留存率
-- 1. 什么是留存率?
    a、某段时间内的新增用户,经过一段时间后,仍然使用应用的被认作是留存用户;这部分用户占当时新增用户的比例即是留存率。
	b、例如,5月份新增用户200,这200人启动情况:
	  6月份:启动人数为100人,5月份新增用户一个月后的留存率是50%
	  7月份:启动人数为80人,5月份新增用户二个月后的留存率是40%
	  8月份:启动人数为100人,5月份新增用户三个月后的留存率是50%
	c、留存用户一般是统计留存率,同时必须有两个参数:哪个月份的几月的留存率
-- 2. 本案例需要统计的指标是:
    计算每天的123日留存率
-- 3. 实现方式
    第一步:统计当天所有的活跃用户
    第二步:统计昨天的1日留存率,求出昨天的新用户但是今天上线的用户/昨天的新用户
    第三步:统计前天的2日留存率,求出前天的新用户但是今天上线的用户/前天的新用户
    所以需要统计的数量有:
    1. 昨天的新用户但是今天上线的用户
    2. 昨天的新用户
    3. 前天的新用户但是今天上线的用户
    4. 前天的新用户

image-20200708203545115

  • 建表
drop table if exists ads_user_retention_day_rate;
create external table ads_user_retention_day_rate 
(
     `stat_date`          string comment '统计日期',
     `create_date`       string  comment '设备新增日期',
     `retention_day`     int comment '截止当前日期留存天数',
     `retention_count`    bigint comment  '留存数量',
     `new_mid_count`     bigint comment '设备新增数量',
     `retention_ratio`   decimal(16,2) comment '留存率'
)  COMMENT '每日用户留存情况'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_user_retention_day_rate/';
  • 插入数据
   with tmp_uv_topic
    as (
        select 
             '2020-06-26'  stat_date,
             sum(if(login_date_first = date_add('2020-06-26',-1),1,0))  1day_count , -- 昨天的新用户
             sum(if(login_date_first = date_add('2020-06-26',-1) and login_date_last = '2020-06-26',1,0)) new_1day ,--昨天的新用户但是今天上线的用户
             sum(if(login_date_first = date_add('2020-06-26',-2),1,0))  2day_count , -- 前天的新用户
             sum(if(login_date_first = date_add('2020-06-26',-2) and login_date_last = '2020-06-26',1,0)) new_2day ,--前天的新用户但是今天上线的用户
             sum(if(login_date_first = date_add('2020-06-26',-3),1,0))  3day_count , -- 大前天的新用户
             sum(if(login_date_first = date_add('2020-06-26',-3) and login_date_last = '2020-06-26',1,0)) new_3day --大前天的新用户但是今天上线的用户
         from dwt_uv_topic
    )
     insert into ads_user_retention_day_rate
     select
          '2020-06-26'  stat_date, --统计日期
          date_add('2020-06-26',-1) create_date,--设备新增日期
          1 retention_day,--截止当前日期留存天数
          new_1day retention_count,--留存数量
          1day_count new_mid_count,--设备新增数量
          new_1day * 100 /1day_count retention_ratio--留存率
     from  tmp_uv_topic
     
    union all 
     
    select
          '2020-06-26'  stat_date,--统计日期
          date_add('2020-06-26',-2) create_date,--设备新增日期
          2 retention_day,--截止当前日期留存天数
          new_2day retention_count,--留存数量
          2day_count new_mid_count,--设备新增数量
          new_2day * 100 / 2day_count retention_ratio --留存率
    from  tmp_uv_topic
    union all 
     
    select
          '2020-06-26'  stat_date,--统计日期
          date_add('2020-06-26',-3) create_date,--设备新增日期
          3 retention_day,--截止当前日期留存天数
          new_3day retention_count,--留存数量
          3day_count new_mid_count,--设备新增数量
          new_3day * 100 / 3day_count retention_ratio --留存率
    from  tmp_uv_topic
3.1.4 沉默用户数
-- 1. 什么是沉默用户?
   只在安装当天启动过,且启动时间是在7天前
-- 2. 实现过程
   1. 统计首次活跃时间 = 最后末次活跃时间,且最后活跃时间在7天前的用户
  • 建表
drop table if exists ads_silent_count;
create external table ads_silent_count( 
    `dt` string COMMENT '统计日期',
    `silent_count` bigint COMMENT '沉默设备数'
) 
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_silent_count';
  • 插入数据
insert into table ads_silent_count
select
    '2020-06-25',
    count(*) 
from dwt_uv_topic
where login_date_first=login_date_last
and login_date_last < date_add('2020-06-25',-7);
3.1.5 本周回流用户数
-- 1. 什么是本周回流用户?
   上周未活跃,本周活跃的设备,且不是本周新增设备
-- 2. 实现步骤:
   第一步:获取本周活跃的用户且不是本周新增的用户
   第二步:获取上周的活跃的用户
   第三步:第一步获取的用户减少第二步获取的用户就是本周回流的用户
  • 建表
drop table if exists ads_back_count;
create external table ads_back_count( 
    `dt` string COMMENT '统计日期',
    `wk_dt` string COMMENT '统计日期所在周',
    `wastage_count` bigint COMMENT '回流设备数'
) 
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_back_count';
  • 插入数据
insert into ads_back_count
    select
        '2020-06-25' dt ,
        concat(date_add(next_day('2020-06-25','MO'),-7),'_', date_add(next_day('2020-06-25','MO'),-1)) wk_dt,
        count(*) 
    from dwt_uv_topic
    where login_date_last >= date_add(next_day('2020-06-25','mo'),-7) 
    and login_date_first < date_add(next_day('2020-06-25','mo'),-7) 
    and mid_id not in (
        select 
             mid_id
        from dws_uv_detail_daycount 
        where dt < date_add(next_day('2020-06-25','mo'),-7) 
        and dt >= date_add(next_day('2020-06-25','mo'),-14)         
    )
3.1.6 流失用户
-- 1. 什么是流失用户
	最近7天未活跃的设备
-- 2. 实现步骤
    第一步:获取最近活跃时间小于7
  • 建表
drop table if exists ads_wastage_count;
create external table ads_wastage_count( 
    `dt` string COMMENT '统计日期',
    `wastage_count` bigint COMMENT '流失设备数'
) 
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_wastage_count';
  • 插入数据
insert into table ads_wastage_count
    select
         '2020-06-25' dt ,
         count(*) 
    from dwt_uv_topic
    where login_date_last < date_add('2020-06-25',-7)
3.1.7 最近连续三周活跃用户数
-- 1. 实现步骤
   第一步: 从dws层获取前一周、前两周以及当前周的所有活跃的用户
   第二步: 然后进行内连接,能连接上的,则说明这连续的3周都活跃了,最后按照用户进行分组去重后求count。
  • 建表
drop table if exists ads_continuity_wk_count;
create external table ads_continuity_wk_count( 
    `dt` string COMMENT '统计日期,一般用结束周周日日期,如果每天计算一次,可用当天日期',
    `wk_dt` string COMMENT '持续时间',
    `continuity_count` bigint COMMENT '活跃次数'
) 
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_continuity_wk_count';
  • 插入数据
  with 1wk_mid
    as (
        select 
            mid_id
        from dws_uv_detail_daycount 
        where dt >= date_add(next_day('2020-06-25','mo'),-7)
        group by mid_id     
    ),
    2wk_mid  as (
        select 
            mid_id
        from dws_uv_detail_daycount
        where dt >= date_add(next_day('2020-06-25','mo'),-14)
        and dt < date_add(next_day('2020-06-25','mo'),-7)
        group by mid_id
    ),
    3wk_mid  as (  
        select 
            mid_id
        from dws_uv_detail_daycount
        where dt >= date_add(next_day('2020-06-25','mo'),-21)
        and dt < date_add(next_day('2020-06-25','mo'),-14) 
        group by mid_id
    )
    
    insert into  ads_continuity_wk_count
    
    select 
        '2020-06-25',
        concat(date_add(next_day('2020-06-25','MO'),-7*3),'_',date_add(next_day('2020-06-25','MO'),-1)),
        count(*)
    from 1wk_mid 
    join 2wk_mid on 1wk_mid.mid_id = 2wk_mid.mid_id
    join 3wk_mid on 2wk_mid.mid_id = 3wk_mid.mid_id
3.1.8 最近七天内连续三天活跃
-- 1. 思路:
   第一步:从dws层获取最近7天的数据,对数据按照用户分组进行开窗,按照活跃时间进行降序排序
   第二步:使用活跃时间减去排名,获取一列
   第三步:按照用户和第三步的列进行分组,求count(*) >= 3的用户
   第四步:分组去重
   第五步:求7天内连续3天的活跃用户
  • 建表
drop table if exists ads_continuity_uv_count;
create external table ads_continuity_uv_count( 
    `dt` string COMMENT '统计日期',
    `wk_dt` string COMMENT '最近7天日期',
    `continuity_count` bigint
) COMMENT '连续活跃设备数'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_continuity_uv_count';
  • 分析
  -- 第一步:从dws层获取最近7天的数据,对数据按照用户分组进行开窗,按照活跃时间进行降序排序
    select 
		mid_id,
        dt ,
        row_number() over(partition by mid_id order by dt ) rk
    from dws_uv_detail_daycount
    where dt > date_add('2020-06-25',-7)  --t1
    
	-- 第二步:使用活跃时间减去排名,获取一列
    select 
		mid_id,
        dt ,
		date_add(dt,-rk) dt_rk
	from t1    --t2

	-- 第三步:按照用户和第三步的列进行分组,求count(*) > =3的用户
	select 
	    mid_id
	from t2 
	group by mid_id,dt_rk
    having 	count(*) >= 3--t3

	-- 第四步:分组去重
	select 
	     mid_id
	from t3
	group by mid_id  --t4

	--第五步:求7天内连续3天的活跃用户
	select 
	'2020-06-25' dt,
	concat(date_add('2020-06-25',-6),'_','2020-06-25') wk_dt ,
	count(*) continuity_count
	from t4
  • 插入数据
insert into ads_continuity_uv_count
select 
        '2020-06-25' dt,
        concat(date_add('2020-06-25',-7),'_','2020-06-25') wk_dt ,
        count(*) continuity_count
    from (
        select 
            mid_id
        from (
            select 
                mid_id
            from (
                select 
                    mid_id,
                    dt ,
                    date_add(dt,rk) dt_rk
                from (
                    select 
                        mid_id,
                        dt ,
                        row_number() over(partition by mid_id order by dt ) rk
                    from dws_uv_detail_daycount
                    where dt > date_add('2020-06-25',-7)
                    and dt < '2020-06-25'
                )t1
            )t2 
            group by mid_id,dt_rk
            having  count(*) >= 3       
        )t3
        group by mid_id
    )t4

3.2 会员主题

3.2.1 会员主题信息
-- 几个指标说明
   1. 总付费会员数:指付费的会员数量
   2. 会员活跃率 = 今天会员活跃数量 /  总的会员数量
   3. 会员付费率 = 今天会员付费人数 /  总的会员数量
   4. 会员新鲜度 = 今天新增会员数量 / 今天活跃的会员数量
  • 建表
drop table if exists ads_user_topic;
create external table ads_user_topic(
    `dt` string COMMENT '统计日期',
    `day_users` string COMMENT '活跃会员数',
    `day_new_users` string COMMENT '新增会员数',
    `day_new_payment_users` string COMMENT '新增消费会员数',
    `payment_users` string COMMENT '总付费会员数',
    `users` string COMMENT '总会员数',
    `day_users2users` decimal(16,2) COMMENT '会员活跃率',
    `payment_users2users` decimal(16,2) COMMENT '会员付费率',
    `day_new_users2users` decimal(16,2) COMMENT '会员新鲜度'
) COMMENT '会员主题信息表'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_user_topic';
  • 插入数据
insert into ads_user_topic
select 
   '2020-06-25' dt ,--统计日期
   sum(if(login_date_last = '2020-06-25',1,0)) day_users,--今天活跃会员数
   sum(if(login_date_first = '2020-06-25',1,0)) day_new_users,--今天新增会员数
   sum(if(payment_date_first = '2020-06-25',1,0)) day_new_payment_users,--今天新增消费会员数
   sum(if(payment_count > 0,1,0)) payment_users , --总付费会员数
   count(*) users ,-- 总会员数
   sum(if(login_date_last = '2020-06-25',1,0))*100/count(*)    day_users2users ,--会员活跃率
   sum(if(payment_count > 0,1,0))*100/count(*)  payment_users2users ,--会员付费率
   sum(if(login_date_first = '2020-06-25',1,0))*100/sum(if(login_date_last = '2020-06-25',1,0))    day_new_users2users --会员新鲜度
   
from dwt_user_topic
3.2.2 漏斗分析
-- 1. 什么是漏斗分析?
   其实是指转化率,同一个会员同一次操作业务过程为:
   浏览页面 -> 进入详情的页面 -> 加入购物车 -> 下单 -> 支付
   理论上从左往右的人数是越来越少,所以称之为漏斗
-- 2. 说明:
   理论上:进入详情页面的方式有很多,不止是通过浏览页面,所以我们在计算从浏览页面以后进入详情页面应该是连续的,才能算作从浏览
   页面到页面详情的转换率,但是在本案例中,比较简单粗暴,处理的方式是:
-- 3. 统计方式:
   1. '统计浏览首页的人数':统计今天浏览过详情页面的人数,
                          同一个人多次浏览首页算作1,数据来源每日设备行为:dwd_page_log
   2. '浏览商品详情页面的人数':统计今天浏览过详情页面的人数,
                          同一个用户多次浏览商品详情算作1次,数据来源每日设备行为:dwd_page_log
   3. '加购物车的人数':统计今天加购物车的人数,数据来源:每日会员行为dws_user_action_daycount
   4. '下单人数':统计今天下单的人数,数据来源:每日会员行为dws_user_action_daycount
   5. '支付人数':统计今天支付的人数,数据来源:每日会员行为dws_user_action_daycount
  • 建表
drop table if exists ads_user_action_convert_day;
create external  table ads_user_action_convert_day(
    `dt` string COMMENT '统计日期',
    `home_count`  bigint COMMENT '浏览首页人数',
    `good_detail_count` bigint COMMENT '浏览商品详情页人数',
    `home2good_detail_convert_ratio` decimal(16,2) COMMENT '首页到商品详情转化率',
    `cart_count` bigint COMMENT '加入购物车的人数',
    `good_detail2cart_convert_ratio` decimal(16,2) COMMENT '商品详情页到加入购物车转化率',
    `order_count` bigint     COMMENT '下单人数',
    `cart2order_convert_ratio`  decimal(16,2) COMMENT '加入购物车到下单转化率',
    `payment_amount` bigint     COMMENT '支付人数',
    `order2payment_convert_ratio` decimal(16,2) COMMENT '下单到支付的转化率'
 ) COMMENT '用户行为漏斗分析'
row format delimited  fields terminated by '\t'
location '/warehouse/gmall/ads/ads_user_action_convert_day/';
  • 插入数据
with 
tmp_action as (
  select 
    '2020-06-25' dt,
    sum(if(cart_count > 0,1,0)) cart_count,--加入购物车的人数
    sum(if(order_count > 0,1,0)) order_count , --下单人数
    sum(if(payment_count > 0,1,0)) payment_count  --支付人数
  from dws_user_action_daycount
  where dt ='2020-06-25'
),
tmp_page as (
select 
    '2020-06-25' dt , --统计日期
    sum(if(array_contains(pages,'home'),1,0)) home_count, --浏览首页人数
    sum(if(array_contains(pages,'good_detail'),1,0)) good_detail_count, --浏览商品详情页人数
    sum(if(array_contains(pages,'good_detail'),1,0)) * 100 /sum(if(array_contains(pages,'home'),1,0)) home2good_detail_convert_ratio --首页到商品详情转化率
from (
    select 
        mid_id,
        --对用户进行分组,过滤出今天进入首页和详情页的用户,获取当天用户的页面行为,去重后放到一个集合中
        -- 那么一行数据有如下2种情况
        -- 用户     page_id
        -- 243  ["good_detail","home"]
        -- 63   ["home"]
        collect_set(page_id) pages
    from dwd_page_log
    where dt = '2020-06-25'
    and page_id in ('home','good_detail') 
    group by mid_id
    )tmp
)
insert into ads_user_action_convert_day
select 
     '2020-06-25' dt , --统计日期
     home_count, --浏览首页人数
     good_detail_count, --浏览商品详情页人数
     home2good_detail_convert_ratio ,--首页到商品详情转化率
     cart_count,--加入购物车的人数
     cart_count *100/good_detail_count good_detail2cart_convert_ratio,--商品详情页到加入购物车转化率
     order_count , --下单人数
     order_count *100/cart_count  cart2order_convert_ratio ,--加入购物车到下单转化率
     payment_count,  --支付人数
     payment_count * 100 / order_count order2payment_convert_ratio --下单到支付的转化率
from tmp_action 
join tmp_page on  tmp_action.dt = tmp_page.dt  

3.3 商品主题

3.3.1 商品个数信息
-- 1. 需求分析:
    `sku_num` string COMMENT 'sku个数',
    统计到目前为止的sku数量
    `spu_num` string COMMENT 'spu个数'
    统计到目前为止的spu数量
  • 建表
drop table if exists ads_product_info;
create external table ads_product_info(
    `dt` string COMMENT '统计日期',
    `sku_num` string COMMENT 'sku个数',
    `spu_num` string COMMENT 'spu个数'
) COMMENT '商品个数信息'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_product_info';
  • 插入数据
with 
tmp_sku as(
    select 
       '2020-06-25'  dt,
       count(*) sku_num --sku个数  
    from dwt_sku_topic
  ),
tmp_spu as(
    select
       '2020-06-25'  dt,
       count(*) spu_num --spu个数
    from (
        select 
           '2020-06-25'  dt,
           spu_id
        from dwt_sku_topic
        group by spu_id
    )tmp   
  )
  insert into ads_product_info
  select
      '2020-06-25'  dt,
      sku_num,
      spu_num
  from tmp_sku 
  join tmp_spu on tmp_sku.dt = tmp_spu.dt
3.3.2 商品销量排名
-- 1. 商品销量排名:
   1. 是按照什么规则进行排名?本需求是按照当天的产品的支付金额的大小进行排名
  • 建表
drop table if exists ads_product_sale_topN;
create external table ads_product_sale_topN(
    `dt` string COMMENT '统计日期',
    `sku_id` string COMMENT '商品ID',
    `payment_amount` bigint COMMENT '销量'
) COMMENT '商品个数信息'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_product_sale_topN';
  • 插入数据
insert into ads_product_sale_topN
select 
    '2020-06-25' dt,
    sku_id,
    payment_amount 
from  dws_sku_action_daycount
where dt = '2020-06-25'
order by payment_amount desc 
limit 10
3.3.3 商品收藏排名
  • 建表
drop table if exists ads_product_favor_topN;
create external table ads_product_favor_topN(
    `dt` string COMMENT '统计日期',
    `sku_id` string COMMENT '商品ID',
    `favor_count` bigint COMMENT '收藏量'
) COMMENT '商品收藏TopN'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_product_favor_topN';
  • 插入数据
insert into table ads_product_favor_topN
select
    '2020-06-25' dt,
    sku_id,
    favor_count
from
    dws_sku_action_daycount
where
    dt='2020-06-25'
order by favor_count desc
limit 10;
3.3.4 商品加入购物车排名
  • 建表
drop table if exists ads_product_cart_topN;
create external table ads_product_cart_topN(
    `dt` string COMMENT '统计日期',
    `sku_id` string COMMENT '商品ID',
    `cart_count` bigint COMMENT '加入购物车次数'
) COMMENT '商品加入购物车TopN'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_product_cart_topN';
  • 插入数据
insert into table ads_product_cart_topN
select
    '2020-06-25' dt,
    sku_id,
    cart_count
from
    dws_sku_action_daycount
where
    dt='2020-06-25'
order by cart_count desc
limit 10;
3.3.5 商品退款率排名(近30天)
  • 建表
drop table if exists ads_product_refund_topN;
create external table ads_product_refund_topN(
    `dt` string COMMENT '统计日期',
    `sku_id` string COMMENT '商品ID',
    `refund_ratio` decimal(16,2) COMMENT '退款率'
) COMMENT '商品退款率TopN'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_product_refund_topN';
  • 插入数据
insert overwrite table ads_product_refund_topN
select
    '2020-06-25',
    sku_id,
    refund_last_30d_count/payment_last_30d_count*100 refund_ratio
from dwt_sku_topic
order by refund_ratio desc
limit 10;
3.3.6 商品差评率
  • 建表
drop table if exists ads_appraise_bad_topN;
create external table ads_appraise_bad_topN(
    `dt` string COMMENT '统计日期',
    `sku_id` string COMMENT '商品ID',
    `appraise_bad_ratio` decimal(16,2) COMMENT '差评率'
) COMMENT '商品差评率TopN'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_appraise_bad_topN';
  • 插入数据
insert into table ads_appraise_bad_topN
select
    '2020-06-25' dt,
    sku_id,
appraise_bad_count/(appraise_good_count+appraise_mid_count+appraise_bad_count+appraise_default_count) appraise_bad_ratio
from
    dws_sku_action_daycount
where
    dt='2020-06-25'
order by appraise_bad_ratio desc
limit 10;

3.4 营销主题

3.4.1 下单数目统计
  • 建表
drop table if exists ads_order_daycount;
create external table ads_order_daycount(
    dt string comment '统计日期',
    order_count bigint comment '单日下单笔数',
    order_amount bigint comment '单日下单金额',
    order_users bigint comment '单日下单用户数'
) comment '每日订单总计表'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_order_daycount';

插入数据

insert into ads_order_daycount
select 
   '2020-06-25' dt ,
    sum(order_count) order_count, --单日下单笔数
    sum(order_amount ) order_amount ,--单日下单金额
    sum(if(order_count > 0 , 1 , 0)) --单日下单用户数
from  dws_user_action_daycount
where dt = '2020-06-25'

分析

insert into table ads_order_daycount
select
    '2020-06-14',
    sum(order_count),
    sum(order_amount),
    sum(if(order_count>0,1,0))
from dws_user_action_daycount
where dt='2020-06-14';
3.4.2 支付信息统计
  • 建表
drop table if exists ads_payment_daycount;
create external table ads_payment_daycount(
    dt string comment '统计日期',
    order_count bigint comment '单日支付笔数',
    order_amount bigint comment '单日支付金额',
    payment_user_count bigint comment '单日支付人数',
    payment_sku_count bigint comment '单日支付商品数',
    payment_avg_time decimal(16,2) comment '下单到支付的平均时长,取分钟数'
) comment '每日订单总计表'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_payment_daycount';
  • 插入数据
with 
tmp_user as (
    select 
        '2020-06-25' dt , --统计日期
        sum(payment_count) order_count, --单日支付笔数
        sum(payment_amount) order_amount,--单日支付金额
        sum(if(order_count > 0 , 1 , 0)) payment_user_count --单日支付人数
    from dws_user_action_daycount
    where dt='2020-06-25'
),
tmp_action as (
    select
      '2020-06-25' dt,
      sum(if(payment_amount > 0,1,0)) payment_sku_count --单日支付商品数
    from dws_sku_action_daycount
    where dt= '2020-06-25'
),
tmp_order as (  
   select
        '2020-06-25' dt,
        sum(unix_timestamp(payment_time)-unix_timestamp(create_time))/count(*)/60 payment_avg_time --下单到支付的平均时长,取分钟数
    from dwd_fact_order_info
    where dt='2020-06-25'
    and payment_time is not null
  )
  
  insert into ads_payment_daycount
  select
        '2020-06-25' dt , --统计日期
        order_count, --单日支付笔数
        order_amount,--单日支付金额
        payment_user_count ,--单日支付人数
        payment_sku_count ,--单日支付商品数
        payment_avg_time --下单到支付的平均时长,取分钟数
  from tmp_user
  join tmp_action on tmp_user.dt = tmp_action.dt
  join tmp_order on tmp_action.dt =  tmp_order.dt
3.4.3 品牌复购率(月复购)
  • 建表
drop table ads_sale_tm_category1_stat_mn;
create external table ads_sale_tm_category1_stat_mn
(  
    tm_id string comment '品牌id',
    category1_id string comment '1级品类id ',
    category1_name string comment '1级品类名称 ',
    buycount   bigint comment  '购买人数',
    buy_twice_last bigint  comment '两次以上购买人数',
    buy_twice_last_ratio decimal(16,2)  comment  '单次复购率',
    buy_3times_last   bigint comment   '三次以上购买人数',
    buy_3times_last_ratio decimal(16,2)  comment  '多次复购率',
    stat_mn string comment '统计月份',
    stat_date string comment '统计日期' 
)   COMMENT '复购率统计'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_sale_tm_category1_stat_mn/';
  • 插入数据
with 
tmp_order as
(
    select
        user_id,--用户di
        order_stats_struct.sku_id sku_id,--商品id
        order_stats_struct.order_count order_count --商品被购买的次数
        -- 使用侧写的方式,将一个用户当天购买的每个商品的明细(数组)进行侧写,形成多行
    from dws_user_action_daycount lateral view explode(order_detail_stats) tmp as order_stats_struct 
    where date_format(dt,'yyyy-MM')=date_format('2020-06-25','yyyy-MM') --取当天的数据
),
tmp_sku as
(
    select
        id,--商品id
        tm_id, -- 品牌id
        category1_id, -- 一次品类的id
        category1_name --一次品类的名字
    from dwd_dim_sku_info
    where dt='2020-06-25'
)
insert into table ads_sale_tm_category1_stat_mn
select
    tm_id,
    category1_id,
    category1_name,
    sum(if(order_count>=1,1,0)) buycount,
    sum(if(order_count>=2,1,0)) buyTwiceLast,
    sum(if(order_count>=2,1,0))/sum( if(order_count>=1,1,0)) buyTwiceLastRatio,
    sum(if(order_count>=3,1,0))  buy3timeLast  ,
    sum(if(order_count>=3,1,0))/sum( if(order_count>=1,1,0)) buy3timeLastRatio ,
    date_format('2020-06-25' ,'yyyy-MM') stat_mn,
    '2020-06-25' stat_date
from
(
    select 
        tmp_order.user_id, --用户
        tmp_sku.category1_id, -- 一级品类
        tmp_sku.category1_name, -- 一级品类的名字
        tmp_sku.tm_id, -- 品牌
        sum(order_count) order_count -- 购买同一品牌数量
    from tmp_order
    join tmp_sku
    on tmp_order.sku_id=tmp_sku.id
    -- 按照用户 + 1级品类 + 品牌id分组,得到一个用户购买某一个1级品类的某一品牌的个数
    group by tmp_order.user_id,tmp_sku.category1_id,tmp_sku.category1_name,tmp_sku.tm_id
)tmp
-- 按照品牌进行 + 
group by tm_id, category1_id, category1_name;

3.5 地区主题

3.5.1 地区主题信息
  • 建表
drop table if exists ads_area_topic;
create external table ads_area_topic(
    `dt` string COMMENT '统计日期',
    `id` bigint COMMENT '编号',
    `province_name` string COMMENT '省份名称',
    `area_code` string COMMENT '地区编码',
    `iso_code` string COMMENT 'iso编码',
    `region_id` string COMMENT '地区ID',
    `region_name` string COMMENT '地区名称',
    `login_day_count` bigint COMMENT '当天活跃设备数',
    `order_day_count` bigint COMMENT '当天下单次数',
    `order_day_amount` decimal(16,2) COMMENT '当天下单金额',
    `payment_day_count` bigint COMMENT '当天支付次数',
    `payment_day_amount` decimal(16,2) COMMENT '当天支付金额'
) COMMENT '地区主题宽表'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_area_topic/';
  • 插入数据
insert into table ads_area_topic
select
    '2020-06-25',
    id,
    province_name,
    area_code,
    iso_code,
    region_id,
    region_name,
    login_day_count,
    order_day_count,
    order_day_amount,
    payment_day_count,
    payment_day_amount
from dwt_area_topic;

分析

insert into table ads_area_topic
select
    '2020-06-14',
    id,
    province_name,
    area_code,
    iso_code,
    region_id,
    region_name,
    login_day_count,
    order_day_count,
    order_day_amount,
    payment_day_count,
    payment_day_amount
    -- 直接导入dwt的地区表
from dwt_area_topic;

3.6 导入脚本

  1. 在/home/atguigu/bin目录下创建脚本dwt_to_ads.sh
[atguigu@hadoop102 bin]$ vim dwt_to_ads.sh
  1. 脚本内容
#!/bin/bash

hive=/opt/module/hive/bin/hive
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
    do_date=$1
else 
    do_date=`date -d "-1 day" +%F`
fi

sql="
set mapreduce.job.queuename=hive;
insert into table ${APP}.ads_uv_count 
select  
    '$do_date' dt,
    daycount.ct,
    wkcount.ct,
    mncount.ct,
    if(date_add(next_day('$do_date','MO'),-1)='$do_date','Y','N') ,
    if(last_day('$do_date')='$do_date','Y','N') 
from 
(
    select  
        '$do_date' dt,
        count(*) ct
    from ${APP}.dwt_uv_topic
    where login_date_last='$do_date'  
)daycount join 
( 
    select  
        '$do_date' dt,
        count (*) ct
    from ${APP}.dwt_uv_topic
    where login_date_last>=date_add(next_day('$do_date','MO'),-7) 
    and login_date_last<= date_add(next_day('$do_date','MO'),-1) 
) wkcount on daycount.dt=wkcount.dt
join 
( 
    select  
        '$do_date' dt,
        count (*) ct
    from ${APP}.dwt_uv_topic
    where date_format(login_date_last,'yyyy-MM')=date_format('$do_date','yyyy-MM')  
)mncount on daycount.dt=mncount.dt;

insert into table ${APP}.ads_new_mid_count 
select
    login_date_first,
    count(*)
from ${APP}.dwt_uv_topic
where login_date_first='$do_date'
group by login_date_first;

insert into table ${APP}.ads_silent_count
select
    '$do_date',
    count(*) 
from ${APP}.dwt_uv_topic
where login_date_first=login_date_last
and login_date_last<=date_add('$do_date',-7);


insert into table ${APP}.ads_back_count
select
'$do_date',
concat(date_add(next_day('$do_date','MO'),-7),'_', date_add(next_day('$do_date','MO'),-1)),
    count(*)
from
(
    select
        mid_id
    from ${APP}.dwt_uv_topic
    where login_date_last>=date_add(next_day('$do_date','MO'),-7) 
    and login_date_last<= date_add(next_day('$do_date','MO'),-1)
    and login_date_first<date_add(next_day('$do_date','MO'),-7)
)current_wk
left join
(
    select
        mid_id
    from ${APP}.dws_uv_detail_daycount
    where dt>=date_add(next_day('$do_date','MO'),-7*2) 
    and dt<= date_add(next_day('$do_date','MO'),-7-1) 
    group by mid_id
)last_wk
on current_wk.mid_id=last_wk.mid_id
where last_wk.mid_id is null;

insert into table ${APP}.ads_wastage_count
select
     '$do_date',
     count(*)
from 
(
    select 
        mid_id
    from ${APP}.dwt_uv_topic
    where login_date_last<=date_add('$do_date',-7)
    group by mid_id
)t1;

insert into table ${APP}.ads_user_retention_day_rate
select
    '$do_date',--统计日期
    date_add('$do_date',-1),--新增日期
    1,--留存天数
    sum(if(login_date_first=date_add('$do_date',-1) and login_date_last='$do_date',1,0)),--$do_date的1日留存数
    sum(if(login_date_first=date_add('$do_date',-1),1,0)),--$do_date新增
    sum(if(login_date_first=date_add('$do_date',-1) and login_date_last='$do_date',1,0))/sum(if(login_date_first=date_add('$do_date',-1),1,0))*100
from ${APP}.dwt_uv_topic

union all

select
    '$do_date',--统计日期
    date_add('$do_date',-2),--新增日期
    2,--留存天数
    sum(if(login_date_first=date_add('$do_date',-2) and login_date_last='$do_date',1,0)),--$do_date的2日留存数
    sum(if(login_date_first=date_add('$do_date',-2),1,0)),--$do_date新增
    sum(if(login_date_first=date_add('$do_date',-2) and login_date_last='$do_date',1,0))/sum(if(login_date_first=date_add('$do_date',-2),1,0))*100
from ${APP}.dwt_uv_topic

union all

select
    '$do_date',--统计日期
    date_add('$do_date',-3),--新增日期
    3,--留存天数
    sum(if(login_date_first=date_add('$do_date',-3) and login_date_last='$do_date',1,0)),--$do_date的3日留存数
    sum(if(login_date_first=date_add('$do_date',-3),1,0)),--$do_date新增
    sum(if(login_date_first=date_add('$do_date',-3) and login_date_last='$do_date',1,0))/sum(if(login_date_first=date_add('$do_date',-3),1,0))*100
from ${APP}.dwt_uv_topic;


insert into table ${APP}.ads_continuity_wk_count
select
    '$do_date',
    concat(date_add(next_day('$do_date','MO'),-7*3),'_',date_add(next_day('$do_date','MO'),-1)),
    count(*)
from
(
    select
        mid_id
    from
    (
        select
            mid_id
        from ${APP}.dws_uv_detail_daycount
        where dt>=date_add(next_day('$do_date','monday'),-7)
        and dt<=date_add(next_day('$do_date','monday'),-1)
        group by mid_id

        union all

        select
            mid_id
        from ${APP}.dws_uv_detail_daycount
        where dt>=date_add(next_day('$do_date','monday'),-7*2)
        and dt<=date_add(next_day('$do_date','monday'),-7-1)
        group by mid_id

        union all

        select
            mid_id
        from ${APP}.dws_uv_detail_daycount
        where dt>=date_add(next_day('$do_date','monday'),-7*3)
        and dt<=date_add(next_day('$do_date','monday'),-7*2-1)
        group by mid_id
    )t1
    group by mid_id
    having count(*)=3
)t2;


insert into table ${APP}.ads_continuity_uv_count
select
    '$do_date',
    concat(date_add('$do_date',-6),'_','$do_date'),
    count(*)
from
(
    select mid_id
    from
    (
        select mid_id      
        from
        (
            select 
                mid_id,
                date_sub(dt,rank) date_dif
            from
            (
                select 
                    mid_id,
                    dt,
                    rank() over(partition by mid_id order by dt) rank
                from ${APP}.dws_uv_detail_daycount
                where dt>=date_add('$do_date',-6) and dt<='$do_date'
            )t1
        )t2 
        group by mid_id,date_dif
        having count(*)>=3
    )t3 
    group by mid_id
)t4;


insert into table ${APP}.ads_user_topic
select
    '$do_date',
    sum(if(login_date_last='$do_date',1,0)),
    sum(if(login_date_first='$do_date',1,0)),
    sum(if(payment_date_first='$do_date',1,0)),
    sum(if(payment_count>0,1,0)),
    count(*),
    sum(if(login_date_last='$do_date',1,0))/count(*),
    sum(if(payment_count>0,1,0))/count(*),
    sum(if(login_date_first='$do_date',1,0))/sum(if(login_date_last='$do_date',1,0))
from ${APP}.dwt_user_topic;

with
tmp_uv as
(
    select
        '$do_date' dt,
        sum(if(array_contains(pages,'home'),1,0)) home_count,
        sum(if(array_contains(pages,'good_detail'),1,0)) good_detail_count
    from
    (
        select
            mid_id,
            collect_set(page_id) pages
        from ${APP}.dwd_page_log
        where dt='$do_date'
        and page_id in ('home','good_detail')
        group by mid_id
    )tmp
),
tmp_cop as
(
    select 
        '$do_date' dt,
        sum(if(cart_count>0,1,0)) cart_count,
        sum(if(order_count>0,1,0)) order_count,
        sum(if(payment_count>0,1,0)) payment_count
    from ${APP}.dws_user_action_daycount
    where dt='$do_date'
)
insert into table ${APP}.ads_user_action_convert_day
select
    tmp_uv.dt,
    tmp_uv.home_count,
    tmp_uv.good_detail_count,
    tmp_uv.good_detail_count/tmp_uv.home_count*100,
    tmp_cop.cart_count,
    tmp_cop.cart_count/tmp_uv.good_detail_count*100,
    tmp_cop.order_count,
    tmp_cop.order_count/tmp_cop.cart_count*100,
    tmp_cop.payment_count,
    tmp_cop.payment_count/tmp_cop.order_count*100
from tmp_uv
join tmp_cop
on tmp_uv.dt=tmp_cop.dt;

insert into table ${APP}.ads_product_info
select
    '$do_date' dt,
    sku_num,
    spu_num
from
(
    select
        '$do_date' dt,
        count(*) sku_num
    from
        ${APP}.dwt_sku_topic
) tmp_sku_num
join
(
    select
        '$do_date' dt,
        count(*) spu_num
    from
    (
        select
            spu_id
        from
            ${APP}.dwt_sku_topic
        group by
            spu_id
    ) tmp_spu_id
) tmp_spu_num
on
    tmp_sku_num.dt=tmp_spu_num.dt;


insert into table ${APP}.ads_product_sale_topN
select
    '$do_date' dt,
    sku_id,
    payment_amount
from
    ${APP}.dws_sku_action_daycount
where
    dt='$do_date'
order by payment_amount desc
limit 10;

insert into table ${APP}.ads_product_favor_topN
select
    '$do_date' dt,
    sku_id,
    favor_count
from
    ${APP}.dws_sku_action_daycount
where
    dt='$do_date'
order by favor_count desc
limit 10;

insert into table ${APP}.ads_product_cart_topN
select
    '$do_date' dt,
    sku_id,
    cart_count
from
    ${APP}.dws_sku_action_daycount
where
    dt='$do_date'
order by cart_count desc
limit 10;


insert into table ${APP}.ads_product_refund_topN
select
    '$do_date',
    sku_id,
    refund_last_30d_count/payment_last_30d_count*100 refund_ratio
from ${APP}.dwt_sku_topic
order by refund_ratio desc
limit 10;


insert into table ${APP}.ads_appraise_bad_topN
select
    '$do_date' dt,
    sku_id,
appraise_bad_count/(appraise_good_count+appraise_mid_count+appraise_bad_count+appraise_default_count) appraise_bad_ratio
from
    ${APP}.dws_sku_action_daycount
where
    dt='$do_date'
order by appraise_bad_ratio desc
limit 10;


insert into table ${APP}.ads_order_daycount
select
    '$do_date',
    sum(order_count),
    sum(order_amount),
    sum(if(order_count>0,1,0))
from ${APP}.dws_user_action_daycount
where dt='$do_date';


insert into table ${APP}.ads_payment_daycount
select
    tmp_payment.dt,
    tmp_payment.payment_count,
    tmp_payment.payment_amount,
    tmp_payment.payment_user_count,
    tmp_skucount.payment_sku_count,
    tmp_time.payment_avg_time
from
(
    select
        '$do_date' dt,
        sum(payment_count) payment_count,
        sum(payment_amount) payment_amount,
        sum(if(payment_count>0,1,0)) payment_user_count
    from ${APP}.dws_user_action_daycount
    where dt='$do_date'
)tmp_payment
join
(
    select
        '$do_date' dt,
        sum(if(payment_count>0,1,0)) payment_sku_count 
    from ${APP}.dws_sku_action_daycount
    where dt='$do_date'
)tmp_skucount on tmp_payment.dt=tmp_skucount.dt
join
(
    select
        '$do_date' dt,
        sum(unix_timestamp(payment_time)-unix_timestamp(create_time))/count(*)/60 payment_avg_time
    from ${APP}.dwd_fact_order_info
    where dt='$do_date'
    and payment_time is not null
)tmp_time on tmp_payment.dt=tmp_time.dt;


with 
tmp_order as
(
    select
        user_id,
        order_stats_struct.sku_id sku_id,
        order_stats_struct.order_count order_count
    from ${APP}.dws_user_action_daycount lateral view explode(order_detail_stats) tmp as order_stats_struct
    where date_format(dt,'yyyy-MM')=date_format('$do_date','yyyy-MM')
),
tmp_sku as
(
    select
        id,
        tm_id,
        category1_id,
        category1_name
    from ${APP}.dwd_dim_sku_info
    where dt='$do_date'
)
insert into table ${APP}.ads_sale_tm_category1_stat_mn
select
    tm_id,
    category1_id,
    category1_name,
    sum(if(order_count>=1,1,0)) buycount,
    sum(if(order_count>=2,1,0)) buyTwiceLast,
    sum(if(order_count>=2,1,0))/sum( if(order_count>=1,1,0)) buyTwiceLastRatio,
    sum(if(order_count>=3,1,0))  buy3timeLast  ,
    sum(if(order_count>=3,1,0))/sum( if(order_count>=1,1,0)) buy3timeLastRatio ,
    date_format('$do_date' ,'yyyy-MM') stat_mn,
    '$do_date' stat_date
from
(
    select 
        tmp_order.user_id,
        tmp_sku.category1_id,
        tmp_sku.category1_name,
        tmp_sku.tm_id,
        sum(order_count) order_count
    from tmp_order
    join tmp_sku
    on tmp_order.sku_id=tmp_sku.id
    group by tmp_order.user_id,tmp_sku.category1_id,tmp_sku.category1_name,tmp_sku.tm_id
)tmp
group by tm_id, category1_id, category1_name;



insert into table ${APP}.ads_area_topic
select
    '$do_date',
    id,
    province_name,
    area_code,
    iso_code,
    region_id,
    region_name,
    login_day_count,
    order_day_count,
    order_day_amount,
    payment_day_count,
    payment_day_amount
from ${APP}.dwt_area_topic;

"

$hive -e "$sql"
ods层:24张
    log:1张
    业务数据:23张。
dwd层:19张
    log:5张
    业务数据:6张维度表,8张事实表
dws和dwt层:12张
    维度:设备、会员、商品、地区、时间、活动:12张
ads层:20张
    设备:8张
    会员:2张
    商品:6张
    营销:3张
    地区:1张
共计:20 + 12 + 19 + 24 = 75张
另外:
临时表:拉链表:1张
       时间表:1张
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
一、课程简介随着技术的飞速发展,经过多年的数据积累,各互联网公司已保存了海量的原始数据和各种业务数据,所以数据仓库技术是各大公司目前都需要着重发展投入的技术领域。数据仓库是面向分析的集成化数据环境,为企业所有决策制定过程,提供系统数据支持的战略集合。通过对数据仓库中数据的分析,可以帮助企业改进业务流程、控制成本、提高产品质量等。、课程内容本次精心打造的数仓项目的课程,从项目架构的搭建,到数据采集模块的设计、数仓架构的设计、实战需求实现、即席查询的实现,我们针对国内目前广泛使用的Apache原生框架和CDH版本框架进行了分别介绍,Apache原生框架介绍中涉及到的技术框架包括Flume、Kafka、Sqoop、MySql、HDFS、Hive、Tez、Spark、Presto、Druid等,CDH版本框架讲解包括CM的安装部署、Hadoop、Zookeeper、Hive、Flume、Kafka、Oozie、Impala、HUE、Kudu、Spark的安装配置,透彻了解不同版本框架的区别联系,将大数据全生态系统前沿技术一网打尽。在过程中对大数据生态体系进行了系统的讲解,对实际企业数仓项目中可能涉及到的技术点都进行了深入的讲解和探讨。同时穿插了大量数仓基础理论知识,让你在掌握实战经验的同时能够打下坚实的理论基础。三、课程目标本课程以国内电商巨头实际业务应用场景为依托,对电商数仓的常见实战指标以及难点实战指标进行了详尽讲解,具体指标包括:每日、周、月活跃设备明细,留存用户比例,沉默用户、回流用户、流失用户统计,最近连续3周活跃用户统计,最近7天内连续3天活跃用户统计,GMV成交总额分析,转化率及漏斗分析,品牌复购率分析、订单表拉链表的设计等,让学生拥有更直观全面的实战经验。通过对本课程的学习,对数仓项目可以建立起清晰明确的概念,系统全面的掌握各项数仓项目技术,轻松应对各种数仓难题。四、课程亮点本课程结合国内多家企业实际项目经验,特别加入了项目架构模块,从集群规模的确定到框架版本选型以及服务器选型,手把手教你从零开始搭建大数据集群。并且总结大量项目实战中会遇到的问题,针对各个技术框架,均有调优实战经验,具体包括:常用Linux运维命令、Hadoop集群调优、Flume组件选型及性能优化、Kafka集群规模确认及关键参数调优。通过这部分学习,助学生迅速成长,获取前沿技术经验,从容解决实战问题。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

源码头

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值