HAWQ取代传统数仓实践(十一)——维度表技术之维度合并

原创 2017年05月31日 15:03:12
        有一种合并维度的情况,就是本来属性相同的维度,因为某种原因被设计成重复的维度属性。例如,在销售订单示例中,随着数据仓库中维度的增加,我们会发现有些通用的数据存在于多个维度中。客户维度的客户地址相关信息、送货地址相关信息里都有邮编、城市和省份。下面说明如何把客户维度里的两个邮编相关信息合并到一个新的维度中。

一、修改数据仓库表结构

        为了合并维度,需要改变数据仓库表结构。图1显示了修改后的结构。新增了一个zip_code_dim邮编信息维度表,sales_order_fact事实表的结构也做了相应的修改。

图1

        zip_code_dim维度表与销售订单事实表相关联。这个关系替换了事实表与客户维度的关系。sales_order_fact表需要两个关系,一个关联到客户地址邮编,另一个关联到送货地址邮编,相应的增加了两个外键字段。假设邮编相关信息不会修改,因此zip_code_dim表中没有是否删除、版本号、生效日期等SCD属性。
        下面的脚本用于修改数据仓库模式,所做的修改如下。
  • 创建邮编维度表zip_code_dim。
  • 初始装载邮编相关数据。
  • 基于zip_code_dim表创建v_customer_zip_code_dim和v_shipping_zip_code_dim视图。
  • 在sales_order_fact表上增加customer_zip_code_sk和shipping_zip_code_sk列。
  • 基于已有的客户邮编和送货邮编初始装载两个邮编代理键。
  • 在customer_dim表上删除客户和送货邮编及其它们的城市和州列。
  • 在pa_customer_dim上删除客户的城市、州和邮编列。
set search_path=tds;
 
-- 建立邮编维度表    
create table zip_code_dim 
(
 zip_code_sk serial,    
 zip_code int,    
 city varchar(30),    
 state varchar(2)
);  

comment on table zip_code_dim is '邮编维度表';      
comment on column zip_code_dim.zip_code_sk is '邮编维度代理键';      
comment on column zip_code_dim.zip_code is '邮编';      
comment on column zip_code_dim.city is '城市';   
comment on column zip_code_dim.state is '省份';      

-- 初始装载邮编相关数据
insert into zip_code_dim (zip_code, city, state)
select distinct * 
  from (select customer_zip_code, customer_city, customer_state
          from customer_dim
         where customer_zip_code is not null
         union all 
        select shipping_zip_code, shipping_city, shipping_state 
          from customer_dim
         where shipping_zip_code is not null) t1;
		 
-- 创建视图    
create view v_customer_zip_code_dim 
(customer_zip_code_sk, customer_zip_code, customer_city, customer_state) as    
select * from zip_code_dim;    
    
create view v_shipping_zip_code_dim 
(shipping_zip_code_sk, shipping_zip_code, shipping_city, shipping_state) as    
select * from zip_code_dim;

-- 添加邮编代理键
alter table sales_order_fact add column customer_zip_code_sk int default null;
alter table sales_order_fact add column shipping_zip_code_sk int default null;

comment on column sales_order_fact.customer_zip_code_sk is '客户邮编代理键';    
comment on column sales_order_fact.shipping_zip_code_sk is '送货邮编代理键';    
    
-- 初始装载两个邮编代理键 
create table sales_order_fact_bak as select * from sales_order_fact;
truncate table sales_order_fact;

insert into sales_order_fact
select t1.order_number,
       t1.customer_sk,
       t1.product_sk,
       t1.order_date_sk,
       t1.year_month,
       t1.order_amount,
       t1.order_quantity,
       t1.request_delivery_date_sk,
       t1.sales_order_attribute_sk,
       t2.customer_zip_code_sk,
       t3.shipping_zip_code_sk   
  from sales_order_fact_bak t1
  left join 
  (select a.order_number order_number,c.customer_zip_code_sk customer_zip_code_sk  
    from sales_order_fact_bak a,  
         customer_dim b,  
         v_customer_zip_code_dim c  
   where a.customer_sk = b.customer_sk  
     and b.customer_zip_code = c.customer_zip_code) t2 on t1.order_number = t2.order_number
  left join
  (select a.order_number order_number,c.shipping_zip_code_sk shipping_zip_code_sk  
    from sales_order_fact_bak a,  
         customer_dim b,  
         v_shipping_zip_code_dim c  
   where a.customer_sk = b.customer_sk  
     and b.shipping_zip_code = c.shipping_zip_code) t3 on t1.order_number = t3.order_number;   

drop table sales_order_fact_bak;	 

-- 在customer_dim表上删除客户和送货邮编及其它们的城市和州列。
alter table customer_dim drop column customer_zip_code cascade;
alter table customer_dim drop column customer_city;
alter table customer_dim drop column customer_state;
alter table customer_dim drop column shipping_zip_code;
alter table customer_dim drop column shipping_city;
alter table customer_dim drop column shipping_state; 

alter table pa_customer_dim drop column customer_zip_code;
alter table pa_customer_dim drop column customer_city;
alter table pa_customer_dim drop column customer_state;
alter table pa_customer_dim drop column shipping_zip_code;
alter table pa_customer_dim drop column shipping_city;
alter table pa_customer_dim drop column shipping_state; 
 
-- 重建相关视图
create or replace view v_customer_dim_latest as   
select customer_sk,  
       customer_number,   
       customer_name,  
       customer_street_address,  
       version,  
       effective_date,
       shipping_address	   
  from (select distinct on (customer_number) customer_number,   
               customer_sk,    
               customer_name,  
               customer_street_address, 
               isdelete,   
               version,  
               effective_date,
               shipping_address			   
          from customer_dim  
         order by customer_number, customer_sk desc) as latest   
  where isdelete is false;  

create or replace view v_customer_dim_his as   
select *, date(lead(effective_date,1,date '2200-01-01') over (partition by customer_number order by effective_date)) expiry_date   
  from customer_dim;  

create or replace view v_pa_customer_dim_latest as   
select customer_sk,  
       customer_number,   
       customer_name,  
       customer_street_address,  
       version,  
       effective_date,
       shipping_address	   
  from (select distinct on (customer_number) customer_number,   
               customer_sk,    
               customer_name,  
               customer_street_address, 
               isdelete,   
               version,  
               effective_date,
               shipping_address			   
          from pa_customer_dim  
         order by customer_number, customer_sk desc) as latest   
  where isdelete is false;
  
create or replace view v_pa_customer_dim_his as   
select *, date(lead(effective_date,1,date '2200-01-01') over (partition by customer_number order by effective_date)) expiry_date   
  from pa_customer_dim;  
        说明:
  • 邮编维度的初始数据是从客户维度表中来,这只是为了演示数据装载的过程。客户的邮编信息很可能覆盖不到所有邮编,所以更好的方法是装载一个完整的邮编信息表。由于客户地址和送货地址可能存在交叉的情况,因此使用distinct去重。送货地址的三个字段是后加的,在此之前数据的送货地址为空,邮编维度表中不能含有NULL值,所以要加上where shipping_zip_code is not null过滤条件去除邮编信息为NULL的数据行。
  • 基于邮编维度表创建客户邮编和送货邮编视图,分别用作两个地理信息的角色扮演维度。
  • 把数据备份表sales_order_fact_bak中的数据装载回销售订单事实表,同时需要关联两个邮编角色维度视图,查询出两个代理键,装载到事实表中。注意老的事实表与新的邮编维度表是通过客户维度表关联起来的,所以在子查询中需要三表连接,然后用两个左外连接查询出所有原事实表数据,装载到新的增加了邮编维度代理键的事实表中。
  • 在customer_dim表上删除列时,需要使用cascade子句同时删除依赖它的视图,之后重建相关视图。

二、修改定期数据装载函数

        定期装载函数有三个地方的修改:
  •  删除客户维度装载里所有邮编信息相关的列,因为客户维度里不再有客户邮编和送货邮编相关信息。
  •  在事实表中引用客户邮编视图和送货邮编视图中的代理键。
  •  修改pa_customer_dim装载,因为需要从销售订单事实表的customer_zip_code_sk获取客户邮编。
        修改后的fn_regular_load函数如下。
create or replace function fn_regular_load ()              
returns void as              
$$              
declare              
    -- 设置scd的生效时间            
    v_cur_date date := current_date;                
    v_pre_date date := current_date - 1;            
    v_last_load date;            
begin            
    -- 分析外部表            
    analyze ext.customer;            
    analyze ext.product;            
    analyze ext.sales_order;            
            
    -- 将外部表数据装载到原始数据表            
    truncate table rds.customer;              
    truncate table rds.product;             
            
    insert into rds.customer select * from ext.customer;             
    insert into rds.product select * from ext.product;            
    insert into rds.sales_order       
    select order_number,      
           customer_number,      
           product_code,      
           order_date,      
           entry_date,      
           order_amount,      
           order_quantity,      
           request_delivery_date,  
           verification_ind,  
           credit_check_flag,  
           new_customer_ind,  
           web_order_flag  
      from ext.sales_order;            
                
    -- 分析rds模式的表            
    analyze rds.customer;            
    analyze rds.product;            
    analyze rds.sales_order;            
            
    -- 设置cdc的上限时间            
    select last_load into v_last_load from rds.cdc_time;            
    truncate table rds.cdc_time;            
    insert into rds.cdc_time select v_last_load, v_cur_date;            
            
    -- 装载客户维度            
    insert into tds.customer_dim            
    (customer_number,            
     customer_name,            
     customer_street_address,            
     shipping_address,           
     isdelete,            
     version,            
     effective_date)            
    select case flag             
                when 'D' then a_customer_number            
                else b_customer_number            
            end customer_number,            
           case flag             
                when 'D' then a_customer_name            
                else b_customer_name            
            end customer_name,            
           case flag             
                when 'D' then a_customer_street_address            
                else b_customer_street_address            
            end customer_street_address,            
           case flag             
                when 'D' then a_shipping_address            
                else b_shipping_address            
            end shipping_address,          
           case flag             
                when 'D' then true            
                else false            
            end isdelete,            
           case flag             
                when 'D' then a_version            
                when 'I' then 1            
                else a_version + 1            
            end v,            
           v_pre_date            
      from (select a.customer_number a_customer_number,            
                   a.customer_name a_customer_name,            
                   a.customer_street_address a_customer_street_address,            
                   a.shipping_address a_shipping_address,            
                   a.version a_version,            
                   b.customer_number b_customer_number,            
                   b.customer_name b_customer_name,            
                   b.customer_street_address b_customer_street_address,            
                   b.shipping_address b_shipping_address,            
                   case when a.customer_number is null then 'I'            
                        when b.customer_number is null then 'D'            
                        else 'U'             
                    end flag            
              from v_customer_dim_latest a             
              full join rds.customer b on a.customer_number = b.customer_number             
             where a.customer_number is null -- 新增            
                or b.customer_number is null -- 删除            
                or (a.customer_number = b.customer_number             
                    and not             
                           (coalesce(a.customer_name,'') = coalesce(b.customer_name,'')             
                        and coalesce(a.customer_street_address,'') = coalesce(b.customer_street_address,'')             
                        and coalesce(a.shipping_address,'') = coalesce(b.shipping_address,'')             
                        ))) t            
             order by coalesce(a_customer_number, 999999999999), b_customer_number limit 999999999999;            
         
    -- 装载产品维度            
    insert into tds.product_dim            
    (product_code,            
     product_name,            
     product_category,                 
     isdelete,            
     version,            
     effective_date)            
    select case flag             
                when 'D' then a_product_code            
                else b_product_code            
            end product_code,            
           case flag             
                when 'D' then a_product_name            
                else b_product_name            
            end product_name,            
           case flag             
                when 'D' then a_product_category            
                else b_product_category            
            end product_category,            
           case flag             
                when 'D' then true            
                else false            
            end isdelete,            
           case flag             
                when 'D' then a_version            
                when 'I' then 1            
                else a_version + 1            
            end v,            
           v_pre_date            
      from (select a.product_code a_product_code,            
                   a.product_name a_product_name,            
                   a.product_category a_product_category,            
                   a.version a_version,            
                   b.product_code b_product_code,            
                   b.product_name b_product_name,            
                   b.product_category b_product_category,                           
                   case when a.product_code is null then 'I'            
                        when b.product_code is null then 'D'            
                        else 'U'             
                    end flag            
              from v_product_dim_latest a             
              full join rds.product b on a.product_code = b.product_code             
             where a.product_code is null -- 新增            
                or b.product_code is null -- 删除            
                or (a.product_code = b.product_code             
                    and not             
                           (a.product_name = b.product_name             
                        and a.product_category = b.product_category))) t            
             order by coalesce(a_product_code, 999999999999), b_product_code limit 999999999999;            
    
    -- 装载销售订单事实表              
    insert into sales_order_fact              
    select a.order_number,              
           customer_sk,              
           product_sk,   
           e.date_sk,            
           e.year * 100 + e.month,                 
           order_amount,          
           order_quantity,      
           f.date_sk,  
           g.sales_order_attribute_sk,
           h.customer_zip_code_sk,    
           i.shipping_zip_code_sk
      from rds.sales_order a,             
           v_customer_dim_his c,              
           v_product_dim_his d,              
           date_dim e,       
           date_dim f,    
           sales_order_attribute_dim g, 
           v_customer_zip_code_dim h,    
           v_shipping_zip_code_dim i,    
           rds.customer j,
           rds.cdc_time k  
     where a.customer_number = c.customer_number              
       and a.order_date >= c.effective_date            
       and a.order_date < c.expiry_date               
       and a.product_code = d.product_code              
       and a.order_date >= d.effective_date            
       and a.order_date < d.expiry_date               
       and date(a.order_date) = e.date        
       and date(a.request_delivery_date) = f.date  
       and a.verification_ind = g.verification_ind      
       and a.credit_check_flag = g.credit_check_flag      
       and a.new_customer_ind = g.new_customer_ind      
       and a.web_order_flag = g.web_order_flag 
       and a.customer_number = j.customer_number    
       and j.customer_zip_code = h.customer_zip_code
       and j.shipping_zip_code = i.shipping_zip_code 
       and a.entry_date >= k.last_load and a.entry_date < k.current_load;                          
    
    -- 重载PA客户维度          
    truncate table pa_customer_dim;            
    insert into pa_customer_dim            
    select distinct a.*              
      from customer_dim a,  
           sales_order_fact b,  
           v_customer_zip_code_dim c     
     where c.customer_state = 'pa'   
       and b.customer_zip_code_sk = c.customer_zip_code_sk  
       and a.customer_sk = b.customer_sk;  
	 
    -- 分析tds模式的表            
    analyze customer_dim;            
    analyze product_dim;            
    analyze sales_order_fact; 
    analyze pa_customer_dim;	
            
    -- 更新时间戳表的last_load字段              
    truncate table rds.cdc_time;            
    insert into rds.cdc_time select v_cur_date, v_cur_date;            
            
end;              
$$              
language plpgsql;
        上面的函数需要注意两个地方。装载事实表数据时,除了关联两个邮编维度视图外,还要关联过渡区的rds.customer表。这是因为要取得邮编维度代理键,必须连接邮编代码字段,而邮编代码已经从客户维度表中删除,只有在源数据的客户表中保留。第二个改变是PA子维度的装载。州代码已经从客户维度表删除,被放到了新的邮编维度表中,而客户维度和邮编维度并没有直接关系,它们是通过事实表的客户代理键和邮编代理键产生联系,因此必须关联事实表、客户维度表、邮编维度表三个表才能取出PA子维度数据。这也就是把PA子维度的装载放到了事实表装载之后的原因。

三、测试

        按照以下步骤测试修改后的定期装载脚本。
  1. 对源数据的客户邮编相关信息做一些修改。
  2. 装载新的客户数据前,查询最后的客户和送货邮编,后面可以用改变后的信息和此查询的输出作对比。
  3. 新增销售订单源数据。
  4. 执行定期装载。
  5. 查询客户维度表、售订单事实表和PA子维度表,确认数据已经正确装载。
        执行下面的语句,对源数据的客户信息做以下两处修改:客户编号4的客户和送货邮编信息;新增一个编号15的客户。
update source.customer   
   set customer_street_address = '9999 louise dr.',  
       customer_zip_code = 17055,   
       customer_city = 'pittsburgh',  
       shipping_address = '9999 louise dr.',  
       shipping_zip_code = 17055,  
       shipping_city = 'pittsburgh'  
 where customer_number = 4;  
  
insert into source.customer   
values(15, 'super stores', '1000 woodland st.', 17055, 'pittsburgh', 'pa', '1000 woodland st.', 17055, 'pittsburgh', 'pa');  
  
commit;
        现在在装载新的客户数据前查询最后的客户和送货邮编。后面可以用改变后的信息和此查询的输出作对比。查询语句如下。
select order_date_sk odsk,  
       customer_number cn,  
       customer_zip_code czc,  
       shipping_zip_code szc  
  from v_customer_zip_code_dim a,  
       v_shipping_zip_code_dim b,  
       sales_order_fact c,  
       customer_dim d  
 where a.customer_zip_code_sk = c.customer_zip_code_sk  
   and b.shipping_zip_code_sk = c.shipping_zip_code_sk  
   and d.customer_sk = c.customer_sk; 
 order by odsk;
        然后使用下面的语句新增两条销售订单。
set @order_date := from_unixtime(unix_timestamp('2017-05-30 00:00:01') + rand() * (unix_timestamp('2017-05-30 12:00:00') - unix_timestamp('2017-05-30 00:00:01')));  
set @request_delivery_date := from_unixtime(unix_timestamp(date_add(current_date, interval 5 day)) + rand() * 86400);       
set @amount := floor(1000 + rand() * 9000);       
set @quantity := floor(10 + rand() * 90);    

insert into source.sales_order values  
  (null, 4, 3, 'y', 'y', 'y', 'n',  @order_date, @request_delivery_date,  
        @order_date, @amount, @quantity);  
          
set @order_date := from_unixtime(unix_timestamp('2017-05-30 12:00:00') + rand() * (unix_timestamp('2017-05-31 00:00:00') - unix_timestamp('2017-05-30 12:00:00')));        
set @request_delivery_date := from_unixtime(unix_timestamp(date_add(current_date, interval 5 day)) + rand() * 86400);
set @amount := floor(1000 + rand() * 9000);       
set @quantity := floor(10 + rand() * 90);    
  
insert into source.sales_order values  
  (null, 15, 4, 'y', 'n', 'y', 'n', @order_date, @request_delivery_date,  
       @order_date, @amount, @quantity);  

commit;
        执行下面的命令定期装载。
~/regular_etl.sh
        查询customer_dim表,确认两个改变的客户,即编号4和15的客户,已经正确装载。
select customer_sk csk,  
       customer_number cnum,  
       customer_name cnam,  
       customer_street_address csd,  
       shipping_address sd,  
       version,  
       effective_date,  
       expiry_date  
  from v_customer_dim_his   
 where customer_number in (4, 15);
        查询结果如图2所示。
图2

        查询sales_order_fact表里的两条新销售订单,确认邮编已经正确装载。
select a.order_number onum,  
       e.customer_number cnum,  
       b.customer_zip_code czc,  
       c.shipping_zip_code szc,  
       f.product_code pc,  
       d.order_date od,  
       a.order_amount,  
       a.order_quantity  
  from sales_order_fact a,  
       v_customer_zip_code_dim b,  
       v_shipping_zip_code_dim c,  
       v_order_date_dim d,  
       customer_dim e,  
       product_dim f  
 where a.customer_sk = e.customer_sk  
   and a.product_sk = f.product_sk  
   and a.customer_zip_code_sk = b.customer_zip_code_sk  
   and a.shipping_zip_code_sk = c.shipping_zip_code_sk  
   and a.order_date_sk = d.order_date_sk  
 order by a.order_number desc 
 limit 2;  
        查询结果如图3所示。
图3

        查询v_pa_customer_dim_his视图,确认PA客户正确装载。
select customer_sk csk,  
       customer_number cnum,  
       customer_name cnam,  
       customer_street_address csa,  
       shipping_address sad,  
       version,  
       effective_date,  
       expiry_date  
  from v_pa_customer_dim_his
 order by customer_sk;
        查询结果如图4所示。
图4
版权声明:本文为博主原创文章,未经博主允许不得转载。

相关文章推荐

HAWQ取代传统数仓实践(十三)——事实表技术之周期快照

一、周期快照简介        周期快照事实表中的每行汇总了发生在某一标准周期,如一天、一周或一月的多个度量。其粒度是周期性的时间段,而不是单个事务。周期快照事实表通常包含许多数据的总计,因为任何与事...

HAWQ取代传统数仓实践(三)——初始ETL(Sqoop、HAWQ)

一、用sqoop用户建立初始抽取脚本        本示例要用Sqoop将MySQL的数据抽取到HDFS上的指定目录,然后利用HAWQ外部表功能将HDFS数据文件装载到内部表中。表1汇总了示例中维度表...

HAWQ论文笔记

1、背景HAWQ是一个构建在HDFS之上的MPP(massively parallel processing)SQL引擎,不像其他构建在hadoop之上的SQL引擎,HAWQ支持标准SQL,并且完整的...

基于Hadoop生态圈的数据仓库实践 —— 进阶技术(一)

一、增加列         数据仓库最常碰到的扩展是给一个已经存在的维度表和事实表添加列。本节说明如何在客户维度表和销售订单事实表上添加列,并在新列上应用SCD2,以及对定时装载脚本所做的修改。假设需...

HAWQ技术解析(八) —— 大表分区

一、HAWQ中的分区表        与大多数关系数据库一样,HAWQ也支持分区表。这里所说的分区表是指HAWQ的内部分区表,外部分区表在后面“外部数据”篇讨论。在数据仓库应用中,事实表通常有非常多的...

HAWQ取代传统数仓实践(八)——维度表技术之角色扮演维度

单个物理维度可以被事实表多次引用,每个引用连接逻辑上存在差异的角色维度。例如,事实表可以有多个日期,每个日期通过外键引用不同的日期维度,原则上每个外键表示不同的日期维度视图,这样引用具有不同的含义。这...

挑战数据结构与算法面试题——80题全解析(一)

题目来源“数据结构与算法面试题80道”。

HAWQ技术解析(六) —— 定义对象

HAWQ本质上是一个数据库系统,所以这里所说的对象指的是数据库对象。和其它关系数据库类似,HAWQ中有数据库、表空间、表、视图、自定义数据类型、自定义函数、序列等对象。本篇将简述这些对象的创建与管理。...

基于Hadoop生态圈的数据仓库实践 —— 环境搭建(三)

三、建立数据仓库示例模型         Hadoop及其相关服务安装配置好后,下面用一个小而完整的示例说明多维模型及其相关ETL技术在Hadoop上的具体实现。 1. 设计ERD         ...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:深度学习:神经网络中的前向传播和反向传播算法推导
举报原因:
原因补充:

(最多只允许输入30个字)