HAWQ取代传统数仓实践（十三）——事实表技术之周期快照

最新推荐文章于 2024-03-28 22:22:04 发布

秉寒-CHO

最新推荐文章于 2024-03-28 22:22:04 发布

阅读量192

点赞数

分类专栏： DW

DW 专栏收录该内容

43 篇文章 4 订阅

订阅专栏

https://blog.csdn.net/wzy0623/article/details/72844714

一、周期快照简介
周期快照事实表中的每行汇总了发生在某一标准周期，如一天、一周或一月的多个度量。其粒度是周期性的时间段，而不是单个事务。周期快照事实表通常包含许多数据的总计，因为任何与事实表时间范围一致的记录都会被包含在内。在这些事实表中，外键的密度是均匀的，因为即使周期内没有活动发生，通常也会在事实表中为每个维度插入包含0或空值的行。
周期快照是在一个给定的时间对事实表进行一段时期的总计。有些数据仓库用户，尤其是业务管理者或者运营部门，经常要看某个特定时间点的汇总数据。下面在示例数据仓库中创建一个月销售订单周期快照，用于按产品统计每个月总的销售订单金额和产品销售数量。

二、建立周期快照表
假设需求是要按产品统计每个月的销售金额和销售数量。单从功能上看，此数据能够从事务事实表中直接查询得到。例如，要取得2017年5月的销售数据，可以使用以下的语句查询：

    select b.month_sk, a.product_sk, sum(order_amount), sum(order_quantity)
      from sales_order_fact a,
           month_dim b,
           v_order_date_dim d
     where a.order_date_sk = d.order_date_sk
       and b.month = d.month
       and b.year = d.year
       and b.month = 5
       and b.year = 2017
     group by b.month_sk, a.product_sk ;

只要将年、月参数传递给这条查询语句，就可以获得任何年月的统计数据。但即便是在如此简单的场景下，我们仍然需要建立独立的周期快照事实表。事务事实表的数据量都会很大，如果每当需要月销售统计数据时，都从最细粒度的事实表查询，那么性能将会差到不堪忍受的程度。再者，月统计数据往往只是下一步数据分析的输入信息，有时把更复杂的逻辑放到一个单一的查询语句中效率会更差。因此，好的做法是将事务型事实表作为一个基石事实数据，以此为基础，向上逐层建立需要的快照事实表。
新的周期快照事实表中有两个度量值，month_order_amount和month_order_quantity。这两个值是不能加到sales_order_fact表中的，因为sales_order_fact表和新的度量值有不同的时间属性，也即数据的粒度不同。sales_order_fact表包含的是单一事务记录。新的度量值是每月的汇总数据，它们是可加的。使用下面的语句建立month_end_sales_order_fact表。

    set search_path=tds;

    create table month_end_sales_order_fact (
        year_month int,
        product_sk bigint,
        month_order_amount numeric(10,2),
        month_order_quantity bigint
    )
    partition by range (year_month)
    ( partition p201601 start (201601) inclusive ,
      partition p201602 start (201602) inclusive ,
      partition p201603 start (201603) inclusive ,
      partition p201604 start (201604) inclusive ,
      partition p201605 start (201605) inclusive ,
      partition p201606 start (201606) inclusive ,
      partition p201607 start (201607) inclusive ,
      partition p201608 start (201608) inclusive ,
      partition p201609 start (201609) inclusive ,
      partition p201610 start (201610) inclusive ,
      partition p201611 start (201611) inclusive ,
      partition p201612 start (201612) inclusive ,
      partition p201701 start (201701) inclusive ,
      partition p201702 start (201702) inclusive ,
      partition p201703 start (201703) inclusive ,
      partition p201704 start (201704) inclusive ,
      partition p201705 start (201705) inclusive ,
      partition p201706 start (201706) inclusive ,
      partition p201707 start (201707) inclusive ,
      partition p201708 start (201708) inclusive ,
      partition p201709 start (201709) inclusive ,
      partition p201710 start (201710) inclusive ,
      partition p201711 start (201711) inclusive ,
      partition p201712 start (201712) inclusive
                        end (201801) exclusive );

    comment on table month_end_sales_order_fact is '月销售周期快照表';
    comment on column month_end_sales_order_fact.year_month is '年月';
    comment on column month_end_sales_order_fact.product_sk is '产品代理键';
    comment on column month_end_sales_order_fact.month_order_amount is '月销售金额';
    comment on column month_end_sales_order_fact.month_order_quantity is '月销售数量';

和销售订单事实表一样，月销售周期快照表也以年月做分区。这样做主要有两点好处：

按年月查询周期快照表时，可以利用分区消除提高性能。
便于实现重复执行定期装载过程。HAWQ没有DELETE语句，但是可以单独清空分区对应的子表。

三、装载周期快照表
建立了month_end_sales_order_fact表后，现在需要向表中装载数据。实际装载时，月销售周期快照事实表的数据源是已有的销售订单事务事实表，而不需要关联产品维度表。之所以可以这样做，是因为总是先处理事务事实表，再处理周期快照事实表，并且事务事实表中的产品代理键就是当时有效的产品描述。这样做还有一个好处是，不必要非在1号装载上月的数据，这点在后面修改工作流时详细说明。
执行下面的语句初始装载月销售数据。

    insert into month_end_sales_order_fact
    select year_month,product_sk,sum(order_amount),sum(order_quantity)
      from sales_order_fact
     group by year_month,product_sk;

fn_month_sum函数用于定期装载月销售订单周期快照事实表，函数定义如下。

    create or replace function tds.fn_month_sum(p_year_month int)
    returns void as
    $$
    declare
        sqlstring varchar(1000);
    begin
        -- 幂等操作，先删除上月数据
        sqlstring := 'truncate table month_end_sales_order_fact_1_prt_p' || cast(p_year_month as varchar);
        execute sqlstring;

        -- 插入上月销售汇总数据
        insert into month_end_sales_order_fact
        select t1.year_month,
               t2.product_sk,
               coalesce(t2.month_order_amount,0),
               coalesce(t2.month_order_quantity,0)
          from (select year * 100 + month year_month
                  from month_dim
                 where year * 100 + month = p_year_month) t1
          left join (select year_month, product_sk, sum(order_amount) month_order_amount, sum(order_quantity) month_order_quantity
                       from sales_order_fact
                      where year_month = p_year_month
                      group by year_month,product_sk) t2
               on t1.year_month = t2.year_month;

    end;
    $$
    language plpgsql;

执行以下语句装载上个月的销售汇总数据。该语句可以重复执行，汇总数据不会重复累加。

select tds.fn_month_sum(cast(extract(year from current_date - interval '1 month') * 100 + extract(month from current_date - interval '1 month') as int));

周期快照表的外键密度是均匀的，因此这里使用外连接关联月份维度和事务事实表。即使上个月没有任何销售记录，周期快照中仍然会有一行记录。在这种情况下，周期快照记录中只有年月，而产品代理键的值为空，度量为0。查询销售订单事实表时可以利用分区消除提高性能。
每个月给定的任何一天，在每天销售订单定期装载执行完后，执行fn_month_sum函数，装载上个月的销售订单汇总数据。为此需要修改Oozie的工作流定义。

四、修改工作流

1. 修改Oozie工作流作业配置文件
需要在“HAWQ取代传统数仓实践（五）——自动调度工作流（Oozie、Falcon）”中创建的workflow.xml工作流定义文件中增加月底销售周期快照的数据装载部分，修改后的文件内容如下：

<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.4" name="RegularETL">
    <start to="hdfsCommands"/>
    <action name="hdfsCommands">
        <fs>
            <delete path='${nameNode}/data/ext/sales_order/*'/>
        </fs>
        <ok to="fork-node"/>
        <error to="fail"/>
    </action>
    <fork name="fork-node">
        <path start="sqoop-customer" />
        <path start="sqoop-product" />
        <path start="sqoop-sales_order" />
    </fork>
    <action name="sqoop-customer">
        <sqoop xmlns="uri:oozie:sqoop-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <arg>import</arg>
            <arg>--connect</arg>
            <arg>jdbc:mysql://172.16.1.127:3306/source?useSSL=false</arg>
            <arg>--username</arg>
            <arg>dwtest</arg>
            <arg>--password</arg>
            <arg>123456</arg>
            <arg>--table</arg>
            <arg>customer</arg>
            <arg>--target-dir</arg>
            <arg>/data/ext/customer</arg>
            <arg>--delete-target-dir</arg>
            <arg>--compress</arg>
        </sqoop>
        <ok to="joining"/>
        <error to="fail"/>
    </action>
    <action name="sqoop-product">
        <sqoop xmlns="uri:oozie:sqoop-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <arg>import</arg>
            <arg>--connect</arg>
            <arg>jdbc:mysql://172.16.1.127:3306/source?useSSL=false</arg>
            <arg>--username</arg>
            <arg>dwtest</arg>
            <arg>--password</arg>
            <arg>123456</arg>
            <arg>--table</arg>
            <arg>product</arg>
            <arg>--target-dir</arg>
            <arg>/data/ext/product</arg>
            <arg>--delete-target-dir</arg>
            <arg>--compress</arg>
        </sqoop>
        <ok to="joining"/>
        <error to="fail"/>
    </action>
    <action name="sqoop-sales_order">
        <sqoop xmlns="uri:oozie:sqoop-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <command>job --meta-connect jdbc:mysql://hdp2/sqoop?user=sqoop&password=sqoop --exec myjob_incremental_import</command>
            <archive>/user/oozie/share/lib/lib_20170208131207/sqoop/java-json.jar#java-json.jar</archive>
        </sqoop>
        <ok to="joining"/>
        <error to="fail"/>
    </action>
    <join name="joining" to="psql-node"/>
    <action name="psql-node">
        <ssh xmlns="uri:oozie:ssh-action:0.1">
            <host>${focusNodeLogin}</host>
            <command>${myScript}</command>
            <capture-output/>
        </ssh>
        <ok to="decision-node"/>
        <error to="fail"/>
    </action>

    <decision name="decision-node">
       <switch>
         <case to="month-sum">
             ${date eq '02'}
         </case>
         <default to="end"/>
       </switch>
    </decision>

    <action name="month-sum">
        <ssh xmlns="uri:oozie:ssh-action:0.1">
            <host>${focusNodeLogin}</host>
            <command>${myScript1}</command>
            <capture-output/>
        </ssh>
        <ok to="end"/>
        <error to="fail"/>
    </action>

    <kill name="fail">
        <message>Sqoop failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>

在该配置文件中增加了一个名为decision-node的decision控制节点，用来判断date参数的值。当date等于'02'时，转到month-sum动作节点，否则转到end节点结束工作流。month-sum是一个SSH动作节点，执行fn_month_sum函数装载周期快照事实表，成功执行后转到end节点结束。很明显，本例中decision节点的作用就是控制在并且只在一个月当中的某一天执行周期快照表的数据装载，其它日期不做这步操作。之所以这里是'02'是为了方便测试。fn_month_sum函数接收年月作为参数，因此不必要非得1号执行，任何一天都可以。这个工作流定义保证了每月汇总只有在每天汇总执行完后才执行，并且每月只执行一次。工作流的DAG如图1所示。

图1

2. 部署工作流

hdfs dfs -put -f workflow.xml /user/oozie/

3. 在Falcon process的ADVANCED OPTIONS中增加属性
需要在调度作业配置中增加myScript1和date两个属性的定义，如图2所示。
图2

myScript1属性的值为/root/regular_etl_month.sh，是调用psql的shell脚本文件。date属性的值为${coord:formatTime(coord:actualTime(), "dd")}，用Oozie的系统函数取得工作流执行时的月中日期。Falcon调度执行工作流时，这些属性的值会作为实参传入workflow.xml工作流定义文件中。

4. 编写快照表数据装载脚本
/root/regular_etl_month.sh文件的内容如下所示：

#!/bin/bash

# 使用gpadmin用户执行月周期快照装载函数
su - gpadmin -c 'export PGPASSWORD=123456;psql -U dwtest -d dw -h hdp3 -c "set search_path=tds;select fn_month_sum(cast(extract(year from current_date - interval '\''1 month'\'') * 100 + extract(month from current_date - interval '\''1 month'\'') as int))"'

该文件以root用户执行，需要注意shell中引号嵌套的用法。

五、测试
首先清空上个月的周期快照数据：

truncate table month_end_sales_order_fact_1_prt_p201705;

然后在Falcon Web UI中执行process。执行成功后查看月周期快照如图3所示。
图3

可以看到，已经生成了上个月的销售汇总周期快照数据。