一种基于SparkSQL的Hive数据仓库拉链表缓慢变化维(SCD2+SCD1)的示例实现

  1. 背景
  2. 准备基础维度表 base_dim
    • 业务字段:id, name, city, st.
    • 维度表默认字段:sk(surrogate key 代理键,全局唯一), scd_update_date, scd_version, scd_valid_flag, scd_start_date, scd_end_date
    • scd1字段 city, st; scd2字段 name.
    • scd1 字段变动不记录scd_version, 但会记录变更时间scd_update_date
             import spark.sql
             import spark.implicits._
             // scd1: city, st; scd2: name
             sql("drop table if exists base_dim")
             sql(
                 s"""
                    | create table base_dim
                    |    (sk int, id int, name string, city string, st string, scd_update_date string,scd_version int,scd_valid_flag string,scd_start_date string, scd_end_date string)
                    | stored as orc
                  """.stripMargin)
             sql(
                 s"""
                    | insert into table base_dim
                    | values
                    |    (1,1,"zhangsan","us","ca","2019-01-01",1,"Y","2019-01-01","9999-12-31"),
                    |    (2,2,"lisi","us","cb","2019-01-01",1,"Y","2019-01-01","9999-12-31"),
                    |    (3,3,"wangwu","ca","bb","2019-01-01",1,"Y","2019-01-01","9999-12-31"),
                    |    (4,4,"zhaoliu","ca","bc","2019-01-01",1,"Y","2019-01-01","9999-12-31"),
                    |    (5,5,"mazi","aa","aa","2019-01-01",1,"Y","2019-01-01","9999-12-31")
                  """.stripMargin)

    +---+---+--------+----+---+---------------+-----------+--------------+--------------+------------+
    | sk| id| name|city| st|scd_update_date|scd_version|scd_valid_flag|scd_start_date|scd_end_date|
    +---+---+--------+----+---+---------------+-----------+--------------+--------------+------------+
    | 3| 3| wangwu| ca| bb| 2019-01-01| 1| Y| 2019-01-01| 9999-12-31|
    | 4| 4| zhaoliu| ca| bc| 2019-01-01| 1| Y| 2019-01-01| 9999-12-31|
    | 5| 5| mazi| aa| aa| 2019-01-01| 1| Y| 2019-01-01| 9999-12-31|
    | 1| 1|zhangsan| us| ca| 2019-01-01| 1| Y| 2019-01-01| 9999-12-31|
    | 2| 2| lisi| us| cb| 2019-01-01| 1| Y| 2019-01-01| 9999-12-31|
    +---+---+--------+----+---+---------------+-----------+--------------+--------------+------------+

  3. 阶段表 stage
    • 数据每行对应的变化:1: scd2 + scd1; 2: expired(snapshot) | unchanged (incrementing mode); 3: unchanged; 4: scd1; 5: scd2; 6: new record
      1) 快照模式(snapshot):不包含在该表中的记录视为已过期的。
      2) 增量变动模式(incrementing mode):不包含在该表中的记录仍然视为有效。
    • 除了1中所述业务字段外,还有表征每行记录的时间字段
             sql("drop table if exists stage")
             sql("create table stage (id int, name string, city string, st string, update_date string) stored as orc")
             // 1: scd2 + scd1; 2: expired(snapshot) | unchanged (incrementing mode); 3: unchanged; 4: scd1; 5: scd2; 6: new record
             val dt = getDate()
             sql(
                 s"""
                    | insert into table stage
                    | values
                    |    (1,"zhang","u","c",‘$dt’),
                    |    (3,"wangwu","ca","bb",‘$dt‘),
                    |    (4,"zhaoliu","ac","cb",$dt’),
                    |    (5,"ma","aa","aa",‘$dt’),
                    |    (6,"laoyang","dd","dd",’$dt‘)
                  """.stripMargin)

    +---+-------+----+---+-----------+
    | id| name|city| st|update_date|
    +---+-------+----+---+-----------+
    | 1| zhang| u| c| 2019-08-06|
    | 3| wangwu| ca| bb| 2019-08-06|
    | 4|zhaoliu| ac| cb| 2019-08-06|
    | 5| ma| aa| aa| 2019-08-06|
    | 6|laoyang| dd| dd| 2019-08-06|
    +---+-------+----+---+-----------+

  4. 获取不变的记录
    1) 快照模式: 在快照中,且规定的特定字段均不变(业务主键+SCD2+SCD1)的记录
            val unchangedDF = sql(
                s"""
                   |SELECT
                   |    bd.sk,
                   |    s.*,
                   |    bd.scd_version,
                   |    bd.scd_valid_flag,
                   |    bd.scd_start_date,
                   |    bd.scd_end_date
                   |FROM
                   |    base_dim bd
                   |    JOIN stage s
                   |    ON bd.id=s.id AND bd.name=s.name AND bd.city=s.city AND bd.st=s.st
                 """.stripMargin)
+---+---+------+----+---+-----------+-----------+--------------+--------------+------------+
| sk| id|  name|city| st|update_date|scd_version|scd_valid_flag|scd_start_date|scd_end_date|
+---+---+------+----+---+-----------+-----------+--------------+--------------+------------+
|  3|  3|wangwu|  ca| bb|       2006|          1|             Y|    2019-01-01|  9999-12-31|
+---+---+------+----+---+-----------+-----------+--------------+--------------+------------+

    2) 增量变动模式:  1. 在阶段表中且规定的特定字段均不变(业务主键+SCD2+SCD1)的记录;2. 不在阶段表中的记录
            val unchangedDF = sql(
                s"""
                   |SELECT
                   |    bd.*
                   |FROM
                   |    base_dim bd
                   |    LEFT JOIN stage s
                   |    ON bd.id=s.id
                   |    WHERE s.id IS NULL OR (bd.name=s.name AND bd.city=s.city AND bd.st=s.st)
                 """.stripMargin)
+---+---+------+----+---+---------------+-----------+--------------+--------------+------------+
| sk| id|  name|city| st|scd_update_date|scd_version|scd_valid_flag|scd_start_date|scd_end_date|
+---+---+------+----+---+---------------+-----------+--------------+--------------+------------+
|  3|  3|wangwu|  ca| bb|     2019-01-01|          1|             Y|    2019-01-01|  9999-12-31|
|  2|  2|  lisi|  us| cb|     2019-01-01|          1|             Y|    2019-01-01|  9999-12-31|
+---+---+------+----+---+---------------+-----------+--------------+--------------+------------+
  1. 标记过期的记录
    1) 快照模式: 1. 不在快照中的记录 2. 发生SCD2的旧的记录
            val pre_date = getPreDate()
            val max_date = "9999-12-31"
            val expireDF = sql(
                s"""
                   |SELECT
                   |    bd.sk sk,
                   |    bd.id id,
                   |    bd.name name,
                   |    bd.city city,
                   |    bd.st st,
                   |    bd.scd_update_date scd_update_date,
                   |    bd.scd_version scd_version,
                   |    'N' scd_valid_flag,
                   |    bd.scd_start_date scd_start_date,
                   |    '$pre_date' scd_end_date
                   |FROM
                   |    (
                   |        SELECT *
                   |        FROM base_dim
                   |        WHERE scd_end_date='$max_date'
                   |    ) bd
                   |    LEFT JOIN stage s
                   |    ON bd.id=s.id
                   |    WHERE s.id IS NULL OR bd.name <> s.name
                 """.stripMargin)
+---+---+--------+----+---+---------------+-----------+--------------+--------------+------------+
| sk| id|    name|city| st|scd_update_date|scd_version|scd_valid_flag|scd_start_date|scd_end_date|
+---+---+--------+----+---+---------------+-----------+--------------+--------------+------------+
|  5|  5|    mazi|  aa| aa|     2019-01-01|          1|             N|    2019-01-01|  2019-08-05|
|  1|  1|zhangsan|  us| ca|     2019-01-01|          1|             N|    2019-01-01|  2019-08-05|
|  2|  2|    lisi|  us| cb|     2019-01-01|          1|             N|    2019-01-01|  2019-08-05|
+---+---+--------+----+---+---------------+-----------+--------------+--------------+------------+
    2) 增量变动模式:  1. 发生SCD2的旧的记录
            val expireDF = sql(
                s"""
                   |SELECT
                   |    bd.sk sk,
                   |    bd.id id,
                   |    bd.name name,
                   |    bd.city city,
                   |    bd.st st,
                   |    bd.scd_update_date scd_update_date,
                   |    bd.scd_version scd_version,
                   |    'N' scd_valid_flag,
                   |    bd.scd_start_date scd_start_date,
                   |    '$pre_date' scd_end_date
                   |FROM
                   |    (
                   |        SELECT *
                   |        FROM base_dim
                   |        WHERE scd_end_date='$max_date'
                   |    ) bd
                   |    LEFT JOIN stage s
                   |    ON bd.id=s.id
                   |    WHERE bd.name <> s.name
                 """.stripMargin)
+---+---+--------+----+---+---------------+-----------+--------------+--------------+------------+
| sk| id|    name|city| st|scd_update_date|scd_version|scd_valid_flag|scd_start_date|scd_end_date|
+---+---+--------+----+---+---------------+-----------+--------------+--------------+------------+
|  5|  5|    mazi|  aa| aa|     2019-01-01|          1|             N|    2019-01-01|  2019-08-05|
|  1|  1|zhangsan|  us| ca|     2019-01-01|          1|             N|    2019-01-01|  2019-08-05|
+---+---+--------+----+---+---------------+-----------+--------------+--------------+------------+
  1. 获取SCD2的新纪录
    • 也可能包含SCD2新行中的SCD1变化
      ```scala
      val scd2NewLineDF = sql(
      s"""
      | SELECT
      | ROW_NUMBER() OVER (ORDER BY t1.id) + t2.sk_max sk,
      | t1.id,
      | t1.name,
      | t1.city,
      | t1.st,
      | t1.scd_update_date,
      | t1.scd_version,
      | t1.scd_valid_flag,
      | t1.scd_start_date,
      | t1.scd_end_date
      | FROM
      | (
      | SELECT
      | t2.id id,
      | t2.name name,
      | t2.city city,
      | t2.st st,
      | t2.update_date scd_update_date,
      | t1.scd_version + 1 scd_version,
      | "Y" scd_valid_flag,
      | t1.scd_start_date scd_start_date,
      | '$max_date' scd_end_date
      | FROM
      | base_dim t1
      | JOIN
      | stage t2
      | ON
      | t1.id=t2.id AND t1.name<>t2.name
      | ) t1
      | CROSS JOIN (SELECT COALESCE(MAX(sk),0) sk_max FROM base_dim) t2
      |
      """.stripMargin)

```
+---+---+-----+----+---+---------------+-----------+--------------+--------------+------------+
| sk| id| name|city| st|scd_update_date|scd_version|scd_valid_flag|scd_start_date|scd_end_date|
+---+---+-----+----+---+---------------+-----------+--------------+--------------+------------+
| 6| 1|zhang| u| c| 2019-08-06| 2| Y| 2019-01-01| 9999-12-31|
| 7| 5| ma| aa| aa| 2019-08-06| 2| Y| 2019-01-01| 9999-12-31|
+---+---+-----+----+---+---------------+-----------+--------------+--------------+------------+

  1. 获取SCD1变化的记录
            val scd1UpdateDF = sql(
                s"""
                   | SELECT
                   |    bd.sk,
                   |    bd.id,
                   |    bd.name,
                   |    s.city,
                   |    s.st,
                   |    s.update_date,
                   |    bd.scd_version,
                   |    bd.scd_valid_flag,
                   |    bd.scd_start_date,
                   |    bd.scd_end_date
                   | FROM
                   |    base_dim bd
                   |    JOIN
                   |        stage s
                   |    ON
                   |        bd.id=s.id AND bd.name=s.name
                   |    WHERE
                   |        bd.city<>s.city OR bd.st<>s.st
                 """.stripMargin)
+---+---+-------+----+---+-----------+-----------+--------------+--------------+------------+
| sk| id|   name|city| st|update_date|scd_version|scd_valid_flag|scd_start_date|scd_end_date|
+---+---+-------+----+---+-----------+-----------+--------------+--------------+------------+
|  4|  4|zhaoliu|  ac| cb| 2019-08-06|          1|             Y|    2019-01-01|  9999-12-31|
+---+---+-------+----+---+-----------+-----------+--------------+--------------+------------+
  1. 获取全新的记录,并汇聚成最终结果
            val scdDF = unchangedDF.union(expireDF).union(scd2NewLineDF).union(scd1UpdateDF)
            scdDF.createOrReplaceTempView("new_base_dim")
            // grand-new records
            val brandNewDF = sql(
                s"""
                   | SELECT
                   |    ROW_NUMBER() OVER (ORDER BY t1.id) + t2.sk_max sk,
                   |    t1.*,
                   |    1,
                   |    "Y",
                   |    '$pre_date',
                   |    '$max_date'
                   | FROM
                   |    (
                   |        stage s
                   |        LEFT ANTI JOIN
                   |            base_dim bd
                   |        ON
                   |            s.id=bd.id
                   |    ) t1
                   |    CROSS JOIN (SELECT COALESCE(MAX(sk),0) sk_max FROM new_base_dim) t2
                 """.stripMargin)
            brandNewDF.show()
            val finalDF = scdDF.union(brandNewDF)
            finalDF.show()
+---+---+-------+----+---+-----------+---+---+-----------+----------+
| sk| id|   name|city| st|update_date|  1|  Y|update_date|9999-12-31|
+---+---+-------+----+---+-----------+---+---+-----------+----------+
|  8|  6|laoyang|  dd| dd| 2019-08-06|  1|  Y| 2019-08-06|9999-12-31|
+---+---+-------+----+---+-----------+---+---+-----------+----------+

8.最终结果
1) 快照模式:

+---+---+--------+----+---+-----------+-----------+--------------+--------------+------------+
| sk| id|    name|city| st|update_date|scd_version|scd_valid_flag|scd_start_date|scd_end_date|
+---+---+--------+----+---+-----------+-----------+--------------+--------------+------------+
|  3|  3|  wangwu|  ca| bb| 2019-08-06|          1|             Y|    2019-01-01|  9999-12-31|
|  5|  5|    mazi|  aa| aa| 2019-01-01|          1|             N|    2019-01-01|  2019-08-05|
|  1|  1|zhangsan|  us| ca| 2019-01-01|          1|             N|    2019-01-01|  2019-08-05|
|  2|  2|    lisi|  us| cb| 2019-01-01|          1|             N|    2019-01-01|  2019-08-05|
|  6|  1|   zhang|   u|  c| 2019-08-06|          2|             Y|    2019-01-01|  9999-12-31|
|  7|  5|      ma|  aa| aa| 2019-08-06|          2|             Y|    2019-01-01|  9999-12-31|
|  4|  4| zhaoliu|  ac| cb| 2019-08-06|          1|             Y|    2019-01-01|  9999-12-31|
|  8|  6| laoyang|  dd| dd| 2019-08-06|          1|             Y|    2019-08-06|  9999-12-31|
+---+---+--------+----+---+-----------+-----------+--------------+--------------+------------+
2)增量变动模式:

+---+---+--------+----+---+---------------+-----------+--------------+--------------+------------+
| sk| id|    name|city| st|scd_update_date|scd_version|scd_valid_flag|scd_start_date|scd_end_date|
+---+---+--------+----+---+---------------+-----------+--------------+--------------+------------+
|  3|  3|  wangwu|  ca| bb|     2019-01-01|          1|             Y|    2019-01-01|  9999-12-31|
|  2|  2|    lisi|  us| cb|     2019-01-01|          1|             Y|    2019-01-01|  9999-12-31|
|  5|  5|    mazi|  aa| aa|     2019-01-01|          1|             N|    2019-01-01|  2019-08-05|
|  1|  1|zhangsan|  us| ca|     2019-01-01|          1|             N|    2019-01-01|  2019-08-05|
|  6|  1|   zhang|   u|  c|     2019-08-06|          2|             Y|    2019-01-01|  9999-12-31|
|  7|  5|      ma|  aa| aa|     2019-08-06|          2|             Y|    2019-01-01|  9999-12-31|
|  4|  4| zhaoliu|  ac| cb|     2019-08-06|          1|             Y|    2019-01-01|  9999-12-31|
|  8|  6| laoyang|  dd| dd|     2019-08-06|          1|             Y|    2019-08-06|  9999-12-31|
+---+---+--------+----+---+---------------+-----------+--------------+--------------+------------+

本文通过手动地实现SCD2+SCD1的中间变化操作,方便大家理解缓慢变化维度的概念和实现过程。另外,本文中的转换都是Spark在内存中的操作,在数据量大的情况下可能会影响性能,建议考虑用临时表做中间转换的可能性。后续将对快照式或增量变动式的阶段表进行数据插入操作的中间细节进行包装,实现自动化SCD引擎,敬请关注。

转发自同名博客:https://blog.csdn.net/fzlulee/article/details/98489677

转载于:https://www.cnblogs.com/HanseyLee/p/11310397.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值