【数仓宽表设计】我对宽表设计的思考

一、背景

在数仓开发过程中,当业务过程中存在相似性时,我们会采用宽表设计,这样可以提升数据整合性、提高数据查询效率,并降低维护成本。同时,宽表设计也便于跨系统和跨部门的数据整合与调用。但是宽表的设计从维度建模的思想来说适当破坏了规范,这一层要根据具体场景去平衡。

二、不同场景下的宽表思考

1.电商业务/金融业务

在电商场景,我们进行维度建模时首先要选择业务过程,比如下单业务,涉及订单事实表,维度包括时间/地区/优惠券/商品等,度量值有运费/优惠金额/原始金额/最终金额等,在这个特定的场景事实表单一,我们按照维度建模正常构建就可以,我认为没有宽表的设计必要;但是如果在金融业务,不仅有订单事实表,还会有借款审批事实表、还款计划事实表等,那我们能想到这里的业务过程是存在相关性的,在订单交易过程中,我们会将借款申请和借款审批信息纳入还款计划,同时将贷后数据如还款状态和还款时间也包括在内。因此这里就有必要整合宽表。

2.社区业务

在社区场景,我们进行维度建模时首先要选择业务过程,比如评论业务,涉及评论表事实表、帖子表事实表、回复事实表,这个过程我认为可以构建宽表,以评论表为主表,关联帖子表,然后关联评论回复表。显然,用户可以评论帖子也可以回复评论。下面是该场景的SQL示例:

select a.global_comment_id                                               as global_comment_id,
       a.business_id                                                     as business_id,
       a.ctime                                                           as ctime,
       a.mtime                                                           as mtime,
       a.is_deleted                                                      as is_deleted,
       a.relation_business_id                                            as relation_business_id,
       a.relation_business_type                                          as relation_business_type,
       a.target_user_account                                             as target_user_account,
       a.target_work_code                                                as target_work_code,
       a.source_user_account                                             as source_user_account,
       xxx as uid_leader,
       a.source_work_code                                                as source_work_code,
       a.parent_business_id                                              as parent_business_id,
       a.content                                                         as content,
       a.comment_status                                                  as comment_status,
       a.comment_time                                                    as comment_time,
       a.is_publisher                                                    as is_publisher,
       a.is_read                                                         as is_read,
       a.floor_num                                                       as floor_num,
       a.audit_status                                                    as audit_status,
       a.comment_type                                                    as comment_type,
       a.like_count                                                      as like_count,
       a.target_nickname                                                 as target_nickname,
       a.source_nickname                                                 as source_nickname,
       a.is_top                                                          as is_top,
       a.article_business_id                                             as article_business_id,
       a.top_time                                                        as top_time,
       b.global_article_id                                               as comment_article_id,
       b.business_id                                                     as comment_article_business_id,
       case when b.business_id = d.pitcure_business_id then 1 else 0 end as comment_article_is_pitcure_business_id,
       b.ctime                                                           as comment_article_ctime,
       b.mtime                                                           as comment_article_mtime,
       b.is_deleted                                                      as comment_article_is_deleted,
       b.article_status                                                  as comment_article_article_status,
       b.simple_content                                                  as comment_article_simple_content,
       b.title                                                           as comment_article_title,
       b.publish_time                                                    as comment_article_publish_time,
       b.latest_reply_time                                               as comment_article_latest_reply_time,
       b.user_account                                                    as comment_article_user_account,
       b.work_code                                                       as comment_article_work_code,
       b.lock_description                                                as comment_article_lock_description,
       b.audit_status                                                    as comment_article_audit_status,
       b.view_count                                                      as comment_article_view_count,
       b.comment_count                                                   as comment_article_comment_count,
       b.like_count                                                      as comment_article_like_count,
       b.favorite_count                                                  as comment_article_favorite_count,
       b.nickname                                                        as comment_article_nickname,
       b.is_top                                                          as comment_article_is_top,
       b.top_time                                                        as comment_article_top_time,
       b.score                                                           as comment_article_score,
       c.global_comment_id                                               as reply_id,
       c.business_id                                                     as reply_business_id,
       c.ctime                                                           as reply_ctime,
       c.mtime                                                           as reply_mtime,
       c.is_deleted                                                      as reply_is_deleted,
       c.relation_business_id                                            as reply_relation_business_id,
       c.relation_business_type                                          as reply_relation_business_type,
       c.target_user_account                                             as reply_target_user_account,
       c.target_work_code                                                as reply_target_work_code,
       c.source_user_account                                             as reply_source_user_account,
       c.source_work_code                                                as reply_source_work_code,
       c.parent_business_id                                              as reply_parent_business_id,
       c.content                                                         as reply_content,
       c.comment_status                                                  as reply_comment_status,
       c.comment_time                                                    as reply_comment_time,
       c.is_publisher                                                    as reply_is_publisher,
       c.is_read                                                         as reply_is_read,
       c.floor_num                                                       as reply_floor_num,
       c.audit_status                                                    as reply_audit_status,
       c.comment_type                                                    as reply_comment_type,
       c.like_count                                                      as reply_like_count,
       c.target_nickname                                                 as reply_target_nickname,
       c.source_nickname                                                 as reply_source_nickname,
       c.is_top                                                          as reply_is_top,
       c.article_business_id                                             as reply_article_business_id,
       c.top_time                                                        as reply_top_time,
       current_timestamp()                                               as create_time,
       date_format(date_add(current_date(), -1), "yyyyMMdd")                log_date
from (
         select global_comment_id,
                business_id,
                ctime,
                mtime,
                is_deleted,
                relation_business_id,
                relation_business_type,
                target_user_account,
                target_work_code,
                source_user_account,
                source_work_code,
                parent_business_id,
                content,
                comment_status,
                comment_time,
                is_publisher,
                is_read,
                floor_num,
                audit_status,
                comment_type,
                like_count,
                target_nickname,
                source_nickname,
                is_top,
                article_business_id,
                top_time
         from dwd.dwd_plnt_comt_info_global_comment_a_d --评论主表
         where log_date = date_format(date_add(current_date(), -1), "yyyyMMdd")
     ) a
         left join
     (
         select global_article_id,
                business_id,
                ctime,
                mtime,
                is_deleted,
                article_status,
                simple_content,
                title,
                publish_time,
                latest_reply_time,
                user_account,
                work_code,
                lock_description,
                audit_status,
                view_count,
                comment_count,
                like_count,
                favorite_count,
                nickname,
                is_top,
                top_time,
                score
         from dwd.dwd_plnt_atcl_info_global_article_a_d --帖子表
         where log_date = date_format(date_add(current_date(), -1), "yyyyMMdd")
     ) b on a.article_business_id = b.business_id and a.relation_business_type = 0
         left join
     (
         select global_comment_id,
                business_id,
                ctime,
                mtime,
                is_deleted,
                relation_business_id,
                relation_business_type,
                target_user_account,
                target_work_code,
                source_user_account,
                source_work_code,
                parent_business_id,
                content,
                comment_status,
                comment_time,
                is_publisher,
                is_read,
                floor_num,
                audit_status,
                comment_type,
                like_count,
                target_nickname,
                source_nickname,
                is_top,
                article_business_id,
                top_time
         from dwd.dwd_plnt_comt_info_global_comment_reply_a_d --评论回复表
         where log_date = date_format(date_add(current_date (), -1), "yyyyMMdd")
     ) c
     on a.business_id = c.relation_business_id and a.relation_business_type = 2
         left join
     (
         select pitcure_business_id
         from b_dwb.dwb_plnt_atcl_info_pic_plnt_atcl_a_d
         where log_date = date_format(date_add(current_date(), -1), "yyyyMMdd")
     ) d
     on a.relation_business_id = d.pitcure_business_id

3. 车联网业务

相比上述两种业务,车辆网的业务没有那么重的业务属性,这个场景的宽表更多的是弱业务性强工具性的宽表,比如下面的例子:

set hive.execution.engine = spark;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwb.dwb_plt_voyance_status_acc_1d_d partition (dt)
select car_time,
       carno,
       workoff,
       sum(if(diff_workoff >= 0, diff_workoff, 0)) over (partition by carno order by car_time )       acc_workoff,
       workon,
       sum(if(diff_workon >= 0, diff_workon, 0)) over (partition by carno order by car_time )         acc_workon,
       manual,
       sum(if(diff_manual >= 0, diff_manual, 0)) over (partition by carno order by car_time )         acc_manual, -- 累计里程
       autonomous,
       sum(if(diff_autonomous >= 0, diff_autonomous, 0)) over (partition by carno order by car_time ) acc_autonomous,
       remotecontrol,
       sum(if(diff_remotecontrol >= 0, diff_remotecontrol, 0))
           over (partition by carno order by car_time )                                               acc_remotecontrol,
       date_add(`current_date`(), -1)                                                                 dt
from (
         select car_time,
                carno,
                manual,
                lead_manual - manual as            diff_manual, --当前里程和上一条里程差分
                autonomous,
                lead_autonomous - autonomous       diff_autonomous,
                workon,
                lead_workon - workon               diff_workon,
                workoff,
                lead_workoff - workoff             diff_workoff,
                remotecontrol,
                lead_remotecontrol - remotecontrol diff_remotecontrol
         from (
                  select car_time,
                         carno,
                         nvl(manual, 0)                                                      manual,
                         lead(nvl(manual, 0), 1) over (partition by carno order by car_time) lead_manual,----手动里程整体上移一行
                         autonomous,
                         lead(autonomous, 1) over (partition by carno order by car_time)     lead_autonomous,
                         workon,
                         lead(workon, 1) over (partition by carno order by car_time)         lead_workon,
                         workoff,
                         lead(workoff, 1) over (partition by carno order by car_time)        lead_workoff,
                         remotecontrol,
                         lead(remotecontrol, 1) over (partition by carno order by car_time)  lead_remotecontrol
                  from (
                           select car_time,
                                  carno,
                                  cast(work_on_mile as double)  workon,
                                  cast(work_off_mile as double) workoff,
                                  cast(manual_mile as double)   manual, --手动里程
                                  cast(auto_mile as double)     autonomous,
                                  cast(remote_mile as double)   remotecontrol
                           from dwd.dwd_plt_vehicle_voyance_status_info_1d_d
                           where dt = date_add(`current_date`(), -1)
                             and substring(car_time, 1, 10) = date_add(`current_date`(), -1)
                             and carno is not null
                       ) l1
              ) l2
     ) l3;

这里的手动瞬时里程manual在车端上传时不是累计增长的,当车辆重新上电后里程会重新增长,为了便于后续DWS层聚合,这里将明细的里程制作为了累计增长的里程,这种宽表对业务同样重要。

三、总结

宽表设计是数仓开发的灵魂,具体怎么使用要根据业务制定和取舍,扬长避短发挥业务价值。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值