一、背景
在数仓开发过程中,当业务过程中存在相似性时,我们会采用宽表设计,这样可以提升数据整合性、提高数据查询效率,并降低维护成本。同时,宽表设计也便于跨系统和跨部门的数据整合与调用。但是宽表的设计从维度建模的思想来说适当破坏了规范,这一层要根据具体场景去平衡。
二、不同场景下的宽表思考
1.电商业务/金融业务
在电商场景,我们进行维度建模时首先要选择业务过程,比如下单业务,涉及订单事实表,维度包括时间/地区/优惠券/商品等,度量值有运费/优惠金额/原始金额/最终金额等,在这个特定的场景事实表单一,我们按照维度建模正常构建就可以,我认为没有宽表的设计必要;但是如果在金融业务,不仅有订单事实表,还会有借款审批事实表、还款计划事实表等,那我们能想到这里的业务过程是存在相关性的,在订单交易过程中,我们会将借款申请和借款审批信息纳入还款计划,同时将贷后数据如还款状态和还款时间也包括在内。因此这里就有必要整合宽表。
2.社区业务
在社区场景,我们进行维度建模时首先要选择业务过程,比如评论业务,涉及评论表事实表、帖子表事实表、回复事实表,这个过程我认为可以构建宽表,以评论表为主表,关联帖子表,然后关联评论回复表。显然,用户可以评论帖子也可以回复评论。下面是该场景的SQL示例:
select a.global_comment_id as global_comment_id,
a.business_id as business_id,
a.ctime as ctime,
a.mtime as mtime,
a.is_deleted as is_deleted,
a.relation_business_id as relation_business_id,
a.relation_business_type as relation_business_type,
a.target_user_account as target_user_account,
a.target_work_code as target_work_code,
a.source_user_account as source_user_account,
xxx as uid_leader,
a.source_work_code as source_work_code,
a.parent_business_id as parent_business_id,
a.content as content,
a.comment_status as comment_status,
a.comment_time as comment_time,
a.is_publisher as is_publisher,
a.is_read as is_read,
a.floor_num as floor_num,
a.audit_status as audit_status,
a.comment_type as comment_type,
a.like_count as like_count,
a.target_nickname as target_nickname,
a.source_nickname as source_nickname,
a.is_top as is_top,
a.article_business_id as article_business_id,
a.top_time as top_time,
b.global_article_id as comment_article_id,
b.business_id as comment_article_business_id,
case when b.business_id = d.pitcure_business_id then 1 else 0 end as comment_article_is_pitcure_business_id,
b.ctime as comment_article_ctime,
b.mtime as comment_article_mtime,
b.is_deleted as comment_article_is_deleted,
b.article_status as comment_article_article_status,
b.simple_content as comment_article_simple_content,
b.title as comment_article_title,
b.publish_time as comment_article_publish_time,
b.latest_reply_time as comment_article_latest_reply_time,
b.user_account as comment_article_user_account,
b.work_code as comment_article_work_code,
b.lock_description as comment_article_lock_description,
b.audit_status as comment_article_audit_status,
b.view_count as comment_article_view_count,
b.comment_count as comment_article_comment_count,
b.like_count as comment_article_like_count,
b.favorite_count as comment_article_favorite_count,
b.nickname as comment_article_nickname,
b.is_top as comment_article_is_top,
b.top_time as comment_article_top_time,
b.score as comment_article_score,
c.global_comment_id as reply_id,
c.business_id as reply_business_id,
c.ctime as reply_ctime,
c.mtime as reply_mtime,
c.is_deleted as reply_is_deleted,
c.relation_business_id as reply_relation_business_id,
c.relation_business_type as reply_relation_business_type,
c.target_user_account as reply_target_user_account,
c.target_work_code as reply_target_work_code,
c.source_user_account as reply_source_user_account,
c.source_work_code as reply_source_work_code,
c.parent_business_id as reply_parent_business_id,
c.content as reply_content,
c.comment_status as reply_comment_status,
c.comment_time as reply_comment_time,
c.is_publisher as reply_is_publisher,
c.is_read as reply_is_read,
c.floor_num as reply_floor_num,
c.audit_status as reply_audit_status,
c.comment_type as reply_comment_type,
c.like_count as reply_like_count,
c.target_nickname as reply_target_nickname,
c.source_nickname as reply_source_nickname,
c.is_top as reply_is_top,
c.article_business_id as reply_article_business_id,
c.top_time as reply_top_time,
current_timestamp() as create_time,
date_format(date_add(current_date(), -1), "yyyyMMdd") log_date
from (
select global_comment_id,
business_id,
ctime,
mtime,
is_deleted,
relation_business_id,
relation_business_type,
target_user_account,
target_work_code,
source_user_account,
source_work_code,
parent_business_id,
content,
comment_status,
comment_time,
is_publisher,
is_read,
floor_num,
audit_status,
comment_type,
like_count,
target_nickname,
source_nickname,
is_top,
article_business_id,
top_time
from dwd.dwd_plnt_comt_info_global_comment_a_d --评论主表
where log_date = date_format(date_add(current_date(), -1), "yyyyMMdd")
) a
left join
(
select global_article_id,
business_id,
ctime,
mtime,
is_deleted,
article_status,
simple_content,
title,
publish_time,
latest_reply_time,
user_account,
work_code,
lock_description,
audit_status,
view_count,
comment_count,
like_count,
favorite_count,
nickname,
is_top,
top_time,
score
from dwd.dwd_plnt_atcl_info_global_article_a_d --帖子表
where log_date = date_format(date_add(current_date(), -1), "yyyyMMdd")
) b on a.article_business_id = b.business_id and a.relation_business_type = 0
left join
(
select global_comment_id,
business_id,
ctime,
mtime,
is_deleted,
relation_business_id,
relation_business_type,
target_user_account,
target_work_code,
source_user_account,
source_work_code,
parent_business_id,
content,
comment_status,
comment_time,
is_publisher,
is_read,
floor_num,
audit_status,
comment_type,
like_count,
target_nickname,
source_nickname,
is_top,
article_business_id,
top_time
from dwd.dwd_plnt_comt_info_global_comment_reply_a_d --评论回复表
where log_date = date_format(date_add(current_date (), -1), "yyyyMMdd")
) c
on a.business_id = c.relation_business_id and a.relation_business_type = 2
left join
(
select pitcure_business_id
from b_dwb.dwb_plnt_atcl_info_pic_plnt_atcl_a_d
where log_date = date_format(date_add(current_date(), -1), "yyyyMMdd")
) d
on a.relation_business_id = d.pitcure_business_id
3. 车联网业务
相比上述两种业务,车辆网的业务没有那么重的业务属性,这个场景的宽表更多的是弱业务性强工具性的宽表,比如下面的例子:
set hive.execution.engine = spark;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwb.dwb_plt_voyance_status_acc_1d_d partition (dt)
select car_time,
carno,
workoff,
sum(if(diff_workoff >= 0, diff_workoff, 0)) over (partition by carno order by car_time ) acc_workoff,
workon,
sum(if(diff_workon >= 0, diff_workon, 0)) over (partition by carno order by car_time ) acc_workon,
manual,
sum(if(diff_manual >= 0, diff_manual, 0)) over (partition by carno order by car_time ) acc_manual, -- 累计里程
autonomous,
sum(if(diff_autonomous >= 0, diff_autonomous, 0)) over (partition by carno order by car_time ) acc_autonomous,
remotecontrol,
sum(if(diff_remotecontrol >= 0, diff_remotecontrol, 0))
over (partition by carno order by car_time ) acc_remotecontrol,
date_add(`current_date`(), -1) dt
from (
select car_time,
carno,
manual,
lead_manual - manual as diff_manual, --当前里程和上一条里程差分
autonomous,
lead_autonomous - autonomous diff_autonomous,
workon,
lead_workon - workon diff_workon,
workoff,
lead_workoff - workoff diff_workoff,
remotecontrol,
lead_remotecontrol - remotecontrol diff_remotecontrol
from (
select car_time,
carno,
nvl(manual, 0) manual,
lead(nvl(manual, 0), 1) over (partition by carno order by car_time) lead_manual,----手动里程整体上移一行
autonomous,
lead(autonomous, 1) over (partition by carno order by car_time) lead_autonomous,
workon,
lead(workon, 1) over (partition by carno order by car_time) lead_workon,
workoff,
lead(workoff, 1) over (partition by carno order by car_time) lead_workoff,
remotecontrol,
lead(remotecontrol, 1) over (partition by carno order by car_time) lead_remotecontrol
from (
select car_time,
carno,
cast(work_on_mile as double) workon,
cast(work_off_mile as double) workoff,
cast(manual_mile as double) manual, --手动里程
cast(auto_mile as double) autonomous,
cast(remote_mile as double) remotecontrol
from dwd.dwd_plt_vehicle_voyance_status_info_1d_d
where dt = date_add(`current_date`(), -1)
and substring(car_time, 1, 10) = date_add(`current_date`(), -1)
and carno is not null
) l1
) l2
) l3;
这里的手动瞬时里程manual在车端上传时不是累计增长的,当车辆重新上电后里程会重新增长,为了便于后续DWS层聚合,这里将明细的里程制作为了累计增长的里程,这种宽表对业务同样重要。
三、总结
宽表设计是数仓开发的灵魂,具体怎么使用要根据业务制定和取舍,扬长避短发挥业务价值。