Doris数据分层 亿级别数据实时性 达到分钟级

本文主要介绍在Doris数据库进行数据分层中使用Bitmap数据结构,实现亿级别数据实时性达到分钟级别

一. bitmap数据结构

BITMAP不能作为key列使用,建表时配合聚合类型为BITMAP_UNION。

插入数据时,相同key的值会把Bitmap字段聚合在一次,聚合时会自动去重,这个是Doris聚合模型实现的。

create table test_bitmap (
  datekey int,
  hour int,
  device_id bitmap BITMAP_UNION
)
aggregate key (datekey, hour)
distributed by hash(datekey, hour) buckets 1
properties(
  "replication_num" = "1"
);

1.1 Bitmap_count

返回聚合好的行数的Bitmap的个数,已经聚合成同一行,并且已经完成去重,所以查询速度是比较快的,也用到了这个函数来进行优化,使其实时性达到分钟级别,详情看第二部分。

1.2 Bitmap_union_count

根据group by 字段,重新聚合计算聚合后的去重的Bitmap个数,这里需要根据group by重新聚合,而且聚合后需要去重,才能得到最终的Bitmap个数,所以统计速度是比较慢的。

二. 分层架构

ODS: unique模型保证数据不重复,数据明细层,通过Flink一比一同步MySQL数据到Doris,会扩展填充一些字段,作用是减少join操作
DWS: agg聚合模型,包含多个维度字段,使用一个Bitmap字段作为指标字段,不同的报表可以复用
增量加速层: agg聚合模型,用于增量统计时加速,只包含报表所需的统计维度,指标值使用Bitmap字段,统计指标时,不需要group by,只需要调用bitmap_count函数,所以能起到加速作用
ADS: unique模型,只保存报表维度+统计的指标数值,给前端查询用,是已经统计好的指标数值,前端直接展示,不需要过多的计算,所以查询速度是比较快的。

三. 案例

需求: 按通道类型统计短信的发送成功量和失败量
这个需求比较简单,维度是通道,指标值有两个(发送成功量和失败量)

  1. ODS层 unique模型 通过flink实时同步 原样同步数据 会扩展字段
//建表语句
CREATE TABLE `msg_sms_ods` (
  `id` bigint(20) NOT NULL COMMENT "",
  `post_time` datetime NULL COMMENT "提交时间",
  `state` int(11) NULL COMMENT "状态",
  `channel_id` int(11) NULL COMMENT "通道id",
  `user_id` int(11) NULL COMMENT "用户id",
  `update_time` DATETIME NULL COMMENT "更新时间"
) ENGINE=OLAP
UNIQUE KEY(`id`,`post_time`)
COMMENT "OLAP"
PARTITION BY range( `post_time`)()
DISTRIBUTED BY HASH(`id`) BUCKETS AUTO
PROPERTIES (
"bloom_filter_columns"="update_time",
"replication_allocation" = "tag.location.default: 3",
"in_memory" = "false",
"storage_format" = "V2",
"enable_unique_key_merge_on_write" = "true",
"estimate_partition_size" = "3G",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "DAY",
"dynamic_partition.start" = "-93",
"dynamic_partition.end" = "3",
"dynamic_partition.prefix" = "p_",
"dynamic_partition.buckets" = "2",
"dynamic_partition.create_history_partition"="true",
"dynamic_partition.history_partition_num"= "93"
);

//预插入数据,本层4条数据,由flink实时同步
INSERT INTO `msg_sms_ods`(`id`, `post_time`, `state`, `channel_id`, `user_id`, `update_time`) 
VALUES (1193913585990959104, '2024-01-16 11:58:00', 1, 111, 1, '2024-01-16 12:00:00');
INSERT INTO `msg_sms_ods`(`id`, `post_time`, `state`, `channel_id`, `user_id`, `update_time`) 
VALUES (1193913585990959105, '2024-01-16 11:58:00', 1, 111, 3, '2024-01-16 12:00:00');
INSERT INTO `msg_sms_ods`(`id`, `post_time`, `state`, `channel_id`, `user_id`, `update_time`) 
VALUES (1193913585990959106, '2024-01-16 11:58:00', 2,  222, 2, '2024-01-16 12:00:00');
INSERT INTO `msg_sms_ods`(`id`, `post_time`, `state`, `channel_id`, `user_id`, `update_time`) 
VALUES (1193913585990959107, '2024-01-16 11:58:00', 2, 222, 2, '2024-01-16 12:00:00');

  1. DWS: 多维度聚合层 聚合模型 聚合字段类型bitmap 不同报表可复用
//建表语句
CREATE TABLE IF NOT EXISTS msg_sms_dws
(
  `user_id` int NOT NULL COMMENT "用户id",
  `channel_id` int NOT NULL COMMENT "通道id",
  `state` int not null COMMENT "状态",
  `stat_date` date not null COMMENT "统计日期",
  `state_count` bitmap BITMAP_UNION NULL COMMENT ""
)
AGGREGATE KEY( `user_id`, `channel_id`, `state`, `stat_date`)
PARTITION BY RANGE(`stat_date`)()
DISTRIBUTED BY HASH(`user_id`, `channel_id`, `state`) BUCKETS AUTO
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"in_memory" = "false",
"storage_format" = "V2",
"estimate_partition_size" = "3G",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "WEEK",
"dynamic_partition.end" = "2",
"dynamic_partition.prefix" = "p_",
"dynamic_partition.buckets" = "10",
"dynamic_partition.create_history_partition"="true",
"dynamic_partition.history_partition_num"= "53"
);

//全量统计sql
delete from msg_sms_dws where stat_date>='2024-01-16'
and stat_date<'2024-01-17'; 

insert into msg_sms_dws(user_id,channel_id,state,stat_date,state_count)
select 
t.user_id,t.channel_id,t.state,t.post_time,
to_bitmap(t.id)
from msg_sms_ods t 
where post_time>='2024-01-16' and post_time<'2024-01-17' ;

生成数据:
image.png

  1. 增量加速层: 增量加速用,解决dws层到ads层增量统计慢问题,本层只保留一天数据
//创建sql
CREATE TABLE `msg_sms_increment` (
`stat_date` date NOT NULL COMMENT '统计日期',
`channel_id`      INT          NOT NULL COMMENT "通道id",
`send_success_num_bit_map` bitmap BITMAP_UNION NULL COMMENT 'ticket发送成功量bitmap',
`send_fail_num_bit_map` bitmap BITMAP_UNION NULL COMMENT 'ticket发送失败量bitmap'
) ENGINE=OLAP
AGGREGATE KEY(`stat_date`, `channel_id`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`channel_id`) BUCKETS AUTO
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"in_memory" = "false",
"storage_format" = "V2",
"disable_auto_compaction" = "false"
);


-- 增量统计 ods-> 增量加速层
insert into msg_sms_increment(stat_date,channel_id,send_success_num_bit_map,send_fail_num_bit_map)

select 
t.post_time,t.channel_id,
 (CASE WHEN t.state=1  THEN to_bitmap(t.id) ELSE to_bitmap(-1) END) ,
 (CASE WHEN t.state!=1  THEN to_bitmap(t.id) ELSE to_bitmap(-1) END) 

from msg_sms_ods t where post_time>='2024-01-16'
and post_time<'2024-01-17' 
and t.update_time >= date_sub(now(), INTERVAL 10 MINUTE);




生成数据:
image.png

  1. ADS层: unique模型 ,保存最终的统计结果,前端只查询这层

增量统计: 读取的是加速层生成数据,不需要group by
全量统计: 读取的是dws生成数据,需要group by, 不能统一使用增量加速层实现,因为增量加速层包含多个bitmap字段,如果插入一天的数据(亿级别)是比较慢的,耗时需要小时级别,dws层只有一个bitmap字段,耗时在10多分钟,所以全量统计就使用dws层生成。

//创建sql
CREATE TABLE IF NOT EXISTS msg_sms_ads
(
`stat_date`       date         NOT NULL COMMENT '统计日期',
`channel_id`      INT          NOT NULL COMMENT "通道id",
`send_success_num` BIGINT NOT NULL COMMENT 'ticket发送成功量',
`send_fail_num` BIGINT NOT NULL COMMENT 'ticket发送失败量'
)
UNIQUE KEY(`stat_date`,`channel_id`)
PARTITION BY RANGE(`stat_date`)()
DISTRIBUTED BY HASH(`stat_date`,`channel_id`) BUCKETS AUTO
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"in_memory" = "false",
"storage_format" = "V2",
"estimate_partition_size" = "3G",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "MONTH",
"dynamic_partition.end" = "2",
"dynamic_partition.prefix" = "p_",
"dynamic_partition.buckets" = "4",
"dynamic_partition.create_history_partition"="true",
"dynamic_partition.history_partition_num"= "12",
"enable_unique_key_merge_on_write" = "true",
"disable_auto_compaction" = "false"
);


-- 增量加速层到ads, 不需要group by
insert into mos_ticket_send_result_ads(stat_date,channel_id,send_success_num,send_fail_num)
select 
t.stat_date,t.channel_id, 
bitmap_count(send_success_num_bit_map)  as send_success_sum,
bitmap_count(send_fail_num_bit_map) AS send_fail_sum
                     
from mos_ticket_send_result_increment t where stat_date>='2024-01-16'


-- 全量统计
delete from mos_ticket_send_result_increment where stat_date>='2024-01-16' and stat_date<'2024-01-17';

delete from mos_ticket_send_result_ads where stat_date>='2024-01-16' and stat_date<'2024-01-17';

insert into mos_ticket_send_result_ads(stat_date,channel_id,send_success_num,send_fail_num)
select 
dws.stat_date,dws.channel_id, 
bitmap_union_count(CASE WHEN state=1 THEN (state_count) ELSE to_bitmap(-1) END)  as send_success_sum,
bitmap_union_count(CASE WHEN state!=1  THEN (state_count) ELSE to_bitmap(-1) END) AS send_fail_sum
                    from sedp_channel_send_report_mos_dws dws where stat_date>='2024-01-16'
and stat_date<'2024-01-17' 
group by channel_id,stat_date

生成数据:
image.png

四.总结

本文主要讲解在doris怎么进行数据分层,使用bitmap数据结构,bitmap_count和bitmap_union_count的使用场景,最后结合案例讲述亿级数据在增量统计怎么实现分钟级别出结果。

  • 36
    点赞
  • 28
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值