本文主要介绍在Doris数据库进行数据分层中使用Bitmap数据结构,实现亿级别数据实时性达到分钟级别
一. bitmap数据结构
BITMAP不能作为key列使用,建表时配合聚合类型为BITMAP_UNION。
插入数据时,相同key的值会把Bitmap字段聚合在一次,聚合时会自动去重,这个是Doris聚合模型实现的。
create table test_bitmap (
datekey int,
hour int,
device_id bitmap BITMAP_UNION
)
aggregate key (datekey, hour)
distributed by hash(datekey, hour) buckets 1
properties(
"replication_num" = "1"
);
1.1 Bitmap_count
返回聚合好的行数的Bitmap的个数,已经聚合成同一行,并且已经完成去重,所以查询速度是比较快的,也用到了这个函数来进行优化,使其实时性达到分钟级别,详情看第二部分。
1.2 Bitmap_union_count
根据group by 字段,重新聚合计算聚合后的去重的Bitmap个数,这里需要根据group by重新聚合,而且聚合后需要去重,才能得到最终的Bitmap个数,所以统计速度是比较慢的。
二. 分层架构
ODS: unique模型保证数据不重复,数据明细层,通过Flink一比一同步MySQL数据到Doris,会扩展填充一些字段,作用是减少join操作
DWS: agg聚合模型,包含多个维度字段,使用一个Bitmap字段作为指标字段,不同的报表可以复用
增量加速层: agg聚合模型,用于增量统计时加速,只包含报表所需的统计维度,指标值使用Bitmap字段,统计指标时,不需要group by,只需要调用bitmap_count函数,所以能起到加速作用
ADS: unique模型,只保存报表维度+统计的指标数值,给前端查询用,是已经统计好的指标数值,前端直接展示,不需要过多的计算,所以查询速度是比较快的。
三. 案例
需求: 按通道类型统计短信的发送成功量和失败量
这个需求比较简单,维度是通道,指标值有两个(发送成功量和失败量)
- ODS层 unique模型 通过flink实时同步 原样同步数据 会扩展字段
//建表语句
CREATE TABLE `msg_sms_ods` (
`id` bigint(20) NOT NULL COMMENT "",
`post_time` datetime NULL COMMENT "提交时间",
`state` int(11) NULL COMMENT "状态",
`channel_id` int(11) NULL COMMENT "通道id",
`user_id` int(11) NULL COMMENT "用户id",
`update_time` DATETIME NULL COMMENT "更新时间"
) ENGINE=OLAP
UNIQUE KEY(`id`,`post_time`)
COMMENT "OLAP"
PARTITION BY range( `post_time`)()
DISTRIBUTED BY HASH(`id`) BUCKETS AUTO
PROPERTIES (
"bloom_filter_columns"="update_time",
"replication_allocation" = "tag.location.default: 3",
"in_memory" = "false",
"storage_format" = "V2",
"enable_unique_key_merge_on_write" = "true",
"estimate_partition_size" = "3G",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "DAY",
"dynamic_partition.start" = "-93",
"dynamic_partition.end" = "3",
"dynamic_partition.prefix" = "p_",
"dynamic_partition.buckets" = "2",
"dynamic_partition.create_history_partition"="true",
"dynamic_partition.history_partition_num"= "93"
);
//预插入数据,本层4条数据,由flink实时同步
INSERT INTO `msg_sms_ods`(`id`, `post_time`, `state`, `channel_id`, `user_id`, `update_time`)
VALUES (1193913585990959104, '2024-01-16 11:58:00', 1, 111, 1, '2024-01-16 12:00:00');
INSERT INTO `msg_sms_ods`(`id`, `post_time`, `state`, `channel_id`, `user_id`, `update_time`)
VALUES (1193913585990959105, '2024-01-16 11:58:00', 1, 111, 3, '2024-01-16 12:00:00');
INSERT INTO `msg_sms_ods`(`id`, `post_time`, `state`, `channel_id`, `user_id`, `update_time`)
VALUES (1193913585990959106, '2024-01-16 11:58:00', 2, 222, 2, '2024-01-16 12:00:00');
INSERT INTO `msg_sms_ods`(`id`, `post_time`, `state`, `channel_id`, `user_id`, `update_time`)
VALUES (1193913585990959107, '2024-01-16 11:58:00', 2, 222, 2, '2024-01-16 12:00:00');
- DWS: 多维度聚合层 聚合模型 聚合字段类型bitmap 不同报表可复用
//建表语句
CREATE TABLE IF NOT EXISTS msg_sms_dws
(
`user_id` int NOT NULL COMMENT "用户id",
`channel_id` int NOT NULL COMMENT "通道id",
`state` int not null COMMENT "状态",
`stat_date` date not null COMMENT "统计日期",
`state_count` bitmap BITMAP_UNION NULL COMMENT ""
)
AGGREGATE KEY( `user_id`, `channel_id`, `state`, `stat_date`)
PARTITION BY RANGE(`stat_date`)()
DISTRIBUTED BY HASH(`user_id`, `channel_id`, `state`) BUCKETS AUTO
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"in_memory" = "false",
"storage_format" = "V2",
"estimate_partition_size" = "3G",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "WEEK",
"dynamic_partition.end" = "2",
"dynamic_partition.prefix" = "p_",
"dynamic_partition.buckets" = "10",
"dynamic_partition.create_history_partition"="true",
"dynamic_partition.history_partition_num"= "53"
);
//全量统计sql
delete from msg_sms_dws where stat_date>='2024-01-16'
and stat_date<'2024-01-17';
insert into msg_sms_dws(user_id,channel_id,state,stat_date,state_count)
select
t.user_id,t.channel_id,t.state,t.post_time,
to_bitmap(t.id)
from msg_sms_ods t
where post_time>='2024-01-16' and post_time<'2024-01-17' ;
生成数据:
- 增量加速层: 增量加速用,解决dws层到ads层增量统计慢问题,本层只保留一天数据
//创建sql
CREATE TABLE `msg_sms_increment` (
`stat_date` date NOT NULL COMMENT '统计日期',
`channel_id` INT NOT NULL COMMENT "通道id",
`send_success_num_bit_map` bitmap BITMAP_UNION NULL COMMENT 'ticket发送成功量bitmap',
`send_fail_num_bit_map` bitmap BITMAP_UNION NULL COMMENT 'ticket发送失败量bitmap'
) ENGINE=OLAP
AGGREGATE KEY(`stat_date`, `channel_id`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`channel_id`) BUCKETS AUTO
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"in_memory" = "false",
"storage_format" = "V2",
"disable_auto_compaction" = "false"
);
-- 增量统计 ods-> 增量加速层
insert into msg_sms_increment(stat_date,channel_id,send_success_num_bit_map,send_fail_num_bit_map)
select
t.post_time,t.channel_id,
(CASE WHEN t.state=1 THEN to_bitmap(t.id) ELSE to_bitmap(-1) END) ,
(CASE WHEN t.state!=1 THEN to_bitmap(t.id) ELSE to_bitmap(-1) END)
from msg_sms_ods t where post_time>='2024-01-16'
and post_time<'2024-01-17'
and t.update_time >= date_sub(now(), INTERVAL 10 MINUTE);
生成数据:
- ADS层: unique模型 ,保存最终的统计结果,前端只查询这层
增量统计: 读取的是加速层生成数据,不需要group by
全量统计: 读取的是dws生成数据,需要group by, 不能统一使用增量加速层实现,因为增量加速层包含多个bitmap字段,如果插入一天的数据(亿级别)是比较慢的,耗时需要小时级别,dws层只有一个bitmap字段,耗时在10多分钟,所以全量统计就使用dws层生成。
//创建sql
CREATE TABLE IF NOT EXISTS msg_sms_ads
(
`stat_date` date NOT NULL COMMENT '统计日期',
`channel_id` INT NOT NULL COMMENT "通道id",
`send_success_num` BIGINT NOT NULL COMMENT 'ticket发送成功量',
`send_fail_num` BIGINT NOT NULL COMMENT 'ticket发送失败量'
)
UNIQUE KEY(`stat_date`,`channel_id`)
PARTITION BY RANGE(`stat_date`)()
DISTRIBUTED BY HASH(`stat_date`,`channel_id`) BUCKETS AUTO
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"in_memory" = "false",
"storage_format" = "V2",
"estimate_partition_size" = "3G",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "MONTH",
"dynamic_partition.end" = "2",
"dynamic_partition.prefix" = "p_",
"dynamic_partition.buckets" = "4",
"dynamic_partition.create_history_partition"="true",
"dynamic_partition.history_partition_num"= "12",
"enable_unique_key_merge_on_write" = "true",
"disable_auto_compaction" = "false"
);
-- 增量加速层到ads, 不需要group by
insert into mos_ticket_send_result_ads(stat_date,channel_id,send_success_num,send_fail_num)
select
t.stat_date,t.channel_id,
bitmap_count(send_success_num_bit_map) as send_success_sum,
bitmap_count(send_fail_num_bit_map) AS send_fail_sum
from mos_ticket_send_result_increment t where stat_date>='2024-01-16'
-- 全量统计
delete from mos_ticket_send_result_increment where stat_date>='2024-01-16' and stat_date<'2024-01-17';
delete from mos_ticket_send_result_ads where stat_date>='2024-01-16' and stat_date<'2024-01-17';
insert into mos_ticket_send_result_ads(stat_date,channel_id,send_success_num,send_fail_num)
select
dws.stat_date,dws.channel_id,
bitmap_union_count(CASE WHEN state=1 THEN (state_count) ELSE to_bitmap(-1) END) as send_success_sum,
bitmap_union_count(CASE WHEN state!=1 THEN (state_count) ELSE to_bitmap(-1) END) AS send_fail_sum
from sedp_channel_send_report_mos_dws dws where stat_date>='2024-01-16'
and stat_date<'2024-01-17'
group by channel_id,stat_date
生成数据:
四.总结
本文主要讲解在doris怎么进行数据分层,使用bitmap数据结构,bitmap_count和bitmap_union_count的使用场景,最后结合案例讲述亿级数据在增量统计怎么实现分钟级别出结果。