1.ReplacingMergeTree
- 删除具有相同排序键值的重复项。
- 数据去重是在合并期间进行的。
- 后台的合并操作在未知的时间触发,因此用户无法对合并进行计划。
- 可使用OPTIMIZE语句运行计划外的合并,但OPTIMIZE是一个很重的操作。
- ReplacingMergeTree适合清除后台的重复数据以节省空间。
指定表引擎:
ENGINE = ReplacingMergeTree([ver])
参数:ver,版本列。版本列的类型为UInt*、Date或DateTime。可选参数。合并的时候,ReplacingTree从所有相同主键的行中选择一行留下:如果ver未指定,选择最后一条。如果指定了ver列,选择ver值最大的版本。
2.示例1:根据排序键去重
建表:
DROP TABLE replacingMergeTreeDemo; CREATE TABLE replacingMergeTreeDemo ( UserID UInt32, CounterID UInt32, UserName String, EventDate Date ) ENGINE = ReplacingMergeTree() ORDER BY (UserID, CounterID) PRIMARY KEY (UserID);
执行效果:
xxxxx :) CREATE TABLE replacingMergeTreeDemo :-] ( :-] UserID UInt32, :-] CounterID UInt32, :-] UserName String, :-] EventDate Date :-] ) ENGINE = ReplacingMergeTree() :-] ORDER BY (UserID, CounterID) :-] PRIMARY KEY (UserID); CREATE TABLE replacingMergeTreeDemo ( `UserID` UInt32, `CounterID` UInt32, `UserName` String, `EventDate` Date ) ENGINE = ReplacingMergeTree() PRIMARY KEY UserID ORDER BY (UserID, CounterID) Ok. 0 rows in set. Elapsed: 0.055 sec. xxxxx :)
插入数据:
分三批次插入数据,为了观察效果,下面的三段脚本依次按顺序分别执行,不要一起执行:
批次1:insert into replacingMergeTreeDemo values(1,100,'xiaohe1','2020-04-21'); insert into replacingMergeTreeDemo values(2,100,'xiaohe2','2020-04-22'); insert into replacingMergeTreeDemo values(3,102,'xiaohe3','2020-04-23');
插入后的效果为:
xxxxx :) select * from replacingMergeTreeDemo; SELECT * FROM replacingMergeTreeDemo ┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐ │ 3 │ 102 │ xiaohe3 │ 2020-04-23 │ └────────┴───────────┴──────────┴────────────┘ ┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐ │ 2 │ 100 │ xiaohe2 │ 2020-04-22 │ └────────┴───────────┴──────────┴────────────┘ ┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐ │ 1 │ 100 │ xiaohe1 │ 2020-04-21 │ └────────┴───────────┴──────────┴────────────┘ 3 rows in set. Elapsed: 0.009 sec. xxxxx :)
批次2:
insert into replacingMergeTreeDemo values(3,103,'xiaohe3','2020-04-23'); insert into replacingMergeTreeDemo values(2,101,'xiaohe2','2020-04-24'); insert into replacingMergeTreeDemo values(1,100,'xiaohe1','2020-04-20');
插入后的效果为:
xxxxx :) select * from replacingMergeTreeDemo; SELECT * FROM replacingMergeTreeDemo ┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐ │ 1 │ 100 │ xiaohe1 │ 2020-04-20 │ └────────┴───────────┴──────────┴────────────┘ ┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐ │ 1 │ 100 │ xiaohe1 │ 2020-04-21 │ │ 2 │ 100 │ xiaohe2 │ 2020-04-22 │ │ 2 │ 101 │ xiaohe2 │ 2020-04-24 │ │ 3 │ 102 │ xiaohe3 │ 2020-04-23 │ │ 3 │ 103 │ xiaohe3 │ 2020-04-23 │ └────────┴───────────┴──────────┴────────────┘
批次3:
insert into replacingMergeTreeDemo values(2,100,'xiaohe2','2020-04-23');
插入后的效果:
xxxxx :) select * from replacingMergeTreeDemo;
SELECT * FROM replacingMergeTreeDemo ┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐ │ 2 │ 100 │ xiaohe2 │ 2020-04-23 │ └────────┴───────────┴──────────┴────────────┘ ┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐ │ 1 │ 100 │ xiaohe1 │ 2020-04-20 │ └────────┴───────────┴──────────┴────────────┘ ┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐ │ 1 │ 100 │ xiaohe1 │ 2020-04-21 │ │ 2 │ 100 │ xiaohe2 │ 2020-04-22 │ │ 2 │ 101 │ xiaohe2 │ 2020-04-24 │ │ 3 │ 102 │ xiaohe3 │ 2020-04-23 │ │ 3 │ 103 │ xiaohe3 │ 2020-04-23 │ └────────┴───────────┴──────────┴────────────┘ 7 rows in set. Elapsed: 0.007 sec. xxxxx :)
如上所示,它并没有合并数据,这时候需要手工执行计划外合并。
这时候就要手工执行计划外合并:
执行命令:optimize table replacingMergeTreeDemo
xxxxx :) optimize table replacingMergeTreeDemo; OPTIMIZE TABLE replacingMergeTreeDemo Ok. 0 rows in set. Elapsed: 0.003 sec. xxxxx :) select * from replacingMergeTreeDemo; SELECT * FROM replacingMergeTreeDemo ┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐ │ 1 │ 100 │ xiaohe1 │ 2020-04-20 │ │ 2 │ 100 │ xiaohe2 │ 2020-04-23 │ │ 2 │ 101 │ xiaohe2 │ 2020-04-24 │ │ 3 │ 102 │ xiaohe3 │ 2020-04-23 │ │ 3 │ 103 │ xiaohe3 │ 2020-04-23 │ └────────┴───────────┴──────────┴────────────┘ 5 rows in set. Elapsed: 0.006 sec. xxxxx :)
从这个示例可以发现,ReplacingMergeTree是根据排序项对数据去重的,而不是根据主键。
3.示例2:使用版本参数的去重
在示例1中, 我们简单观察可以发现, 去重后,保留的记录是最后一条记录。
如果指定了版本,则保留版本列的值最大的记录。1.建表
使用EventDate列作为版本字段。
DROP TABLE replacingMergeTreeDemo; CREATE TABLE replacingMergeTreeDemo ( UserID UInt32, CounterID UInt32, UserName String, EventDate Date ) ENGINE = ReplacingMergeTree(EventDate) ORDER BY (UserID, CounterID) PRIMARY KEY (UserID);
2.插入数据
批次1:
insert into replacingMergeTreeDemo values(1,100,'xiaohe1','2020-06-21'); insert into replacingMergeTreeDemo values(2,100,'xiaohe2','2020-11-27'); insert into replacingMergeTreeDemo values(3,102,'xiaohe3','2020-04-23');
批次2:
insert into replacingMergeTreeDemo values(3,103,'xiaohe3','2020-04-23'); insert into replacingMergeTreeDemo values(2,101,'xiaohe2','2020-04-24'); insert into replacingMergeTreeDemo values(1,100,'xiaohe1','2020-04-20');
批次3:
insert into replacingMergeTreeDemo values(2,100,'xiaohe2','2020-04-23');
查看数据:
xxxxx :) select * from replacingMergeTreeDemo; SELECT * FROM replacingMergeTreeDemo ┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐ │ 2 │ 100 │ xiaohe2 │ 2020-04-23 │ └────────┴───────────┴──────────┴────────────┘ ┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐ │ 1 │ 100 │ xiaohe1 │ 2020-04-20 │ └────────┴───────────┴──────────┴────────────┘ ┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐ │ 1 │ 100 │ xiaohe1 │ 2020-06-21 │ │ 2 │ 100 │ xiaohe2 │ 2020-11-27 │ │ 2 │ 101 │ xiaohe2 │ 2020-04-24 │ │ 3 │ 102 │ xiaohe3 │ 2020-04-23 │ │ 3 │ 103 │ xiaohe3 │ 2020-04-23 │ └────────┴───────────┴──────────┴────────────┘ 7 rows in set. Elapsed: 0.009 sec. xxxxx :) optimize table replacingMergeTreeDemo; OPTIMIZE TABLE replacingMergeTreeDemo Ok. 0 rows in set. Elapsed: 0.002 sec. xxxxx :) select * from replacingMergeTreeDemo; SELECT * FROM replacingMergeTreeDemo ┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐ │ 1 │ 100 │ xiaohe1 │ 2020-06-21 │ │ 2 │ 100 │ xiaohe2 │ 2020-11-27 │ │ 2 │ 101 │ xiaohe2 │ 2020-04-24 │ │ 3 │ 102 │ xiaohe3 │ 2020-04-23 │ │ 3 │ 103 │ xiaohe3 │ 2020-04-23 │ └────────┴───────────┴──────────┴────────────┘ 5 rows in set. Elapsed: 0.005 sec. xxxxx :)
经过上面案例,可以看出,最终保留了如下数据:
从这个示例可以发现, ReplacingMergeTree根据排序项对数据去重的, 保留版本字段列对应的最大值的记录。