背景:
在测试Clikchouse删除效率的时候插入同一大小同一顺序一样的数据发现查询该表数据缺失,刚刚插入的数据成功了,但是表里面没有:
复现:
建表:
CREATE TABLE ck_test.t_local_1
(
`id` Int32,
`day` Int32
)
ENGINE = ReplicatedMergeTree('/clickhouse/ck_test/t_local_1/{layer}-{shard}/', '{replica}')
PARTITION BY day
ORDER BY (id, day)
SETTINGS index_granularity = 8192
插入数据:
insert into ck_test.t_local_1 (id, day) values(1 , 202010),(2 , 202010),(3 , 202010);
hadoop001 :) select * from t_local_1;
SELECT *
FROM t_local_1
┌─id─┬────day─┐
│ 1 │ 202010 │
│ 2 │ 202010 │
│ 3 │ 202010 │
└────┴────────┘
3 rows in set. Elapsed: 0.002 sec.
删除数据
ALTER TABLE ck_test.t_local_1 DELETE WHERE id=1
继续插入id=1的数据,并进行查询
hadoop001 :) insert into ck_test.t_local_1 (id, day) values(1 , 202010)
INSERT INTO ck_test.t_local_1 (id, day) VALUES
Ok.
1 rows in set. Elapsed: 0.003 sec.
hadoop001 :) select * from t_local_1;
SELECT *
FROM t_local_1
┌─id─┬────day─┐
│ 2 │ 202010 │
│ 3 │ 202010 │
└────┴────────┘
2 rows in set. Elapsed: 0.001 sec.
发现查询出来的数据id=1的虽然插入成功但是查询不出来,出现缺失,后续通过排查日志,发现在插入该表该条数据的时候,日志出现该条语句,意思是出现插入该条数据的时候已经存在一个一模一样的数据块了,所以插入失败被忽而略。
2020.10.30 16:46:59.072137 [ 6521 ] {be6de9bf-b86e-4d50-ba34-8cfd5b059839} <Debug> executeQuery: (from 127.0.0.1:11532) insert into ck_test.t_local_1 (id, day) values
2020.10.30 16:46:59.072198 [ 6521 ] {be6de9bf-b86e-4d50-ba34-8cfd5b059839} <Trace> AccessRightsContext (default): Access granted: INSERT(id, day) ON ck_test.t_local_1
2020.10.30 16:46:59.072567 [ 6521 ] {be6de9bf-b86e-4d50-ba34-8cfd5b059839} <Debug> DiskLocal: Reserving 1.00 MiB on disk `default`, having unreserved 10.56 TiB.
2020.10.30 16:46:59.072907 [ 6521 ] {be6de9bf-b86e-4d50-ba34-8cfd5b059839} <Debug> ck_test.t_local_1 (Replicated OutputStream): Wrote block with ID '202010_10217750497126877342_4223329090016823837', 1 rows
2020.10.30 16:46:59.074224 [ 6521 ] {be6de9bf-b86e-4d50-ba34-8cfd5b059839} <Information> ck_test.t_local_1 (Replicated OutputStream): Block with ID 202010_10217750497126877342_4223329090016823837 already exists; ignoring it.
重点:
另外ck没有事务概念,但是为了保证重复插入的insert的幂等性,会检测重复,如果重复则跳过。
如果想不跳过可以SET insert_deduplicate=0
下面进行插入查询设置:
通过一系列的插入数据、查询、设置值、再插入、再查询得出问题结论。
hadoop001 :) insert into ck_test.t_local_1 (id, day) values(1 , 202010)
INSERT INTO ck_test.t_local_1 (id, day) VALUES
Ok.
1 rows in set. Elapsed: 0.003 sec.
hadoop001 :) select * from t_local_1;
SELECT *
FROM t_local_1
┌─id─┬────day─┐
│ 2 │ 202010 │
│ 3 │ 202010 │
└────┴────────┘
2 rows in set. Elapsed: 0.001 sec.
hadoop001 :) SET insert_deduplicate=0
SET insert_deduplicate = 0
Ok.
0 rows in set. Elapsed: 0.008 sec.
hadoop001 :) insert into ck_test.t_local_1 (id, day) values(1 , 202010)
INSERT INTO ck_test.t_local_1 (id, day) VALUES
Ok.
1 rows in set. Elapsed: 0.004 sec.
hadoop001 :) select * from t_local_1;
SELECT *
FROM t_local_1
┌─id─┬────day─┐
│ 1 │ 202010 │
└────┴────────┘
┌─id─┬────day─┐
│ 2 │ 202010 │
│ 3 │ 202010 │
└────┴────────┘
3 rows in set. Elapsed: 0.003 sec.
请注意,不建议您关闭重复数据删除,因为这是保证INSERT幂等的唯一方法。通常,在DELETE之后进行重复数据删除不是问题,因为数据是按大批插入的,**并且仅当两次接收到完全相同的批处理(具有完全相同的行且顺序完全相同)**时才触发重复数据删除。