在使用ClickHouse过程中出现分别两次插入一条数据内容相似的记录到ReplicatedMergeTree引擎的表中,表中只有一条记录插入成功,第二条插入的记录在表中不存在,且在日志中可以看到报错信息:Block with ID xx already exists locally as part xx, ignoring it. 这说明第二条插入的记录被判定为重复而忽略。通过尝试发现在配置文件config.xml中将MergeTree的InMemory配置屏蔽可以规避该问题。也就是说part_type为Wide或者Compact都可以正常插入两条类似数据,只有InMemory模式报错。
代码提交记录:https://github.com/ClickHouse/ClickHouse/pull/47121#event-8675523528
复现问题
ClickHouse代码版本:21.8.3
建表语句:
create table if not exists default.t_write_local on CLUSTER sharding_cluster(
id String,
report_time Int64
) ENGINE = ReplicatedMergeTree('/adbtest1.0/{shard}/t_write_local', '{replica}')
PARTITION BY (toYYYYMMDD(toDateTime(report_time/1000)))
ORDER BY (report_time,id)
TTL toDateTime(report_time/1000) + INTERVAL 30 DAY;
插入两条相似数据:
insert into default.t_write_local(id, report_time) VALUES('adcdefghijklmnopqrstuvwxyz', '1675326231000');
insert into default.t_write_local(id, report_time) VALUES('a1234567890123456789012345', '1675326231000');
第二条数据插入的ClickHouse在这里插入代码片日志:
2023.02.12 14:40:15.738909 [ 12573 ] {81583f23-604a-490e-ae36-b5a3967f6acb} <Debug> default.t_write_local (01f997c1-2d37-4ac1-81f9-97c12d37fac1) (Replicated OutputStream): Wrote block with ID '20230202_17874882959606962441_9047980079478079928', 1 rows
2023.02.12 14:40:15.744609 [ 12573 ] {81583f23-604a-490e-ae36-b5a3967f6acb} <Information> default.t_write_local (01f997c1-2d37-4ac1-81f9-97c12d37fac1) (Replicated OutputStream): Block with ID 20230202_17874882959606962441_9047980079478079928 already exists locally as part 20230202_0_0_0; ignoring it.
在config.xml中配置merge_tree的min_rows_for_compact_part为0则关闭inmemory模式;配置为xx数值为打开inmemory模式,且当数据满xx行时写入磁盘。
<merge_tree>
<min_rows_for_compact_part>100</min_rows_for_compact_part>
</merge_tree>
将这里的min_rows_for_compact_part配置为0,重试插入数据可以正确插入。
代码原因分析
以21.8.3版本代码为例
1、先定位写入数据的函数入口
从日志"Wrote block with ID ‘xxx’, xx rows"可以定位到:ReplicatedMergeTreeBlockOutputStream.cpp的write函数。
void ReplicatedMergeTreeBlockOutputStream::write(const Block & block)
{
last_block_is_duplicate = false;
auto zookeeper = storage.getZooKeeper();
assertSessionIsNotExpired(zookeeper);
/** If write is with quorum, then we check that the required number of replicas is now live,
* and also that for all previous parts for which quorum is required, this quorum is reached.
* And also check that during the insertion, the replica was not reinitialized or disabled (by the value of `is_active` node).
* TODO Too complex logic, you can do better.
*/
if (quorum)
checkQuorumPrecondition(zookeeper);
auto part_blocks = storage.writer.splitBlockIntoParts(block, max_parts_per_block, metadata_snapshot, context);
for (auto & current_block : part_blocks)
{
Stopwatch watch;
/// Write part to the filesystem under temporary name. Calculate a checksum.
MergeTreeData::MutableDataPartPtr part = storage.writer.writeTempPart(current_block, metadata_snapshot, context);
/// If optimize_on_insert setting is true, current_block could become empty after merge
/// and we didn't create part.
if (!part)
continue;
String block_id;
if (deduplicate)
{
/// We add the hash from the data and partition identifier to deduplication ID.
/// That is, do not insert the same data to the same partition twice.
block_id = part->getZeroLevelPartBlockID();
LOG_DEBUG(log, "Wrote block with ID '{}', {} rows", block_id, current_block.block.rows());
}
else
{
LOG_DEBUG(log, "Wrote block with {} rows", current_block.block.rows());
}
try
{
commitPart(zookeeper, part, block_id);
/// Set a special error code if the block is duplicate
int error = (deduplicate && last_block_is_duplicate) ? ErrorCodes::INSERT_WAS_DEDUPLICATED : 0;
PartLog::addNewPart(storage.getContext(), part, watch.elapsed(), ExecutionStatus(error));
}
catch (...)
{
PartLog::addNewPart(storage.getContext(), part, watch.elapsed(), ExecutionStatus::fromCurrentException(__PRETTY_FUNCTION__));
throw;
}
}
}
从代码可以看出,写入数据流程为:切割block -> 遍历所有block中的part -> 通过writeTempPart函数临时写入part数据 -> 去重,通过getZeroLevelPartBlockID获取对应part的BlockID -> commitPart函数提交数据。
引发“Block with ID xx already exists locally as part xxx, ignoring it.”问题最主要的是writeTempPart+getZeroLevelPartBlockID函数
- writeTempPart:临时写入part数据,这里包含Block ID数据来源
- getZeroLevelPartBlock:获取part的Block ID
2、先来关注获取Block ID的方法
IMergeTreeDataPart.cpp的getZeroLevelPartBlockID方法。具体代码如下:
String IMergeTreeDataPart::getZeroLevelPartBlockID() const
{
if (info.level != 0)
throw Exception(ErrorCodes::LOGICAL_ERROR, "Trying to get block id for non zero level part {}", name);
SipHash hash;
checksums.computeTotalChecksumDataOnly(hash);
union
{
char bytes[16];
UInt64 words[2];
} hash_value;
hash.get128(hash_value.bytes);
return info.partition_id + "_" + toString(hash_value.words[0]) + "_" + toString(hash_value.words[1]);
}
从代码可以得知Block ID分成两部分,partition_id + checksums的hash结果。
由于插入的两条数据partition_id相同。所以关注点定位到part的checksums带来的差异。
继续定位checksums的哪些属性会影响这个hash值,跟踪checksums的computeTotalChecksumDataOnly函数。在MergeTreeDataPartChecksum.cpp文件中;代码如下:
void MergeTreeDataPartChecksums::computeTotalChecksumDataOnly(SipHash & hash) const
{
/// We use fact that iteration is in deterministic (lexicographical) order.
for (const auto & it : files)
{
const String & name = it.first;
const Checksum & sum = it.second;
if (!endsWith(name, ".bin"))
continue;
UInt64 len = name.size();
hash.update(len);
hash.update(name.data(), len);
hash.update(sum.uncompressed_size);
hash.update(sum.uncompressed_hash);
}
}
通过代码可知影响checksums的hash值得元素包括为它中每个以.bin结尾的checksum。checksume是一个map类型的数据。由于插入的数据只有id字段的内容不同,所以这里影响checksums的hash的属性中name,name.size(),name.data()在两条数据插入中均相同,不同的就是sum。所以我们需要考虑这个sum是怎么生成的。下面通过数据的临时写入过程来获取sum的组成情况。
3、数据写入临时文件过程分析
先看写入临时文件的入口函数,在MergeTreeDataWriter.cpp中的writeTempPart函数:
MergeTreeData::MutableDataPartPtr MergeTreeDataWriter::writeTempPart(
BlockWithPartition & block_with_partition, const StorageMetadataPtr & metadata_snapshot, ContextPtr context)
{
Block & block = block_with_partition.block;
static const String TMP_PREFIX = "tmp_insert_";
// ... 省略
out.writePrefix();
out.writeWithPermutation(block, perm_ptr);
out.writeSuffixAndFinalizePart(new_data_part, sync_on_insert);
ProfileEvents::increment(ProfileEvents::MergeTreeDataWriterRows, block.rows());
ProfileEvents::increment(ProfileEvents::MergeTreeDataWriterUncompressedBytes, block.bytes());
ProfileEvents::increment(ProfileEvents::MergeTreeDataWriterCompressedBytes, new_data_part->getBytesOnDisk());
return new_data_part;
}
该函数就是将part数据按照clickhouse的存储数据格式临时保存。我们将关注点放在checksums的变化过程中,这个代码中并没有显式对checksums内容修改的语句,所以定位到writeSuffixAndFinalizePart函数,该函数将part传入,在函数内对checksums做了修改。
继续跟踪MergedBlockOutputStream.cpp文件中的writeSuffixAndFinalizePart函数:
void MergedBlockOutputStream::writeSuffixAndFinalizePart(
MergeTreeData::MutableDataPartPtr & new_part,
bool sync,
const NamesAndTypesList * total_columns_list,
MergeTreeData::DataPart::Checksums * additional_column_checksums)
{
/// Finish write and get checksums.
MergeTreeData::DataPart::Checksums checksums;
if (additional_column_checksums)
checksums = std::move(*additional_column_checksums);
/// Finish columns serialization.
writer->finish(checksums, sync);
// ... 省略
new_part->checksums = checksums;
new_part->setBytesOnDisk(checksums.getTotalSizeOnDisk());
new_part->index_granularity = writer->getIndexGranularity();
new_part->calculateColumnsSizesOnDisk();
if (default_codec != nullptr)
new_part->default_codec = default_codec;
new_part->storage.lockSharedData(*new_part);
}
代码中有part->checksums的赋值操作。数据来源是通过write->finish函数获取。继续调试finish函数。
在MergeTreeDataPartWriterInMemory.cpp文件中的finish函数是inmemory模式的实现:
void MergeTreeDataPartWriterInMemory::finish(IMergeTreeDataPart::Checksums & checksums, bool /* sync */)
{
/// If part is empty we still need to initialize block by empty columns.
if (!part_in_memory->block)
for (const auto & column : columns_list)
part_in_memory->block.insert(ColumnWithTypeAndName{column.type, column.name});
checksums.files["data.bin"] = part_in_memory->calculateBlockChecksum();
}
通过代码可知:inmemory模式临时存储的数据只有一个name为"data.bin"的checksum。checksum的具体内容通过calculateBlockChecksum函数获取。
继续调试MergeTreeDataPartInMemory.cpp中的calculateBlockChecksum函数:
IMergeTreeDataPart::Checksum MergeTreeDataPartInMemory::calculateBlockChecksum() const
{
SipHash hash;
IMergeTreeDataPart::Checksum checksum;
hash.get128(checksum.uncompressed_hash);
for (const auto & column : block)
column.column->updateHashFast(hash);
checksum.uncompressed_size = block.bytes();
hash.get128(checksum.uncompressed_hash);
return checksum;
}
通过代码可知,checksum来自part的每列数据做hash的结果。我们出问题的插入数据为String类型。所以我们定位到String类型字段的updateHashFast函数来确认这个值为何输入不同数据得到相同结果。
进入ColumnString.h的updateHashFast函数:
void updateHashFast(SipHash & hash) const override
{
hash.update(reinterpret_cast<const char *>(offsets.data()), size() * sizeof(offsets[0]));
hash.update(reinterpret_cast<const char *>(chars.data()), size() * sizeof(chars[0]));
}
通过代码可知:String字段的hash结果与字段的长度(offsets)与字段值(chars)相关。通过调试可以得到:
- sizeof(offsets[0]) = 8;匹配它的类型UInt64。
- sizeof(chars[0]) = 1; 匹配它的类型UInt8。
我们在看size()函数的内容:
size_t size() const override
{
return offsets.size();
}
通过这个size函数的实现可以知道,它返回的是offsets的长度。这里有个疑问:为何对offsets与chars计算hash给的size都是offsets的size?继续调试可以发现offsets的长度=1。说明给String类型字段做hash处理可以理解为:
void updateHashFast(SipHash & hash) const override
{
hash.update(offsets.data(), 8);
hash.update(chars.data(), 1);
}
这里我们知道offsets就是一个长度的UInt64,所以对offsets做update处理数据与size都正确。但是对chars做update的size为1有何影响?这需要继续看update函数的实现。
我们进入到SipHash.h的update函数来分析:
void update(const char * data, UInt64 size)
{
const char * end = data + size;
/// We'll finish to process the remainder of the previous update, if any.
if (cnt & 7)
{
while (cnt & 7 && data < end)
{
current_bytes[cnt & 7] = *data;
++data;
++cnt;
}
/// If we still do not have enough bytes to an 8-byte word.
if (cnt & 7)
return;
v3 ^= current_word;
SIPROUND;
SIPROUND;
v0 ^= current_word;
}
cnt += end - data;
while (data + 8 <= end)
{
current_word = unalignedLoad<UInt64>(data);
v3 ^= current_word;
SIPROUND;
SIPROUND;
v0 ^= current_word;
data += 8;
}
/// Pad the remainder, which is missing up to an 8-byte word.
current_word = 0;
switch (end - data)
{
case 7: current_bytes[6] = data[6]; [[fallthrough]];
case 6: current_bytes[5] = data[5]; [[fallthrough]];
case 5: current_bytes[4] = data[4]; [[fallthrough]];
case 4: current_bytes[3] = data[3]; [[fallthrough]];
case 3: current_bytes[2] = data[2]; [[fallthrough]];
case 2: current_bytes[1] = data[1]; [[fallthrough]];
case 1: current_bytes[0] = data[0]; [[fallthrough]];
case 0: break;
}
}
通过代码分析可以发现:update函数其实就是对所有输入数据转char*,然后再以步长为8对数据进行运算处理(hash过程)。实现流程为:
- 1、如果上一次update有数据剩余则,将本次输入数据将上一次数据补齐到current_bytes中,如果不能补齐则返回等待下一次update处理,如果能够补齐,则将本次补齐的8字节数据进行hash计算。
- 2、以8字节为长度循环遍历执行本次输入数据,8字节为单位进行hash运算。
- 3、将剩余不足8字节的数据打包进current_types数组中,等待下次update时使用。
所以如果chars做hash时传入的size参数却是1,对于String类型的字段只有第一个字节的数据被应用到了SipHash中。其他数据都被忽略。
通过该结论可以得到:
- 如果插入数据时,String字段每次给不同的长度的数据,那么可以插入成功。(因为offsets不同,可以影响hash结果)
- 如果插入数据时,String字段每次只给一个字节的数据,那么可以插入成功。(因为chars的数据第一个字节数据已用于hash计算)
- 如果插入数据时,String字段每次给的数据长度相同,且内容的第一个字节不同,那么可以掺入成功。(因为chars的第一个字节可以影响hash结果)
- 如果插入数据时,String字段每次给的数据长度相同,且内容的第一个字节相同,那么插入报Block ID已存在,不能被插入。(因为offsets相同,且chars的子一个字节数据相同,hash结果无法区分)
问题解决办法
现在已经得到问题的根源:由于String类型字段的updateHashFast函数中,对字段内容做hash时给的size不对导致。那么解决问题的最根本的方式就是修复该错误。修改ColumnString.h中的updateHashFast函数代码:
void updateHashFast(SipHash & hash) const override
{
hash.update(reinterpret_cast<const char *>(offsets.data()), offsets.size() * sizeof(offsets[0]));
hash.update(reinterpret_cast<const char *>(chars.data()), chars.size() * sizeof(chars[0]));
}
只需要将offsets与chars各自给自己的size就可以解决。