[ClickHouse] 分别插入两条相似数据报：Block with ID xx already exists locally as part xx, ignoring it.

1lI

已于 2023-04-10 16:04:00 修改

阅读量620

点赞数

分类专栏： ClickHouse 文章标签： clickhouse c++

于 2023-02-12 20:38:47 首次发布

本文链接：https://blog.csdn.net/u012395477/article/details/128996523

版权

ClickHouse 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

问题场景

复现问题
代码原因分析
问题解决办法

在使用ClickHouse过程中出现分别两次插入一条数据内容相似的记录到ReplicatedMergeTree引擎的表中，表中只有一条记录插入成功，第二条插入的记录在表中不存在，且在日志中可以看到报错信息：Block with ID xx already exists locally as part xx, ignoring it. 这说明第二条插入的记录被判定为重复而忽略。通过尝试发现在配置文件config.xml中将MergeTree的InMemory配置屏蔽可以规避该问题。也就是说part_type为Wide或者Compact都可以正常插入两条类似数据，只有InMemory模式报错。

代码提交记录：https://github.com/ClickHouse/ClickHouse/pull/47121#event-8675523528

复现问题

ClickHouse代码版本：21.8.3
建表语句：

create table if not exists default.t_write_local on CLUSTER sharding_cluster(
    id String,
    report_time Int64
) ENGINE = ReplicatedMergeTree('/adbtest1.0/{shard}/t_write_local', '{replica}')
PARTITION BY (toYYYYMMDD(toDateTime(report_time/1000)))
ORDER BY (report_time,id)
TTL toDateTime(report_time/1000) + INTERVAL 30 DAY;

插入两条相似数据：

insert into default.t_write_local(id, report_time) VALUES('adcdefghijklmnopqrstuvwxyz', '1675326231000');
insert into default.t_write_local(id, report_time) VALUES('a1234567890123456789012345', '1675326231000');

第二条数据插入的ClickHouse在这里插入代码片日志：

2023.02.12 14:40:15.738909 [ 12573 ] {81583f23-604a-490e-ae36-b5a3967f6acb} <Debug> default.t_write_local (01f997c1-2d37-4ac1-81f9-97c12d37fac1) (Replicated OutputStream): Wrote block with ID '20230202_17874882959606962441_9047980079478079928', 1 rows
2023.02.12 14:40:15.744609 [ 12573 ] {81583f23-604a-490e-ae36-b5a3967f6acb} <Information> default.t_write_local (01f997c1-2d37-4ac1-81f9-97c12d37fac1) (Replicated OutputStream): Block with ID 20230202_17874882959606962441_9047980079478079928 already exists locally as part 20230202_0_0_0; ignoring it.

在config.xml中配置merge_tree的min_rows_for_compact_part为0则关闭inmemory模式；配置为xx数值为打开inmemory模式，且当数据满xx行时写入磁盘。

<merge_tree>
	<min_rows_for_compact_part>100</min_rows_for_compact_part>
</merge_tree>

将这里的min_rows_for_compact_part配置为0，重试插入数据可以正确插入。

代码原因分析

以21.8.3版本代码为例

1、先定位写入数据的函数入口

从日志"Wrote block with ID ‘xxx’, xx rows"可以定位到：ReplicatedMergeTreeBlockOutputStream.cpp的write函数。

void ReplicatedMergeTreeBlockOutputStream::write(const Block & block)
{
    last_block_is_duplicate = false;

    auto zookeeper = storage.getZooKeeper();
    assertSessionIsNotExpired(zookeeper);

    /** If write is with quorum, then we check that the required number of replicas is now live,
      *  and also that for all previous parts for which quorum is required, this quorum is reached.
      * And also check that during the insertion, the replica was not reinitialized or disabled (by the value of `is_active` node).
      * TODO Too complex logic, you can do better.
      */
    if (quorum)
        checkQuorumPrecondition(zookeeper);

    auto part_blocks = storage.writer.splitBlockIntoParts(block, max_parts_per_block, metadata_snapshot, context);

    for (auto & current_block : part_blocks)
    {
        Stopwatch watch;

        /// Write part to the filesystem under temporary name. Calculate a checksum.
        MergeTreeData::MutableDataPartPtr part = storage.writer.writeTempPart(current_block, metadata_snapshot, context);

        /// If optimize_on_insert setting is true, current_block could become empty after merge
        /// and we didn't create part.
        if (!part)
            continue;

        String block_id;

        if (deduplicate)
        {
            /// We add the hash from the data and partition identifier to deduplication ID.
            /// That is, do not insert the same data to the same partition twice.
            block_id = part->getZeroLevelPartBlockID();

            LOG_DEBUG(log, "Wrote block with ID '{}', {} rows", block_id, current_block.block.rows());
        }
        else
        {
            LOG_DEBUG(log, "Wrote block with {} rows", current_block.block.rows());
        }

        try
        {
            commitPart(zookeeper, part, block_id);

            /// Set a special error code if the block is duplicate
            int error = (deduplicate && last_block_is_duplicate) ? ErrorCodes::INSERT_WAS_DEDUPLICATED : 0;
            PartLog::addNewPart(storage.getContext(), part, watch.elapsed(), ExecutionStatus(error));
        }
        catch (...)
        {
            PartLog::addNewPart(storage.getContext(), part, watch.elapsed(), ExecutionStatus::fromCurrentException(__PRETTY_FUNCTION__));
            throw;
        }
    }
}

从代码可以看出，写入数据流程为：切割block -> 遍历所有block中的part -> 通过writeTempPart函数临时写入part数据 -> 去重，通过getZeroLevelPartBlockID获取对应part的BlockID -> commitPart函数提交数据。
引发“Block with ID xx already exists locally as part xxx, ignoring it.”问题最主要的是writeTempPart+getZeroLevelPartBlockID函数

writeTempPart：临时写入part数据，这里包含Block ID数据来源
getZeroLevelPartBlock：获取part的Block ID

2、先来关注获取Block ID的方法

IMergeTreeDataPart.cpp的getZeroLevelPartBlockID方法。具体代码如下：

String IMergeTreeDataPart::getZeroLevelPartBlockID() const
{
    if (info.level != 0)
        throw Exception(ErrorCodes::LOGICAL_ERROR, "Trying to get block id for non zero level part {}", name);

    SipHash hash;
    checksums.computeTotalChecksumDataOnly(hash);
    union
    {
        char bytes[16];
        UInt64 words[2];
    } hash_value;
    hash.get128(hash_value.bytes);

    return info.partition_id + "_" + toString(hash_value.words[0]) + "_" + toString(hash_value.words[1]);
}

从代码可以得知Block ID分成两部分，partition_id + checksums的hash结果。
由于插入的两条数据partition_id相同。所以关注点定位到part的checksums带来的差异。
继续定位checksums的哪些属性会影响这个hash值，跟踪checksums的computeTotalChecksumDataOnly函数。在MergeTreeDataPartChecksum.cpp文件中；代码如下：

void MergeTreeDataPartChecksums::computeTotalChecksumDataOnly(SipHash & hash) const
{
    /// We use fact that iteration is in deterministic (lexicographical) order.
    for (const auto & it : files)
    {
        const String & name = it.first;
        const Checksum & sum = it.second;

        if (!endsWith(name, ".bin"))
            continue;
        UInt64 len = name.size();
        hash.update(len);
        hash.update(name.data(), len);
        hash.update(sum.uncompressed_size);
        hash.update(sum.uncompressed_hash);
    }
}

通过代码可知影响checksums的hash值得元素包括为它中每个以.bin结尾的checksum。checksume是一个map类型的数据。由于插入的数据只有id字段的内容不同，所以这里影响checksums的hash的属性中name，name.size(),name.data()在两条数据插入中均相同，不同的就是sum。所以我们需要考虑这个sum是怎么生成的。下面通过数据的临时写入过程来获取sum的组成情况。

3、数据写入临时文件过程分析

先看写入临时文件的入口函数，在MergeTreeDataWriter.cpp中的writeTempPart函数：

MergeTreeData::MutableDataPartPtr MergeTreeDataWriter::writeTempPart(
    BlockWithPartition & block_with_partition, const StorageMetadataPtr & metadata_snapshot, ContextPtr context)
{
    Block & block = block_with_partition.block;

    static const String TMP_PREFIX = "tmp_insert_";

    // ... 省略
    
    out.writePrefix();
    out.writeWithPermutation(block, perm_ptr);
    out.writeSuffixAndFinalizePart(new_data_part, sync_on_insert);

    ProfileEvents::increment(ProfileEvents::MergeTreeDataWriterRows, block.rows());
    ProfileEvents::increment(ProfileEvents::MergeTreeDataWriterUncompressedBytes, block.bytes());
    ProfileEvents::increment(ProfileEvents::MergeTreeDataWriterCompressedBytes, new_data_part->getBytesOnDisk());

    return new_data_part;
}

该函数就是将part数据按照clickhouse的存储数据格式临时保存。我们将关注点放在checksums的变化过程中，这个代码中并没有显式对checksums内容修改的语句，所以定位到writeSuffixAndFinalizePart函数，该函数将part传入，在函数内对checksums做了修改。
继续跟踪MergedBlockOutputStream.cpp文件中的writeSuffixAndFinalizePart函数：

void MergedBlockOutputStream::writeSuffixAndFinalizePart(
        MergeTreeData::MutableDataPartPtr & new_part,
        bool sync,
        const NamesAndTypesList * total_columns_list,
        MergeTreeData::DataPart::Checksums * additional_column_checksums)
{
    /// Finish write and get checksums.
    MergeTreeData::DataPart::Checksums checksums;

    if (additional_column_checksums)
        checksums = std::move(*additional_column_checksums);

    /// Finish columns serialization.
    writer->finish(checksums, sync);
    
	// ... 省略
	
    new_part->checksums = checksums;
    new_part->setBytesOnDisk(checksums.getTotalSizeOnDisk());
    new_part->index_granularity = writer->getIndexGranularity();
    new_part->calculateColumnsSizesOnDisk();
    if (default_codec != nullptr)
        new_part->default_codec = default_codec;
    new_part->storage.lockSharedData(*new_part);
}

代码中有part->checksums的赋值操作。数据来源是通过write->finish函数获取。继续调试finish函数。
在MergeTreeDataPartWriterInMemory.cpp文件中的finish函数是inmemory模式的实现：

void MergeTreeDataPartWriterInMemory::finish(IMergeTreeDataPart::Checksums & checksums, bool /* sync */)
{
    /// If part is empty we still need to initialize block by empty columns.
    if (!part_in_memory->block)
        for (const auto & column : columns_list)
            part_in_memory->block.insert(ColumnWithTypeAndName{column.type, column.name});

    checksums.files["data.bin"] = part_in_memory->calculateBlockChecksum();
}

通过代码可知：inmemory模式临时存储的数据只有一个name为"data.bin"的checksum。checksum的具体内容通过calculateBlockChecksum函数获取。
继续调试MergeTreeDataPartInMemory.cpp中的calculateBlockChecksum函数：

IMergeTreeDataPart::Checksum MergeTreeDataPartInMemory::calculateBlockChecksum() const
{
    SipHash hash;
    IMergeTreeDataPart::Checksum checksum;
    hash.get128(checksum.uncompressed_hash);
    for (const auto & column : block)
        column.column->updateHashFast(hash);

    checksum.uncompressed_size = block.bytes();
    hash.get128(checksum.uncompressed_hash);
    return checksum;
}

通过代码可知，checksum来自part的每列数据做hash的结果。我们出问题的插入数据为String类型。所以我们定位到String类型字段的updateHashFast函数来确认这个值为何输入不同数据得到相同结果。
进入ColumnString.h的updateHashFast函数：

    void updateHashFast(SipHash & hash) const override
    {
        hash.update(reinterpret_cast<const char *>(offsets.data()), size() * sizeof(offsets[0]));
        hash.update(reinterpret_cast<const char *>(chars.data()), size() * sizeof(chars[0]));
    }

通过代码可知：String字段的hash结果与字段的长度（offsets）与字段值（chars）相关。通过调试可以得到：

sizeof(offsets[0]) = 8；匹配它的类型UInt64。
sizeof(chars[0]) = 1; 匹配它的类型UInt8。
我们在看size()函数的内容：

    size_t size() const override
    {
        return offsets.size();
    }

通过这个size函数的实现可以知道，它返回的是offsets的长度。这里有个疑问：为何对offsets与chars计算hash给的size都是offsets的size？继续调试可以发现offsets的长度=1。说明给String类型字段做hash处理可以理解为：

    void updateHashFast(SipHash & hash) const override
    {
        hash.update(offsets.data(), 8);
        hash.update(chars.data(), 1);
    }

这里我们知道offsets就是一个长度的UInt64，所以对offsets做update处理数据与size都正确。但是对chars做update的size为1有何影响？这需要继续看update函数的实现。
我们进入到SipHash.h的update函数来分析：

    void update(const char * data, UInt64 size)
    {
        const char * end = data + size;

        /// We'll finish to process the remainder of the previous update, if any.
        if (cnt & 7)
        {
            while (cnt & 7 && data < end)
            {
                current_bytes[cnt & 7] = *data;
                ++data;
                ++cnt;
            }

            /// If we still do not have enough bytes to an 8-byte word.
            if (cnt & 7) 
                return;
     
            v3 ^= current_word;
            SIPROUND;
            SIPROUND;
            v0 ^= current_word;
        }

        cnt += end - data;

        while (data + 8 <= end)
        {
            current_word = unalignedLoad<UInt64>(data);

            v3 ^= current_word;
            SIPROUND;
            SIPROUND;
            v0 ^= current_word;
        
            data += 8;
        }

        /// Pad the remainder, which is missing up to an 8-byte word.
        current_word = 0;
        switch (end - data)
        {
            case 7: current_bytes[6] = data[6]; [[fallthrough]];
            case 6: current_bytes[5] = data[5]; [[fallthrough]];
            case 5: current_bytes[4] = data[4]; [[fallthrough]];
            case 4: current_bytes[3] = data[3]; [[fallthrough]];
            case 3: current_bytes[2] = data[2]; [[fallthrough]];
            case 2: current_bytes[1] = data[1]; [[fallthrough]];
            case 1: current_bytes[0] = data[0]; [[fallthrough]];
            case 0: break;
        }
    }

通过代码分析可以发现：update函数其实就是对所有输入数据转char*，然后再以步长为8对数据进行运算处理（hash过程）。实现流程为：

1、如果上一次update有数据剩余则，将本次输入数据将上一次数据补齐到current_bytes中，如果不能补齐则返回等待下一次update处理，如果能够补齐，则将本次补齐的8字节数据进行hash计算。
2、以8字节为长度循环遍历执行本次输入数据，8字节为单位进行hash运算。
3、将剩余不足8字节的数据打包进current_types数组中，等待下次update时使用。
所以如果chars做hash时传入的size参数却是1，对于String类型的字段只有第一个字节的数据被应用到了SipHash中。其他数据都被忽略。

通过该结论可以得到：

如果插入数据时，String字段每次给不同的长度的数据，那么可以插入成功。（因为offsets不同，可以影响hash结果）
如果插入数据时，String字段每次只给一个字节的数据，那么可以插入成功。（因为chars的数据第一个字节数据已用于hash计算）
如果插入数据时，String字段每次给的数据长度相同，且内容的第一个字节不同，那么可以掺入成功。（因为chars的第一个字节可以影响hash结果）
如果插入数据时，String字段每次给的数据长度相同，且内容的第一个字节相同，那么插入报Block ID已存在，不能被插入。（因为offsets相同，且chars的子一个字节数据相同，hash结果无法区分）

问题解决办法

现在已经得到问题的根源：由于String类型字段的updateHashFast函数中，对字段内容做hash时给的size不对导致。那么解决问题的最根本的方式就是修复该错误。修改ColumnString.h中的updateHashFast函数代码：

    void updateHashFast(SipHash & hash) const override
    {
        hash.update(reinterpret_cast<const char *>(offsets.data()), offsets.size() * sizeof(offsets[0]));
        hash.update(reinterpret_cast<const char *>(chars.data()), chars.size() * sizeof(chars[0]));
    }