[ClickHouse] 分别插入两条相似数据报:Block with ID xx already exists locally as part xx, ignoring it.


在使用ClickHouse过程中出现分别两次插入一条数据内容相似的记录到ReplicatedMergeTree引擎的表中,表中只有一条记录插入成功,第二条插入的记录在表中不存在,且在日志中可以看到报错信息:Block with ID xx already exists locally as part xx, ignoring it. 这说明第二条插入的记录被判定为重复而忽略。通过尝试发现在配置文件config.xml中将MergeTree的InMemory配置屏蔽可以规避该问题。也就是说part_type为Wide或者Compact都可以正常插入两条类似数据,只有InMemory模式报错。

代码提交记录:https://github.com/ClickHouse/ClickHouse/pull/47121#event-8675523528

复现问题

ClickHouse代码版本:21.8.3
建表语句:

create table if not exists default.t_write_local on CLUSTER sharding_cluster(
    id String,
    report_time Int64
) ENGINE = ReplicatedMergeTree('/adbtest1.0/{shard}/t_write_local', '{replica}')
PARTITION BY (toYYYYMMDD(toDateTime(report_time/1000)))
ORDER BY (report_time,id)
TTL toDateTime(report_time/1000) + INTERVAL 30 DAY;

插入两条相似数据:

insert into default.t_write_local(id, report_time) VALUES('adcdefghijklmnopqrstuvwxyz', '1675326231000');
insert into default.t_write_local(id, report_time) VALUES('a1234567890123456789012345', '1675326231000');

第二条数据插入的ClickHouse在这里插入代码片日志:

2023.02.12 14:40:15.738909 [ 12573 ] {81583f23-604a-490e-ae36-b5a3967f6acb} <Debug> default.t_write_local (01f997c1-2d37-4ac1-81f9-97c12d37fac1) (Replicated OutputStream): Wrote block with ID '20230202_17874882959606962441_9047980079478079928', 1 rows
2023.02.12 14:40:15.744609 [ 12573 ] {81583f23-604a-490e-ae36-b5a3967f6acb} <Information> default.t_write_local (01f997c1-2d37-4ac1-81f9-97c12d37fac1) (Replicated OutputStream): Block with ID 20230202_17874882959606962441_9047980079478079928 already exists locally as part 20230202_0_0_0; ignoring it.

在config.xml中配置merge_tree的min_rows_for_compact_part为0则关闭inmemory模式;配置为xx数值为打开inmemory模式,且当数据满xx行时写入磁盘。

<merge_tree>
	<min_rows_for_compact_part>100</min_rows_for_compact_part>
</merge_tree>

将这里的min_rows_for_compact_part配置为0,重试插入数据可以正确插入。

代码原因分析

以21.8.3版本代码为例

1、先定位写入数据的函数入口

从日志"Wrote block with ID ‘xxx’, xx rows"可以定位到:ReplicatedMergeTreeBlockOutputStream.cpp的write函数。

void ReplicatedMergeTreeBlockOutputStream::write(const Block & block)
{
    last_block_is_duplicate = false;

    auto zookeeper = storage.getZooKeeper();
    assertSessionIsNotExpired(zookeeper);

    /** If write is with quorum, then we check that the required number of replicas is now live,
      *  and also that for all previous parts for which quorum is required, this quorum is reached.
      * And also check that during the insertion, the replica was not reinitialized or disabled (by the value of `is_active` node).
      * TODO Too complex logic, you can do better.
      */
    if (quorum)
        checkQuorumPrecondition(zookeeper);

    auto part_blocks = storage.writer.splitBlockIntoParts(block, max_parts_per_block, metadata_snapshot, context);

    for (auto & current_block : part_blocks)
    {
        Stopwatch watch;

        /// Write part to the filesystem under temporary name. Calculate a checksum.
        MergeTreeData::MutableDataPartPtr part = storage.writer.writeTempPart(current_block, metadata_snapshot, context);

        /// If optimize_on_insert setting is true, current_block could become empty after merge
        /// and we didn't create part.
        if (!part)
            continue;

        String block_id;

        if (deduplicate)
        {
            /// We add the hash from the data and partition identifier to deduplication ID.
            /// That is, do not insert the same data to the same partition twice.
            block_id = part->getZeroLevelPartBlockID();

            LOG_DEBUG(log, "Wrote block with ID '{}', {} rows", block_id, current_block.block.rows());
        }
        else
        {
            LOG_DEBUG(log, "Wrote block with {} rows", current_block.block.rows());
        }

        try
        {
            commitPart(zookeeper, part, block_id);

            /// Set a special error code if the block is duplicate
            int error = (deduplicate && last_block_is_duplicate) ? ErrorCodes::INSERT_WAS_DEDUPLICATED : 0;
            PartLog::addNewPart(storage.getContext(), part, watch.elapsed(), ExecutionStatus(error));
        }
        catch (...)
        {
            PartLog::addNewPart(storage.getContext(), part, watch.elapsed(), ExecutionStatus::fromCurrentException(__PRETTY_FUNCTION__));
            throw;
        }
    }
}

从代码可以看出,写入数据流程为:切割block -> 遍历所有block中的part -> 通过writeTempPart函数临时写入part数据 -> 去重,通过getZeroLevelPartBlockID获取对应part的BlockID -> commitPart函数提交数据。
引发“Block with ID xx already exists locally as part xxx, ignoring it.”问题最主要的是writeTempPart+getZeroLevelPartBlockID函数

  • writeTempPart:临时写入part数据,这里包含Block ID数据来源
  • getZeroLevelPartBlock:获取part的Block ID

2、先来关注获取Block ID的方法

IMergeTreeDataPart.cpp的getZeroLevelPartBlockID方法。具体代码如下:

String IMergeTreeDataPart::getZeroLevelPartBlockID() const
{
    if (info.level != 0)
        throw Exception(ErrorCodes::LOGICAL_ERROR, "Trying to get block id for non zero level part {}", name);

    SipHash hash;
    checksums.computeTotalChecksumDataOnly(hash);
    union
    {
        char bytes[16];
        UInt64 words[2];
    } hash_value;
    hash.get128(hash_value.bytes);

    return info.partition_id + "_" + toString(hash_value.words[0]) + "_" + toString(hash_value.words[1]);
}

从代码可以得知Block ID分成两部分,partition_id + checksums的hash结果。
由于插入的两条数据partition_id相同。所以关注点定位到part的checksums带来的差异。
继续定位checksums的哪些属性会影响这个hash值,跟踪checksums的computeTotalChecksumDataOnly函数。在MergeTreeDataPartChecksum.cpp文件中;代码如下:

void MergeTreeDataPartChecksums::computeTotalChecksumDataOnly(SipHash & hash) const
{
    /// We use fact that iteration is in deterministic (lexicographical) order.
    for (const auto & it : files)
    {
        const String & name = it.first;
        const Checksum & sum = it.second;

        if (!endsWith(name, ".bin"))
            continue;
        UInt64 len = name.size();
        hash.update(len);
        hash.update(name.data(), len);
        hash.update(sum.uncompressed_size);
        hash.update(sum.uncompressed_hash);
    }
}

通过代码可知影响checksums的hash值得元素包括为它中每个以.bin结尾的checksum。checksume是一个map类型的数据。由于插入的数据只有id字段的内容不同,所以这里影响checksums的hash的属性中name,name.size(),name.data()在两条数据插入中均相同,不同的就是sum。所以我们需要考虑这个sum是怎么生成的。下面通过数据的临时写入过程来获取sum的组成情况。

3、数据写入临时文件过程分析

先看写入临时文件的入口函数,在MergeTreeDataWriter.cpp中的writeTempPart函数:

MergeTreeData::MutableDataPartPtr MergeTreeDataWriter::writeTempPart(
    BlockWithPartition & block_with_partition, const StorageMetadataPtr & metadata_snapshot, ContextPtr context)
{
    Block & block = block_with_partition.block;

    static const String TMP_PREFIX = "tmp_insert_";

    // ... 省略
    
    out.writePrefix();
    out.writeWithPermutation(block, perm_ptr);
    out.writeSuffixAndFinalizePart(new_data_part, sync_on_insert);

    ProfileEvents::increment(ProfileEvents::MergeTreeDataWriterRows, block.rows());
    ProfileEvents::increment(ProfileEvents::MergeTreeDataWriterUncompressedBytes, block.bytes());
    ProfileEvents::increment(ProfileEvents::MergeTreeDataWriterCompressedBytes, new_data_part->getBytesOnDisk());

    return new_data_part;
}

该函数就是将part数据按照clickhouse的存储数据格式临时保存。我们将关注点放在checksums的变化过程中,这个代码中并没有显式对checksums内容修改的语句,所以定位到writeSuffixAndFinalizePart函数,该函数将part传入,在函数内对checksums做了修改。
继续跟踪MergedBlockOutputStream.cpp文件中的writeSuffixAndFinalizePart函数:

void MergedBlockOutputStream::writeSuffixAndFinalizePart(
        MergeTreeData::MutableDataPartPtr & new_part,
        bool sync,
        const NamesAndTypesList * total_columns_list,
        MergeTreeData::DataPart::Checksums * additional_column_checksums)
{
    /// Finish write and get checksums.
    MergeTreeData::DataPart::Checksums checksums;

    if (additional_column_checksums)
        checksums = std::move(*additional_column_checksums);

    /// Finish columns serialization.
    writer->finish(checksums, sync);
    
	// ... 省略
	
    new_part->checksums = checksums;
    new_part->setBytesOnDisk(checksums.getTotalSizeOnDisk());
    new_part->index_granularity = writer->getIndexGranularity();
    new_part->calculateColumnsSizesOnDisk();
    if (default_codec != nullptr)
        new_part->default_codec = default_codec;
    new_part->storage.lockSharedData(*new_part);
}

代码中有part->checksums的赋值操作。数据来源是通过write->finish函数获取。继续调试finish函数。
在MergeTreeDataPartWriterInMemory.cpp文件中的finish函数是inmemory模式的实现:

void MergeTreeDataPartWriterInMemory::finish(IMergeTreeDataPart::Checksums & checksums, bool /* sync */)
{
    /// If part is empty we still need to initialize block by empty columns.
    if (!part_in_memory->block)
        for (const auto & column : columns_list)
            part_in_memory->block.insert(ColumnWithTypeAndName{column.type, column.name});

    checksums.files["data.bin"] = part_in_memory->calculateBlockChecksum();
}

通过代码可知:inmemory模式临时存储的数据只有一个name为"data.bin"的checksum。checksum的具体内容通过calculateBlockChecksum函数获取。
继续调试MergeTreeDataPartInMemory.cpp中的calculateBlockChecksum函数:

IMergeTreeDataPart::Checksum MergeTreeDataPartInMemory::calculateBlockChecksum() const
{
    SipHash hash;
    IMergeTreeDataPart::Checksum checksum;
    hash.get128(checksum.uncompressed_hash);
    for (const auto & column : block)
        column.column->updateHashFast(hash);

    checksum.uncompressed_size = block.bytes();
    hash.get128(checksum.uncompressed_hash);
    return checksum;
}

通过代码可知,checksum来自part的每列数据做hash的结果。我们出问题的插入数据为String类型。所以我们定位到String类型字段的updateHashFast函数来确认这个值为何输入不同数据得到相同结果。
进入ColumnString.h的updateHashFast函数:

    void updateHashFast(SipHash & hash) const override
    {
        hash.update(reinterpret_cast<const char *>(offsets.data()), size() * sizeof(offsets[0]));
        hash.update(reinterpret_cast<const char *>(chars.data()), size() * sizeof(chars[0]));
    }

通过代码可知:String字段的hash结果与字段的长度(offsets)与字段值(chars)相关。通过调试可以得到:

  • sizeof(offsets[0]) = 8;匹配它的类型UInt64。
  • sizeof(chars[0]) = 1; 匹配它的类型UInt8。
    我们在看size()函数的内容:
    size_t size() const override
    {
        return offsets.size();
    }

通过这个size函数的实现可以知道,它返回的是offsets的长度。这里有个疑问:为何对offsets与chars计算hash给的size都是offsets的size?继续调试可以发现offsets的长度=1。说明给String类型字段做hash处理可以理解为:

    void updateHashFast(SipHash & hash) const override
    {
        hash.update(offsets.data(), 8);
        hash.update(chars.data(), 1);
    }

这里我们知道offsets就是一个长度的UInt64,所以对offsets做update处理数据与size都正确。但是对chars做update的size为1有何影响?这需要继续看update函数的实现。
我们进入到SipHash.h的update函数来分析:

    void update(const char * data, UInt64 size)
    {
        const char * end = data + size;

        /// We'll finish to process the remainder of the previous update, if any.
        if (cnt & 7)
        {
            while (cnt & 7 && data < end)
            {
                current_bytes[cnt & 7] = *data;
                ++data;
                ++cnt;
            }

            /// If we still do not have enough bytes to an 8-byte word.
            if (cnt & 7) 
                return;
     
            v3 ^= current_word;
            SIPROUND;
            SIPROUND;
            v0 ^= current_word;
        }

        cnt += end - data;

        while (data + 8 <= end)
        {
            current_word = unalignedLoad<UInt64>(data);

            v3 ^= current_word;
            SIPROUND;
            SIPROUND;
            v0 ^= current_word;
        
            data += 8;
        }

        /// Pad the remainder, which is missing up to an 8-byte word.
        current_word = 0;
        switch (end - data)
        {
            case 7: current_bytes[6] = data[6]; [[fallthrough]];
            case 6: current_bytes[5] = data[5]; [[fallthrough]];
            case 5: current_bytes[4] = data[4]; [[fallthrough]];
            case 4: current_bytes[3] = data[3]; [[fallthrough]];
            case 3: current_bytes[2] = data[2]; [[fallthrough]];
            case 2: current_bytes[1] = data[1]; [[fallthrough]];
            case 1: current_bytes[0] = data[0]; [[fallthrough]];
            case 0: break;
        }
    }

通过代码分析可以发现:update函数其实就是对所有输入数据转char*,然后再以步长为8对数据进行运算处理(hash过程)。实现流程为:

  • 1、如果上一次update有数据剩余则,将本次输入数据将上一次数据补齐到current_bytes中,如果不能补齐则返回等待下一次update处理,如果能够补齐,则将本次补齐的8字节数据进行hash计算。
  • 2、以8字节为长度循环遍历执行本次输入数据,8字节为单位进行hash运算。
  • 3、将剩余不足8字节的数据打包进current_types数组中,等待下次update时使用。
    所以如果chars做hash时传入的size参数却是1,对于String类型的字段只有第一个字节的数据被应用到了SipHash中。其他数据都被忽略。

通过该结论可以得到:

  • 如果插入数据时,String字段每次给不同的长度的数据,那么可以插入成功。(因为offsets不同,可以影响hash结果)
  • 如果插入数据时,String字段每次只给一个字节的数据,那么可以插入成功。(因为chars的数据第一个字节数据已用于hash计算)
  • 如果插入数据时,String字段每次给的数据长度相同,且内容的第一个字节不同,那么可以掺入成功。(因为chars的第一个字节可以影响hash结果)
  • 如果插入数据时,String字段每次给的数据长度相同,且内容的第一个字节相同,那么插入报Block ID已存在,不能被插入。(因为offsets相同,且chars的子一个字节数据相同,hash结果无法区分)

问题解决办法

现在已经得到问题的根源:由于String类型字段的updateHashFast函数中,对字段内容做hash时给的size不对导致。那么解决问题的最根本的方式就是修复该错误。修改ColumnString.h中的updateHashFast函数代码:

    void updateHashFast(SipHash & hash) const override
    {
        hash.update(reinterpret_cast<const char *>(offsets.data()), offsets.size() * sizeof(offsets[0]));
        hash.update(reinterpret_cast<const char *>(chars.data()), chars.size() * sizeof(chars[0]));
    }

只需要将offsets与chars各自给自己的size就可以解决。

当你连接远程仓库时出现"error: remote origin already exists"的错误提示时,这通常是由于之前已经设置了一个名为"origin"的远程仓库。解决这个问题的步骤如下: 1. 首先,你可以通过运行命令`git remote -v`来查看当前的远程仓库信息,确认是否已经存在"origin"远程仓库。 2. 如果已经存在"origin"远程仓库,你可以使用命令`git remote rm origin`来删除当前已存在的远程仓库。 3. 接下来,你需要重新建立一个新的远程仓库地址。你可以使用命令`git remote add origin 远程仓库地址`来指定一个新的远程仓库地址作为"origin"。 通过执行以上步骤,你就可以成功连接到新的远程仓库并避免"error: remote origin already exists"错误的出现了。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *3* [error: remote origin already exists.](https://blog.csdn.net/Lovely_red_scarf/article/details/125760091)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] - *2* [git推送出现问题error: remote origin already exists.(解决)](https://blog.csdn.net/weixin_72186894/article/details/131654247)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值