【ClickHouse源码】ReplicatedMergeTree之数据同步流程

ReplicatedMergeTree之数据同步流程

在创建了ReplicatedMergeTree后,会有几个taskHolder在后台去监听zk的log并向queue添加,监听mutations的变化并触发mutation相关操作。这里先不对mutation相关操作做分析,主要先说明一下正常的数据插入和正常的数据复制流程。

首先了解一个taskHolder

queue_task_handle:负责从queue中选取节点执行操作

这个task是如何启动的呢,这个主要在建表时或者server重启时通过startup()方法调用的,具体不多说,可以看以下两个方法:

第一个:

BlockIO InterpreterCreateQuery::createTable(ASTCreateQuery & create)
{
    ...

    {
        ...
        // 这里就是创建表时调用ReplicatedMergeTree::startup()的地方
        res->startup();
    }

    ...
}

第二个:

void DatabaseOrdinary::startupTables(ThreadPool & thread_pool)
{
    LOG_INFO(log, "Starting up tables.");

    ...

    auto startupOneTable = [&](const StoragePtr & table)
    {
        // 这里是在server重启时,会对每个数据库重新start每个table的后台线程及各种任务,
        // 调用相应table的startup()方法
        table->startup();

        ...
    };

    ...
}

StorageReplicatedMergeTree::startup()方法,如下:

void StorageReplicatedMergeTree::startup()
{
    if (is_readonly)
        return;

    if (set_table_structure_at_startup)
        set_table_structure_at_startup();

    queue.initialize(
        zookeeper_path, replica_path,
        database_name + "." + table_name + " (ReplicatedMergeTreeQueue)",
        getDataParts());

    StoragePtr ptr = shared_from_this();
    InterserverIOEndpointPtr data_parts_exchange_endpoint = std::make_shared<DataPartsExchange::Service>(*this, ptr);
    data_parts_exchange_endpoint_holder = std::make_shared<InterserverIOEndpointHolder>(
        data_parts_exchange_endpoint->getId(replica_path), data_parts_exchange_endpoint, global_context.getInterserverIOHandler());
    // 这里会将queueTask()添加到queue_task_handle这个taskHolder中
    queue_task_handle = global_context.getBackgroundPool().addTask([this] { return queueTask(); });
    // movePartsTask()添加到move_parts_task_handle中
    move_parts_task_handle = global_context.getBackgroundPool().addTask([this] { return movePartsTask(); });

    // 激活副本
    restarting_thread.start();

    startup_event.wait();
}

queueTask()这个方法就是处理之前添加到Queue中数据的方法了,如下:

BackgroundProcessingPoolTaskResult StorageReplicatedMergeTree::queueTask()
{

    if (queue.actions_blocker.isCancelled())
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(5));
        return BackgroundProcessingPoolTaskResult::SUCCESS;
    }

    // 对正在执行的entry用selected变量来表示,方便以下处理,表意也比较好
    ReplicatedMergeTreeQueue::SelectedEntry selected;

    try
    {
        // 选择需要处理的entrys
        selected = queue.selectEntryToProcess(merger_mutator, *this);
    }
    catch (...)
    {
        tryLogCurrentException(log, __PRETTY_FUNCTION__);
    }

    LogEntryPtr & entry = selected.first;

    // 如果没有需要处理的直接返回
    if (!entry)
        return BackgroundProcessingPoolTaskResult::NOTHING_TO_DO;
    // 有需要处理的执行以下代码
    time_t prev_attempt_time = entry->last_attempt_time;

    // 这里是真正的处理方法,并且传入了一个匿名方法
    bool res = queue.processEntry([this]{ return getZooKeeper(); }, entry, [&](LogEntryPtr & entry_to_process)
    {
        try
        {
            // 这里是真正的执行器execute
            return executeLogEntry(*entry_to_process);
        }
        catch (const Exception & e)
        {
            ...
            throw;
        }
        catch (...)
        {
            ...
            throw;
        }
    });

    // 如果entry处理失败或已经被处理,就sleep
    bool need_sleep = !res && (entry->last_attempt_time - prev_attempt_time < 10);

    // sleep了就表示有问题了,没有sleep就表示成功,继续下面的流程
    return need_sleep ? BackgroundProcessingPoolTaskResult::ERROR : BackgroundProcessingPoolTaskResult::SUCCESS;
}

这个方法中主要做的是根据不同的type去执行不同的操作,先了解下type都有哪些:

  • EMPTY:没使用
  • GET_PART:从另一个副本拉取数据
  • MERGE_PARTS:执行part的merge
  • DROP_RANGE:删除指定分区中指定的part
  • CLEAR_COLUMN:在指定part中删除指定的列
  • CLEAR_INDEX:在执行part中删除指定索引
  • REPLACE_RANGE:用新的分区的指定范围盖原分区的指定范围
  • MUTATE_PART:表示执行一个或多个变更
bool StorageReplicatedMergeTree::executeLogEntry(LogEntry & entry)
{
    // 根据不同type执行不同逻辑
    if (entry.type == LogEntry::DROP_RANGE)
    {
        executeDropRange(entry);
        return true;
    }

    if (entry.type == LogEntry::CLEAR_COLUMN || entry.type == LogEntry::CLEAR_INDEX)
    {
        executeClearColumnOrIndexInPartition(entry);
        return true;
    }

    if (entry.type == LogEntry::REPLACE_RANGE)
    {
        executeReplaceRange(entry);
        return true;
    }

    // 如果是副本间复制就会执行下面逻辑
    if (entry.type == LogEntry::GET_PART ||
        entry.type == LogEntry::MERGE_PARTS ||
        entry.type == LogEntry::MUTATE_PART)
    {
        // 因为所有的在处理的part都会在预提交-已提交的流程中,先在预提交中选择entry
        DataPartPtr existing_part = getPartIfExists(entry.new_part_name, {MergeTreeDataPartState::PreCommitted});
        if (!existing_part)
            // 如果在预提交中找不到,再去已提交的part里去找
            existing_part = getActiveContainingPart(entry.new_part_name);

        // 如果该entry是自身节点添加的数据,也会被触发执行到这个流程中,因为自身有数据了
        // 这里跳过处理
        if (existing_part && getZooKeeper()->exists(replica_path + "/parts/" + existing_part->name))
        {
            if (!(entry.type == LogEntry::GET_PART && entry.source_replica == replica_name))
            {
                LOG_DEBUG(log, "Skipping action for part " << entry.new_part_name << " because part " + existing_part->name + " already exists.");
            }
            return true;
        }
    }

    if (entry.type == LogEntry::GET_PART && entry.source_replica == replica_name)
        LOG_WARNING(log, "Part " << entry.new_part_name << " from own log doesn't exist.");

    if (entry.quorum && getZooKeeper()->exists(zookeeper_path + "/quorum/failed_parts/" + entry.new_part_name))
    {
        LOG_DEBUG(log, "Skipping action for part " << entry.new_part_name << " because quorum for that part was failed.");
        return true;   
    }

    bool do_fetch = false;
    if (entry.type == LogEntry::GET_PART)
    {
        do_fetch = true;
    }
    else if (entry.type == LogEntry::MERGE_PARTS)
    {
        do_fetch = !tryExecuteMerge(entry);
    }
    else if (entry.type == LogEntry::MUTATE_PART)
    {
        do_fetch = !tryExecutePartMutation(entry);
    }
    else
    {
        throw Exception("Unexpected log entry type: " + toString(static_cast<int>(entry.type)), ErrorCodes::LOGICAL_ERROR);
    }

    // 这里是根据上面的判断决定是否要去其他副本拉取,如果是true,会执行真正的拉取动过
    if (do_fetch)
        return executeFetch(entry);

    return true;
}

executeFetch(entry)方法中也有很多逻辑,至此,已经了解到ReplicatedMergeTree是如何触发数据同步的,以及根据不同的类型用不同的方法进行处理,下面看下executeFetch(entry)方法

bool StorageReplicatedMergeTree::executeFetch(LogEntry & entry)
{
    /// 查找是否有需要覆盖的part
    String replica = findReplicaHavingCoveringPart(entry, true);
    const auto storage_settings_ptr = getSettings();

    // 设置一些并行参数,判断replicated_max_parallel_fetches和
    // replicated_max_parallel_fetches_for_table是否符合要求
    static std::atomic_uint total_fetches {0};
    if (storage_settings_ptr->replicated_max_parallel_fetches && total_fetches >= storage_settings_ptr->replicated_max_parallel_fetches)
    {
        throw Exception("Too many total fetches from replicas, maximum: " + storage_settings_ptr->replicated_max_parallel_fetches.toString(),
            ErrorCodes::TOO_MANY_FETCHES);
    }

    ++total_fetches;
    SCOPE_EXIT({--total_fetches;});

    if (storage_settings_ptr->replicated_max_parallel_fetches_for_table && current_table_fetches >= storage_settings_ptr->replicated_max_parallel_fetches_for_table)
    {
        throw Exception("Too many fetches from replicas for table, maximum: " + storage_settings_ptr->replicated_max_parallel_fetches_for_table.toString(),
            ErrorCodes::TOO_MANY_FETCHES);
    }

    ++current_table_fetches;
    SCOPE_EXIT({--current_table_fetches;});

    try
    {
        if (replica.empty())
        {
            // 与多副本确认机制有关,比较复杂,先略过
            if (entry.quorum)
            {
                ...
            }

            if (replica.empty())
            {
                ProfileEvents::increment(ProfileEvents::ReplicatedPartFailedFetches);
                throw Exception("No active replica has part " + entry.new_part_name + " or covering part", ErrorCodes::NO_REPLICA_HAS_PART);
            }
        }

        try
        {
            String part_name = entry.actual_new_part_name.empty() ? entry.new_part_name : entry.actual_new_part_name;
            // 拉取part的方法
            if (!fetchPart(part_name, zookeeper_path + "/replicas/" + replica, false, entry.quorum))
                return false;
        }
        catch (Exception & e)
        {
            /// No stacktrace, just log message
            if (e.code() == ErrorCodes::RECEIVED_ERROR_TOO_MANY_REQUESTS)
                e.addMessage("Too busy replica. Will try later.");
            throw;
        }

        if (entry.type == LogEntry::MERGE_PARTS)
            ProfileEvents::increment(ProfileEvents::ReplicatedPartFetchesOfMerged);
    }
    catch (...)
    {
        ...
    }

    return true;
}

穿插一个findReplicaHavingCoveringPart方法,可以大致了解下里面的逻辑

String StorageReplicatedMergeTree::findReplicaHavingCoveringPart(LogEntry & entry, bool active)
{
    auto zookeeper = getZooKeeper();
    Strings replicas = zookeeper->getChildren(zookeeper_path + "/replicas");

    /// 按照特定的规则随机的选择副本,也即随机重排列副本的顺序
    std::shuffle(replicas.begin(), replicas.end(), thread_local_rng);

    for (const String & replica : replicas)
    {
        // 跳过自己
        if (replica == replica_name)
            continue;
        // 如果replica不是active状态的也跳过
        if (active && !zookeeper->exists(zookeeper_path + "/replicas/" + replica + "/is_active"))
            continue;

        String largest_part_found;
        // 获取所有的parts
        Strings parts = zookeeper->getChildren(zookeeper_path + "/replicas/" + replica + "/parts");
        for (const String & part_on_replica : parts)
        {
            if (part_on_replica == entry.new_part_name
                || MergeTreePartInfo::contains(part_on_replica, entry.new_part_name, format_version))
            {
                if (largest_part_found.empty()
                    || MergeTreePartInfo::contains(part_on_replica, largest_part_found, format_version))
                {
                    largest_part_found = part_on_replica;
                }
            }
        }

        if (!largest_part_found.empty())
        {
            bool the_same_part = largest_part_found == entry.new_part_name;

            /// 确认largest_part_found不是源part
            if (!the_same_part)
            {
                String reject_reason;
                if (!queue.addFuturePartIfNotCoveredByThem(largest_part_found, entry, reject_reason))
                {
                    LOG_INFO(log, "Will not fetch part " << largest_part_found << " covering " << entry.new_part_name << ". " << reject_reason);
                    return {};
                }
            }

            return replica;
        }
    }

    return {};
}

回归正题,流程执行到fetchPart方法就真正的要执行每一个小part的拉取了,代码如下

bool StorageReplicatedMergeTree::fetchPart(const String & part_name, const String & source_replica_path, bool to_detached, size_t quorum)
{
    ...

    std::function<MutableDataPartPtr()> get_part;
    if (part_to_clone)
    {
        get_part = [&, part_to_clone]()
        {
            return cloneAndLoadDataPart(part_to_clone, "tmp_clone_", part_info);
        };
    }
    else
    {
        // 获取需要clone数据的副本地址
        ReplicatedMergeTreeAddress address(getZooKeeper()->get(source_replica_path + "/host"));
        auto timeouts = ConnectionTimeouts::getHTTPTimeouts(global_context);
        auto user_password = global_context.getInterserverCredentials();
        String interserver_scheme = global_context.getInterserverScheme();

        get_part = [&, address, timeouts, user_password, interserver_scheme]()
        {
            if (interserver_scheme != address.scheme)
                throw Exception("Interserver schemes are different: '" + interserver_scheme
                    + "' != '" + address.scheme + "', can't fetch part from " + address.host,
                    ErrorCodes::LOGICAL_ERROR);

            // 这里的fetchPart主要就是构造HTTP参数及连接真正拉取数据
            return fetcher.fetchPart(
                part_name, source_replica_path,
                address.host, address.replication_port,
                timeouts, user_password.first, user_password.second, interserver_scheme, to_detached);
        };
    }

    ...

    return true;
}

至此,整个ReplicatedMergeTree的数据异步复制流程的主要逻辑就完整了,还有许多细节这里都忽略了,比如quorum等,后续有空再补充。

  • 3
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

一只努力的微服务

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值