【ClickHouse源码】MergeTree重启加载数据过程

ClickHouse有很多表引擎,用的最多的是ReplicatedMergeTree,但是ReplicatedMergeTree也是基于MergeTree进行的增强,核心还是MergeTree,下面主要介绍一下MergeTree在ClickHouse重启时是如何加载数据的。

MergeTree加载数据过程

server端的启动是在Server.cppmain()方法开始的,所以加载数据的过程也包含其中,由于代码过长只留下部分关键代码,如下:

int Server::main(const std::vector<std::string> & /*args*/)
{
    ......
        
    /// 构造format_schemas绝对路径
    auto format_schema_path = Poco::File(config().getString("format_schema_path", path + "format_schemas/"));
    global_context->setFormatSchemaPath(format_schema_path.path());
    format_schema_path.createDirectories();

    LOG_INFO(log, "Loading metadata from " + path);

    try
    {
        loadMetadataSystem(*global_context);
        /// After attaching system databases we can initialize system log.
        global_context->initializeSystemLogs();
        /// After the system database is created, attach virtual system tables (in addition to query_log and part_log)
        attachSystemTablesServer(*global_context->getDatabase("system"), has_zookeeper);
        /// Then, load remaining databases
        loadMetadata(*global_context);
    }
    catch (...)
    {
        tryLogCurrentException(log, "Caught exception while loading metadata");
        throw;
    }
    LOG_DEBUG(log, "Loaded metadata.");
    
    ......
}

主要包含三个逻辑:

  1. loadMetadataSystem():加载system系统库及部分系统表的元数据信息,例如query_logtrace_logmetric_log等这些在config.xml中配置的系统表,并且会在/var/lib/clickhouse/metadata/system路径下生成.sql文件
  2. attachSystemTablesServer():加载system系统库及其余部分系统表的元数据信息,例如clustersmacrosparts等不在/var/lib/clickhouse/metadata/system路径下生成.sql文件的系统表
  3. loadMetadata():加载除系统库外的其余数据库的库表元信息,这个方法也是本文要分析的重点

loadMetadata()方法

void loadMetadata(Context & context)
{
    String path = context.getPath() + "metadata";

    /** There may exist 'force_restore_data' file, that means,
      *  skip safety threshold on difference of data parts while initializing tables.
      * This file is deleted after successful loading of tables.
      * (flag is "one-shot")
      */
    Poco::File force_restore_data_flag_file(context.getFlagsPath() + "force_restore_data");
    bool has_force_restore_data_flag = force_restore_data_flag_file.exists();

    /// Loop over databases.
    std::map<String, String> databases;
    Poco::DirectoryIterator dir_end;
    for (Poco::DirectoryIterator it(path); it != dir_end; ++it)
    {
        if (!it->isDirectory())
            continue;

        /// For '.svn', '.gitignore' directory and similar.
        if (it.name().at(0) == '.')
            continue;

        if (it.name() == SYSTEM_DATABASE)
            continue;

        databases.emplace(unescapeForFileName(it.name()), it.path().toString());
    }

    for (const auto & [name, db_path] : databases)
        loadDatabase(context, name, db_path, has_force_restore_data_flag);

    if (has_force_restore_data_flag)
    {
        try
        {
            force_restore_data_flag_file.remove();
        }
        catch (...)
        {
            tryLogCurrentException("Load metadata", "Can't remove force restore file to enable data santity checks");
        }
    }
}

主要包含三个逻辑:

  1. 判断是否有force_restore_data文件,即表示是否要强制恢复数据,会随之一步步透传下去,直至赋值到InterpreterCreateQueryhas_force_restore_data_flag属性上,供后续逻辑使用
  2. 遍历metadata目录下的非系统库执行loadDatabase()加载库表,这个方法是关键,下面继续分析
  3. 如果上述流程执行成功就会尝试删除force_restore_data文件,避免再次重启再次进入强制回复逻辑

loadDatabase()方法

static void loadDatabase(
    Context & context,
    const String & database,
    const String & database_path,
    bool force_restore_data)
{
    /// There may exist .sql file with database creation statement.
    /// Or, if it is absent, then database with default engine is created.

    String database_attach_query;
    String database_metadata_file = database_path + ".sql";

    if (Poco::File(database_metadata_file).exists())
    {
        ReadBufferFromFile in(database_metadata_file, 1024);
        readStringUntilEOF(database_attach_query, in);
    }
    else
        database_attach_query = "ATTACH DATABASE " + backQuoteIfNeed(database);

    executeCreateQuery(database_attach_query, context, database,
                       database_metadata_file, force_restore_data);
}

这个方法中就是将上面解析到的.sql再次拼成一个CreateQuery来执行一次创建型的查询,并且带入force_restore_data,要注意这里sql已经不是create...而变为了attach...

executeCreateQuery()方法实际调用了InterpreterCreateQuery::execute(),除了拼凑参数没做其他逻辑。

InterpreterCreateQuery::execute()方法

BlockIO InterpreterCreateQuery::execute()
{
    auto & create = query_ptr->as<ASTCreateQuery &>();
    checkAccess(create);
    ASTQueryWithOutput::resetOutputASTIfExist(create);

    /// CREATE|ATTACH DATABASE
    if (!create.database.empty() && create.table.empty())
        return createDatabase(create);
    else if (!create.is_dictionary)
        return createTable(create);
    else
        return createDictionary(create);
}

这个方法包含创建库、创建Table以及创建字典。

在server是重启情况下,对于一般的库来说,是database目录不为空,table目录也不为空,所以会进入到createTable()方法,createTable()方法也就是正常执行create table逻辑的方法,同时兼容attach table的逻辑,方法比较长,对于本文要讲的内容理解到这里已经够了。

StorageMergeTree::StorageMergeTree()方法

上面已经进行到了createTable()方法,上面已经提到了,这里的SQL就已经是attach而不是create了,在attach过程中就会实例化StorageMergeTree,会触发StorageMergeTree的构造方法。

StorageMergeTree::StorageMergeTree(
    const String & database_name_,
    const String & table_name_,
    const ColumnsDescription & columns_,
    const IndicesDescription & indices_,
    const ConstraintsDescription & constraints_,
    bool attach,
    Context & context_,
    const String & date_column_name,
    const ASTPtr & partition_by_ast_,
    const ASTPtr & order_by_ast_,
    const ASTPtr & primary_key_ast_,
    const ASTPtr & sample_by_ast_, /// nullptr, if sampling is not supported.
    const ASTPtr & ttl_table_ast_,
    const MergingParams & merging_params_,
    std::unique_ptr<MergeTreeSettings> storage_settings_,
    bool has_force_restore_data_flag)
        : MergeTreeData(database_name_, table_name_,
            columns_, indices_, constraints_,
            context_, date_column_name, partition_by_ast_, order_by_ast_, primary_key_ast_,
            sample_by_ast_, ttl_table_ast_, merging_params_,
            std::move(storage_settings_), false, attach),
        reader(*this), writer(*this),
        merger_mutator(*this, global_context.getBackgroundPool().getNumberOfThreads())
{
    loadDataParts(has_force_restore_data_flag);

    if (!attach && !getDataParts().empty())
        throw Exception("Data directory for table already containing data parts - probably it was unclean DROP table or manual intervention. You must either clear directory by hand or use ATTACH TABLE instead of CREATE TABLE if you need to use that parts.", ErrorCodes::INCORRECT_DATA);

    increment.set(getMaxBlockNumber());

    loadMutations();
}

主要包含三个逻辑:

  1. loadDataParts():加载表的所有parts
  2. 设置max_block_number
  3. loadMutations():加载所有的mutations

loadDataParts()方法

void MergeTreeData::loadDataParts(bool skip_sanity_checks)
{
    LOG_DEBUG(log, "Loading data parts");

    ......
    /* step1 */
    auto disks = storage_policy->getDisks();

    /// Reversed order to load part from low priority disks firstly.
    /// Used for keep part on low priority disk if duplication found
    for (auto disk_it = disks.rbegin(); disk_it != disks.rend(); ++disk_it)
    {
        auto disk_ptr = *disk_it;
        for (Poco::DirectoryIterator it(getFullPathOnDisk(disk_ptr)); it != end; ++it)
        {
            /// Skip temporary directories.
            if (startsWith(it.name(), "tmp"))
                continue;

            part_names_with_disks.emplace_back(it.name(), disk_ptr);
        }
    }

    ......
    /* step2 */
    ThreadPool pool(num_threads);

    for (size_t i = 0; i < part_names_with_disks.size(); ++i)
    {
        pool.scheduleOrThrowOnError([&, i]
        {
            const auto & part_name = part_names_with_disks[i].first;
            const auto part_disk_ptr = part_names_with_disks[i].second;
            ......
            /* step3 */
            Poco::Path part_path(getFullPathOnDisk(part_disk_ptr), part_name);
            Poco::Path marker_path(part_path, DELETE_ON_DESTROY_MARKER_PATH);
            if (Poco::File(marker_path).exists())
            {
                LOG_WARNING(log, "Detaching stale part " << getFullPathOnDisk(part_disk_ptr) << part_name << ", which should have been deleted after a move. That can only happen after unclean restart of ClickHouse after move of a part having an operation blocking that stale copy of part.");
                std::lock_guard loading_lock(mutex);
                broken_parts_to_detach.push_back(part);
                ++suspicious_broken_parts;
                return;
            }

            /* step4 */
            try
            {
                part->loadColumnsChecksumsIndexes(require_part_metadata, true);
            }
            catch (const Exception & e)
            {
                ......
            }


            ......
            /* step5 */    
            /// Ignore and possibly delete broken parts that can appear as a result of hard server restart.
            if (broken)
            {
                /* step6 */
                if (part->info.level == 0)
                {
                    ......
                    std::lock_guard loading_lock(mutex);
                    broken_parts_to_remove.push_back(part);
                }
                else
                {
                    ......
                    /* step7 */
                    for (const auto & [contained_name, contained_disk_ptr] : part_names_with_disks)
                    {
                        if (contained_name == part_name)
                            continue;

                        MergeTreePartInfo contained_part_info;
                        if (!MergeTreePartInfo::tryParsePartName(contained_name, &contained_part_info, format_version))
                            continue;

                        if (part->info.contains(contained_part_info))
                        {
                            ......
                            ++contained_parts;
                        }
                    }

                    /* step8 */
                    if (contained_parts >= 2)
                    {
                        ......
                        std::lock_guard loading_lock(mutex);
                        broken_parts_to_remove.push_back(part);
                    }
                    else
                    {
                        ......
                        std::lock_guard loading_lock(mutex);
                        broken_parts_to_detach.push_back(part);
                        ++suspicious_broken_parts;
                    }
                }

                return;
            }
            ......
        });
    }

    pool.wait();

    ......
    /* step10 */
    if (suspicious_broken_parts > settings->max_suspicious_broken_parts && !skip_sanity_checks)
        throw Exception("Suspiciously many (" + toString(suspicious_broken_parts) + ") broken parts to remove.",
            ErrorCodes::TOO_MANY_UNEXPECTED_DATA_PARTS);
    ......
    /* step9 */
    for (auto & part : broken_parts_to_remove)
        part->remove();
    for (auto & part : broken_parts_to_detach)
        part->renameToDetached("");

    /* step11 */
    /// Delete from the set of current parts those parts that are covered by another part (those parts that
    /// were merged), but that for some reason are still not deleted from the filesystem.
    /// Deletion of files will be performed later in the clearOldParts() method.

    ......

    /* step12 */
    calculateColumnSizesImpl();

    LOG_DEBUG(log, "Loaded data parts (" << data_parts_indexes.size() << " items)");
}

这个方法的代码过长,也是省去了一些,保留了主要逻辑,主要可以分解为12个逻辑步骤,在代码相应位置做了step的标记,下面具体看下每个步骤都做了什么:

step1:从storage_policy(包含哪些信息可以参照官网有关存储策略的配置)中获取到该表所有的parts都在哪些disk上,并生成map<part_name,disk_ptr>,即part名称和part指针的map

step2:创建线程池,实现数据的并行加载,pool.scheduleOrThrowOnError([&, i]{})就是向线程池提交任务

step3:判断是否有DELETE_ON_DESTROY_MARKER_PATHdelete-on-destroy.txt)文件存在,如果有将该part存入broken_parts_to_detach,以便后续处理,并将坏part数量(suspicious_broken_parts)统计加1

step4:真正的加载数据,正如方法名,加载Columns、Checksums、Indexes等数据,实际包含loadColumns(require_columns_checksums)loadChecksums(require_columns_checksums)loadIndexGranularity()loadIndex()loadRowsCount()loadPartitionAndMinMaxIndex()loadTTLInfos()

step5:如果step4过程有错误,broken会为true,触发step5逻辑,统计坏part

step6:如果坏part的level为0,就直接存入broken_parts_to_remove,以便后续操作

step7:如果level不为0,说明该part是几个part合并后的part,通过part的contains()方法统计该part中包含现存的多少个part,数量为contained_parts

step8:如果contained_parts大于等于2说明该part是由多个part合并得来的,这块损坏很大可能是由于merge造成的可以直接删除,也不会造成数据丢失,即加入broken_parts_to_remove;如果contained_parts小于2说明是个未知问题,也没有其他part能构造出该part的全部数据,即放入broken_parts_to_detach

step9:真正执行remove()renameToDetached()

step10:根据坏part的多少决定是否要继续执行还是抛错结束server进程,max_suspicious_broken_parts默认配置为10

step11:删除已经merge的,但未被删除的inactive状态的parts

step12:计算当前columns的大小,结果会体现在system.columns系统表中

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

一只努力的微服务

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值