MergeTree
1. BackgroundJobExecutor
MergeTree是基于类LSM Tree进行设计的,数据写入后是以part形式存储,并且不能更改。为了实现更新、删除等操作,就需要在进行alter、merge、mutation。为此clickhouse实现了一套任务框架,将各种任务都抽象为task的job,通过统一的任务池管理。看下MergeTree是如何实现的。
1.1 startup
void StorageMergeTree::startup()
{
......
try
{
background_executor.start();
startBackgroundMovesIfNeeded();
}
catch (...)
{
......
}
}
MergeTree实现了startup方法,方法中省略了一些其他逻辑,主要关注下background_executor.start(),这个就是后台任务的执行器,在表引擎启动时伴随启动,用来执行后面的merge和mutation任务。
1.2 IBackgroundJobExecutor::start()
void IBackgroundJobExecutor::start()
{
std::lock_guard lock(scheduling_task_mutex);
if (!scheduling_task)
{
scheduling_task = global_context.getSchedulePool().createTask(
getBackgroundTaskName(), [this]{ jobExecutingTask(); });
}
scheduling_task->activateAndSchedule();
}
上面提到的background_executor.start(),是调用了IBackgroundJobExecutor基类的start方法,任务不存在就执行创建,任务存在就激活任务。任务在后台会不断执行jobExecutingTask方法来处理后台job。
1.3 jobExecutingTask
void IBackgroundJobExecutor::jobExecutingTask()
try
{
auto job_and_pool = getBackgroundJob();
if (job_and_pool) /// If we have job, then try to assign into background pool
{
......
pools[job_and_pool->pool_type].scheduleOrThrowOnError([this, pool_config, job{std::move(job_and_pool->job)}] ()
{
try /// We don't want exceptions in background pool
{
bool job_success = job();
/// Job done, decrement metric and reset no_work counter
CurrentMetrics::values[pool_config.tasks_metric]--;
if (job_success)
{
/// Job done, new empty space in pool, schedule background task
runTaskWithoutDelay();
}
else
{
/// Job done, but failed, schedule with backoff
scheduleTask(/* with_backoff = */ true);
}
}
catch (...)
{
tryLogCurrentException(__PRETTY_FUNCTION__);
CurrentMetrics::values[pool_config.tasks_metric]--;
scheduleTask(/* with_backoff = */ true);
}
});
......
}
catch (...) /// Exception while we looking for a task, reschedule
{
tryLogCurrentException(__PRETTY_FUNCTION__);
scheduleTask(/* with_backoff = */ true);
}
该方法会通过getBackgroundJob来获取需要执行的job,然后从job_and_pool中拿到要执行的job调用 job()来执行,根据task执行结果job_success,再通过scheduleTask方法决定是否要延迟执行。
若task执行成功,1s后会执行下一个任务;
若task执行失败,会根据失败的任务数来计算一个延迟时间,将该任务添加到delayed_tasks延迟任务对列,该队列中task是根据延迟时间来排序的,当task的延迟时间到达时,会唤醒BackgroundJobExecutor中的任务线程来执行。
getBackgroundJob方法实际会调用torageMergeTree中的getDataProcessingJob方法,获取merge mutation entry来执行。
1.4 getDataProcessingJob
这里的getDataProcessingJob是StorageMergeTree的一个实现,ck中还有另一个实现是ReplicatedMergeTree的getDataProcessingJob,ReplicatedMergeTree中的实现会更复杂一点,会多出fetch任务。
std::optional<JobAndPool> StorageMergeTree::getDataProcessingJob()
{
if (shutdown_called)
return {};
if (merger_mutator.merges_blocker.isCancelled())
return {};
auto metadata_snapshot = getInMemoryMetadataPtr();
std::shared_ptr<MergeMutateSelectedEntry> merge_entry, mutate_entry;
auto share_lock = lockForShare(RWLockImpl::NO_QUERY, getSettings()->lock_acquire_timeout_for_background_operations);
merge_entry = selectPartsToMerge(metadata_snapshot, false, {}, false, nullptr, share_lock);
if (!merge_entry)
mutate_entry = selectPartsToMutate(metadata_snapshot, nullptr, share_lock);
if (merge_entry || mutate_entry)
{
return JobAndPool{[this, metadata_snapshot, merge_entry, mutate_entry, share_lock] () mutable
{
if (merge_entry)
return mergeSelectedParts(metadata_snapshot, false, {}, *merge_entry, share_lock);
else if (mutate_entry)
return mutateSelectedPart(metadata_snapshot, *mutate_entry, share_lock);
__builtin_unreachable();
}, PoolType::MERGE_MUTATE};
}
else if (auto lock = time_after_previous_cleanup.compareAndRestartDeferred(1))
{
return JobAndPool{[this, share_lock] ()
{
/// All use relative_data_path which changes during rename
/// so execute under share lock.
clearOldPartsFromFilesystem();
clearOldTemporaryDirectories();
clearOldWriteAheadLogs();
clearOldMutations();
clearEmptyParts();
return true;
}, PoolType::MERGE_MUTATE};
}
return {};
}
该方法主要分为两部分:
- 通过selectPartsToMerge方法选择出merge_entry和通过selectPartsToMutate方法选择出mutate_entry并提交到B后台执行线程池中进行执行;
- 若到了清理周期,会提交清理任务。包含清理Outdated part、清理无用的tmp文件夹、清理wal中已经落盘的part、清理mutation中已经执行完的、清理空part文件夹,至于为什么会有空part,参考另一个博文。
2. alter
void StorageMergeTree::alter(
const AlterCommands & commands,
const Context & context,
TableLockHolder & table_lock_holder)
{
auto table_id = getStorageID();
StorageInMemoryMetadata new_metadata = getInMemoryMetadata();
StorageInMemoryMetadata old_metadata = getInMemoryMetadata();
/// 1.获取可能的mutation_commands
auto maybe_mutation_commands = commands.getMutationCommands(new_metadata, context.getSettingsRef().materialize_ttl_after_modify, context);
String mutation_file_name;
Int64 mutation_version = -1;
commands.apply(new_metadata, context);
/// 2.判断是不是MODIFY_SETTING类型的修改
if (commands.isSettingsAlter())
/// 若是,直接changeSettings并执行alterTable
{
changeSettings(new_metadata.settings_changes, table_lock_holder);
DatabaseCatalog::instance().getDatabase(table_id.database_name)->alterTable(context, table_id, new_metadata);
}
else
/// 若不是,执行changeSettings和alterTable后,还需要执行mutation并同步等待结束
{
{
changeSettings(new_metadata.settings_changes, table_lock_holder);
checkTTLExpressions(new_metadata, old_metadata);
/// Reinitialize primary key because primary key column types might have changed.
setProperties(new_metadata, old_metadata);
DatabaseCatalog::instance().getDatabase(table_id.database_name)->alterTable(context, table_id, new_metadata);
if (!maybe_mutation_commands.empty())
mutation_version = startMutation(maybe_mutation_commands, mutation_file_name);
}
/// Always execute required mutations synchronously, because alters
/// should be executed in sequential order.
if (!maybe_mutation_commands.empty())
waitForMutation(mutation_version, mutation_file_name);
}
}
2.1. getMutationCommands
MutationCommands AlterCommands::getMutationCommands(StorageInMemoryMetadata metadata, bool materialize_ttl, const Context & context) const
{
MutationCommands result;
for (const auto & alter_cmd : *this)
if (auto mutation_cmd = alter_cmd.tryConvertToMutationCommand(metadata, context); mutation_cmd)
result.push_back(*mutation_cmd);
if (materialize_ttl)
{
for (const auto & alter_cmd : *this)
{
if (alter_cmd.isTTLAlter(metadata))
{
result.push_back(createMaterializeTTLCommand());
break;
}
}
}
return result;
}
tryConvertToMutationCommand尝试将alter命令转换为mutation命令,但某些情况下会返回空。是否会返回空,主要取决于里面的isRequireMutationStage方法以及是不是合法的mutation命令。该方法判断是否需要执行mutation,若不需要则返回空(例如在SQL中指定if not exist或者列不存在等),若需要则会根据不同的mutation类型返回不同的mutation命令,如READ_COLUMN、DROP_COLUMN、DROP_INDEX、RENAME_COLUMN。
materialize_ttl是由配置参数materialize_ttl_after_modify(默认true)来决定的,在判断metadata是TTLAlter时,会构造MaterializeTTL命令
2.2 changeSettings
主要包含两部分内容:更改内存中的metadata和重新应用存储策略。
若存储策略有修改,会调用startBackgroundMovesIfNeeded方法在后台进行part移动。
2.3 alterTable
主要是将内存中的metadata应用到表对应的本地元文件中,也是先写tmp文件再做rename。
void DatabaseOrdinary::alterTable(const Context & context, const StorageID & table_id, const StorageInMemoryMetadata & metadata)
{
String table_name = table_id.table_name;
/// Read the definition of the table and replace the necessary parts with new ones.
String table_metadata_path = getObjectMetadataPath(table_name);
String table_metadata_tmp_path = table_metadata_path + ".tmp";
String statement;
{
ReadBufferFromFile in(table_metadata_path, METADATA_FILE_BUFFER_SIZE);
readStringUntilEOF(statement, in);
}
ParserCreateQuery parser;
ASTPtr ast = parseQuery(
parser,
statement.data(),
statement.data() + statement.size(),
"in file " + table_metadata_path,
0,
context.getSettingsRef().max_parser_depth);
applyMetadataChangesToCreateQuery(ast, metadata);
statement = getObjectDefinitionFromCreateQuery(ast);
{
WriteBufferFromFile out(table_metadata_tmp_path, statement.size(), O_WRONLY | O_CREAT | O_EXCL);
writeString(statement, out);
out.next();
if (context.getSettingsRef().fsync_metadata)
out.sync();
out.close();
}
commitAlterTable(table_id, table_metadata_tmp_path, table_metadata_path, statement, context);
}
2.4 startMutation
该方法是将mutation构造成MergeTreeMutationEntry并存入current_mutations_by_version,然后唤醒后台线程执行器。
Int64 StorageMergeTree::startMutation(const MutationCommands & commands, String & mutation_file_name)
{
/// Choose any disk, because when we load mutations we search them at each disk
/// where storage can be placed. See loadMutations().
auto disk = getStoragePolicy()->getAnyDisk();
Int64 version;
{
std::lock_guard lock(currently_processing_in_background_mutex);
MergeTreeMutationEntry entry(commands, disk, relative_data_path, insert_increment.get());
version = increment.get();
entry.commit(version);
mutation_file_name = entry.file_name;
auto insertion = current_mutations_by_id.emplace(mutation_file_name, std::move(entry));
current_mutations_by_version.emplace(version, insertion.first->second);
LOG_INFO(log, "Added mutation: {}", mutation_file_name);
}
background_executor.triggerTask();
return version;
}
在向current_mutations_by_version(是一个map结构,key是mutation的version,value时mutation的entry)中提交mutation entry时需要对其加锁避免多个线程同时操作current_mutations_by_version。
2.5 waitForMutation
void StorageMergeTree::waitForMutation(Int64 version, const String & file_name)
{
LOG_INFO(log, "Waiting mutation: {}", file_name);
{
auto check = [version, this]()
{
if (shutdown_called)
return true;
auto mutation_status = getIncompleteMutationsStatus(version);
return !mutation_status || mutation_status->is_done || !mutation_status->latest_fail_reason.empty();
};
std::unique_lock lock(mutation_wait_mutex);
mutation_wait_event.wait(lock, check);
}
/// At least we have our current mutation
std::set<String> mutation_ids;
mutation_ids.insert(file_name);
auto mutation_status = getIncompleteMutationsStatus(version, &mutation_ids);
checkMutationStatus(mutation_status, mutation_ids);
LOG_INFO(log, "Mutation {} done", file_name);
}
判断mutation是否执行完是通过getIncompleteMutationsStatus方法实现的,主要逻辑就是循环判断每个part的block_number是否都大于mutation_id,若都大于就表示该mutation已做完,反之还在进行中。
waitForMutation是要保证mutation是串行的,最后将本次mutation_id加入到mutation_ids这个set中,是为了判断当前的mutation_id是否执行成功,因为condition_variable虽然可以保证当前mutation_id执行完成,但是没办法获取成功和失败状态,所以又执行checkMutationStatus,获取当前mutation_id执行结果。
3. merge
正常的后台异步的merge任务是放在了getDataProcessingJob方法中,通过selectPartsToMerge触发的,但是还可以看到StorageMergeTree中还有一个单独的merge方法,这个方法是主要是供optimize使用的,其逻辑和后台异步任务基本一致。也可以借此方法了解merge的逻辑。
bool StorageMergeTree::merge(
bool aggressive,
const String & partition_id,
bool final,
bool deduplicate,
const Names & deduplicate_by_columns,
String * out_disable_reason,
bool optimize_skip_merged_partitions)
{
auto table_lock_holder = lockForShare(RWLockImpl::NO_QUERY, getSettings()->lock_acquire_timeout_for_background_operations);
auto metadata_snapshot = getInMemoryMetadataPtr();
SelectPartsDecision select_decision;
auto merge_mutate_entry = selectPartsToMerge(metadata_snapshot, aggressive, partition_id, final, out_disable_reason, table_lock_holder, optimize_skip_merged_partitions, &select_decision);
/// If there is nothing to merge then we treat this merge as successful (needed for optimize final optimization)
if (select_decision == SelectPartsDecision::NOTHING_TO_MERGE)
return true;
if (!merge_mutate_entry)
return false;
return mergeSelectedParts(metadata_snapshot, deduplicate, deduplicate_by_columns, *merge_mutate_entry, table_lock_holder);
}
3.1 selectPartsToMerge
该方法是选择需要merge的part并创建merge_mutate_entry的过程。其中也会根据是否提供partition_id分为两个逻辑,若提供partition_id,将调用selectAllPartsToMergeWithinPartition方法执行单partition的part选择;若未提供partition_id,则执行全局的part选择。最后还需要将生成的entry存入merging_tagger中,merging_tagger主要是记录有哪些future_part(正在执行merge但还未提交的part),并为这些part预留足够的处理空间。
std::shared_ptr<StorageMergeTree::MergeMutateSelectedEntry> StorageMergeTree::selectPartsToMerge(
const StorageMetadataPtr & metadata_snapshot, bool aggressive, const String & partition_id, bool final, String * out_disable_reason, TableLockHolder & /* table_lock_holder */, bool optimize_skip_merged_partitions, SelectPartsDecision * select_decision_out)
{
......
if (partition_id.empty())
{
UInt64 max_source_parts_size = merger_mutator.getMaxSourcePartsSizeForMerge();
bool merge_with_ttl_allowed = getTotalMergesWithTTLInMergeList() < data_settings->max_number_of_merges_with_ttl_in_pool;
/// TTL requirements is much more strict than for regular merge, so
/// if regular not possible, than merge with ttl is not also not
/// possible.
if (max_source_parts_size > 0)
{
select_decision = merger_mutator.selectPartsToMerge(
future_part,
aggressive,
max_source_parts_size,
can_merge,
merge_with_ttl_allowed,
out_disable_reason);
}
else if (out_disable_reason)
*out_disable_reason = "Current value of max_source_parts_size is zero";
}
else
{
while (true)
{
UInt64 disk_space = getStoragePolicy()->getMaxUnreservedFreeSpace();
select_decision = merger_mutator.selectAllPartsToMergeWithinPartition(
future_part, disk_space, can_merge, partition_id, final, metadata_snapshot, out_disable_reason, optimize_skip_merged_partitions);
......
}
}
......
merging_tagger = std::make_unique<CurrentlyMergingPartsTagger>(future_part, MergeTreeDataMergerMutator::estimateNeededDiskSpace(future_part.parts), *this, metadata_snapshot, false);
return std::make_shared<MergeMutateSelectedEntry>(future_part, std::move(merging_tagger), MutationCommands{});
}
3.2 selectPartsToMerge
该方法是在所有part选择哪些part是最适合merge的。part的选择在ck中是通过MergeSelector来完成的,实现IMergeSelector接口的selector有SimpleMergeSelector、TTLDeleteMergeSelector、TTLRecompressMergeSelector等,这里使用了TTLDeleteMergeSelector和SimpleMergeSelector来进行选择。这些selector会使用启发式算法选择出一个最优的PartsRange让后台线程来执行merge。
SelectPartsDecision MergeTreeDataMergerMutator::selectPartsToMerge(
FutureMergedMutatedPart & future_part,
bool aggressive,
size_t max_total_size_to_merge,
const AllowedMergingPredicate & can_merge_callback,
bool merge_with_ttl_allowed,
String * out_disable_reason)
{
......
IMergeSelector::PartsRange parts_to_merge;
if (metadata_snapshot->hasAnyTTL() && merge_with_ttl_allowed && !ttl_merges_blocker.isCancelled())
{
/// TTL delete is preferred to recompression
TTLDeleteMergeSelector delete_ttl_selector(
next_delete_ttl_merge_times_by_partition,
current_time,
data_settings->merge_with_ttl_timeout,
data_settings->ttl_only_drop_parts);
parts_to_merge = delete_ttl_selector.select(parts_ranges, max_total_size_to_merge);
if (!parts_to_merge.empty())
{
future_part.merge_type = MergeType::TTL_DELETE;
}
else if (metadata_snapshot->hasAnyRecompressionTTL())
{
TTLRecompressMergeSelector recompress_ttl_selector(
next_recompress_ttl_merge_times_by_partition,
current_time,
data_settings->merge_with_recompression_ttl_timeout,
metadata_snapshot->getRecompressionTTLs());
parts_to_merge = recompress_ttl_selector.select(parts_ranges, max_total_size_to_merge);
if (!parts_to_merge.empty())
future_part.merge_type = MergeType::TTL_RECOMPRESS;
}
}
if (parts_to_merge.empty())
{
SimpleMergeSelector::Settings merge_settings;
if (aggressive)
merge_settings.base = 1;
parts_to_merge = SimpleMergeSelector(merge_settings)
.select(parts_ranges, max_total_size_to_merge);
/// Do not allow to "merge" part with itself for regular merges, unless it is a TTL-merge where it is ok to remove some values with expired ttl
if (parts_to_merge.size() == 1)
throw Exception("Logical error: merge selector returned only one part to merge", ErrorCodes::LOGICAL_ERROR);
if (parts_to_merge.empty())
{
if (out_disable_reason)
*out_disable_reason = "There is no need to merge parts according to merge selector algorithm";
return SelectPartsDecision::CANNOT_SELECT;
}
}
......
return SelectPartsDecision::SELECTED;
}
3.3 selectAllPartsToMergeWithinPartition
该方法是在有partition_id传入的时候才会执行,不再使用MergeSelector做PartsRange的选择,而是通过selectAllPartsFromPartition将属于同一partition_id的part全部获取到并直接执行merge。但在这之前会进行一些前置判断,比如该partition是否只有一个part,PartsRange所需空间是否足够等,所有条件都满足才会执行partition范围的merge。
SelectPartsDecision MergeTreeDataMergerMutator::selectAllPartsToMergeWithinPartition(
FutureMergedMutatedPart & future_part,
UInt64 & available_disk_space,
const AllowedMergingPredicate & can_merge,
const String & partition_id,
bool final,
const StorageMetadataPtr & metadata_snapshot,
String * out_disable_reason,
bool optimize_skip_merged_partitions)
{
MergeTreeData::DataPartsVector parts = selectAllPartsFromPartition(partition_id);
......
auto it = parts.begin();
auto prev_it = it;
UInt64 sum_bytes = 0;
while (it != parts.end())
{
/// For the case of one part, we check that it can be merged "with itself".
if ((it != parts.begin() || parts.size() == 1) && !can_merge(*prev_it, *it, out_disable_reason))
{
return SelectPartsDecision::CANNOT_SELECT;
}
sum_bytes += (*it)->getBytesOnDisk();
prev_it = it;
++it;
}
/// Enough disk space to cover the new merge with a margin.
auto required_disk_space = sum_bytes * DISK_USAGE_COEFFICIENT_TO_SELECT;
......
LOG_DEBUG(log, "Selected {} parts from {} to {}", parts.size(), parts.front()->name, parts.back()->name);
future_part.assign(std::move(parts));
available_disk_space -= required_disk_space;
return SelectPartsDecision::SELECTED;
}
4. mutate
同merge方法一样,正常的后台异步的mutate任务也是放在了getDataProcessingJob方法中,通过selectPartsToMutate来触发,但是StorageMergeTree中还有一个单独的mutate方法,这个方法是主要是用作非schema修改操作,如alter table xxx update/delete之类的操作使用。
void StorageMergeTree::mutate(const MutationCommands & commands, const Context & query_context)
{
String mutation_file_name;
Int64 version = startMutation(commands, mutation_file_name);
if (query_context.getSettingsRef().mutations_sync > 0)
waitForMutation(version, mutation_file_name);
}
4.1 selectPartsToMutate
凡是涉及到mutate的流程中都是通过background_executor.triggerTask()来唤醒后台线程来实现,其具体逻辑还要从selectPartsToMutate看起。
std::shared_ptr<StorageMergeTree::MergeMutateSelectedEntry> StorageMergeTree::selectPartsToMutate(
const StorageMetadataPtr & metadata_snapshot, String * /* disable_reason */, TableLockHolder & /* table_lock_holder */)
{
std::lock_guard lock(currently_processing_in_background_mutex);
size_t max_ast_elements = global_context.getSettingsRef().max_expanded_ast_elements;
FutureMergedMutatedPart future_part;
if (storage_settings.get()->assign_part_uuids)
future_part.uuid = UUIDHelpers::generateV4();
MutationCommands commands;
CurrentlyMergingPartsTaggerPtr tagger;
if (current_mutations_by_version.empty())
return {};
auto mutations_end_it = current_mutations_by_version.end();
for (const auto & part : getDataPartsVector())
{
if (currently_merging_mutating_parts.count(part))
continue;
auto mutations_begin_it = current_mutations_by_version.upper_bound(part->info.getDataVersion());
if (mutations_begin_it == mutations_end_it)
continue;
size_t max_source_part_size = merger_mutator.getMaxSourcePartSizeForMutation();
if (max_source_part_size < part->getBytesOnDisk())
{
LOG_DEBUG(log, "Current max source part size for mutation is {} but part size {}. Will not mutate part {}. "
"Max size depends not only on available space, but also on settings "
"'number_of_free_entries_in_pool_to_execute_mutation' and 'background_pool_size'",
max_source_part_size, part->getBytesOnDisk(), part->name);
continue;
}
size_t current_ast_elements = 0;
for (auto it = mutations_begin_it; it != mutations_end_it; ++it)
{
size_t commands_size = 0;
MutationCommands commands_for_size_validation;
for (const auto & command : it->second.commands)
{
if (command.type != MutationCommand::Type::DROP_COLUMN
&& command.type != MutationCommand::Type::DROP_INDEX
&& command.type != MutationCommand::Type::RENAME_COLUMN)
{
commands_for_size_validation.push_back(command);
}
else
{
commands_size += command.ast->size();
}
}
if (!commands_for_size_validation.empty())
{
MutationsInterpreter interpreter(
shared_from_this(), metadata_snapshot, commands_for_size_validation, global_context, false);
commands_size += interpreter.evaluateCommandsSize();
}
if (current_ast_elements + commands_size >= max_ast_elements)
break;
current_ast_elements += commands_size;
commands.insert(commands.end(), it->second.commands.begin(), it->second.commands.end());
}
auto new_part_info = part->info;
new_part_info.mutation = current_mutations_by_version.rbegin()->first;
future_part.parts.push_back(part);
future_part.part_info = new_part_info;
future_part.name = part->getNewName(new_part_info);
future_part.type = part->getType();
tagger = std::make_unique<CurrentlyMergingPartsTagger>(future_part, MergeTreeDataMergerMutator::estimateNeededDiskSpace({part}), *this, metadata_snapshot, true);
return std::make_shared<MergeMutateSelectedEntry>(future_part, std::move(tagger), commands);
}
return {};
}
因为mutation生成时是会以当前最大的block number加一作为mutation version,多个mutation是按照version有序存放在current_mutations_by_version中,会按照version从小到大的顺序执行。这里的顺序执行实际上是体现在part层面的,即有两个mutation先后作用在partA上,那partA要先做第一个mutaion,完成后再做第二个,但是ck在这里做了优化,即将作用于同一个part的多个未执行的mutation会做合并,一并进行处理。当然在整个逻辑中有些跳出条件,如source_part_size过大、ast_elements超出限制等。
在最后也是需要存入CurrentlyMergingPartsTagger,避免重复执行,同时也是为了预占处理空间。