原文链接:click here,欢迎访问我的博客
文章目录
本篇博客将从源码层面分析 RocksDB 写操作中与 WriteGroup 有关的内容,且不考虑 pipelined_write 与 2pc,所用代码版本为 v7.7.4
WAL 主要的功能是当 RocksDB 异常退出后,能够恢复出错前的 memtable 中的数据,因此 RocksDB 默认是每次用户写都会刷新数据到 WAL。每次当当前 WAL 对应的 memtable 刷新到磁盘之后,都会新建一个WAL,即一个 memtable 对应一个 WAL。每一个 WAL 最终都会写入对应的 WAL 文件中,这些文件保存在 options.wal_dir 中,所有 WAL 文件都是按照 log_number 来起的。
WriteImpl
我们知道,每一个写线程都对应一个 WriteBatch,其写入会交给 DBImpl::WriteImpl() 来完成,而 WriteGroup 的构建与消除,也是在该函数内部完成,因此本文以该函数为分析入口。
进入函数内部,跳过配置检查、unordered_write、pipelined_write 等分支,会看见封装 Writer 的代码:
// DBImpl::WriteImpl()
WriteThread::Writer w(write_options, my_batch, callback, log_ref,
disable_memtable, batch_cnt, pre_release_callback,
post_memtable_callback);
Writer 内部的结构在上一篇博客中已经讨论过了,核心为 WriteBatch* 、link_older 与 link_newer,这里就不再赘述了。封装完 Writer 后,会将其加入 DB 对象中的 Writer 链表,上一篇博客我们把它称 WriteLink,本篇依旧如此。
// DBImpl::WriteImpl()
write_thread_.JoinBatchGroup(&w);
JoinBatchGroup
该函数完整源码如下:
void WriteThread::JoinBatchGroup(Writer* w) {
TEST_SYNC_POINT_CALLBACK("WriteThread::JoinBatchGroup:Start", w);
assert(w->batch != nullptr);
bool linked_as_leader = LinkOne(w, &newest_writer_);
if (linked_as_leader) {
SetState(w, STATE_GROUP_LEADER);
}
TEST_SYNC_POINT_CALLBACK("WriteThread::JoinBatchGroup:Wait", w);
TEST_SYNC_POINT_CALLBACK("WriteThread::JoinBatchGroup:Wait2", w);
if (!linked_as_leader) {
/**
* Wait util:
* 1) An existing leader pick us as the new leader when it finishes
* 2) An existing leader pick us as its follewer and
* 2.1) finishes the memtable writes on our behalf
* 2.2) Or tell us to finish the memtable writes in pralallel
* 3) (pipelined write) An existing leader pick us as its follower and
* finish book-keeping and WAL write for us, enqueue us as pending
* memtable writer, and
* 3.1) we become memtable writer group leader, or
* 3.2) an existing memtable writer group leader tell us to finish memtable
* writes in parallel.
*/
TEST_SYNC_POINT_CALLBACK("WriteThread::JoinBatchGroup:BeganWaiting", w);
AwaitState(w, STATE_GROUP_LEADER | STATE_MEMTABLE_WRITER_LEADER |
STATE_PARALLEL_MEMTABLE_WRITER | STATE_COMPLETED,
&jbg_ctx);
TEST_SYNC_POINT_CALLBACK("WriteThread::JoinBatchGroup:DoneWaiting", w);
}
}
// ...
bool WriteThread::LinkOne(Writer* w, std::atomic<Writer*>* newest_writer) {
assert(newest_writer != nullptr);
assert(w->state == STATE_INIT);
Writer* writers = newest_writer->load(std::memory_order_relaxed);
while (true) {
// If write stall in effect, and w->no_slowdown is not true,
// block here until stall is cleared. If its true, then return
// immediately
if (writers == &write_stall_dummy_) {
if (w->no_slowdown) {
w->status = Status::Incomplete("Write stall");
SetState(w, STATE_COMPLETED);
return false;
}
// Since no_slowdown is false, wait here to be notified of the write
// stall clearing
{
MutexLock lock(&stall_mu_);
writers = newest_writer->load(std::memory_order_relaxed);
if (writers == &write_stall_dummy_) {
TEST_SYNC_POINT_CALLBACK("WriteThread::WriteStall::Wait", w);
stall_cv_.Wait();
// Load newest_writers_ again since it may have changed
writers = newest_writer->load(std::memory_order_relaxed);
continue;
}
}
}
w->link_older = writers;
if (newest_writer->compare_exchange_weak(writers, w)) {
return (writers == nullptr);
}
}
}
其逻辑很简单,先取出 newest_writer_ ,如果其 stall 了,那么视配置来决定是直接返回还是等待。之后,将当前 Writer 插入 WriteLink,实际就是把 link_order 指向 newest_writer,然后把自己变为新的 newest_writer_ 。如果原来的 newest_writer_ 为空,说明当前 Writer 为头一个,则返回 true 表示自己是 Leader,反之返回 false。
插入 WriteLink 后,如果是 Leader,那就把 state 设为 STATE_GROUP_LEADER。如果不是 Leader,就会调用 AwaitState() 阻塞自己,等待 Leader 给自己设置状态(唤醒)。不考虑 pipelined 的情况下,被唤醒的条件有两个,如注释所述:
- 自己不在 WriteGroup 中,被 WriteGroup 的 Leader 选为新的 Leader。
- 在 WriteGroup 中,被 Leader 唤醒。
注意到,JoinBatchGroup() 会阻塞非 Leader,因此接下来的代码,只有 Leader 或者被 Leader 唤醒的 Writer 可以执行,我们先以非 Leader 的视角来看。
非 Leader 视角
JoinBatchGroup() 执行完毕后,RocksDB 会对该 Writer 的 state 做两个判断,一个是 STATE_PARALLEL_MEMTABLE_WRITER,一个是 STATE_COMPLETED,我们一个一个分析。
STATE_PARALLEL_MEMTABLE_WRITER
首先是 STATE_PARALLEL_MEMTABLE_WRITER,源码如下:
// DBImpl::WriteImpl()
if (w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER) {
// we are a non-leader in a parallel group
if (w.ShouldWriteToMemtable()) {
PERF_TIMER_STOP(write_pre_and_post_process_time);
PERF_TIMER_GUARD(write_memtable_time);
ColumnFamilyMemTablesImpl column_family_memtables(
versions_->GetColumnFamilySet());
w.status = WriteBatchInternal::InsertInto(
&w, w.sequence, &column_family_memtables, &flush_scheduler_,
&trim_history_scheduler_,
write_options.ignore_missing_column_families, 0 /*log_number*/, this,
true /*concurrent_memtable_writes*/, seq_per_batch_, w.batch_cnt,
batch_per_txn_, write_options.memtable_insert_hint_per_batch);
PERF_TIMER_START(write_pre_and_post_process_time);
}
if (write_thread_.CompleteParallelMemTableWriter(&w)) {
// we're responsible for exit batch group
// TODO(myabandeh): propagate status to write_group
auto last_sequence = w.write_group->last_sequence;
for (auto* tmp_w : *(w.write_group)) {
assert(tmp_w);
if (tmp_w->post_memtable_callback) {
Status tmp_s =
(*tmp_w->post_memtable_callback)(last_sequence, disable_memtable);
// TODO: propagate the execution status of post_memtable_callback to
// caller.
assert(tmp_s.ok());
}
}
versions_->SetLastSequence(last_sequence);
MemTableInsertStatusCheck(w.status);
write_thread_.ExitAsBatchGroupFollower(&w);
}
assert(w.state == WriteThread::STATE_COMPLETED);
// STATE_COMPLETED conditional below handles exit
}
STATE_PARALLEL_MEMTABLE_WRITER 的意思是,自己在 WriteGroup 中但不是 Leader,且此时已经被 Leader 唤醒,且写入配置为 parallel。因此,该段代码的意思为,在 parallel 的情况下,Leader 把自己的 Follower 唤醒,要求其并行的将自己写入 memtable 中,写入的实现位于 WriteBatchInternal::InsertInto() 中。代码的后半部分用了一个 CompleteParallelMemTableWriter() 判断,先来看看它的注释。
// Reports the completion of w's batch to the parallel group leader, and
// waits for the rest of the parallel batch to complete. Returns true
// if this thread is the last to complete, and hence should advance
// the sequence number and then call EarlyExitParallelGroup, false if
// someone else has already taken responsibility for that.
bool CompleteParallelMemTableWriter(Writer* w);
重点为,如果当前 Writer 是并行写入中最后一个完成的 Writer,那么返回 true。因此上述代码块的第二个作用为,如果当前 Writer 是最后一个完成的,那么尤其负责 WriteGroup 的善后工作,包括更新全局 seq num 以及执行 ExitAsBatchGroupFollower(),ExitAsBatchGroupFollower() 作用主要是辅助选出新的 WriteGroup,在上一篇博客中提到了,在后文我们会细说。
当然,这一部分只有在开启 parallel 的情况下才可能达到,否则会直接跳过。
STATE_COMPLETED
顾名思义,STATE_COMPLETED 指该 Writer 已经完成了,因此这一部分没做什么重要的操作,直接返回了
// DBImpl::WriteImpl()
if (w.state == WriteThread::STATE_COMPLETED) {
if (log_used != nullptr) {
*log_used = w.log_used;
}
if (seq_used != nullptr) {
*seq_used = w.sequence;
}
// write is complete and leader has updated sequence
return w.FinalStatus();
}
上述两个分支,都只有非 Leader 能够达到。接下来,我们以 Leader 的视角继续。
Leader 视角
JoinBatchGroup() 会阻塞除了 Leader 以外的所有 Writer,而 Leader 的状态为 STATE_GROUP_LEADER,因此它会直接跳过上面两个分支,执行接下来的代码。Leader 会先创建一个空 WriteGroup,然后开始逐步构建它。
// DBImpl::WriteImpl()
WriteThread::WriteGroup write_group;
last_batch_group_size_ =
write_thread_.EnterAsBatchGroupLeader(&w, &write_group);
其中,EnterAsBatchGroupLeader() 就是构建 WriteBatch 的核心函数。
EnterAsBatchGroupLeader
省去中间的配置判断,该函数源码如下:
size_t WriteThread::EnterAsBatchGroupLeader(Writer* leader,
WriteGroup* write_group) {
assert(leader->link_older == nullptr);
assert(leader->batch != nullptr);
assert(write_group != nullptr);
size_t size = WriteBatchInternal::ByteSize(leader->batch);
// Allow the group to grow up to a maximum size, but if the
// original write is small, limit the growth so we do not slow
// down the small write too much.
size_t max_size = max_write_batch_group_size_bytes;
const uint64_t min_batch_size_bytes = max_write_batch_group_size_bytes / 8;
if (size <= min_batch_size_bytes) {
max_size = size + min_batch_size_bytes;
}
leader->write_group = write_group;
write_group->leader = leader;
write_group->last_writer = leader;
write_group->size = 1;
Writer* newest_writer = newest_writer_.load(std::memory_order_acquire);
// This is safe regardless of any db mutex status of the caller. Previous
// calls to ExitAsGroupLeader either didn't call CreateMissingNewerLinks
// (they emptied the list and then we added ourself as leader) or had to
// explicitly wake us up (the list was non-empty when we added ourself,
// so we have already received our MarkJoined).
CreateMissingNewerLinks(newest_writer);
// Tricky. Iteration start (leader) is exclusive and finish
// (newest_writer) is inclusive. Iteration goes from old to new.
Writer* w = leader;
while (w != newest_writer) {
assert(w->link_newer);
w = w->link_newer;
// 各种if判断,如果w和leader的配置不吻合,那就break。
w->write_group = write_group;
size += batch_size;
write_group->last_writer = w;
write_group->size++;
}
TEST_SYNC_POINT_CALLBACK("WriteThread::EnterAsBatchGroupLeader:End", w);
return size;
}
首先,该函数取出 newest_writer_,然后调用 WriteThread::CreateMissingNewerLinks()。在 JoinBatchGroup 时,构造的是只有后向指针 link_older 的单向链表,而该函数就是从尾部遍历一遍这个链表,把每一个 Writer 的 link_newer 确定,即边单向为双向。其实现很简单,这里就不赘述了。
接着,进入循环,从 Leader(也就是自己)开始遍历。如果 w 和 Leader 的配置不吻合,那就 break,因为 WriteGroup 要保证配置一致。如果吻合,那就加入 WriteGroup 中,以此类推,最终用 last_writer 来标记 WriteGroup 中的最后一个 Writer。
Leader 构建完 WriteGroup 之后,就要执行写入了。
是否 parallel
WriteGroup 的写入分为 parallel 和 !parallel,即是否并行。判断准则如下:
// DBImpl::WriteImpl()
// Rules for when we can update the memtable concurrently
// 1. supported by memtable
// 2. Puts are not okay if inplace_update_support
// 3. Merges are not okay
//
// Rules 1..2 are enforced by checking the options
// during startup (CheckConcurrentWritesSupported), so if
// options.allow_concurrent_memtable_write is true then they can be
// assumed to be true. Rule 3 is checked for each batch. We could
// relax rules 2 if we could prevent write batches from referring
// more than once to a particular key.
bool parallel = immutable_db_options_.allow_concurrent_memtable_write &&
write_group.size > 1;
size_t total_count = 0;
size_t valid_batches = 0;
size_t total_byte_size = 0;
size_t pre_release_callback_cnt = 0;
for (auto* writer : write_group) {
assert(writer);
if (writer->CheckCallback(this)) {
valid_batches += writer->batch_cnt;
if (writer->ShouldWriteToMemtable()) {
total_count += WriteBatchInternal::Count(writer->batch);
parallel = parallel && !writer->batch->HasMerge();
}
total_byte_size = WriteBatchInternal::AppendedByteSize(
total_byte_size, WriteBatchInternal::ByteSize(writer->batch));
if (writer->pre_release_callback) {
pre_release_callback_cnt++;
}
}
}
从上述代码来看,准则有三点:
- allow_concurrent_memtable_write 必须要置位。
- 只能是 out_of_place_update,不能是 inplace_update。
- 所有 Writer 中都不含 Merge 操作。
我们分析下第二点。如果采用就地更新的话,那么就不能支持多 Writer 并发写了,原因是如果不是原地更新的话,那么同一个 key 可能会有多个版本:(keyX, seq1, val1), (keyX, seq2, val2), (keyX, seq3, val3),多个 Writer 并发插数据到跳表的时候,一定能够保证,对于相同的 key,seq 越大的排在跳表的后面,这可以保证MVCC 或者事务的正确性。如果是原地更新,那么同一个 key 在跳表中只对应一个节点,多 writer 并发写的时候,无法保证 seq 最大的 Writer 最后写入相关的节点。
写入 WAL
如果没有置位 disableWAL ,那么接下来 WriteGroup 都要被写入 WAL,但是这会被分为 2pc 和 !2pc 两个分支。
// DBImpl::WriteImpl()
if (!two_write_queues_) {
if (status.ok() && !write_options.disableWAL) {
assert(log_context.log_file_number_size);
LogFileNumberSize& log_file_number_size =
*(log_context.log_file_number_size);
PERF_TIMER_GUARD(write_wal_time);
io_s =
WriteToWAL(write_group, log_context.writer, log_used,
log_context.need_log_sync, log_context.need_log_dir_sync,
last_sequence + 1, log_file_number_size);
}
} else {
if (status.ok() && !write_options.disableWAL) {
PERF_TIMER_GUARD(write_wal_time);
// LastAllocatedSequence is increased inside WriteToWAL under
// wal_write_mutex_ to ensure ordered events in WAL
io_s = ConcurrentWriteToWAL(write_group, log_used, &last_sequence,
seq_inc);
} else {
// Otherwise we inc seq number for memtable writes
last_sequence = versions_->FetchAddLastAllocatedSequence(seq_inc);
}
}
如果没有开启 2pc,那么就会调用 WriteToWAL(),如果开启了 2pc,则调用 ConcurrentWriteToWAL(),二者都会将整个 WriteGroup 传入。具体怎么写 WAL 的,这里就不深入,在下一篇博客中会专门对 WAL 写入进行源码分析。
在写完 WAL 之后,会对 seq 进行一些推进,这里我们先略过。之后就开始写 memtable。
写入 memtable
写入 memtable 的流程被分为了两个分支,parallel 和 ! parallel,源码如下:
// DBImpl::WriteImpl()
if (!parallel) {
// w.sequence will be set inside InsertInto
w.status = WriteBatchInternal::InsertInto(
write_group, current_sequence, column_family_memtables_.get(),
&flush_scheduler_, &trim_history_scheduler_,
write_options.ignore_missing_column_families,
0 /*recovery_log_number*/, this, parallel, seq_per_batch_,
batch_per_txn_);
} else {
write_group.last_sequence = last_sequence;
write_thread_.LaunchParallelMemTableWriters(&write_group);
in_parallel_group = true;
// Each parallel follower is doing each own writes. The leader should
// also do its own.
if (w.ShouldWriteToMemtable()) {
ColumnFamilyMemTablesImpl column_family_memtables(
versions_->GetColumnFamilySet());
assert(w.sequence == current_sequence);
w.status = WriteBatchInternal::InsertInto(
&w, w.sequence, &column_family_memtables, &flush_scheduler_,
&trim_history_scheduler_,
write_options.ignore_missing_column_families, 0 /*log_number*/,
this, true /*concurrent_memtable_writes*/, seq_per_batch_,
w.batch_cnt, batch_per_txn_,
write_options.memtable_insert_hint_per_batch);
}
}
在分析之前,先说一下,WriteBatchInternal::InsertInto() 是向 memtable 写入的入口函数,它有三个重载,分别用于 WriteGroup、WriteBatch 以及 Writer,如下:
static Status InsertInto(
WriteThread::WriteGroup& write_group, xxx);
// Convenience form of InsertInto when you have only one batch
// next_seq returns the seq after last sequence number used in MemTable insert
static Status InsertInto(
const WriteBatch* batch, xxx);
static Status InsertInto(
WriteThread::Writer* writer, xxx;
如果是 !parallel,那就很直观了,因为 Leader 全权负责整个 WriteGroup 的写入,它会直接调用 WriteBatchInternal::InsertInto() 的第一个重载,独自写入整个 WriteGroup。
这里着重关注一下 parallel 模式。
// DBImpl::WriteImpl()
write_thread_.LaunchParallelMemTableWriters(&write_group);
上述操作会唤醒 WriteGroup 的所有 Writer,并将它们的状态设为 STATE_PARALLEL_MEMTABLE_WRITER,源码如下:
void WriteThread::LaunchParallelMemTableWriters(WriteGroup* write_group) {
assert(write_group != nullptr);
write_group->running.store(write_group->size);
for (auto w : *write_group) {
SetState(w, STATE_PARALLEL_MEMTABLE_WRITER);
}
}
此时,被唤醒的 Writer 会从 JoinBatchGroup() 中返回,进入非 Leader 视角的分支,自己执行写入。而 Leader 同它们一样,不再负责整个 WriteGroup 的写入了,只需完成自己的写入即可。
// DBImpl::WriteImpl()
// Each parallel follower is doing each own writes. The leader should
// also do its own.
if (w.ShouldWriteToMemtable()) {
ColumnFamilyMemTablesImpl column_family_memtables(
versions_->GetColumnFamilySet());
assert(w.sequence == current_sequence);
w.status = WriteBatchInternal::InsertInto(
&w, w.sequence, &column_family_memtables, &flush_scheduler_,
&trim_history_scheduler_,
write_options.ignore_missing_column_families, 0 /*log_number*/,
this, true /*concurrent_memtable_writes*/, seq_per_batch_,
w.batch_cnt, batch_per_txn_,
write_options.memtable_insert_hint_per_batch);
}
不管是 WriteBatchInternal::InsertInto() 的第一个重载还是第二个重载,其核心都是一致的,在这里就不深入了,会放在对 memtable 写入的博客中详细分析。
写完 memtable 之后,Leader 会做一些 log_sync 操作,这里先略过。
ExitAsBatchGroupLeader
到这里,WriteGroup 任务就算完成了,然后会开始一些扫尾工作。如果是 parallel,那么由最后一个完成写入的 Writer 来执行,如果不是,那么 Leader 直接执行。
// DBImpl::WriteImpl()
bool should_exit_batch_group = true;
if (in_parallel_group) {
// CompleteParallelWorker returns true if this thread should
// handle exit, false means somebody else did
should_exit_batch_group = write_thread_.CompleteParallelMemTableWriter(&w);
}
if (should_exit_batch_group) {
if (status.ok()) {
for (auto* tmp_w : write_group) {
assert(tmp_w);
if (tmp_w->post_memtable_callback) {
Status tmp_s =
(*tmp_w->post_memtable_callback)(last_sequence, disable_memtable);
// TODO: propagate the execution status of post_memtable_callback to
// caller.
assert(tmp_s.ok());
}
}
// Note: if we are to resume after non-OK statuses we need to revisit how
// we reacts to non-OK statuses here.
versions_->SetLastSequence(last_sequence);
}
MemTableInsertStatusCheck(w.status);
write_thread_.ExitAsBatchGroupLeader(write_group, status);
}
扫尾工作有两项:
- 更新该 version 的 last_sequence_。
- 辅助生成下一个 WriteGroup。
截取 ExitAsBatchGroupLeader 中的非 pipelined 部分,源码如下:
void WriteThread::ExitAsBatchGroupLeader(WriteGroup& write_group,
Status& status) {
Writer* leader = write_group.leader;
Writer* last_writer = write_group.last_writer;
// ...
if (enable_pipelined_write_) {
// ...
} else {
Writer* head = newest_writer_.load(std::memory_order_acquire);
if (head != last_writer ||
!newest_writer_.compare_exchange_strong(head, nullptr)) {
// Either last_writer wasn't the head during the load(), or it was the
// head during the load() but somebody else pushed onto the list before
// we did the compare_exchange_strong (causing it to fail). In the
// latter case compare_exchange_strong has the effect of re-reading
// its first param (head). No need to retry a failing CAS, because
// only a departing leader (which we are at the moment) can remove
// nodes from the list.
assert(head != last_writer);
// After walking link_older starting from head (if not already done)
// we will be able to traverse w->link_newer below. This function
// can only be called from an active leader, only a leader can
// clear newest_writer_, we didn't, and only a clear newest_writer_
// could cause the next leader to start their work without a call
// to MarkJoined, so we can definitely conclude that no other leader
// work is going on here (with or without db mutex).
CreateMissingNewerLinks(head);
assert(last_writer->link_newer != nullptr);
assert(last_writer->link_newer->link_older == last_writer);
last_writer->link_newer->link_older = nullptr;
// Next leader didn't self-identify, because newest_writer_ wasn't
// nullptr when they enqueued (we were definitely enqueued before them
// and are still in the list). That means leader handoff occurs when
// we call MarkJoined
SetState(last_writer->link_newer, STATE_GROUP_LEADER);
}
// else nobody else was waiting, although there might already be a new
// leader now
while (last_writer != leader) {
assert(last_writer);
last_writer->status = status;
// we need to read link_older before calling SetState, because as soon
// as it is marked committed the other thread's Await may return and
// deallocate the Writer.
auto next = last_writer->link_older;
SetState(last_writer, STATE_COMPLETED);
last_writer = next;
}
}
首先,虽然在 EnterAsBatchGroupLeader() 时已经调用过一次 CreateMissingNewerLinks() 将 WriteLink 由单向链表转变为双向链表,但是在 WriteGroup 写入的过程中,很有可能会有新的 Write 加入 WriteLink,而新的这一段就是单向链表了,因此在 Exit 时又调用了一遍 CreateMissingNewerLinks() 确保 WriteLink 为双向链表。
接着,它会选择新的 Leader,实际上就是 last_writer 的后一个 Writer,并且把新 Leader 的 link_order 置空,意为把新旧两个 WriteGroup 断开了。
// WriteThread::ExitAsBatchGroupLeader()
last_writer->link_newer->link_older = nullptr;
SetState(last_writer->link_newer, STATE_GROUP_LEADER);
最后,把所处 WriteGroup 中的所有 Writer 状态都改为 STATE_COMPLETED,意味完成写入。
// WriteThread::ExitAsBatchGroupLeader()
while (last_writer != leader) {
assert(last_writer);
last_writer->status = status;
// we need to read link_older before calling SetState, because as soon
// as it is marked committed the other thread's Await may return and
// deallocate the Writer.
auto next = last_writer->link_older;
SetState(last_writer, STATE_COMPLETED);
last_writer = next;
}
注意,当选择新的 Leader 之后,后者会被唤醒,即可从 JoinBatchGroup() 中返回,进入 Leader 视角的分支,重复上述行为,构造新的 WriteGroup,以此循环。
至此,与 WriteGroup 有关的源码分析就结束了,下一篇将分析向 WAL 中的写入。