RocksDB源码学习(六): 写(二)-WriteGroup

最新推荐文章于 2023-12-08 09:00:00 发布

_CLAY_

最新推荐文章于 2023-12-08 09:00:00 发布

阅读量470

点赞数 1

文章标签：数据库 c++

本文链接：https://blog.csdn.net/weixin_46322986/article/details/128104406

版权

原文链接：click here，欢迎访问我的博客

文章目录

- WriteImpl

本篇博客将从源码层面分析 RocksDB 写操作中与 WriteGroup 有关的内容，且不考虑 pipelined_write 与 2pc，所用代码版本为 v7.7.4

WAL 主要的功能是当 RocksDB 异常退出后，能够恢复出错前的 memtable 中的数据，因此 RocksDB 默认是每次用户写都会刷新数据到 WAL。每次当当前 WAL 对应的 memtable 刷新到磁盘之后，都会新建一个WAL，即一个 memtable 对应一个 WAL。每一个 WAL 最终都会写入对应的 WAL 文件中，这些文件保存在 options.wal_dir 中，所有 WAL 文件都是按照 log_number 来起的。

WriteImpl

我们知道，每一个写线程都对应一个 WriteBatch，其写入会交给 DBImpl::WriteImpl() 来完成，而 WriteGroup 的构建与消除，也是在该函数内部完成，因此本文以该函数为分析入口。

进入函数内部，跳过配置检查、unordered_write、pipelined_write 等分支，会看见封装 Writer 的代码：

// DBImpl::WriteImpl()
WriteThread::Writer w(write_options, my_batch, callback, log_ref,
                      disable_memtable, batch_cnt, pre_release_callback,
                      post_memtable_callback);

Writer 内部的结构在上一篇博客中已经讨论过了，核心为 WriteBatch* 、link_older 与 link_newer，这里就不再赘述了。封装完 Writer 后，会将其加入 DB 对象中的 Writer 链表，上一篇博客我们把它称 WriteLink，本篇依旧如此。

// DBImpl::WriteImpl()
write_thread_.JoinBatchGroup(&w);

JoinBatchGroup

该函数完整源码如下：

void WriteThread::JoinBatchGroup(Writer* w) {
  TEST_SYNC_POINT_CALLBACK("WriteThread::JoinBatchGroup:Start", w);
  assert(w->batch != nullptr);

  bool linked_as_leader = LinkOne(w, &newest_writer_);

  if (linked_as_leader) {
    SetState(w, STATE_GROUP_LEADER);
  }

  TEST_SYNC_POINT_CALLBACK("WriteThread::JoinBatchGroup:Wait", w);
  TEST_SYNC_POINT_CALLBACK("WriteThread::JoinBatchGroup:Wait2", w);

  if (!linked_as_leader) {
    /**
     * Wait util:
     * 1) An existing leader pick us as the new leader when it finishes
     * 2) An existing leader pick us as its follewer and
     * 2.1) finishes the memtable writes on our behalf
     * 2.2) Or tell us to finish the memtable writes in pralallel
     * 3) (pipelined write) An existing leader pick us as its follower and
     *    finish book-keeping and WAL write for us, enqueue us as pending
     *    memtable writer, and
     * 3.1) we become memtable writer group leader, or
     * 3.2) an existing memtable writer group leader tell us to finish memtable
     *      writes in parallel.
     */
    TEST_SYNC_POINT_CALLBACK("WriteThread::JoinBatchGroup:BeganWaiting", w);
    AwaitState(w, STATE_GROUP_LEADER | STATE_MEMTABLE_WRITER_LEADER |
                      STATE_PARALLEL_MEMTABLE_WRITER | STATE_COMPLETED,
               &jbg_ctx);
    TEST_SYNC_POINT_CALLBACK("WriteThread::JoinBatchGroup:DoneWaiting", w);
  }
}

// ...
bool WriteThread::LinkOne(Writer* w, std::atomic<Writer*>* newest_writer) {
  assert(newest_writer != nullptr);
  assert(w->state == STATE_INIT);
  Writer* writers = newest_writer->load(std::memory_order_relaxed);
  while (true) {
    // If write stall in effect, and w->no_slowdown is not true,
    // block here until stall is cleared. If its true, then return
    // immediately
    if (writers == &write_stall_dummy_) {
      if (w->no_slowdown) {
        w->status = Status::Incomplete("Write stall");
        SetState(w, STATE_COMPLETED);
        return false;
      }
      // Since no_slowdown is false, wait here to be notified of the write
      // stall clearing
      {
        MutexLock lock(&stall_mu_);
        writers = newest_writer->load(std::memory_order_relaxed);
        if (writers == &write_stall_dummy_) {
          TEST_SYNC_POINT_CALLBACK("WriteThread::WriteStall::Wait", w);
          stall_cv_.Wait();
          // Load newest_writers_ again since it may have changed
          writers = newest_writer->load(std::memory_order_relaxed);
          continue;
        }
      }
    }
    w->link_older = writers;
    if (newest_writer->compare_exchange_weak(writers, w)) {
      return (writers == nullptr);
    }
  }
}

其逻辑很简单，先取出 newest_writer_ ，如果其 stall 了，那么视配置来决定是直接返回还是等待。之后，将当前 Writer 插入 WriteLink，实际就是把 link_order 指向 newest_writer，然后把自己变为新的 newest_writer_ 。如果原来的 newest_writer_ 为空，说明当前 Writer 为头一个，则返回 true 表示自己是 Leader，反之返回 false。

插入 WriteLink 后，如果是 Leader，那就把 state 设为 STATE_GROUP_LEADER。如果不是 Leader，就会调用 AwaitState() 阻塞自己，等待 Leader 给自己设置状态（唤醒）。不考虑 pipelined 的情况下，被唤醒的条件有两个，如注释所述：

自己不在 WriteGroup 中，被 WriteGroup 的 Leader 选为新的 Leader。
在 WriteGroup 中，被 Leader 唤醒。

注意到，JoinBatchGroup() 会阻塞非 Leader，因此接下来的代码，只有 Leader 或者被 Leader 唤醒的 Writer 可以执行，我们先以非 Leader 的视角来看。

非 Leader 视角

JoinBatchGroup() 执行完毕后，RocksDB 会对该 Writer 的 state 做两个判断，一个是 STATE_PARALLEL_MEMTABLE_WRITER，一个是 STATE_COMPLETED，我们一个一个分析。

STATE_PARALLEL_MEMTABLE_WRITER

首先是 STATE_PARALLEL_MEMTABLE_WRITER，源码如下：

// DBImpl::WriteImpl()
if (w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER) {
    // we are a non-leader in a parallel group
    if (w.ShouldWriteToMemtable()) {
      PERF_TIMER_STOP(write_pre_and_post_process_time);
      PERF_TIMER_GUARD(write_memtable_time);

      ColumnFamilyMemTablesImpl column_family_memtables(
          versions_->GetColumnFamilySet());
      w.status = WriteBatchInternal::InsertInto(
          &w, w.sequence, &column_family_memtables, &flush_scheduler_,
          &trim_history_scheduler_,
          write_options.ignore_missing_column_families, 0 /*log_number*/, this,
          true /*concurrent_memtable_writes*/, seq_per_batch_, w.batch_cnt,
          batch_per_txn_, write_options.memtable_insert_hint_per_batch);

      PERF_TIMER_START(write_pre_and_post_process_time);
    }

    if (write_thread_.CompleteParallelMemTableWriter(&w)) {
      // we're responsible for exit batch group
      // TODO(myabandeh): propagate status to write_group
      auto last_sequence = w.write_group->last_sequence;
      for (auto* tmp_w : *(w.write_group)) {
        assert(tmp_w);
        if (tmp_w->post_memtable_callback) {
          Status tmp_s =
              (*tmp_w->post_memtable_callback)(last_sequence, disable_memtable);
          // TODO: propagate the execution status of post_memtable_callback to
          // caller.
          assert(tmp_s.ok());
        }
      }
      versions_->SetLastSequence(last_sequence);
      MemTableInsertStatusCheck(w.status);
      write_thread_.ExitAsBatchGroupFollower(&w);
    }
    assert(w.state == WriteThread::STATE_COMPLETED);
    // STATE_COMPLETED conditional below handles exit
}

STATE_PARALLEL_MEMTABLE_WRITER 的意思是，自己在 WriteGroup 中但不是 Leader，且此时已经被 Leader 唤醒，且写入配置为 parallel。因此，该段代码的意思为，在 parallel 的情况下，Leader 把自己的 Follower 唤醒，要求其并行的将自己写入 memtable 中，写入的实现位于 WriteBatchInternal::InsertInto() 中。代码的后半部分用了一个 CompleteParallelMemTableWriter() 判断，先来看看它的注释。

// Reports the completion of w's batch to the parallel group leader, and
// waits for the rest of the parallel batch to complete.  Returns true
// if this thread is the last to complete, and hence should advance
// the sequence number and then call EarlyExitParallelGroup, false if
// someone else has already taken responsibility for that.
bool CompleteParallelMemTableWriter(Writer* w);

重点为，如果当前 Writer 是并行写入中最后一个完成的 Writer，那么返回 true。因此上述代码块的第二个作用为，如果当前 Writer 是最后一个完成的，那么尤其负责 WriteGroup 的善后工作，包括更新全局 seq num 以及执行 ExitAsBatchGroupFollower()，ExitAsBatchGroupFollower() 作用主要是辅助选出新的 WriteGroup，在上一篇博客中提到了，在后文我们会细说。

当然，这一部分只有在开启 parallel 的情况下才可能达到，否则会直接跳过。

STATE_COMPLETED

顾名思义，STATE_COMPLETED 指该 Writer 已经完成了，因此这一部分没做什么重要的操作，直接返回了

// DBImpl::WriteImpl()  
if (w.state == WriteThread::STATE_COMPLETED) {
    if (log_used != nullptr) {
        *log_used = w.log_used;
    }
    if (seq_used != nullptr) {
        *seq_used = w.sequence;
    }
    // write is complete and leader has updated sequence
    return w.FinalStatus();
}

上述两个分支，都只有非 Leader 能够达到。接下来，我们以 Leader 的视角继续。

Leader 视角

JoinBatchGroup() 会阻塞除了 Leader 以外的所有 Writer，而 Leader 的状态为 STATE_GROUP_LEADER，因此它会直接跳过上面两个分支，执行接下来的代码。Leader 会先创建一个空 WriteGroup，然后开始逐步构建它。

// DBImpl::WriteImpl()  
WriteThread::WriteGroup write_group;
last_batch_group_size_ =
    write_thread_.EnterAsBatchGroupLeader(&w, &write_group);

其中，EnterAsBatchGroupLeader() 就是构建 WriteBatch 的核心函数。

EnterAsBatchGroupLeader

省去中间的配置判断，该函数源码如下：

size_t WriteThread::EnterAsBatchGroupLeader(Writer* leader,
                                            WriteGroup* write_group) {
  assert(leader->link_older == nullptr);
  assert(leader->batch != nullptr);
  assert(write_group != nullptr);

  size_t size = WriteBatchInternal::ByteSize(leader->batch);

  // Allow the group to grow up to a maximum size, but if the
  // original write is small, limit the growth so we do not slow
  // down the small write too much.
  size_t max_size = max_write_batch_group_size_bytes;
  const uint64_t min_batch_size_bytes = max_write_batch_group_size_bytes / 8;
  if (size <= min_batch_size_bytes) {
    max_size = size + min_batch_size_bytes;
  }

  leader->write_group = write_group;
  write_group->leader = leader;
  write_group->last_writer = leader;
  write_group->size = 1;
  Writer* newest_writer = newest_writer_.load(std::memory_order_acquire);

  // This is safe regardless of any db mutex status of the caller. Previous
  // calls to ExitAsGroupLeader either didn't call CreateMissingNewerLinks
  // (they emptied the list and then we added ourself as leader) or had to
  // explicitly wake us up (the list was non-empty when we added ourself,
  // so we have already received our MarkJoined).
  CreateMissingNewerLinks(newest_writer);

  // Tricky. Iteration start (leader) is exclusive and finish
  // (newest_writer) is inclusive. Iteration goes from old to new.
  Writer* w = leader;
  while (w != newest_writer) {
    assert(w->link_newer);
    w = w->link_newer;

    // 各种if判断，如果w和leader的配置不吻合，那就break。

    w->write_group = write_group;
    size += batch_size;
    write_group->last_writer = w;
    write_group->size++;
  }
  TEST_SYNC_POINT_CALLBACK("WriteThread::EnterAsBatchGroupLeader:End", w);
  return size;
}

首先，该函数取出 newest_writer_，然后调用 WriteThread::CreateMissingNewerLinks()。在 JoinBatchGroup 时，构造的是只有后向指针 link_older 的单向链表，而该函数就是从尾部遍历一遍这个链表，把每一个 Writer 的 link_newer 确定，即边单向为双向。其实现很简单，这里就不赘述了。

接着，进入循环，从 Leader（也就是自己）开始遍历。如果 w 和 Leader 的配置不吻合，那就 break，因为 WriteGroup 要保证配置一致。如果吻合，那就加入 WriteGroup 中，以此类推，最终用 last_writer 来标记 WriteGroup 中的最后一个 Writer。

Leader 构建完 WriteGroup 之后，就要执行写入了。

是否 parallel

WriteGroup 的写入分为 parallel 和 !parallel，即是否并行。判断准则如下：

// DBImpl::WriteImpl()
    // Rules for when we can update the memtable concurrently
    // 1. supported by memtable
    // 2. Puts are not okay if inplace_update_support
    // 3. Merges are not okay
    //
    // Rules 1..2 are enforced by checking the options
    // during startup (CheckConcurrentWritesSupported), so if
    // options.allow_concurrent_memtable_write is true then they can be
    // assumed to be true.  Rule 3 is checked for each batch.  We could
    // relax rules 2 if we could prevent write batches from referring
    // more than once to a particular key.
    bool parallel = immutable_db_options_.allow_concurrent_memtable_write &&
                    write_group.size > 1;
    size_t total_count = 0;
    size_t valid_batches = 0;
    size_t total_byte_size = 0;
    size_t pre_release_callback_cnt = 0;
    for (auto* writer : write_group) {
      assert(writer);
      if (writer->CheckCallback(this)) {
        valid_batches += writer->batch_cnt;
        if (writer->ShouldWriteToMemtable()) {
          total_count += WriteBatchInternal::Count(writer->batch);
          parallel = parallel && !writer->batch->HasMerge();
        }
        total_byte_size = WriteBatchInternal::AppendedByteSize(
            total_byte_size, WriteBatchInternal::ByteSize(writer->batch));
        if (writer->pre_release_callback) {
          pre_release_callback_cnt++;
        }
      }
    }

从上述代码来看，准则有三点：

allow_concurrent_memtable_write 必须要置位。
只能是 out_of_place_update，不能是 inplace_update。
所有 Writer 中都不含 Merge 操作。

我们分析下第二点。如果采用就地更新的话，那么就不能支持多 Writer 并发写了，原因是如果不是原地更新的话，那么同一个 key 可能会有多个版本：（keyX, seq1, val1），（keyX, seq2, val2），（keyX, seq3, val3)，多个 Writer 并发插数据到跳表的时候，一定能够保证，对于相同的 key，seq 越大的排在跳表的后面，这可以保证MVCC 或者事务的正确性。如果是原地更新，那么同一个 key 在跳表中只对应一个节点，多 writer 并发写的时候，无法保证 seq 最大的 Writer 最后写入相关的节点。

写入 WAL

如果没有置位 disableWAL ，那么接下来 WriteGroup 都要被写入 WAL，但是这会被分为 2pc 和 !2pc 两个分支。

// DBImpl::WriteImpl()
    if (!two_write_queues_) {
      if (status.ok() && !write_options.disableWAL) {
        assert(log_context.log_file_number_size);
        LogFileNumberSize& log_file_number_size =
            *(log_context.log_file_number_size);
        PERF_TIMER_GUARD(write_wal_time);
        io_s =
            WriteToWAL(write_group, log_context.writer, log_used,
                       log_context.need_log_sync, log_context.need_log_dir_sync,
                       last_sequence + 1, log_file_number_size);
      }
    } else {
      if (status.ok() && !write_options.disableWAL) {
        PERF_TIMER_GUARD(write_wal_time);
        // LastAllocatedSequence is increased inside WriteToWAL under
        // wal_write_mutex_ to ensure ordered events in WAL
        io_s = ConcurrentWriteToWAL(write_group, log_used, &last_sequence,
                                    seq_inc);
      } else {
        // Otherwise we inc seq number for memtable writes
        last_sequence = versions_->FetchAddLastAllocatedSequence(seq_inc);
      }
    }

如果没有开启 2pc，那么就会调用 WriteToWAL()，如果开启了 2pc，则调用 ConcurrentWriteToWAL()，二者都会将整个 WriteGroup 传入。具体怎么写 WAL 的，这里就不深入，在下一篇博客中会专门对 WAL 写入进行源码分析。

在写完 WAL 之后，会对 seq 进行一些推进，这里我们先略过。之后就开始写 memtable。

写入 memtable

写入 memtable 的流程被分为了两个分支，parallel 和 ! parallel，源码如下：

// DBImpl::WriteImpl()
      if (!parallel) {
        // w.sequence will be set inside InsertInto
        w.status = WriteBatchInternal::InsertInto(
            write_group, current_sequence, column_family_memtables_.get(),
            &flush_scheduler_, &trim_history_scheduler_,
            write_options.ignore_missing_column_families,
            0 /*recovery_log_number*/, this, parallel, seq_per_batch_,
            batch_per_txn_);
      } else {
        write_group.last_sequence = last_sequence;
        write_thread_.LaunchParallelMemTableWriters(&write_group);
        in_parallel_group = true;

        // Each parallel follower is doing each own writes. The leader should
        // also do its own.
        if (w.ShouldWriteToMemtable()) {
          ColumnFamilyMemTablesImpl column_family_memtables(
              versions_->GetColumnFamilySet());
          assert(w.sequence == current_sequence);
          w.status = WriteBatchInternal::InsertInto(
              &w, w.sequence, &column_family_memtables, &flush_scheduler_,
              &trim_history_scheduler_,
              write_options.ignore_missing_column_families, 0 /*log_number*/,
              this, true /*concurrent_memtable_writes*/, seq_per_batch_,
              w.batch_cnt, batch_per_txn_,
              write_options.memtable_insert_hint_per_batch);
        }
      }

在分析之前，先说一下，WriteBatchInternal::InsertInto() 是向 memtable 写入的入口函数，它有三个重载，分别用于 WriteGroup、WriteBatch 以及 Writer，如下：

  static Status InsertInto(
      WriteThread::WriteGroup& write_group, xxx);

  // Convenience form of InsertInto when you have only one batch
  // next_seq returns the seq after last sequence number used in MemTable insert
  static Status InsertInto(
      const WriteBatch* batch, xxx);

  static Status InsertInto(
      WriteThread::Writer* writer, xxx;

如果是 !parallel，那就很直观了，因为 Leader 全权负责整个 WriteGroup 的写入，它会直接调用 WriteBatchInternal::InsertInto() 的第一个重载，独自写入整个 WriteGroup。

这里着重关注一下 parallel 模式。

// DBImpl::WriteImpl()
write_thread_.LaunchParallelMemTableWriters(&write_group);

上述操作会唤醒 WriteGroup 的所有 Writer，并将它们的状态设为 STATE_PARALLEL_MEMTABLE_WRITER，源码如下：

void WriteThread::LaunchParallelMemTableWriters(WriteGroup* write_group) {
  assert(write_group != nullptr);
  write_group->running.store(write_group->size);
  for (auto w : *write_group) {
    SetState(w, STATE_PARALLEL_MEMTABLE_WRITER);
  }
}

此时，被唤醒的 Writer 会从 JoinBatchGroup() 中返回，进入非 Leader 视角的分支，自己执行写入。而 Leader 同它们一样，不再负责整个 WriteGroup 的写入了，只需完成自己的写入即可。

// DBImpl::WriteImpl()
        // Each parallel follower is doing each own writes. The leader should
        // also do its own.
        if (w.ShouldWriteToMemtable()) {
          ColumnFamilyMemTablesImpl column_family_memtables(
              versions_->GetColumnFamilySet());
          assert(w.sequence == current_sequence);
          w.status = WriteBatchInternal::InsertInto(
              &w, w.sequence, &column_family_memtables, &flush_scheduler_,
              &trim_history_scheduler_,
              write_options.ignore_missing_column_families, 0 /*log_number*/,
              this, true /*concurrent_memtable_writes*/, seq_per_batch_,
              w.batch_cnt, batch_per_txn_,
              write_options.memtable_insert_hint_per_batch);
        }

不管是 WriteBatchInternal::InsertInto() 的第一个重载还是第二个重载，其核心都是一致的，在这里就不深入了，会放在对 memtable 写入的博客中详细分析。

写完 memtable 之后，Leader 会做一些 log_sync 操作，这里先略过。

ExitAsBatchGroupLeader

到这里，WriteGroup 任务就算完成了，然后会开始一些扫尾工作。如果是 parallel，那么由最后一个完成写入的 Writer 来执行，如果不是，那么 Leader 直接执行。

// DBImpl::WriteImpl()
  bool should_exit_batch_group = true;
  if (in_parallel_group) {
    // CompleteParallelWorker returns true if this thread should
    // handle exit, false means somebody else did
    should_exit_batch_group = write_thread_.CompleteParallelMemTableWriter(&w);
  }
  if (should_exit_batch_group) {
    if (status.ok()) {
      for (auto* tmp_w : write_group) {
        assert(tmp_w);
        if (tmp_w->post_memtable_callback) {
          Status tmp_s =
              (*tmp_w->post_memtable_callback)(last_sequence, disable_memtable);
          // TODO: propagate the execution status of post_memtable_callback to
          // caller.
          assert(tmp_s.ok());
        }
      }
      // Note: if we are to resume after non-OK statuses we need to revisit how
      // we reacts to non-OK statuses here.
      versions_->SetLastSequence(last_sequence);
    }
    MemTableInsertStatusCheck(w.status);
    write_thread_.ExitAsBatchGroupLeader(write_group, status);
  }

扫尾工作有两项：

更新该 version 的 last_sequence_。
辅助生成下一个 WriteGroup。

截取 ExitAsBatchGroupLeader 中的非 pipelined 部分，源码如下：

void WriteThread::ExitAsBatchGroupLeader(WriteGroup& write_group,
                                         Status& status) {
  Writer* leader = write_group.leader;
  Writer* last_writer = write_group.last_writer;
  // ...
  if (enable_pipelined_write_) {
      // ...
  } else {
    Writer* head = newest_writer_.load(std::memory_order_acquire);
    if (head != last_writer ||
        !newest_writer_.compare_exchange_strong(head, nullptr)) {
      // Either last_writer wasn't the head during the load(), or it was the
      // head during the load() but somebody else pushed onto the list before
      // we did the compare_exchange_strong (causing it to fail).  In the
      // latter case compare_exchange_strong has the effect of re-reading
      // its first param (head).  No need to retry a failing CAS, because
      // only a departing leader (which we are at the moment) can remove
      // nodes from the list.
      assert(head != last_writer);

      // After walking link_older starting from head (if not already done)
      // we will be able to traverse w->link_newer below. This function
      // can only be called from an active leader, only a leader can
      // clear newest_writer_, we didn't, and only a clear newest_writer_
      // could cause the next leader to start their work without a call
      // to MarkJoined, so we can definitely conclude that no other leader
      // work is going on here (with or without db mutex).
      CreateMissingNewerLinks(head);
      assert(last_writer->link_newer != nullptr);
      assert(last_writer->link_newer->link_older == last_writer);
      last_writer->link_newer->link_older = nullptr;

      // Next leader didn't self-identify, because newest_writer_ wasn't
      // nullptr when they enqueued (we were definitely enqueued before them
      // and are still in the list).  That means leader handoff occurs when
      // we call MarkJoined
      SetState(last_writer->link_newer, STATE_GROUP_LEADER);
    }
    // else nobody else was waiting, although there might already be a new
    // leader now

    while (last_writer != leader) {
      assert(last_writer);
      last_writer->status = status;
      // we need to read link_older before calling SetState, because as soon
      // as it is marked committed the other thread's Await may return and
      // deallocate the Writer.
      auto next = last_writer->link_older;
      SetState(last_writer, STATE_COMPLETED);

      last_writer = next;
    }
}

首先，虽然在 EnterAsBatchGroupLeader() 时已经调用过一次 CreateMissingNewerLinks() 将 WriteLink 由单向链表转变为双向链表，但是在 WriteGroup 写入的过程中，很有可能会有新的 Write 加入 WriteLink，而新的这一段就是单向链表了，因此在 Exit 时又调用了一遍 CreateMissingNewerLinks() 确保 WriteLink 为双向链表。

接着，它会选择新的 Leader，实际上就是 last_writer 的后一个 Writer，并且把新 Leader 的 link_order 置空，意为把新旧两个 WriteGroup 断开了。

// WriteThread::ExitAsBatchGroupLeader()
last_writer->link_newer->link_older = nullptr;
SetState(last_writer->link_newer, STATE_GROUP_LEADER);

最后，把所处 WriteGroup 中的所有 Writer 状态都改为 STATE_COMPLETED，意味完成写入。

// WriteThread::ExitAsBatchGroupLeader()
while (last_writer != leader) {
    assert(last_writer);
    last_writer->status = status;
    // we need to read link_older before calling SetState, because as soon
    // as it is marked committed the other thread's Await may return and
    // deallocate the Writer.
    auto next = last_writer->link_older;
    SetState(last_writer, STATE_COMPLETED);

    last_writer = next;
}

注意，当选择新的 Leader 之后，后者会被唤醒，即可从 JoinBatchGroup() 中返回，进入 Leader 视角的分支，重复上述行为，构造新的 WriteGroup，以此循环。

至此，与 WriteGroup 有关的源码分析就结束了，下一篇将分析向 WAL 中的写入。

_CLAY_

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
RocksDB源码学习(六): 写(二)-WriteGroup

本篇博客将从源码层面分析 RocksDB 写操作中与 WriteGroup 有关的内容，且不考虑 pipelined_write 与 2pc，所用代码版本为 `v7.7.4`
复制链接

扫一扫