RocksDB源码学习(六): 写(二)-WriteGroup

原文链接:click here,欢迎访问我的博客

本篇博客将从源码层面分析 RocksDB 写操作中与 WriteGroup 有关的内容,且不考虑 pipelined_write 与 2pc,所用代码版本为 v7.7.4


WAL 主要的功能是当 RocksDB 异常退出后,能够恢复出错前的 memtable 中的数据,因此 RocksDB 默认是每次用户写都会刷新数据到 WAL。每次当当前 WAL 对应的 memtable 刷新到磁盘之后,都会新建一个WAL,即一个 memtable 对应一个 WAL。每一个 WAL 最终都会写入对应的 WAL 文件中,这些文件保存在 options.wal_dir 中,所有 WAL 文件都是按照 log_number 来起的。

WriteImpl

我们知道,每一个写线程都对应一个 WriteBatch,其写入会交给 DBImpl::WriteImpl() 来完成,而 WriteGroup 的构建与消除,也是在该函数内部完成,因此本文以该函数为分析入口。

进入函数内部,跳过配置检查、unordered_write、pipelined_write 等分支,会看见封装 Writer 的代码:

// DBImpl::WriteImpl()
WriteThread::Writer w(write_options, my_batch, callback, log_ref,
                      disable_memtable, batch_cnt, pre_release_callback,
                      post_memtable_callback);

Writer 内部的结构在上一篇博客中已经讨论过了,核心为 WriteBatch* 、link_older 与 link_newer,这里就不再赘述了。封装完 Writer 后,会将其加入 DB 对象中的 Writer 链表,上一篇博客我们把它称 WriteLink,本篇依旧如此。

// DBImpl::WriteImpl()
write_thread_.JoinBatchGroup(&w);

JoinBatchGroup

该函数完整源码如下:

void WriteThread::JoinBatchGroup(Writer* w) {
  TEST_SYNC_POINT_CALLBACK("WriteThread::JoinBatchGroup:Start", w);
  assert(w->batch != nullptr);

  bool linked_as_leader = LinkOne(w, &newest_writer_);

  if (linked_as_leader) {
    SetState(w, STATE_GROUP_LEADER);
  }

  TEST_SYNC_POINT_CALLBACK("WriteThread::JoinBatchGroup:Wait", w);
  TEST_SYNC_POINT_CALLBACK("WriteThread::JoinBatchGroup:Wait2", w);

  if (!linked_as_leader) {
    /**
     * Wait util:
     * 1) An existing leader pick us as the new leader when it finishes
     * 2) An existing leader pick us as its follewer and
     * 2.1) finishes the memtable writes on our behalf
     * 2.2) Or tell us to finish the memtable writes in pralallel
     * 3) (pipelined write) An existing leader pick us as its follower and
     *    finish book-keeping and WAL write for us, enqueue us as pending
     *    memtable writer, and
     * 3.1) we become memtable writer group leader, or
     * 3.2) an existing memtable writer group leader tell us to finish memtable
     *      writes in parallel.
     */
    TEST_SYNC_POINT_CALLBACK("WriteThread::JoinBatchGroup:BeganWaiting", w);
    AwaitState(w, STATE_GROUP_LEADER | STATE_MEMTABLE_WRITER_LEADER |
                      STATE_PARALLEL_MEMTABLE_WRITER | STATE_COMPLETED,
               &jbg_ctx);
    TEST_SYNC_POINT_CALLBACK("WriteThread::JoinBatchGroup:DoneWaiting", w);
  }
}

// ...
bool WriteThread::LinkOne(Writer* w, std::atomic<Writer*>* newest_writer) {
  assert(newest_writer != nullptr);
  assert(w->state == STATE_INIT);
  Writer* writers = newest_writer->load(std::memory_order_relaxed);
  while (true) {
    // If write stall in effect, and w->no_slowdown is not true,
    // block here until stall is cleared. If its true, then return
    // immediately
    if (writers == &write_stall_dummy_) {
      if (w->no_slowdown) {
        w->status = Status::Incomplete("Write stall");
        SetState(w, STATE_COMPLETED);
        return false;
      }
      // Since no_slowdown is false, wait here to be notified of the write
      // stall clearing
      {
        MutexLock lock(&stall_mu_);
        writers = newest_writer->load(std::memory_order_relaxed);
        if (writers == &write_stall_dummy_) {
          TEST_SYNC_POINT_CALLBACK("WriteThread::WriteStall::Wait", w);
          stall_cv_.Wait();
          // Load newest_writers_ again since it may have changed
          writers = newest_writer->load(std::memory_order_relaxed);
          continue;
        }
      }
    }
    w->link_older = writers;
    if (newest_writer->compare_exchange_weak(writers, w)) {
      return (writers == nullptr);
    }
  }
}

其逻辑很简单,先取出 newest_writer_ ,如果其 stall 了,那么视配置来决定是直接返回还是等待。之后,将当前 Writer 插入 WriteLink,实际就是把 link_order 指向 newest_writer,然后把自己变为新的 newest_writer_ 。如果原来的 newest_writer_ 为空,说明当前 Writer 为头一个,则返回 true 表示自己是 Leader,反之返回 false。

插入 WriteLink 后,如果是 Leader,那就把 state 设为 STATE_GROUP_LEADER。如果不是 Leader,就会调用 AwaitState() 阻塞自己,等待 Leader 给自己设置状态(唤醒)。不考虑 pipelined 的情况下,被唤醒的条件有两个,如注释所述:

  • 自己不在 WriteGroup 中,被 WriteGroup 的 Leader 选为新的 Leader。
  • 在 WriteGroup 中,被 Leader 唤醒。

注意到,JoinBatchGroup() 会阻塞非 Leader,因此接下来的代码,只有 Leader 或者被 Leader 唤醒的 Writer 可以执行,我们先以非 Leader 的视角来看。

非 Leader 视角

JoinBatchGroup() 执行完毕后,RocksDB 会对该 Writer 的 state 做两个判断,一个是 STATE_PARALLEL_MEMTABLE_WRITER,一个是 STATE_COMPLETED,我们一个一个分析。

STATE_PARALLEL_MEMTABLE_WRITER

首先是 STATE_PARALLEL_MEMTABLE_WRITER,源码如下:

// DBImpl::WriteImpl()
if (w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER) {
    // we are a non-leader in a parallel group
    if (w.ShouldWriteToMemtable()) {
      PERF_TIMER_STOP(write_pre_and_post_process_time);
      PERF_TIMER_GUARD(write_memtable_time);

      ColumnFamilyMemTablesImpl column_family_memtables(
          versions_->GetColumnFamilySet());
      w.status = WriteBatchInternal::InsertInto(
          &w, w.sequence, &column_family_memtables, &flush_scheduler_,
          &trim_history_scheduler_,
          write_options.ignore_missing_column_families, 0 /*log_number*/, this,
          true /*concurrent_memtable_writes*/, seq_per_batch_, w.batch_cnt,
          batch_per_txn_, write_options.memtable_insert_hint_per_batch);

      PERF_TIMER_START(write_pre_and_post_process_time);
    }

    if (write_thread_.CompleteParallelMemTableWriter(&w)) {
      // we're responsible for exit batch group
      // TODO(myabandeh): propagate status to write_group
      auto last_sequence = w.write_group->last_sequence;
      for (auto* tmp_w : *(w.write_group)) {
        assert(tmp_w);
        if (tmp_w->post_memtable_callback) {
          Status tmp_s =
              (*tmp_w->post_memtable_callback)(last_sequence, disable_memtable);
          // TODO: propagate the execution status of post_memtable_callback to
          // caller.
          assert(tmp_s.ok());
        }
      }
      versions_->SetLastSequence(last_sequence);
      MemTableInsertStatusCheck(w.status);
      write_thread_.ExitAsBatchGroupFollower(&w);
    }
    assert(w.state == WriteThread::STATE_COMPLETED);
    // STATE_COMPLETED conditional below handles exit
}

STATE_PARALLEL_MEMTABLE_WRITER 的意思是,自己在 WriteGroup 中但不是 Leader,且此时已经被 Leader 唤醒,且写入配置为 parallel。因此,该段代码的意思为,在 parallel 的情况下,Leader 把自己的 Follower 唤醒,要求其并行的将自己写入 memtable 中,写入的实现位于 WriteBatchInternal::InsertInto() 中。代码的后半部分用了一个 CompleteParallelMemTableWriter() 判断,先来看看它的注释。

// Reports the completion of w's batch to the parallel group leader, and
// waits for the rest of the parallel batch to complete.  Returns true
// if this thread is the last to complete, and hence should advance
// the sequence number and then call EarlyExitParallelGroup, false if
// someone else has already taken responsibility for that.
bool CompleteParallelMemTableWriter(Writer* w);

重点为,如果当前 Writer 是并行写入中最后一个完成的 Writer,那么返回 true。因此上述代码块的第二个作用为,如果当前 Writer 是最后一个完成的,那么尤其负责 WriteGroup 的善后工作,包括更新全局 seq num 以及执行 ExitAsBatchGroupFollower(),ExitAsBatchGroupFollower() 作用主要是辅助选出新的 WriteGroup,在上一篇博客中提到了,在后文我们会细说。

当然,这一部分只有在开启 parallel 的情况下才可能达到,否则会直接跳过。

STATE_COMPLETED

顾名思义,STATE_COMPLETED 指该 Writer 已经完成了,因此这一部分没做什么重要的操作,直接返回了

// DBImpl::WriteImpl()  
if (w.state == WriteThread::STATE_COMPLETED) {
    if (log_used != nullptr) {
        *log_used = w.log_used;
    }
    if (seq_used != nullptr) {
        *seq_used = w.sequence;
    }
    // write is complete and leader has updated sequence
    return w.FinalStatus();
}

上述两个分支,都只有非 Leader 能够达到。接下来,我们以 Leader 的视角继续。

Leader 视角

JoinBatchGroup() 会阻塞除了 Leader 以外的所有 Writer,而 Leader 的状态为 STATE_GROUP_LEADER,因此它会直接跳过上面两个分支,执行接下来的代码。Leader 会先创建一个空 WriteGroup,然后开始逐步构建它。

// DBImpl::WriteImpl()  
WriteThread::WriteGroup write_group;
last_batch_group_size_ =
    write_thread_.EnterAsBatchGroupLeader(&w, &write_group);

其中,EnterAsBatchGroupLeader() 就是构建 WriteBatch 的核心函数。

EnterAsBatchGroupLeader

省去中间的配置判断,该函数源码如下:

size_t WriteThread::EnterAsBatchGroupLeader(Writer* leader,
                                            WriteGroup* write_group) {
  assert(leader->link_older == nullptr);
  assert(leader->batch != nullptr);
  assert(write_group != nullptr);

  size_t size = WriteBatchInternal::ByteSize(leader->batch);

  // Allow the group to grow up to a maximum size, but if the
  // original write is small, limit the growth so we do not slow
  // down the small write too much.
  size_t max_size = max_write_batch_group_size_bytes;
  const uint64_t min_batch_size_bytes = max_write_batch_group_size_bytes / 8;
  if (size <= min_batch_size_bytes) {
    max_size = size + min_batch_size_bytes;
  }

  leader->write_group = write_group;
  write_group->leader = leader;
  write_group->last_writer = leader;
  write_group->size = 1;
  Writer* newest_writer = newest_writer_.load(std::memory_order_acquire);

  // This is safe regardless of any db mutex status of the caller. Previous
  // calls to ExitAsGroupLeader either didn't call CreateMissingNewerLinks
  // (they emptied the list and then we added ourself as leader) or had to
  // explicitly wake us up (the list was non-empty when we added ourself,
  // so we have already received our MarkJoined).
  CreateMissingNewerLinks(newest_writer);

  // Tricky. Iteration start (leader) is exclusive and finish
  // (newest_writer) is inclusive. Iteration goes from old to new.
  Writer* w = leader;
  while (w != newest_writer) {
    assert(w->link_newer);
    w = w->link_newer;

    // 各种if判断,如果w和leader的配置不吻合,那就break。

    w->write_group = write_group;
    size += batch_size;
    write_group->last_writer = w;
    write_group->size++;
  }
  TEST_SYNC_POINT_CALLBACK("WriteThread::EnterAsBatchGroupLeader:End", w);
  return size;
}

首先,该函数取出 newest_writer_,然后调用 WriteThread::CreateMissingNewerLinks()。在 JoinBatchGroup 时,构造的是只有后向指针 link_older 的单向链表,而该函数就是从尾部遍历一遍这个链表,把每一个 Writer 的 link_newer 确定,即边单向为双向。其实现很简单,这里就不赘述了。

接着,进入循环,从 Leader(也就是自己)开始遍历。如果 w 和 Leader 的配置不吻合,那就 break,因为 WriteGroup 要保证配置一致。如果吻合,那就加入 WriteGroup 中,以此类推,最终用 last_writer 来标记 WriteGroup 中的最后一个 Writer。

Leader 构建完 WriteGroup 之后,就要执行写入了。

是否 parallel

WriteGroup 的写入分为 parallel 和 !parallel,即是否并行。判断准则如下:

// DBImpl::WriteImpl()
    // Rules for when we can update the memtable concurrently
    // 1. supported by memtable
    // 2. Puts are not okay if inplace_update_support
    // 3. Merges are not okay
    //
    // Rules 1..2 are enforced by checking the options
    // during startup (CheckConcurrentWritesSupported), so if
    // options.allow_concurrent_memtable_write is true then they can be
    // assumed to be true.  Rule 3 is checked for each batch.  We could
    // relax rules 2 if we could prevent write batches from referring
    // more than once to a particular key.
    bool parallel = immutable_db_options_.allow_concurrent_memtable_write &&
                    write_group.size > 1;
    size_t total_count = 0;
    size_t valid_batches = 0;
    size_t total_byte_size = 0;
    size_t pre_release_callback_cnt = 0;
    for (auto* writer : write_group) {
      assert(writer);
      if (writer->CheckCallback(this)) {
        valid_batches += writer->batch_cnt;
        if (writer->ShouldWriteToMemtable()) {
          total_count += WriteBatchInternal::Count(writer->batch);
          parallel = parallel && !writer->batch->HasMerge();
        }
        total_byte_size = WriteBatchInternal::AppendedByteSize(
            total_byte_size, WriteBatchInternal::ByteSize(writer->batch));
        if (writer->pre_release_callback) {
          pre_release_callback_cnt++;
        }
      }
    }

从上述代码来看,准则有三点:

  • allow_concurrent_memtable_write 必须要置位。
  • 只能是 out_of_place_update,不能是 inplace_update。
  • 所有 Writer 中都不含 Merge 操作。

我们分析下第二点。如果采用就地更新的话,那么就不能支持多 Writer 并发写了,原因是如果不是原地更新的话,那么同一个 key 可能会有多个版本:(keyX, seq1, val1), (keyX, seq2, val2), (keyX, seq3, val3),多个 Writer 并发插数据到跳表的时候,一定能够保证,对于相同的 key,seq 越大的排在跳表的后面,这可以保证MVCC 或者事务的正确性。如果是原地更新,那么同一个 key 在跳表中只对应一个节点,多 writer 并发写的时候,无法保证 seq 最大的 Writer 最后写入相关的节点。

写入 WAL

如果没有置位 disableWAL ,那么接下来 WriteGroup 都要被写入 WAL,但是这会被分为 2pc 和 !2pc 两个分支。

// DBImpl::WriteImpl()
    if (!two_write_queues_) {
      if (status.ok() && !write_options.disableWAL) {
        assert(log_context.log_file_number_size);
        LogFileNumberSize& log_file_number_size =
            *(log_context.log_file_number_size);
        PERF_TIMER_GUARD(write_wal_time);
        io_s =
            WriteToWAL(write_group, log_context.writer, log_used,
                       log_context.need_log_sync, log_context.need_log_dir_sync,
                       last_sequence + 1, log_file_number_size);
      }
    } else {
      if (status.ok() && !write_options.disableWAL) {
        PERF_TIMER_GUARD(write_wal_time);
        // LastAllocatedSequence is increased inside WriteToWAL under
        // wal_write_mutex_ to ensure ordered events in WAL
        io_s = ConcurrentWriteToWAL(write_group, log_used, &last_sequence,
                                    seq_inc);
      } else {
        // Otherwise we inc seq number for memtable writes
        last_sequence = versions_->FetchAddLastAllocatedSequence(seq_inc);
      }
    }

如果没有开启 2pc,那么就会调用 WriteToWAL(),如果开启了 2pc,则调用 ConcurrentWriteToWAL(),二者都会将整个 WriteGroup 传入。具体怎么写 WAL 的,这里就不深入,在下一篇博客中会专门对 WAL 写入进行源码分析。

在写完 WAL 之后,会对 seq 进行一些推进,这里我们先略过。之后就开始写 memtable。

写入 memtable

写入 memtable 的流程被分为了两个分支,parallel 和 ! parallel,源码如下:

// DBImpl::WriteImpl()
      if (!parallel) {
        // w.sequence will be set inside InsertInto
        w.status = WriteBatchInternal::InsertInto(
            write_group, current_sequence, column_family_memtables_.get(),
            &flush_scheduler_, &trim_history_scheduler_,
            write_options.ignore_missing_column_families,
            0 /*recovery_log_number*/, this, parallel, seq_per_batch_,
            batch_per_txn_);
      } else {
        write_group.last_sequence = last_sequence;
        write_thread_.LaunchParallelMemTableWriters(&write_group);
        in_parallel_group = true;

        // Each parallel follower is doing each own writes. The leader should
        // also do its own.
        if (w.ShouldWriteToMemtable()) {
          ColumnFamilyMemTablesImpl column_family_memtables(
              versions_->GetColumnFamilySet());
          assert(w.sequence == current_sequence);
          w.status = WriteBatchInternal::InsertInto(
              &w, w.sequence, &column_family_memtables, &flush_scheduler_,
              &trim_history_scheduler_,
              write_options.ignore_missing_column_families, 0 /*log_number*/,
              this, true /*concurrent_memtable_writes*/, seq_per_batch_,
              w.batch_cnt, batch_per_txn_,
              write_options.memtable_insert_hint_per_batch);
        }
      }

在分析之前,先说一下,WriteBatchInternal::InsertInto() 是向 memtable 写入的入口函数,它有三个重载,分别用于 WriteGroup、WriteBatch 以及 Writer,如下:

  static Status InsertInto(
      WriteThread::WriteGroup& write_group, xxx);

  // Convenience form of InsertInto when you have only one batch
  // next_seq returns the seq after last sequence number used in MemTable insert
  static Status InsertInto(
      const WriteBatch* batch, xxx);

  static Status InsertInto(
      WriteThread::Writer* writer, xxx;

如果是 !parallel,那就很直观了,因为 Leader 全权负责整个 WriteGroup 的写入,它会直接调用 WriteBatchInternal::InsertInto() 的第一个重载,独自写入整个 WriteGroup。

这里着重关注一下 parallel 模式。

// DBImpl::WriteImpl()
write_thread_.LaunchParallelMemTableWriters(&write_group);

上述操作会唤醒 WriteGroup 的所有 Writer,并将它们的状态设为 STATE_PARALLEL_MEMTABLE_WRITER,源码如下:

void WriteThread::LaunchParallelMemTableWriters(WriteGroup* write_group) {
  assert(write_group != nullptr);
  write_group->running.store(write_group->size);
  for (auto w : *write_group) {
    SetState(w, STATE_PARALLEL_MEMTABLE_WRITER);
  }
}

此时,被唤醒的 Writer 会从 JoinBatchGroup() 中返回,进入非 Leader 视角的分支,自己执行写入。而 Leader 同它们一样,不再负责整个 WriteGroup 的写入了,只需完成自己的写入即可。

// DBImpl::WriteImpl()
        // Each parallel follower is doing each own writes. The leader should
        // also do its own.
        if (w.ShouldWriteToMemtable()) {
          ColumnFamilyMemTablesImpl column_family_memtables(
              versions_->GetColumnFamilySet());
          assert(w.sequence == current_sequence);
          w.status = WriteBatchInternal::InsertInto(
              &w, w.sequence, &column_family_memtables, &flush_scheduler_,
              &trim_history_scheduler_,
              write_options.ignore_missing_column_families, 0 /*log_number*/,
              this, true /*concurrent_memtable_writes*/, seq_per_batch_,
              w.batch_cnt, batch_per_txn_,
              write_options.memtable_insert_hint_per_batch);
        }

不管是 WriteBatchInternal::InsertInto() 的第一个重载还是第二个重载,其核心都是一致的,在这里就不深入了,会放在对 memtable 写入的博客中详细分析。

写完 memtable 之后,Leader 会做一些 log_sync 操作,这里先略过。

ExitAsBatchGroupLeader

到这里,WriteGroup 任务就算完成了,然后会开始一些扫尾工作。如果是 parallel,那么由最后一个完成写入的 Writer 来执行,如果不是,那么 Leader 直接执行。

// DBImpl::WriteImpl()
  bool should_exit_batch_group = true;
  if (in_parallel_group) {
    // CompleteParallelWorker returns true if this thread should
    // handle exit, false means somebody else did
    should_exit_batch_group = write_thread_.CompleteParallelMemTableWriter(&w);
  }
  if (should_exit_batch_group) {
    if (status.ok()) {
      for (auto* tmp_w : write_group) {
        assert(tmp_w);
        if (tmp_w->post_memtable_callback) {
          Status tmp_s =
              (*tmp_w->post_memtable_callback)(last_sequence, disable_memtable);
          // TODO: propagate the execution status of post_memtable_callback to
          // caller.
          assert(tmp_s.ok());
        }
      }
      // Note: if we are to resume after non-OK statuses we need to revisit how
      // we reacts to non-OK statuses here.
      versions_->SetLastSequence(last_sequence);
    }
    MemTableInsertStatusCheck(w.status);
    write_thread_.ExitAsBatchGroupLeader(write_group, status);
  }

扫尾工作有两项:

  • 更新该 version 的 last_sequence_。
  • 辅助生成下一个 WriteGroup。

截取 ExitAsBatchGroupLeader 中的非 pipelined 部分,源码如下:

void WriteThread::ExitAsBatchGroupLeader(WriteGroup& write_group,
                                         Status& status) {
  Writer* leader = write_group.leader;
  Writer* last_writer = write_group.last_writer;
  // ...
  if (enable_pipelined_write_) {
      // ...
  } else {
    Writer* head = newest_writer_.load(std::memory_order_acquire);
    if (head != last_writer ||
        !newest_writer_.compare_exchange_strong(head, nullptr)) {
      // Either last_writer wasn't the head during the load(), or it was the
      // head during the load() but somebody else pushed onto the list before
      // we did the compare_exchange_strong (causing it to fail).  In the
      // latter case compare_exchange_strong has the effect of re-reading
      // its first param (head).  No need to retry a failing CAS, because
      // only a departing leader (which we are at the moment) can remove
      // nodes from the list.
      assert(head != last_writer);

      // After walking link_older starting from head (if not already done)
      // we will be able to traverse w->link_newer below. This function
      // can only be called from an active leader, only a leader can
      // clear newest_writer_, we didn't, and only a clear newest_writer_
      // could cause the next leader to start their work without a call
      // to MarkJoined, so we can definitely conclude that no other leader
      // work is going on here (with or without db mutex).
      CreateMissingNewerLinks(head);
      assert(last_writer->link_newer != nullptr);
      assert(last_writer->link_newer->link_older == last_writer);
      last_writer->link_newer->link_older = nullptr;

      // Next leader didn't self-identify, because newest_writer_ wasn't
      // nullptr when they enqueued (we were definitely enqueued before them
      // and are still in the list).  That means leader handoff occurs when
      // we call MarkJoined
      SetState(last_writer->link_newer, STATE_GROUP_LEADER);
    }
    // else nobody else was waiting, although there might already be a new
    // leader now

    while (last_writer != leader) {
      assert(last_writer);
      last_writer->status = status;
      // we need to read link_older before calling SetState, because as soon
      // as it is marked committed the other thread's Await may return and
      // deallocate the Writer.
      auto next = last_writer->link_older;
      SetState(last_writer, STATE_COMPLETED);

      last_writer = next;
    }
}

首先,虽然在 EnterAsBatchGroupLeader() 时已经调用过一次 CreateMissingNewerLinks() 将 WriteLink 由单向链表转变为双向链表,但是在 WriteGroup 写入的过程中,很有可能会有新的 Write 加入 WriteLink,而新的这一段就是单向链表了,因此在 Exit 时又调用了一遍 CreateMissingNewerLinks() 确保 WriteLink 为双向链表。

接着,它会选择新的 Leader,实际上就是 last_writer 的后一个 Writer,并且把新 Leader 的 link_order 置空,意为把新旧两个 WriteGroup 断开了。

// WriteThread::ExitAsBatchGroupLeader()
last_writer->link_newer->link_older = nullptr;
SetState(last_writer->link_newer, STATE_GROUP_LEADER);

最后,把所处 WriteGroup 中的所有 Writer 状态都改为 STATE_COMPLETED,意味完成写入。

// WriteThread::ExitAsBatchGroupLeader()
while (last_writer != leader) {
    assert(last_writer);
    last_writer->status = status;
    // we need to read link_older before calling SetState, because as soon
    // as it is marked committed the other thread's Await may return and
    // deallocate the Writer.
    auto next = last_writer->link_older;
    SetState(last_writer, STATE_COMPLETED);

    last_writer = next;
}

注意,当选择新的 Leader 之后,后者会被唤醒,即可从 JoinBatchGroup() 中返回,进入 Leader 视角的分支,重复上述行为,构造新的 WriteGroup,以此循环。


至此,与 WriteGroup 有关的源码分析就结束了,下一篇将分析向 WAL 中的写入。

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
根据Valgrind提供的信息,可以得出以下分析: 这段Valgrind信息表示在程序运行结束时,有24字节的内存块是明确丢失的。这是在294条记录中的第68条记录。 这个内存块的分配是通过`operator new`函数进行的,具体是在`vg_replace_malloc.c`文件的`operator new(unsigned long, std::nothrow_t const&)`函数中进行的。这个函数用于分配内存,并且使用了`std::nothrow_t`参数,表示在分配失败时不抛出异常。 这个内存块的丢失发生在`libstdc++.so.6.0.19`库文件中的`__cxa_thread_atexit`函数中。这个函数是C++标准库中的一个线程退出钩子函数,用于在线程退出时执行清理操作。 进一步跟踪,这个内存块的丢失是在`librocksdb.so.6.20.3`库文件中的`rocksdb::InstrumentedMutex::Lock()`函数中发生的。这个函数是RocksDB数据库引擎的一个锁操作函数,用于获取互斥锁。 在调用堆栈中,可以看到这个内存块丢失是在RocksDB数据库引擎的后台合并线程(Background Compaction)中发生的。具体是在`rocksdb::DBImpl::BackgroundCallCompaction()`和`rocksdb::DBImpl::BGWorkCompaction()`函数中进行的合并操作。 最后,从调用堆栈中可以看到,这个内存块的丢失是在后台线程中发生的。这是在`librocksdb.so.6.20.3`库文件中的`rocksdb::ThreadPoolImpl::Impl::BGThread()`和`rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper()`函数中执行的。 综上所述,根据Valgrind的信息分析,这段代码中存在一个明确的内存泄漏问题,24字节的内存块在后台合并线程中丢失。需要进一步检查代码,确保在合适的时机释放这些内存块,以避免资源泄漏和潜在的问题。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值