RocksDB源码学习(七): 写(三)-WAL

最新推荐文章于 2024-02-24 15:01:26 发布

_CLAY_

最新推荐文章于 2024-02-24 15:01:26 发布

阅读量655

点赞数 2

文章标签：数据库 c++

本文链接：https://blog.csdn.net/weixin_46322986/article/details/128104437

版权

原文链接：click here，欢迎访问我的博客

文章目录

本篇博客将从源码层面分析 RocksDB 中 WAL 的创建与写入，且不考虑 2pc，所用代码版本为 v7.7.4

WAL 主要的功能是当 RocksDB 异常退出后，能够恢复出错前的 memtable 中的数据，因此 RocksDB 默认是每次用户写都会刷新数据到 WAL。每次当当前 WAL 对应的 memtable 刷新到磁盘后，都会新建一个WAL，即一个 memtable 对应一个 WAL。实际上，memtable 刷新为 sstable 是通过 immutable memtable 后台完成的，所以只要 memtable 转换为 immutable memtable，就会新生成一个 memtable 和对应的 WAL。

每个 WAL 最终都会写入对应的 WAL 文件，所有的 WAL 文件都是保存在 options.wal_dir 中，为了保证数据的状态，这些文件的名字都是按照顺序的（log_number）。在 WriteGroup 写入 memtable 之前，它会先写入 WAL，具体写入的实现分为 2pc 和 !2pc，这里我们只考虑 !2pc。

WAL 结构

WAL 由一堆变长的 record 组成，而每个 record 是由 kBlockSize(32k) 来分组，比如某一个 record 大于kBlockSize 的话，它就会被切分为多个record（通过type来判断)。

/**
 * File is broken down into variable sized records. The format of each record
 * is described below.
 *       +-----+-------------+--+----+----------+------+-- ... ----+
 * File  | r0  |        r1   |P | r2 |    r3    |  r4  |           |
 *       +-----+-------------+--+----+----------+------+-- ... ----+
 *       <--- kBlockSize ------>|<-- kBlockSize ------>|
 *  rn = variable size records
 *  P = Padding

如果一个 record 小于 kBlockSize，那么剩余的部分会填充 \0。

/**
 * Data is written out in kBlockSize chunks. If next record does not fit
 * into the space left, the leftover space will be padded with \0.

record 的格式如下：

/**
 * Recyclable record format:
 *
 * +---------+-----------+-----------+----------------+--- ... ---+
 * |CRC (4B) | Size (2B) | Type (1B) | Log number (4B)| Payload   |
 * +---------+-----------+-----------+----------------+--- ... ---+
 *
 * CRC = 32bit hash computed over the record type and payload using CRC
 * Size = Length of the payload data
 * Type = Type of record
 *        (kZeroType, kFullType, kFirstType, kLastType, kMiddleType )
 *        The type is used to group a bunch of records together to represent
 *        blocks that are larger than kBlockSize
 * Payload = Byte stream as long as specified by the payload size
 * Log number = 32bit log file number, so that we can distinguish between
 * records written by the most recent log writer vs a previous one.

一个 record 记录一个 WriteBatch，放在 payload 里，其内容为 WriteBatch::rep_，是一个 string，格式如下：

// WriteBatch::rep_ :=
//    sequence: fixed64
//    count: fixed32
//    data: record[count]
// record :=
//    kTypeValue varstring varstring
//    kTypeDeletion varstring
//    kTypeSingleDeletion varstring
//    kTypeRangeDeletion varstring varstring
//    kTypeMerge varstring varstring
//    kTypeColumnFamilyValue varint32 varstring varstring
//    kTypeColumnFamilyDeletion varint32 varstring
//    kTypeColumnFamilySingleDeletion varint32 varstring
//    kTypeColumnFamilyRangeDeletion varint32 varstring varstring
//    kTypeColumnFamilyMerge varint32 varstring varstring
//    kTypeBeginPrepareXID
//    kTypeEndPrepareXID varstring
//    kTypeCommitXID varstring
//    kTypeCommitXIDAndTimestamp varstring varstring
//    kTypeRollbackXID varstring
//    kTypeBeginPersistedPrepareXID
//    kTypeBeginUnprepareXID
//    kTypeWideColumnEntity varstring varstring
//    kTypeColumnFamilyWideColumnEntity varint32 varstring varstring
//    kTypeNoop
// varstring :=
//    len: varint32
//    data: uint8[len]

可以看到，一个 WriteBatch 一个 seq num，说明在 WAL 中是以 WriteBatch 为单位的，而不是以 WriteBatch 里的具体操作为单位的。

WAL 创建

RocksDB 在两种情况下会创建 WAL：

新的 DB 被打开时会创建一个 WAL。
当一个 CF 的 memtable 被刷新之后会创建一个 WAL。

第一种情况，创建 WAL 的入口被封装进了 DB::Open() 函数中，核心源码如下：

Status DBImpl::Open(const DBOptions& db_options, const std::string& dbname,
                    const std::vector<ColumnFamilyDescriptor>& column_families,
                    std::vector<ColumnFamilyHandle*>* handles, DB** dbptr,
                    const bool seq_per_batch, const bool batch_per_txn) {
//......................................................................
  s = impl->Recover(column_families, false, false, false, &recovered_seq,
                    &recovery_ctx);
  if (s.ok()) {
    uint64_t new_log_number = impl->versions_->NewFileNumber();
//.............................................
    s = impl->CreateWAL(new_log_number, 0 /*recycle_log_number*/,
                        preallocate_block_size, &new_log);
  }
//................................................
}

第二种情况，当 CF 的 memtable 要 flush 时，通过 DBImpl::Flush() 调用自身的 FlushMemTable() 函数，在flush memtable 的过程中进行新的 WAL 的创建。这里当触发 CF 的 flush 时，需要将内存中 memtable 标记为imutable memetable，然后在后台转换为 sstable，同时会生成新的 memtable。这个时候 WAL 记录的是旧的 memtable 的请求，为了数据的隔离性，且 WAL 不会过大，每个 WAL 文件只和一个 memtable 绑定，所以切换memtable 的过程中会创建新的wal文件，用来接收新的请求。

函数调用链如下，一直到 DBImpl::SwitchMemtable()。该函数作用很简单，生成新的 memtable 并把旧的 memtable 变成 imutable memetable。同时，该函数也封装了创建 WAL 的入口函数。

Status DBImpl::Flush(const FlushOptions& flush_options,
                     ColumnFamilyHandle* column_family) {
    ...
    // 主要就是flush memtable
    s = FlushMemTable(cfh->cfd(), flush_options, FlushReason::kManualFlush);
    ...
}

Status DBImpl::FlushMemTable(ColumnFamilyData* cfd,
                             const FlushOptions& flush_options,
                             FlushReason flush_reason, bool writes_stopped) {
    ...
    // 切换memtable
    s = SwitchMemtable(cfd, &context);
    ...
}

Status DBImpl::SwitchMemtable(ColumnFamilyData* cfd, WriteContext* context) {
//..................................................
  if (creating_new_log) {
    // TODO: Write buffer size passed in should be max of all CF's instead
    // of mutable_cf_options.write_buffer_size.
    io_s = CreateWAL(new_log_number, recycle_log_number, preallocate_block_size,
                     &new_log);
    if (s.ok()) {
      s = io_s;
    }
  }
//...............................................
  return s;
}

可以看到，无论是哪种情况，在调用 CreateWAL() 时均会传入一个 new_log_number，这个值就是对应 WAL 的文件名前缀。CreateWAL() 的主要源码如下：

IOStatus DBImpl::CreateWAL(uint64_t log_file_num, uint64_t recycle_log_number,
                           size_t preallocate_block_size,
                           log::Writer** new_log) {
  IOStatus io_s;
  std::unique_ptr<FSWritableFile> lfile;

  DBOptions db_options =
      BuildDBOptions(immutable_db_options_, mutable_db_options_);
  FileOptions opt_file_options =
      fs_->OptimizeForLogWrite(file_options_, db_options);
  std::string wal_dir = immutable_db_options_.GetWalDir();
  std::string log_fname = LogFileName(wal_dir, log_file_num);

  if (recycle_log_number) {
    ROCKS_LOG_INFO(immutable_db_options_.info_log,
                   "reusing log %" PRIu64 " from recycle list\n",
                   recycle_log_number);
    std::string old_log_fname = LogFileName(wal_dir, recycle_log_number);
    TEST_SYNC_POINT("DBImpl::CreateWAL:BeforeReuseWritableFile1");
    TEST_SYNC_POINT("DBImpl::CreateWAL:BeforeReuseWritableFile2");
    io_s = fs_->ReuseWritableFile(log_fname, old_log_fname, opt_file_options,
                                  &lfile, /*dbg=*/nullptr);
  } else {
    io_s = NewWritableFile(fs_.get(), log_fname, &lfile, opt_file_options);
  }
  // ...
}

可以看到，它会通过 wal_dir 和 log_file_num 生成一个文件名，然后判断是否配置了 recycle_log_number。如果否，那么就会调用 NewWritableFile() 重新创建一个 WAL。如果是，那就调用 ReuseWritableFile() 复用原来的 WAL，但要将其名字改为新的 log_fname。

会过头来看 log_file_num 的生成，其很简单，就是自增而已，由 NewFileNumber() 实现。因此一般来说 WAL 文件的名称都是类似 0000001.LOG 这样子。

// Allocate and return a new file number
uint64_t NewFileNumber() { return next_file_number_.fetch_add(1); }

WAL 写入

接下来进入本篇博客的重点，WAL 的写入。

在上一篇博客中分析到，DBImpl::WriteImpl() 中，WriteGroup 的 Leader 会在写 memtable 之前把整个 WriteGroup 写入 WAL 中，源码如下：

// DBImpl::WriteImpl()
    if (!two_write_queues_) {
      if (status.ok() && !write_options.disableWAL) {
        assert(log_context.log_file_number_size);
        LogFileNumberSize& log_file_number_size =
            *(log_context.log_file_number_size);
        PERF_TIMER_GUARD(write_wal_time);
        io_s =
            WriteToWAL(write_group, log_context.writer, log_used,
                       log_context.need_log_sync, log_context.need_log_dir_sync,
                       last_sequence + 1, log_file_number_size);
      }
    } else {
      if (status.ok() && !write_options.disableWAL) {
        PERF_TIMER_GUARD(write_wal_time);
        // LastAllocatedSequence is increased inside WriteToWAL under
        // wal_write_mutex_ to ensure ordered events in WAL
        io_s = ConcurrentWriteToWAL(write_group, log_used, &last_sequence,
                                    seq_inc);
      } else {
        // Otherwise we inc seq number for memtable writes
        last_sequence = versions_->FetchAddLastAllocatedSequence(seq_inc);
      }
    }

这里我们只考虑 !2pc 的过程，RocksDB 会调用 DBImpl::WriteToWAL()：

WriteToWAL(write_group, log_context.writer, log_used,
           log_context.need_log_sync, log_context.need_log_dir_sync,
           last_sequence + 1, log_file_number_size);

其中，write_group 就是 Leader 的 WriteGroup，而 last_sequence + 1 是什么？我们把代码往前翻，会发现 last_sequence 就是当前 VersionSet 的最后一个 seq num，如下：

// DBImpl::WriteImpl()
    if (!two_write_queues_) {
      // Assign it after ::PreprocessWrite since the sequence might advance
      // inside it by WriteRecoverableState
      last_sequence = versions_->LastSequence();
    }

因此，last_sequence + 1 就是给这个 WriteGroup 分配一个新的 seq num。WriteToWAL() 的流程主要分为四大步：MergeBatch、SetSequence、WriteToWAL（针对 WriteBatch）和 Sync。

IOStatus DBImpl::WriteToWAL(const WriteThread::WriteGroup& write_group,
                            log::Writer* log_writer, uint64_t* log_used,
                            bool need_log_sync, bool need_log_dir_sync,
                            SequenceNumber sequence,
                            LogFileNumberSize& log_file_number_size) {
  // ...  
  WriteBatch* merged_batch;
  io_s = status_to_io_status(MergeBatch(write_group, &tmp_batch_, &merged_batch,
                                        &write_with_wal, &to_be_cached_state));
  // ...
  WriteBatchInternal::SetSequence(merged_batch, sequence);
  // ...
  io_s = WriteToWAL(*merged_batch, log_writer, log_used, &log_size,
                    write_group.leader->rate_limiter_priority,
                    log_file_number_size);
  // ...
  if (io_s.ok() && need_log_sync) {
    for (auto& log : logs_) {
      io_s = log.writer->file()->Sync(immutable_db_options_.use_fsync);
      if (!io_s.ok()) {
        break;
      }
    }
    // ...
  }
  // ...
}

MergeBatch

RocksDB 在写入 WAL 之前，会把 WriteGroup 合并为一个 WriteBatch，名为 merged_batch。注意，这个 merge 不是 LSM 中的 merge，这里仅仅做合并，不做去重。完整源码如下：

Status DBImpl::MergeBatch(const WriteThread::WriteGroup& write_group,
                          WriteBatch* tmp_batch, WriteBatch** merged_batch,
                          size_t* write_with_wal,
                          WriteBatch** to_be_cached_state) {
  assert(write_with_wal != nullptr);
  assert(tmp_batch != nullptr);
  assert(*to_be_cached_state == nullptr);
  *write_with_wal = 0;
  auto* leader = write_group.leader;
  assert(!leader->disable_wal);  // Same holds for all in the batch group
  if (write_group.size == 1 && !leader->CallbackFailed() &&
      leader->batch->GetWalTerminationPoint().is_cleared()) {
    // we simply write the first WriteBatch to WAL if the group only
    // contains one batch, that batch should be written to the WAL,
    // and the batch is not wanting to be truncated
    *merged_batch = leader->batch;
    if (WriteBatchInternal::IsLatestPersistentState(*merged_batch)) {
      *to_be_cached_state = *merged_batch;
    }
    *write_with_wal = 1;
  } else {
    // WAL needs all of the batches flattened into a single batch.
    // We could avoid copying here with an iov-like AddRecord
    // interface
    *merged_batch = tmp_batch;
    for (auto writer : write_group) {
      if (!writer->CallbackFailed()) {
        Status s = WriteBatchInternal::Append(*merged_batch, writer->batch,
                                              /*WAL_only*/ true);
        if (!s.ok()) {
          tmp_batch->Clear();
          return s;
        }
        if (WriteBatchInternal::IsLatestPersistentState(writer->batch)) {
          // We only need to cache the last of such write batch
          *to_be_cached_state = writer->batch;
        }
        (*write_with_wal)++;
      }
    }
  }
  // return merged_batch;
  return Status::OK();
}

其逻辑很简单，就是遍历一遍 WriteGroup，把其中的所有 WriteBatch 都 Append 进 merged_batch 中。

SetSequence

到目前为止，写操作都没有被分配 seq num。RocksDB 生成完 merged_batch 之后，会给其分配一个 seq num，而这个 seq 就是 last_sequence + 1。

// DBImpl::WriteToWAL
  // sequence == last_sequence + 1
  WriteBatchInternal::SetSequence(merged_batch, sequence);

SetSequence() 源码如下：

void WriteBatchInternal::SetSequence(WriteBatch* b, SequenceNumber seq) {
  EncodeFixed64(&b->rep_[0], seq);
}

在前文中我们介绍了 WriteBatch::rep_ 的首个元素就是 seq num，这里该函数将其赋值为最新的 seq num。

WriteToWAL（针对 wb）

WriteToWAL() 一共有两个重载，一个针对 WriteGroup，另一个针对 WriteBatch：

IOStatus WriteToWAL(const WriteBatch& merged_batch, log::Writer* log_writer,
                    uint64_t* log_used, uint64_t* log_size,
                    Env::IOPriority rate_limiter_priority,
                    LogFileNumberSize& log_file_number_size);

IOStatus WriteToWAL(const WriteThread::WriteGroup& write_group,
                    log::Writer* log_writer, uint64_t* log_used,
                    bool need_log_sync, bool need_log_dir_sync,
                    SequenceNumber sequence,
                    LogFileNumberSize& log_file_number_size);

WAL 写入的入口为第二个重载，传入 WriteGroup，但最后还要要调用第一个重载，传入 merged_batch：

// DBImpl::WriteToWAL
  io_s = WriteToWAL(*merged_batch, log_writer, log_used, &log_size,
                    write_group.leader->rate_limiter_priority,
                    log_file_number_size);

该重载的完整源码如下：

IOStatus DBImpl::WriteToWAL(const WriteBatch& merged_batch,
                            log::Writer* log_writer, uint64_t* log_used,
                            uint64_t* log_size,
                            Env::IOPriority rate_limiter_priority,
                            LogFileNumberSize& log_file_number_size) {
  assert(log_size != nullptr);

  Slice log_entry = WriteBatchInternal::Contents(&merged_batch);
  TEST_SYNC_POINT_CALLBACK("DBImpl::WriteToWAL:log_entry", &log_entry);
  auto s = merged_batch.VerifyChecksum();
  if (!s.ok()) {
    return status_to_io_status(std::move(s));
  }
  *log_size = log_entry.size();
  // When two_write_queues_ WriteToWAL has to be protected from concurretn calls
  // from the two queues anyway and log_write_mutex_ is already held. Otherwise
  // if manual_wal_flush_ is enabled we need to protect log_writer->AddRecord
  // from possible concurrent calls via the FlushWAL by the application.
  const bool needs_locking = manual_wal_flush_ && !two_write_queues_;
  // Due to performance cocerns of missed branch prediction penalize the new
  // manual_wal_flush_ feature (by UNLIKELY) instead of the more common case
  // when we do not need any locking.
  if (UNLIKELY(needs_locking)) {
    log_write_mutex_.Lock();
  }
  IOStatus io_s = log_writer->AddRecord(log_entry, rate_limiter_priority);

  if (UNLIKELY(needs_locking)) {
    log_write_mutex_.Unlock();
  }
  if (log_used != nullptr) {
    *log_used = logfile_number_;
  }
  total_log_size_ += log_entry.size();
  log_file_number_size.AddSize(*log_size);
  log_empty_ = false;
  return io_s;
}

首先，它会将 merged_batch 包装成一个 log_entry，作为 WAL 中的 record。接着，调用 Writer::AddRecord() 来将改 record 写入到 WAL 中，并最终写入 WAL 文件。整个写入过程的调用链为 AddRecord --> Flush -->WriteBuffered --> PosixWritableFile::Append --> PosixWrite --> write 。前面就是一些字符串的格式处理，一直到后面调用 Linux 的 write 系统调用，才真正将数据写入 WAL 文件。

Sync

这里要先了解一下 Linux 的 I/O 机制。

Linux 在内核设有缓冲区高速缓存或页面高速缓存，大多数磁盘 I/O 都通过缓冲区进行。当我们向文件写数据时，内核通常先将数据复制到一个缓冲区中，如果该缓冲区尚未写满，则并不将其排入输出队列，而是等待写满或者内核需要重用该缓冲区以便存放其他数据时，才会将该缓冲区排入输出队列，然后等它到达队首时，才进行实际的I/O 操作。这就是被称为延迟写的输出方式。延迟写减少了磁盘读次数，但是却减低了文件内容跟新的速度。当系统发生故障时，延迟写的方式可能造成文件跟新丢失。

为了应对此种情况，Linux 提供了三个函数来保证实际文件系统与缓冲区中内容的一致：

fdatasyncsync：该函数只是将所有修改过的块缓冲区排入写队列，然后就返回，他并不等待实际写磁盘操作结束。
fsync：只对由文件描述符fd指定的一个文件起作用，并且等待写磁盘操作结束才返回。
fdatasync：类似于 fsync，但是它只影响文件的数据部分。而除数据外，fsync 还会同步更新文件的属性。

因此，上一步的 WriteToWAL() 并不一定真正写入了文件系统，期间可能出现故障导致写缓冲区内容的丢失。保证写内容的顺利落盘，RocksDB 在 WriteToWAL() 之后使用了刷盘操作，由 need_log_sync 决定是否使用：

// DBImpl::WriteToWAL
  if (io_s.ok() && need_log_sync) {
    // ...
    for (auto& log : logs_) {
      io_s = log.writer->file()->Sync(immutable_db_options_.use_fsync);
      if (!io_s.ok()) {
        break;
      }
    }
    // ...
  }

log.writer->file()->Sync() 会进一步调用 WritableFileWriter::SyncInternal() ，该函数核心源码如下：

IOStatus WritableFileWriter::SyncInternal(bool use_fsync) {
  if (use_fsync) {
    s = writable_file_->Fsync(io_options, nullptr);
  } else {
    s = writable_file_->Sync(io_options, nullptr);
  }
}

继续往下追踪，发现 Fsync() 最终调用了 fsync()，而 Sync() 最终调用了 fdatasync()，如下：

IOStatus PosixMmapFile::Sync(const IOOptions& /*opts*/,
                             IODebugContext* /*dbg*/) {
#ifdef HAVE_FULLFSYNC
  if (::fcntl(fd_, F_FULLFSYNC) < 0) {
    return IOError("while fcntl(F_FULLSYNC) mmapped file", filename_, errno);
  }
#else   // HAVE_FULLFSYNC
  if (fdatasync(fd_) < 0) {
    return IOError("While fdatasync mmapped file", filename_, errno);
  }
#endif  // HAVE_FULLFSYNC

  return Msync();
}

/**
 * Flush data as well as metadata to stable storage.
 */
IOStatus PosixMmapFile::Fsync(const IOOptions& /*opts*/,
                              IODebugContext* /*dbg*/) {
#ifdef HAVE_FULLFSYNC
  if (::fcntl(fd_, F_FULLFSYNC) < 0) {
    return IOError("While fcntl(F_FULLSYNC) on mmaped file", filename_, errno);
  }
#else   // HAVE_FULLFSYNC
  if (fsync(fd_) < 0) {
    return IOError("While fsync mmaped file", filename_, errno);
  }
#endif  // HAVE_FULLFSYNC

  return Msync();
}