目录
db_bench 写数据流程分析
db_bench 是 RocksDB 为我们提供的数据库性能测试工具,通过初始化各类命令行选项,可对数据库进行全方位的读写性能测试。在使用这个工具进行自定义测试的时候,面对众多的可设置选项该如何抉择呢?下面就从 db_bench
测试工具的源代码开始分析,看看这些纷繁复杂的选项是怎样一步步影响 RocksDB
写流程的。
一、初始化阶段
位于 tools/db_bench_tool.cc 路径下的 db_bench_tool(int, char **)
函数,完成对主要参数和环境的初始化以后,创建基准测试对象 Benchmark
。
ROCKSDB_NAMESPACE::Benchmark benchmark;
benchmark.Run();
接着,Run
函数会完成另一部分初始化工作,包含测试键值对条目、键值大小和是否开启写前日志等参数的设置。
通过匹配传入的测试方法名称,设置并调用相应函数。
...
} else if (name == "fillseq") {
fresh_db = true;
method = &Benchmark::WriteSeq;
} else if (name == "fillbatch") {
fresh_db = true;
entries_per_batch_ = 1000;
method = &Benchmark::WriteSeq;
} else if (name == "fillrandom") {
fresh_db = true;
method = &Benchmark::WriteRandom;
} else if
...
这些写函数最终都调用了相同函数 DoWrite
void WriteSeq(ThreadState* thread) {
DoWrite(thread, SEQUENTIAL);
}
void WriteRandom(ThreadState* thread) {
DoWrite(thread, RANDOM);
}
void WriteUniqueRandom(ThreadState* thread) {
DoWrite(thread, UNIQUE_RANDOM);
}
接下来分析 DoWrite
函数的实现
Benchmark::DoWrite()
代码第 5037 - 5043 行,是当列族的数目 > 1时,进行 Put
操作。
} else if (FLAGS_num_column_families <= 1) {
batch.Put(key, val);
} else {
// We use same rand_num as seed for key and column family so that we
// can deterministically find the cfh corresponding to a particular
// key while reading the key.
batch.Put(db_with_cfh->GetCfh(rand_num), key,
val);
}
使用 rand_num 绑定 column families 和 key 。
使用 rand_num 绑定 key 的方法
Benchmark::GenerateKeyFromInt()
根据给定的规范和随机数生成 key。
GenerateKeyFromInt(rand_num, FLAGS_num, &key);
如果 keys_per_prefix_ 为正,则额外的尾随字节将被截断或用“0”填充:
// ----------------------------
// | prefix 00000 | key 00000 |
// ----------------------------
如果keys_per_prefix_ 为0,则 key 只是随机数的二进制表示,后跟尾随’0’:
// ----------------------------
// | key 00000 |
// ----------------------------
使用 rand_num 绑定 Cfh 的方法
根据当前随机数获取列族 ColumnFamilyHandle ColumnFamilyHandle* GetCfh(int64_t rand_num)
。使用相同的随机数 rand_num 作为列族和键的线索,这样就可以确定地找到与键对应的 cfh,同时读取键。
二、写流程分析
写流程仍然是从 Benchmark::DoWrite 第 5041-5042 行代码开始继续向下分析:
} else if (FLAGS_num_column_families <= 1) {
batch.Put(key, val); // 列族数目不大于1的情况
} else { // 含有多个列族
// We use same rand_num as seed for key and column family so that we
// can deterministically find the cfh corresponding to a particular
// key while reading the key.
batch.Put(db_with_cfh->GetCfh(rand_num), key,
val);
}
这一行调用了 WriteBatch batch
(声明在 DoWrite 函数内,第4793行)的 Put 函数,函数定义如下:
Status WriteBatch::Put(ColumnFamilyHandle* column_family, const Slice& key,
const Slice& value) {
return WriteBatchInternal::Put(this, GetColumnFamilyID(column_family), key,
value);
}
WriteBatchInternel 的编码过程
在函数内传入了 this 指针,并通过 GetColumnFamilyID 函数得到 column families 的 ID,实际上是让 ColumnFamilyHandle 变量返回 GetID 函数的结果。继续调用 WriteBatchInternal 的 Put 函数。
Status WriteBatchInternal::Put(WriteBatch* b, uint32_t column_family_id,
const Slice& key, const Slice& value) {
...
LocalSavePoint save(b);
WriteBatchInternal::SetCount(b, WriteBatchInternal::Count(b) + 1);
// 操作总数 +1
if (column_family_id == 0) {
b->rep_.push_back(static_cast<char>(kTypeValue));
// 如果 column_family_id = 0,加入 value 的类型。
// kTypeValue(unsigned char) = 0x1
} else {
b->rep_.push_back(static_cast<char>(kTypeColumnFamilyValue));
// 先加入 kTypeColumnFamilyValue = 0x5 WAL only.
PutVarint32(&b->rep_, column_family_id); // 然后加入 column family ID
}
PutLengthPrefixedSlice(&b->rep_, key); // 放入 key 大小和数据
PutLengthPrefixedSlice(&b->rep_, value); // 放入 value 大小和数据
...
return save.commit();
}
WriteBatch 其成员变量包含一个 string 类型的 rep_。
每一个 WriteBatch 都是以一个固定长度的头部开始,然后后面接着许多连续的记录,
固定头部共12字节,其中前8字节为WriteBatch的序列号,对应rep_[0]到rep_[7],每次处理Batch中的记录时才会更新,后四字节为当前Batch中的记录数,对应rep_[8]到rep_[11];通过头部我们就可以知道每一个batch的序号了,并能够知道每一个batch中的记录数,将batch写入到文件中后就可以方便的依次获取每一条记录了。
后面的记录结构为:
插入数据时:type(kTypeValue、kTypeDeletion),Key_size,Key,Value_size,Value
删除数据时:type(kTypeValue、kTypeDeletion),Key_size,Key ————————————————
版权声明:本文为CSDN博主「禾夕」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/u012658346/article/details/45341885
可以看到这个字符串首先有8字节的序列号和4字节的记录数作为头,所以这个类定义了
static const size_t KHeader=12
作为这个字符串的最小长度。在头之后,紧接着就是一条一条的记录。
对于插入的记录,由kTypeValue+key长度+key+value长度+value组成
对于删除记录,由kTypeDelete+key长度+key组成
这类定义了写和删除的操作实现就是调用WriteBatchInternal这个类里面的方法
————————————————
版权声明:本文为CSDN博主「UncleYau」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/u014361034/article/details/61928521
WriteBatch 的写入过程
经过上节的分析可知,具体的键值对存储在 WriteBatch 类中,WriteBatchInternal 则提供操作 WriteBatch 的方法。
参考 https://blog.csdn.net/u014361034/article/details/61928521
virtual Status Write(const WriteOptions& options, WriteBatch* updates) = 0; // db 的纯虚函数
Status DBImpl::Write(const WriteOptions& write_options, WriteBatch* my_batch) {
return WriteImpl(write_options, my_batch, nullptr, nullptr);
}
Status DBImpl::WriteImpl(const WriteOptions& write_options,
WriteBatch* my_batch, WriteCallback* callback,
uint64_t* log_used, uint64_t log_ref,
bool disable_memtable, uint64_t* seq_used,
size_t batch_cnt,
PreReleaseCallback* pre_release_callback)
{
...
// 建立写操作 :my_batch 是要写的数据
WriteThread::Writer w(write_options, my_batch, callback, log_ref,
disable_memtable, batch_cnt, pre_release_callback);
...
// 建立批量任务组 : 将写操作 w 加入批量组
write_thread_.JoinBatchGroup(&w);
...
// 如果写操作 w 的状态是并行写 memtable 的 writer
if (w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER) {
// we are a non-leader in a parallel group 在组内不是 leader
if (w.ShouldWriteToMemtable()) {
PERF_TIMER_STOP(write_pre_and_post_process_time);
PERF_TIMER_GUARD(write_memtable_time);
ColumnFamilyMemTablesImpl column_family_memtables(
versions_->GetColumnFamilySet());// 从 versions_ 获取当前 Cf 的 MemTable
// 开始将 WriteBatch 写入到 MemTable
w.status = WriteBatchInternal::InsertInto(
&w, w.sequence, &column_family_memtables, &flush_scheduler_,
&trim_history_scheduler_,
write_options.ignore_missing_column_families, 0 /*log_number*/, this,
true /*concurrent_memtable_writes*/, seq_per_batch_, w.batch_cnt,
batch_per_txn_, write_options.memtable_insert_hint_per_batch);
PERF_TIMER_START(write_pre_and_post_process_time);
}
if (write_thread_.CompleteParallelMemTableWriter(&w)) {
// we're responsible for exit batch group
// TODO(myabandeh): propagate status to write_group
auto last_sequence = w.write_group->last_sequence;
versions_->SetLastSequence(last_sequence); // 将当前组的序列设置为全局可见序列
MemTableInsertStatusCheck(w.status);
write_thread_.ExitAsBatchGroupFollower(&w); // 并作为本组的 leader 退出
}
}
...
// 如果写操作 w 的状态为组内 leader,把写操作 w 作为 leader 加入到 WriteGroup 中
last_batch_group_size_ =
write_thread_.EnterAsBatchGroupLeader(&w, &write_group);
... // 省略一些操作 log 的步骤
// 开始将 WriteGroup 内的 Writer 写入到 WAL
if (status.ok()) {
if (!two_write_queues_) {
if (status.ok() && !write_options.disableWAL) {
PERF_TIMER_GUARD(write_wal_time);
io_s = WriteToWAL(write_group, log_writer, log_used, need_log_sync,
need_log_dir_sync, last_sequence + 1);
}
} else {
if (status.ok() && !write_options.disableWAL) {
PERF_TIMER_GUARD(write_wal_time);
// LastAllocatedSequence is increased inside WriteToWAL under
// wal_write_mutex_ to ensure ordered events in WAL
io_s = ConcurrentWriteToWAL(write_group, log_used, &last_sequence,
seq_inc);
} else {
// Otherwise we inc seq number for memtable writes
last_sequence = versions_->FetchAddLastAllocatedSequence(seq_inc);
}
}
// 开始将 WriteBatch 写入到 MemTable
if (status.ok()) {
PERF_TIMER_GUARD(write_memtable_time);
if (!parallel) {
// w.sequence will be set inside InsertInto
// 如果不是并行写:将以 write_group 的形式调用 InsertInto 函数
w.status = WriteBatchInternal::InsertInto(
write_group, current_sequence, column_family_memtables_.get(),
&flush_scheduler_, &trim_history_scheduler_,
write_options.ignore_missing_column_families,
0 /*recovery_log_number*/, this, parallel, seq_per_batch_,
batch_per_txn_);
} else {
write_group.last_sequence = last_sequence;
write_thread_.LaunchParallelMemTableWriters(&write_group);
in_parallel_group = true;
// Each parallel follower is doing each own writes. The leader should
// also do its own.
// 每个组员会自己调用 InsertInto 操作,leader 调用自己的 InsertInto
if (w.ShouldWriteToMemtable()) {
ColumnFamilyMemTablesImpl column_family_memtables(
versions_->GetColumnFamilySet());
assert(w.sequence == current_sequence);
w.status = WriteBatchInternal::InsertInto(
&w, w.sequence, &column_family_memtables, &flush_scheduler_,
&trim_history_scheduler_,
write_options.ignore_missing_column_families, 0 /*log_number*/,
this, true /*concurrent_memtable_writes*/, seq_per_batch_,
w.batch_cnt, batch_per_txn_,
write_options.memtable_insert_hint_per_batch);
}
}
if (seq_used != nullptr) {
*seq_used = w.sequence;
}
}
...
}
WriteBatch 写入 MemTable
由上可知,WriteBatchInternal::InsertInto 函数有三个实现。
// 非 parallel 写
static Status InsertInto(
WriteThread::WriteGroup& write_group, SequenceNumber sequence,
ColumnFamilyMemTables* memtables, FlushScheduler* flush_scheduler,
TrimHistoryScheduler* trim_history_scheduler,
bool ignore_missing_column_families = false, uint64_t log_number = 0,
DB* db = nullptr, bool concurrent_memtable_writes = false,
bool seq_per_batch = false, bool batch_per_txn = true);
static Status InsertInto(
const WriteBatch* batch, ColumnFamilyMemTables* memtables,
FlushScheduler* flush_scheduler,
TrimHistoryScheduler* trim_history_scheduler,
bool ignore_missing_column_families = false, uint64_t log_number = 0,
DB* db = nullptr, bool concurrent_memtable_writes = false,
SequenceNumber* next_seq = nullptr, bool* has_valid_writes = nullptr,
bool seq_per_batch = false, bool batch_per_txn = true);
// parallel 写; 每个成员(包括 leader )负责自己的写入
static Status InsertInto(WriteThread::Writer* writer, SequenceNumber sequence,
ColumnFamilyMemTables* memtables,
FlushScheduler* flush_scheduler,
TrimHistoryScheduler* trim_history_scheduler,
bool ignore_missing_column_families = false,
uint64_t log_number = 0, DB* db = nullptr,
bool concurrent_memtable_writes = false,
bool seq_per_batch = false, size_t batch_cnt = 0,
bool batch_per_txn = true,
bool hint_per_batch = false);
以 write_group 为组地进行插入:
Status WriteBatchInternal::InsertInto(
WriteThread::WriteGroup& write_group, SequenceNumber sequence,
ColumnFamilyMemTables* memtables, FlushScheduler* flush_scheduler,
TrimHistoryScheduler* trim_history_scheduler,
bool ignore_missing_column_families, uint64_t recovery_log_number, DB* db,
bool concurrent_memtable_writes, bool seq_per_batch, bool batch_per_txn) {
MemTableInserter inserter(
sequence, memtables, flush_scheduler, trim_history_scheduler,
ignore_missing_column_families, recovery_log_number, db,
concurrent_memtable_writes, nullptr /* prot_info */,
nullptr /*has_valid_writes*/, seq_per_batch, batch_per_txn);
for (auto w : write_group) {
if (w->CallbackFailed()) {
continue;
}
w->sequence = inserter.sequence();
if (!w->ShouldWriteToMemtable()) {
// In seq_per_batch_ mode this advances the seq by one.
inserter.MaybeAdvanceSeq(true);
continue;
}
SetSequence(w->batch, inserter.sequence());
inserter.set_log_number_ref(w->log_ref);
inserter.set_prot_info(w->batch->prot_info_.get());
w->status = w->batch->Iterate(&inserter); // 插入操作关键部分
if (!w->status.ok()) {
return w->status;
}
assert(!seq_per_batch || w->batch_cnt != 0);
assert(!seq_per_batch || inserter.sequence() - w->sequence == w->batch_cnt);
}
return Status::OK();
}
WriteBatch 的 Iterate 函数实现如下
Status WriteBatch::Iterate(Handler* handler) const {
if (rep_.size() < WriteBatchInternal::kHeader) {
return Status::Corruption("malformed WriteBatch (too small)");
}
return WriteBatchInternal::Iterate(this, handler, WriteBatchInternal::kHeader,
rep_.size());
}
WriteBatchInternal::Iterate 这个方法做的事情就是将 WriteBatch 中的内容移除头 12 个字节后,一条条取记录。然后根据类型调用 handler(MemTableInserter)的方法处理;
Status WriteBatchInternal::Iterate(const WriteBatch* wb,
WriteBatch::Handler* handler, size_t begin,
size_t end) {
if (begin > wb->rep_.size() || end > wb->rep_.size() || end < begin) {
return Status::Corruption("Invalid start/end bounds for Iterate");
}
assert(begin <= end);
Slice input(wb->rep_.data() + begin, static_cast<size_t>(end - begin));
bool whole_batch =
(begin == WriteBatchInternal::kHeader) && (end == wb->rep_.size());
Slice key, value, blob, xid;
// Sometimes a sub-batch starts with a Noop. We want to exclude such Noops as
// the batch boundary symbols otherwise we would mis-count the number of
// batches. We do that by checking whether the accumulated batch is empty
// before seeing the next Noop.
bool empty_batch = true;
uint32_t found = 0;
Status s;
char tag = 0;
uint32_t column_family = 0; // default
bool last_was_try_again = false;
bool handler_continue = true;
while (((s.ok() && !input.empty()) || UNLIKELY(s.IsTryAgain()))) {
handler_continue = handler->Continue();
if (!handler_continue) {
break;
}
if (LIKELY(!s.IsTryAgain())) {
last_was_try_again = false;
tag = 0;
column_family = 0; // default
s = ReadRecordFromWriteBatch(&input, &tag, &column_family, &key, &value,
&blob, &xid); // 获取写入的 column_family ID
if (!s.ok()) {
return s;
}
} else {
assert(s.IsTryAgain());
assert(!last_was_try_again); // to detect infinite loop bugs
if (UNLIKELY(last_was_try_again)) {
return Status::Corruption(
"two consecutive TryAgain in WriteBatch handler; this is either a "
"software bug or data corruption.");
}
last_was_try_again = true;
s = Status::OK();
}
switch (tag) {
case kTypeColumnFamilyValue:
case kTypeValue:
assert(wb->content_flags_.load(std::memory_order_relaxed) &
(ContentFlags::DEFERRED | ContentFlags::HAS_PUT));
s = handler->PutCF(column_family, key, value); // 调用 handler PutCF 方法插入
if (LIKELY(s.ok())) {
empty_batch = false;
found++;
}
break;
case kTypeColumnFamilyDeletion:
case kTypeDeletion:
assert(wb->content_flags_.load(std::memory_order_relaxed) &
(ContentFlags::DEFERRED | ContentFlags::HAS_DELETE));
s = handler->DeleteCF(column_family, key);
if (LIKELY(s.ok())) {
empty_batch = false;
found++;
}
break;
case kTypeColumnFamilySingleDeletion:
case kTypeSingleDeletion:
assert(wb->content_flags_.load(std::memory_order_relaxed) &
(ContentFlags::DEFERRED | ContentFlags::HAS_SINGLE_DELETE));
s = handler->SingleDeleteCF(column_family, key);
if (LIKELY(s.ok())) {
empty_batch = false;
found++;
}
break;
case kTypeColumnFamilyRangeDeletion:
case kTypeRangeDeletion:
assert(wb->content_flags_.load(std::memory_order_relaxed) &
(ContentFlags::DEFERRED | ContentFlags::HAS_DELETE_RANGE));
s = handler->DeleteRangeCF(column_family, key, value);
if (LIKELY(s.ok())) {
empty_batch = false;
found++;
}
break;
case kTypeColumnFamilyMerge:
case kTypeMerge:
assert(wb->content_flags_.load(std::memory_order_relaxed) &
(ContentFlags::DEFERRED | ContentFlags::HAS_MERGE));
s = handler->MergeCF(column_family, key, value);
if (LIKELY(s.ok())) {
empty_batch = false;
found++;
}
break;
case kTypeColumnFamilyBlobIndex:
case kTypeBlobIndex:
assert(wb->content_flags_.load(std::memory_order_relaxed) &
(ContentFlags::DEFERRED | ContentFlags::HAS_BLOB_INDEX));
s = handler->PutBlobIndexCF(column_family, key, value);
if (LIKELY(s.ok())) {
found++;
}
break;
case kTypeLogData:
handler->LogData(blob);
// A batch might have nothing but LogData. It is still a batch.
empty_batch = false;
break;
case kTypeBeginPrepareXID:
assert(wb->content_flags_.load(std::memory_order_relaxed) &
(ContentFlags::DEFERRED | ContentFlags::HAS_BEGIN_PREPARE));
s = handler->MarkBeginPrepare();
assert(s.ok());
empty_batch = false;
if (!handler->WriteAfterCommit()) {
s = Status::NotSupported(
"WriteCommitted txn tag when write_after_commit_ is disabled (in "
"WritePrepared/WriteUnprepared mode). If it is not due to "
"corruption, the WAL must be emptied before changing the "
"WritePolicy.");
}
if (handler->WriteBeforePrepare()) {
s = Status::NotSupported(
"WriteCommitted txn tag when write_before_prepare_ is enabled "
"(in WriteUnprepared mode). If it is not due to corruption, the "
"WAL must be emptied before changing the WritePolicy.");
}
break;
case kTypeBeginPersistedPrepareXID:
assert(wb->content_flags_.load(std::memory_order_relaxed) &
(ContentFlags::DEFERRED | ContentFlags::HAS_BEGIN_PREPARE));
s = handler->MarkBeginPrepare();
assert(s.ok());
empty_batch = false;
if (handler->WriteAfterCommit()) {
s = Status::NotSupported(
"WritePrepared/WriteUnprepared txn tag when write_after_commit_ "
"is enabled (in default WriteCommitted mode). If it is not due "
"to corruption, the WAL must be emptied before changing the "
"WritePolicy.");
}
break;
case kTypeBeginUnprepareXID:
assert(wb->content_flags_.load(std::memory_order_relaxed) &
(ContentFlags::DEFERRED | ContentFlags::HAS_BEGIN_UNPREPARE));
s = handler->MarkBeginPrepare(true /* unprepared */);
assert(s.ok());
empty_batch = false;
if (handler->WriteAfterCommit()) {
s = Status::NotSupported(
"WriteUnprepared txn tag when write_after_commit_ is enabled (in "
"default WriteCommitted mode). If it is not due to corruption, "
"the WAL must be emptied before changing the WritePolicy.");
}
if (!handler->WriteBeforePrepare()) {
s = Status::NotSupported(
"WriteUnprepared txn tag when write_before_prepare_ is disabled "
"(in WriteCommitted/WritePrepared mode). If it is not due to "
"corruption, the WAL must be emptied before changing the "
"WritePolicy.");
}
break;
case kTypeEndPrepareXID:
assert(wb->content_flags_.load(std::memory_order_relaxed) &
(ContentFlags::DEFERRED | ContentFlags::HAS_END_PREPARE));
s = handler->MarkEndPrepare(xid);
assert(s.ok());
empty_batch = true;
break;
case kTypeCommitXID:
assert(wb->content_flags_.load(std::memory_order_relaxed) &
(ContentFlags::DEFERRED | ContentFlags::HAS_COMMIT));
s = handler->MarkCommit(xid);
assert(s.ok());
empty_batch = true;
break;
case kTypeRollbackXID:
assert(wb->content_flags_.load(std::memory_order_relaxed) &
(ContentFlags::DEFERRED | ContentFlags::HAS_ROLLBACK));
s = handler->MarkRollback(xid);
assert(s.ok());
empty_batch = true;
break;
case kTypeNoop:
s = handler->MarkNoop(empty_batch);
assert(s.ok());
empty_batch = true;
break;
default:
return Status::Corruption("unknown WriteBatch tag");
}
}
if (!s.ok()) {
return s;
}
if (handler_continue && whole_batch &&
found != WriteBatchInternal::Count(wb)) {
return Status::Corruption("WriteBatch has wrong count");
} else {
return Status::OK();
}
}
MemTableInserter 的 PutCF 实现如下
Status PutCF(uint32_t column_family_id, const Slice& key,
const Slice& value) override {
const auto* kv_prot_info = NextProtectionInfo();
if (kv_prot_info != nullptr) {
// Memtable needs seqno, doesn't need CF ID
auto mem_kv_prot_info =
kv_prot_info->StripC(column_family_id).ProtectS(sequence_);
return PutCFImpl(column_family_id, key, value, kTypeValue,
&mem_kv_prot_info);
}
return PutCFImpl(column_family_id, key, value, kTypeValue,
nullptr /* kv_prot_info */);
}
MemTableInserter 的 PutCFImpl 函数,根据 id 获取 column family 的 memtable,然后调用 memtable 的 Add 函数插入数据。
Status PutCFImpl(uint32_t column_family_id, const Slice& key,
const Slice& value, ValueType value_type,
const ProtectionInfoKVOS64* kv_prot_info) {
// optimize for non-recovery mode
if (UNLIKELY(write_after_commit_ && rebuilding_trx_ != nullptr)) {
// TODO(ajkr): propagate `ProtectionInfoKVOS64`.
return WriteBatchInternal::Put(rebuilding_trx_, column_family_id, key,
value);
// else insert the values to the memtable right away
}
// 在 MemTable 中找传入的 column family
Status ret_status;
if (UNLIKELY(!SeekToColumnFamily(column_family_id, &ret_status))) {
if (ret_status.ok() && rebuilding_trx_ != nullptr) {
assert(!write_after_commit_);
// The CF is probably flushed and hence no need for insert but we still
// need to keep track of the keys for upcoming rollback/commit.
// TODO(ajkr): propagate `ProtectionInfoKVOS64`.
ret_status = WriteBatchInternal::Put(rebuilding_trx_, column_family_id,
key, value);
if (ret_status.ok()) {
MaybeAdvanceSeq(IsDuplicateKeySeq(column_family_id, key));
}
} else if (ret_status.ok()) {
MaybeAdvanceSeq(false /* batch_boundary */);
}
return ret_status;
}
assert(ret_status.ok());
// 获取当前 (ColumnFamilyData) current_的 MemTable
// ColumnFamilyMemTables *const MemTableInserter::cf_mems_
MemTable* mem = cf_mems_->GetMemTable();
auto* moptions = mem->GetImmutableMemTableOptions();
// inplace_update_support is inconsistent with snapshots, and therefore with
// any kind of transactions including the ones that use seq_per_batch
assert(!seq_per_batch_ || !moptions->inplace_update_support);
if (!moptions->inplace_update_support) {
ret_status =
mem->Add(sequence_, value_type, key, value, kv_prot_info,
concurrent_memtable_writes_, get_post_process_info(mem),
hint_per_batch_ ? &GetHintMap()[mem] : nullptr);
} else if (moptions->inplace_callback == nullptr) {
assert(!concurrent_memtable_writes_);
ret_status = mem->Update(sequence_, key, value, kv_prot_info);
} else {
assert(!concurrent_memtable_writes_);
ret_status = mem->UpdateCallback(sequence_, key, value, kv_prot_info);
if (ret_status.IsNotFound()) {
// key not found in memtable. Do sst get, update, add
SnapshotImpl read_from_snapshot;
read_from_snapshot.number_ = sequence_;
ReadOptions ropts;
// it's going to be overwritten for sure, so no point caching data block
// containing the old version
ropts.fill_cache = false;
ropts.snapshot = &read_from_snapshot;
std::string prev_value;
std::string merged_value;
auto cf_handle = cf_mems_->GetColumnFamilyHandle();
Status get_status = Status::NotSupported();
if (db_ != nullptr && recovering_log_number_ == 0) {
if (cf_handle == nullptr) {
cf_handle = db_->DefaultColumnFamily();
}
get_status = db_->Get(ropts, cf_handle, key, &prev_value);
}
// Intentionally overwrites the `NotFound` in `ret_status`.
if (!get_status.ok() && !get_status.IsNotFound()) {
ret_status = get_status;
} else {
ret_status = Status::OK();
}
if (ret_status.ok()) {
UpdateStatus update_status;
char* prev_buffer = const_cast<char*>(prev_value.c_str());
uint32_t prev_size = static_cast<uint32_t>(prev_value.size());
if (get_status.ok()) {
update_status = moptions->inplace_callback(prev_buffer, &prev_size,
value, &merged_value);
} else {
update_status = moptions->inplace_callback(
nullptr /* existing_value */, nullptr /* existing_value_size */,
value, &merged_value);
}
if (update_status == UpdateStatus::UPDATED_INPLACE) {
assert(get_status.ok());
if (kv_prot_info != nullptr) {
ProtectionInfoKVOS64 updated_kv_prot_info(*kv_prot_info);
updated_kv_prot_info.UpdateV(value,
Slice(prev_buffer, prev_size));
// prev_value is updated in-place with final value.
ret_status = mem->Add(sequence_, value_type, key,
Slice(prev_buffer, prev_size),
&updated_kv_prot_info);
} else {
ret_status = mem->Add(sequence_, value_type, key,
Slice(prev_buffer, prev_size),
nullptr /* kv_prot_info */);
}
if (ret_status.ok()) {
RecordTick(moptions->statistics, NUMBER_KEYS_WRITTEN);
}
} else if (update_status == UpdateStatus::UPDATED) {
if (kv_prot_info != nullptr) {
ProtectionInfoKVOS64 updated_kv_prot_info(*kv_prot_info);
updated_kv_prot_info.UpdateV(value, merged_value);
// merged_value contains the final value.
ret_status = mem->Add(sequence_, value_type, key,
Slice(merged_value), &updated_kv_prot_info);
} else {
// merged_value contains the final value.
ret_status =
mem->Add(sequence_, value_type, key, Slice(merged_value),
nullptr /* kv_prot_info */);
}
if (ret_status.ok()) {
RecordTick(moptions->statistics, NUMBER_KEYS_WRITTEN);
}
}
}
}
}
if (UNLIKELY(ret_status.IsTryAgain())) {
assert(seq_per_batch_);
const bool kBatchBoundary = true;
MaybeAdvanceSeq(kBatchBoundary);
} else if (ret_status.ok()) {
MaybeAdvanceSeq();
CheckMemtableFull();
}
// optimize for non-recovery mode
// If `ret_status` is `TryAgain` then the next (successful) try will add
// the key to the rebuilding transaction object. If `ret_status` is
// another non-OK `Status`, then the `rebuilding_trx_` will be thrown
// away. So we only need to add to it when `ret_status.ok()`.
if (UNLIKELY(ret_status.ok() && rebuilding_trx_ != nullptr)) {
assert(!write_after_commit_);
// TODO(ajkr): propagate `ProtectionInfoKVOS64`.
ret_status = WriteBatchInternal::Put(rebuilding_trx_, column_family_id,
key, value);
}
return ret_status;
}
MemTable 的 Add 函数,主要向 MemTable 中添加 KV 记录,加入的同时进行编码操作。
Status MemTable::Add(SequenceNumber s, ValueType type,
const Slice& key, /* user key */
const Slice& value,
const ProtectionInfoKVOS64* kv_prot_info,
bool allow_concurrent,
MemTablePostProcessInfo* post_process_info, void** hint) {
// Format of an entry is concatenation of:
// key_size : varint32 of internal_key.size()
// key bytes : char[internal_key.size()]
// value_size : varint32 of value.size()
// value bytes : char[value.size()]
uint32_t key_size = static_cast<uint32_t>(key.size());
uint32_t val_size = static_cast<uint32_t>(value.size());
uint32_t internal_key_size = key_size + 8;
const uint32_t encoded_len = VarintLength(internal_key_size) +
internal_key_size + VarintLength(val_size) +
val_size;
char* buf = nullptr;
std::unique_ptr<MemTableRep>& table =
type == kTypeRangeDeletion ? range_del_table_ : table_;
KeyHandle handle = table->Allocate(encoded_len, &buf);
char* p = EncodeVarint32(buf, internal_key_size);
memcpy(p, key.data(), key_size);
Slice key_slice(p, key_size);
p += key_size;
uint64_t packed = PackSequenceAndType(s, type);
EncodeFixed64(p, packed);
p += 8;
p = EncodeVarint32(p, val_size);
memcpy(p, value.data(), val_size);
assert((unsigned)(p + val_size - buf) == (unsigned)encoded_len);
if (kv_prot_info != nullptr) {
Slice encoded(buf, encoded_len);
TEST_SYNC_POINT_CALLBACK("MemTable::Add:Encoded", &encoded);
Status status = VerifyEncodedEntry(encoded, *kv_prot_info);
if (!status.ok()) {
return status;
}
}
size_t ts_sz = GetInternalKeyComparator().user_comparator()->timestamp_size();
Slice key_without_ts = StripTimestampFromUserKey(key, ts_sz);
if (!allow_concurrent) {
// Extract prefix for insert with hint.
if (insert_with_hint_prefix_extractor_ != nullptr &&
insert_with_hint_prefix_extractor_->InDomain(key_slice)) {
Slice prefix = insert_with_hint_prefix_extractor_->Transform(key_slice);
bool res = table->InsertKeyWithHint(handle, &insert_hints_[prefix]);
if (UNLIKELY(!res)) {
return Status::TryAgain("key+seq exists");
}
} else {
bool res = table->InsertKey(handle);
if (UNLIKELY(!res)) {
return Status::TryAgain("key+seq exists");
}
}
// this is a bit ugly, but is the way to avoid locked instructions
// when incrementing an atomic
num_entries_.store(num_entries_.load(std::memory_order_relaxed) + 1,
std::memory_order_relaxed);
data_size_.store(data_size_.load(std::memory_order_relaxed) + encoded_len,
std::memory_order_relaxed);
if (type == kTypeDeletion) {
num_deletes_.store(num_deletes_.load(std::memory_order_relaxed) + 1,
std::memory_order_relaxed);
}
if (bloom_filter_ && prefix_extractor_ &&
prefix_extractor_->InDomain(key_without_ts)) {
bloom_filter_->Add(prefix_extractor_->Transform(key_without_ts));
}
if (bloom_filter_ && moptions_.memtable_whole_key_filtering) {
bloom_filter_->Add(key_without_ts);
}
// The first sequence number inserted into the memtable
assert(first_seqno_ == 0 || s >= first_seqno_);
if (first_seqno_ == 0) {
first_seqno_.store(s, std::memory_order_relaxed);
if (earliest_seqno_ == kMaxSequenceNumber) {
earliest_seqno_.store(GetFirstSequenceNumber(),
std::memory_order_relaxed);
}
assert(first_seqno_.load() >= earliest_seqno_.load());
}
assert(post_process_info == nullptr);
UpdateFlushState();
} else {
bool res = (hint == nullptr)
? table->InsertKeyConcurrently(handle)
: table->InsertKeyWithHintConcurrently(handle, hint);
if (UNLIKELY(!res)) {
return Status::TryAgain("key+seq exists");
}
assert(post_process_info != nullptr);
post_process_info->num_entries++;
post_process_info->data_size += encoded_len;
if (type == kTypeDeletion) {
post_process_info->num_deletes++;
}
if (bloom_filter_ && prefix_extractor_ &&
prefix_extractor_->InDomain(key_without_ts)) {
bloom_filter_->AddConcurrently(
prefix_extractor_->Transform(key_without_ts));
}
if (bloom_filter_ && moptions_.memtable_whole_key_filtering) {
bloom_filter_->AddConcurrently(key_without_ts);
}
// atomically update first_seqno_ and earliest_seqno_.
uint64_t cur_seq_num = first_seqno_.load(std::memory_order_relaxed);
while ((cur_seq_num == 0 || s < cur_seq_num) &&
!first_seqno_.compare_exchange_weak(cur_seq_num, s)) {
}
uint64_t cur_earliest_seqno =
earliest_seqno_.load(std::memory_order_relaxed);
while (
(cur_earliest_seqno == kMaxSequenceNumber || s < cur_earliest_seqno) &&
!first_seqno_.compare_exchange_weak(cur_earliest_seqno, s)) {
}
}
if (type == kTypeRangeDeletion) {
is_range_del_table_empty_.store(false, std::memory_order_relaxed);
}
UpdateOldestKeyTime();
return Status::OK();
}
下一篇将继续分析 MemTable 如何持久化,写入设备。