RocksDB:block-based SST文件详解 + 文件格式 + 文件编码

SST File Format

<beginning_of_file>
[data block 1]
[data block 2]
...
[data block N]
[meta block 1: filter block]                  (see section: "filter" Meta Block)
[meta block 2: index block]
[meta block 3: compression dictionary block]  (see section: "compression dictionary" Meta Block)
[meta block 4: range deletion block]          (see section: "range deletion" Meta Block)
[meta block 5: stats block]                   (see section: "properties" Meta Block)
...
[meta block K: future extended block]  (we may add more meta blocks in the future)
[metaindex block]
[Footer]                               (fixed size; starts at file_size - sizeof(Footer))
<end_of_file>

Data Block

Data Block Format

DataBlock是KV数据块,包含KV,DataBlock的格式如下,其中:

  • entry:KV数据
  • restart_point:前缀压缩的起始entry
  • restart_count:restart_point数量
<beginning_of_datablock>
[entry 1]
[entry 2]
...
[entry N]
[restart_point 1]
[restart_point 2]
...
[restart_point N]
[restart_count]
<end_of_datablock>

每个entry的格式如下:

  • sharedKeyLength:与前一条记录key共享部分的长度,为0则表示该 Entry 是一个重启点
  • unsharedKeyLength:与前一条记录key不共享部分的长度
  • valueLength:value长度
  • unsharedKeyContent:与前一条记录key非共享的内容
  • valueContent:value内容
<beginning_of_entry>
[sharedKeyLength]
[unsharedKeyLength]
[valueLength]
[unsharedKeyContent]
[valueConten]
<end_of_entry>

Block Entry Encode

inline void BlockBuilder::AddWithLastKeyImpl(const Slice& key, // 需要添加到block中的key 
                                             const Slice& value, // 需要添加到block中的value
                                             const Slice& last_key, // 前一个entry中的key 
                                             const Slice* const delta_value, // 无用
                                             size_t buffer_size // 当前buffer写入size) {
  if (counter_ >= block_restart_interval_) {
    // 根据配置block_restart_interval_,重新开始前缀压缩点restart_point
    // 记录当前buffersize
    restarts_.push_back(static_cast<uint32_t>(buffer_size));
    estimate_ += sizeof(uint32_t);
    counter_ = 0;
  } else if (use_delta_encoding_) {
    // 获取与前一个key的共享size
    shared = key.difference_offset(last_key); 
  }
  // 与前一个key的非共享size
  const size_t non_shared = key.size() - shared;
  // 将共享size和、非共享size、valuesize编码写入
  PutVarint32Varint32Varint32(&buffer_, static_cast<uint32_t>(shared),
                                static_cast<uint32_t>(non_shared),
                                static_cast<uint32_t>(value.size()));
  // 将非共享部分key写入
  buffer_.append(key.data() + shared, non_shared);
  // 将value写入
  buffer_.append(value.data(), value.size());
}

Data Block Encode

Slice BlockBuilder::Finish() {
  // 写入restart_point
  for (size_t i = 0; i < restarts_.size(); i++) {
    PutFixed32(&buffer_, restarts_[i]);
  }
  // 获取restart_point数量
  uint32_t num_restarts = static_cast<uint32_t>(restarts_.size());
  // 写入indextype和restart_point数量
  uint32_t block_footer = PackIndexTypeAndNumRestarts(index_type, num_restarts);
  PutFixed32(&buffer_, block_footer);
}

Index Block

Index Block Format

<beginning_of_indexblock>
[data_block_handle 1]
[data_block_handle 2]
...
[data_block_handle N]
<end_of_indexblock>
  • IndexBlock用于快速查找目标key的范围, 一个file可以拥有一个index block,每个data block 对应着一个entry,这个entry 的key >=指定的data block的最后一个key,并且小于下一个连续的block的第一个key,这个entry的value是BlockHandle,指向对应的data block.
  • 为什么key不是采用其索引的DataBlock的最大key?主要目的是节省空间;假设其索引的block的最大key为"acknowledge",下一个block最小的key为"apple",如果DataBlockIndex的key采用其索引block的最大key,占用长度为len(“acknowledge”);采用后一种方式,key值可以为"ad"(“acknowledge” < “ad” < “apple”),长度仅为2,并且检索效果是一样的。

Index Block Encode

virtual void AddIndexEntry(std::string* last_key_in_current_block,
                             const Slice* first_key_in_next_block,
                             const BlockHandle& block_handle) override {
  // 根据当前block最大key和下一个block最小key,得出一个长度最小,但处于两个key之间的key。
  FindShortestInternalKeySeparator(*comparator_->user_comparator(),
                                         last_key_in_current_block,
                                         *first_key_in_next_block);
  // 创建一个entry,包含
  IndexValue entry(block_handle, current_block_first_internal_key_);
  std::string encoded_entry;
  std::string delta_encoded_entry;
  entry.EncodeTo(&encoded_entry, include_first_key_, nullptr);
  index_block_builder_.Add(sep, encoded_entry, &delta_encoded_entry_slice);
}

Filter Block

Filter Block Format

  • filter_entry:记录真实filter的bitmap信息。
  • filter_entry_offset:每个filter_entry的offset
  • filter_entry_offset_array_offset:第一个filter_entry_offset的offset
<beginning_of_filterblock>
[filter_entry 1]
[filter_entry 2]
...
[filter_entry N]
[filter_entry_offset 1]
[filter_entry_offset 2]
...
[filter_entry_offset N]
[filter_entry_offset_array_offset]
[base]
<end_of_filterblock>

Filter Block Encode

后续补充。

MetaIndex Block

MetaIndex Format

MetaIndex Block按找kv存储,其中key为某类block的名字,value为某个block的handle。用于读取file时快速定位metablock。

<beginning_of_file>
[key + filter block handle]
[key + index block handle]
[key + compression dictionary block handle]
[key + range deletion block handle]
[key + stats block handle]
<end_of_file>

MetaIndex Encode

以filter为例,在filter block写盘的时候,会讲filter_block_handle写入metaindex_block,其中key为filter_policy名称。

void BlockBasedTableBuilder::WriteFilterBlock(
    MetaIndexBuilder* meta_index_builder) {
  // filter 写盘
  ...
  // 生成key
  key = is_partitioned_filter ? BlockBasedTable::kPartitionedFilterBlockPrefix
                                : BlockBasedTable::kFullFilterBlockPrefix;
  key.append(rep_->table_options.filter_policy->CompatibilityName());
  // filter_block_handle写入meta_index_builder
  meta_index_builder->Add(key, filter_block_handle);
}

Footer

Footer Format

<beginning_of_footer>
[checksum (seq + type)] // 1B
[metaindex_handle] // < 20B
[index_handle] // < 20B
[padding] // metaindex_handle和index_handle共占用40B,不足补0
[footer version] // 4B
[table_magic_number] // 8B
<end_of_footer>
};

Footer size

Footer由53字节构成。读取的时候直接读取文件结尾的53字节。见代码

const uint32_t kMaxVarint64Length = 10;
static constexpr uint32_t kMaxEncodedLength = 2 * kMaxVarint64Length; // 2*10=20
constexpr uint32_t kMagicNumberLengthByte = 8;
static constexpr uint32_t kNewVersionsEncodedLength =
      1 + 2 * BlockHandle::kMaxEncodedLength + 4 + kMagicNumberLengthByte; // 1+2*20+4+8=53
static constexpr uint32_t kMaxEncodedLength = kNewVersionsEncodedLength;

Footer encode

void FooterBuilder::Build(uint64_t magic_number, uint32_t format_version,
                          uint64_t footer_offset, ChecksumType checksum_type,
                          const BlockHandle& metaindex_handle,
                          const BlockHandle& index_handle) {
  char* part2;
  char* part3;
  if (format_version > 0) {
    slice_ = Slice(data_.data(), Footer::kNewVersionsEncodedLength);
    char* cur = data_.data();
    *(cur++) = checksum_type; // 1B checksum_type_
    part2 = cur;
    cur += kFooterPart2Size; // 跳过40B
    part3 = cur;
    EncodeFixed32(cur, format_version); // 4B format_version
    cur += 4;
    EncodeFixed64(cur, magic_number); // 8B magic_number
  }
  {
    char* cur = part2;
    cur = metaindex_handle.EncodeTo(cur); // <20B metaindex_handle
    cur = index_handle.EncodeTo(cur); // <20B index_handle
    // Zero pad remainder
    std::fill(cur, part3, char{0}); // 不足40B的部分补0
  }
}

再来看一下BlockHandle encode,由于编码采用PutVarint64Varint64方法,因此编码后数据会小于8字节,其他的可能是进行预留。

class BlockHandle {
  uint64_t offset_;
  uint64_t size_;
}
void BlockHandle::EncodeTo(std::string* dst) const {
  // Sanity check that all fields have been set
  PutVarint64Varint64(dst, offset_, size_);
}

SST build table

Status BuildTable(
    const std::string& dbname, VersionSet* versions, ...)
{
  iter->SeekToFirst();
  if (iter->Valid() || !range_del_agg->IsEmpty()) {
    TableBuilder* builder;
    {
      IOStatus io_s = NewWritableFile(fs, fname, &file, file_options);
      builder = NewTableBuilder(tboptions, file_writer.get());
    }
    CompactionIterator c_iter(
        iter, tboptions.internal_comparator.user_comparator(), &merge,...);
    c_iter.SeekToFirst();
    for (; c_iter.Valid(); c_iter.Next()) {
      // 在data_block中add一个kv,在Add中若发现已经满一个block就会将block_handle记录到index_block
      builder->Add(key, value);
    }
    // 其他非data_block刷盘
    s = builder->Finish();
    EventHelpers::LogAndNotifyTableFileCreationFinished(...);
  }
}
Status BlockBasedTableBuilder::Finish() {
  // 将最后一个block的block_handle的记录到index_block
  r->index_builder->AddIndexEntry(
          &r->last_key, nullptr /* no next data block */, r->pending_handle);
  BlockHandle metaindex_block_handle, index_block_handle;
  MetaIndexBuilder meta_index_builder;
  // 刷filter block
  WriteFilterBlock(&meta_index_builder);
  // 刷index block,并返回index_block_handler
  WriteIndexBlock(&meta_index_builder, &index_block_handle);
  // 刷其他block
  WriteCompressionDictBlock(&meta_index_builder);
  WriteRangeDelBlock(&meta_index_builder);
  WritePropertiesBlock(&meta_index_builder);
  if (ok()) {
    // flush the meta index block
    // 将所有metablock刷盘,返回metaindex_block_handle
    WriteRawBlock(meta_index_builder.Finish(), kNoCompression,
                  &metaindex_block_handle, BlockType::kMetaIndex);
  }
  if (ok()) {
    // footer刷盘
    WriteFooter(metaindex_block_handle, index_block_handle);
  }
}

参考文献

https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format
https://www.jianshu.com/p/d6ce3593a69e

  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值