文章目录
SST File Format
<beginning_of_file>
[data block 1]
[data block 2]
...
[data block N]
[meta block 1: filter block] (see section: "filter" Meta Block)
[meta block 2: index block]
[meta block 3: compression dictionary block] (see section: "compression dictionary" Meta Block)
[meta block 4: range deletion block] (see section: "range deletion" Meta Block)
[meta block 5: stats block] (see section: "properties" Meta Block)
...
[meta block K: future extended block] (we may add more meta blocks in the future)
[metaindex block]
[Footer] (fixed size; starts at file_size - sizeof(Footer))
<end_of_file>
Data Block
Data Block Format
DataBlock是KV数据块,包含KV,DataBlock的格式如下,其中:
- entry:KV数据
- restart_point:前缀压缩的起始entry
- restart_count:restart_point数量
<beginning_of_datablock>
[entry 1]
[entry 2]
...
[entry N]
[restart_point 1]
[restart_point 2]
...
[restart_point N]
[restart_count]
<end_of_datablock>
每个entry的格式如下:
- sharedKeyLength:与前一条记录key共享部分的长度,为0则表示该 Entry 是一个重启点
- unsharedKeyLength:与前一条记录key不共享部分的长度
- valueLength:value长度
- unsharedKeyContent:与前一条记录key非共享的内容
- valueContent:value内容
<beginning_of_entry>
[sharedKeyLength]
[unsharedKeyLength]
[valueLength]
[unsharedKeyContent]
[valueConten]
<end_of_entry>
Block Entry Encode
inline void BlockBuilder::AddWithLastKeyImpl(const Slice& key, // 需要添加到block中的key
const Slice& value, // 需要添加到block中的value
const Slice& last_key, // 前一个entry中的key
const Slice* const delta_value, // 无用
size_t buffer_size // 当前buffer写入size) {
if (counter_ >= block_restart_interval_) {
// 根据配置block_restart_interval_,重新开始前缀压缩点restart_point
// 记录当前buffersize
restarts_.push_back(static_cast<uint32_t>(buffer_size));
estimate_ += sizeof(uint32_t);
counter_ = 0;
} else if (use_delta_encoding_) {
// 获取与前一个key的共享size
shared = key.difference_offset(last_key);
}
// 与前一个key的非共享size
const size_t non_shared = key.size() - shared;
// 将共享size和、非共享size、valuesize编码写入
PutVarint32Varint32Varint32(&buffer_, static_cast<uint32_t>(shared),
static_cast<uint32_t>(non_shared),
static_cast<uint32_t>(value.size()));
// 将非共享部分key写入
buffer_.append(key.data() + shared, non_shared);
// 将value写入
buffer_.append(value.data(), value.size());
}
Data Block Encode
Slice BlockBuilder::Finish() {
// 写入restart_point
for (size_t i = 0; i < restarts_.size(); i++) {
PutFixed32(&buffer_, restarts_[i]);
}
// 获取restart_point数量
uint32_t num_restarts = static_cast<uint32_t>(restarts_.size());
// 写入indextype和restart_point数量
uint32_t block_footer = PackIndexTypeAndNumRestarts(index_type, num_restarts);
PutFixed32(&buffer_, block_footer);
}
Index Block
Index Block Format
<beginning_of_indexblock>
[data_block_handle 1]
[data_block_handle 2]
...
[data_block_handle N]
<end_of_indexblock>
- IndexBlock用于快速查找目标key的范围, 一个file可以拥有一个index block,每个data block 对应着一个entry,这个entry 的key >=指定的data block的最后一个key,并且小于下一个连续的block的第一个key,这个entry的value是BlockHandle,指向对应的data block.
- 为什么key不是采用其索引的DataBlock的最大key?主要目的是节省空间;假设其索引的block的最大key为"acknowledge",下一个block最小的key为"apple",如果DataBlockIndex的key采用其索引block的最大key,占用长度为len(“acknowledge”);采用后一种方式,key值可以为"ad"(“acknowledge” < “ad” < “apple”),长度仅为2,并且检索效果是一样的。
Index Block Encode
virtual void AddIndexEntry(std::string* last_key_in_current_block,
const Slice* first_key_in_next_block,
const BlockHandle& block_handle) override {
// 根据当前block最大key和下一个block最小key,得出一个长度最小,但处于两个key之间的key。
FindShortestInternalKeySeparator(*comparator_->user_comparator(),
last_key_in_current_block,
*first_key_in_next_block);
// 创建一个entry,包含
IndexValue entry(block_handle, current_block_first_internal_key_);
std::string encoded_entry;
std::string delta_encoded_entry;
entry.EncodeTo(&encoded_entry, include_first_key_, nullptr);
index_block_builder_.Add(sep, encoded_entry, &delta_encoded_entry_slice);
}
Filter Block
Filter Block Format
- filter_entry:记录真实filter的bitmap信息。
- filter_entry_offset:每个filter_entry的offset
- filter_entry_offset_array_offset:第一个filter_entry_offset的offset
<beginning_of_filterblock>
[filter_entry 1]
[filter_entry 2]
...
[filter_entry N]
[filter_entry_offset 1]
[filter_entry_offset 2]
...
[filter_entry_offset N]
[filter_entry_offset_array_offset]
[base]
<end_of_filterblock>
Filter Block Encode
后续补充。
MetaIndex Block
MetaIndex Format
MetaIndex Block按找kv存储,其中key为某类block的名字,value为某个block的handle。用于读取file时快速定位metablock。
<beginning_of_file>
[key + filter block handle]
[key + index block handle]
[key + compression dictionary block handle]
[key + range deletion block handle]
[key + stats block handle]
<end_of_file>
MetaIndex Encode
以filter为例,在filter block写盘的时候,会讲filter_block_handle写入metaindex_block,其中key为filter_policy名称。
void BlockBasedTableBuilder::WriteFilterBlock(
MetaIndexBuilder* meta_index_builder) {
// filter 写盘
...
// 生成key
key = is_partitioned_filter ? BlockBasedTable::kPartitionedFilterBlockPrefix
: BlockBasedTable::kFullFilterBlockPrefix;
key.append(rep_->table_options.filter_policy->CompatibilityName());
// filter_block_handle写入meta_index_builder
meta_index_builder->Add(key, filter_block_handle);
}
Footer
Footer Format
<beginning_of_footer>
[checksum (seq + type)] // 1B
[metaindex_handle] // < 20B
[index_handle] // < 20B
[padding] // metaindex_handle和index_handle共占用40B,不足补0
[footer version] // 4B
[table_magic_number] // 8B
<end_of_footer>
};
Footer size
Footer由53字节构成。读取的时候直接读取文件结尾的53字节。见代码
const uint32_t kMaxVarint64Length = 10;
static constexpr uint32_t kMaxEncodedLength = 2 * kMaxVarint64Length; // 2*10=20
constexpr uint32_t kMagicNumberLengthByte = 8;
static constexpr uint32_t kNewVersionsEncodedLength =
1 + 2 * BlockHandle::kMaxEncodedLength + 4 + kMagicNumberLengthByte; // 1+2*20+4+8=53
static constexpr uint32_t kMaxEncodedLength = kNewVersionsEncodedLength;
Footer encode
void FooterBuilder::Build(uint64_t magic_number, uint32_t format_version,
uint64_t footer_offset, ChecksumType checksum_type,
const BlockHandle& metaindex_handle,
const BlockHandle& index_handle) {
char* part2;
char* part3;
if (format_version > 0) {
slice_ = Slice(data_.data(), Footer::kNewVersionsEncodedLength);
char* cur = data_.data();
*(cur++) = checksum_type; // 1B checksum_type_
part2 = cur;
cur += kFooterPart2Size; // 跳过40B
part3 = cur;
EncodeFixed32(cur, format_version); // 4B format_version
cur += 4;
EncodeFixed64(cur, magic_number); // 8B magic_number
}
{
char* cur = part2;
cur = metaindex_handle.EncodeTo(cur); // <20B metaindex_handle
cur = index_handle.EncodeTo(cur); // <20B index_handle
// Zero pad remainder
std::fill(cur, part3, char{0}); // 不足40B的部分补0
}
}
再来看一下BlockHandle encode,由于编码采用PutVarint64Varint64方法,因此编码后数据会小于8字节,其他的可能是进行预留。
class BlockHandle {
uint64_t offset_;
uint64_t size_;
}
void BlockHandle::EncodeTo(std::string* dst) const {
// Sanity check that all fields have been set
PutVarint64Varint64(dst, offset_, size_);
}
SST build table
Status BuildTable(
const std::string& dbname, VersionSet* versions, ...)
{
iter->SeekToFirst();
if (iter->Valid() || !range_del_agg->IsEmpty()) {
TableBuilder* builder;
{
IOStatus io_s = NewWritableFile(fs, fname, &file, file_options);
builder = NewTableBuilder(tboptions, file_writer.get());
}
CompactionIterator c_iter(
iter, tboptions.internal_comparator.user_comparator(), &merge,...);
c_iter.SeekToFirst();
for (; c_iter.Valid(); c_iter.Next()) {
// 在data_block中add一个kv,在Add中若发现已经满一个block就会将block_handle记录到index_block
builder->Add(key, value);
}
// 其他非data_block刷盘
s = builder->Finish();
EventHelpers::LogAndNotifyTableFileCreationFinished(...);
}
}
Status BlockBasedTableBuilder::Finish() {
// 将最后一个block的block_handle的记录到index_block
r->index_builder->AddIndexEntry(
&r->last_key, nullptr /* no next data block */, r->pending_handle);
BlockHandle metaindex_block_handle, index_block_handle;
MetaIndexBuilder meta_index_builder;
// 刷filter block
WriteFilterBlock(&meta_index_builder);
// 刷index block,并返回index_block_handler
WriteIndexBlock(&meta_index_builder, &index_block_handle);
// 刷其他block
WriteCompressionDictBlock(&meta_index_builder);
WriteRangeDelBlock(&meta_index_builder);
WritePropertiesBlock(&meta_index_builder);
if (ok()) {
// flush the meta index block
// 将所有metablock刷盘,返回metaindex_block_handle
WriteRawBlock(meta_index_builder.Finish(), kNoCompression,
&metaindex_block_handle, BlockType::kMetaIndex);
}
if (ok()) {
// footer刷盘
WriteFooter(metaindex_block_handle, index_block_handle);
}
}
参考文献
https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format
https://www.jianshu.com/p/d6ce3593a69e