

Firstly, We have to agree that if you are a software engineer and you are about to decide which database you will choose for your new application it is essential to have a good understanding of the underlying storage engine to reason about how the database actually delivers


We are going to talk about storage engines that are used in both traditional relational databases and NOSQL databases.


We will talk about the two most popular of the storage engines Log-structured storage engines, and page-oriented storage engines.


The main idea about the storage engines is how it stores and retrieves the data and how the index is created to speed the performance of the reads.


We are going to start with the Log-structured storage engines and how it handles these operation.


Hash indexes

^ h 灰分指标

Lets assume we have a key value data set and our data storage is a file and we only appending to it (no updates).


So we will have an in-memory hash map with the key of the data and an offset to the first byte of the value in the log file to seek for , when we add new key-value record we append it first to the end of the log file (our storage) and update the hash map.


This approach is very efficient if all the keys fit in the available memory (RAM), since the hash map is kept completely in memory.


So if we have an application has a lot of writes but there are too many distinct keys and you have a large number of writes for those keys this approach will be suitable for you and there is an storage engine working like that (Bitcask).


Critical problem!


As we mentioned before we will have a lot of writes to the log file, we may have for each key more than one million write per hour and all the writes will be appended to the end of the file because there is no updates so we will have a lot of duplicates records and we can running out of disk space.

如前所述,我们将对日志文件进行大量写操作,每个键每小时可能有超过一百万个写操作,并且所有写操作都将追加到文件末尾,因为没有更新,因此我们将记录有很多重复, 我们可能会用光磁盘空间

A good solution is to break the log into segments of a certain size by closing a segment file when it reaches this size, and making subsequent writes to a new segment file.


This can be done by performing compression on the segments by throwing away duplicate keys in the log, and keeping only the most recent update for each key.


Image for post
Compression of a key-value update log retaining only the most recent value for each key

Each segment now has its own in-memory hash table, mapping keys to file offsets.


Since the compression process makes segments smaller because each segment may have a lot of duplicates we can perform a merge process at the same time with the compression over the segments.


The merging and compression of frozen segments can be done in a background thread, and while it is going on, we can still continue to serve read and write requests as normal, using the old segment files.


After the merging process is complete, we switch read requests to using the new merged segment instead of the old segments — and then the old segment files can simply be deleted.


Image for post

In order to find the value for a key, we first check the most recent segment’s hash map; if the key is not present we check the second-most-recent segment, and so on.

为了找到键的值,我们首先检查最近段的哈希图; 如果密钥不存在,我们检查最近的第二部分,依此类推。



1- The hash table must fit in memory: if you have a lot of keys which will not fit to the available memory this approach will not fit.


2- Range queries are not efficient: you cannot easily scan over all keysbetween red0 and red99 — you’d have to look up each key individually in the hash maps.


To get over these limitations we will talk about another approach for indexing.


SSTables and LSM-Trees

S 稳定站和LSM树

As we mentioned before our segments is appending only so the orderof key-value pairs in the file does not matter.


We can make a simple change by sorting segments by key and here we call it Sorting String Table (SSTable), then merge them to sorted segment like the merge sort algorithm .

我们可以通过按键对段进行排序来进行简单的更改,这里我们将其称为“排序字符串表( SSTable )”,然后像合并排序算法一样将它们合并到已排序的段中。

Image for post
Merging several SSTable segments

Now we solved the second option in the limitations which is range queries, so now with the sstable segments with sorted key you can perform range query.


Also we can avoid the memory limitations of the hash maps because now we don’t have to store all the keys in the in-memory hash maps, we will just store small number of keys with their offsets and with the sorting if you search for key not exists in the hash map you will know the range to search between for this key.

同样,我们也可以避免哈希映射的内存限制 ,因为现在我们不必将所有键都存储在内存哈希图中,我们只存储少量键及其偏移量和排序(如果您搜索)密钥在哈希图中不存在,您将知道该密钥之间的搜索范围。

Image for post
An SSTable with an in-memory index

正如我们提到的,日志结构是仅追加的方法,这意味着我们对段进行了顺序写入,因此如何使这些段按键排序? (As we mentioned the log structure is an appending only approach so that means we make a sequential writes to segments so how we will make these segments sorted by key ?)


中号 emtable

We can now make our storage engine work as follows:• When a write comes in, add it to an in-memory balanced tree data structure (for example, a red-black tree). This in-memory tree is sometimes called a memtable.• When the memtable gets bigger than some threshold, typically a few megabytes, write it out to disk as an SSTable file. This can be done efficiently because the tree already maintains the key-value pairs sorted by key.

现在,我们可以使存储引擎按以下方式工作:•进行写入时,将其添加到内存平衡树数据结构(例如,红黑树)中。 该内存树有时称为内存表。•当内存表大于某个阈值(通常为几兆字节)时,请将其作为SSTable文件写出到磁盘中。 由于树已经维护了按键排序的键值对,因此可以高效地完成此操作。

The new SSTable file becomes the most recent segment of the database. While the SSTable is being written out to disk, writes can continue to a new memtable instance.• In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.• From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.

新的SSTable文件成为数据库的最新段。 在将SSTable写入磁盘时,写操作可以继续到新的memtable实例。•为了满足读取请求,首先尝试在memtable中查找键,然后在最近的磁盘段中查找,然后在•时不时在后台运行合并和压缩过程,以合并段文件并丢弃被覆盖或删除的值。

And that is we called Log-Structured Merge-Tree (LSM).

这就是我们所说的对数结构合并树( LSM )。

LSM engines are now default in popular NoSQL databases including Apache Cassandra, Elasticsearch (Lucene), Google Bigtable, Apache HBase, and InfluxDB. Even widely adopted embedded data stores such as LevelDB and RocksDB are LSM based.

LSM引擎现在是流行的NoSQL数据库(包括Apache Cassandra,Elasticsearch(Lucene),Google Bigtable,Apache HBase和InfluxDB)中的默认数据库。 甚至被广泛采用的嵌入式数据存储(例如LevelDB和RocksDB)也基于LSM。

Now we finished the Log-structured storage engine, lets start with the page-oriented storage engines which presented by the B-Trees.




It is the most widely used indexing structure in most of the database special the relational databases (Postgres, Mysql, Sql and Oracle).


B-trees break the database down into fixed-size pages, traditionally 4 KB in size (sometimes bigger), and read or write one page at a time.

B树将数据库分为固定大小的页面,传统上大小为4 KB(有时更大),并且一次读取或写入一页。

Each page can be identified using an address or location, which allows one page to refer to another — similar to a pointer, but on disk instead of in memory and the keys in the pages is sorted.


The internal nodes consists of pointer to the other nodes and the leaf nodes contains the actual reference to the data rows in db.


Image for post

The number of references to child pages in one page of the B-tree is called thebranching factor.


A B-tree with n keys always has a depth of O(log n).

具有n个键的B树的深度始终为O(log n)。

Most databases can fit into a B-tree that is three or four levels deep, so you don’t need to follow many page references to find the page you are looking for. (A four-level tree of 4 KB pages with a branching factor of 500 can tore up to 256 TB).

大多数数据库都可以放入三层或四层深的B树中,因此您无需遵循大量页面引用即可查找所需的页面。 (一个4 KB页的四层树,其分支因子为500,最多可破坏256 TB)。

Finally we finished, i hope this article help you to determine your storage engine, good luck with that.




Image for post

翻译自: https://medium.com/@mohamedveron23/guide-to-database-storage-engines-2b188bd3e9e3


  • 0
  • 0
    觉得还不错? 一键收藏
  • 0




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


