【spdk】SPDK “Reduce“ Block Compression Algorithm

SPDK “Reduce” Block Compression Algorithm {#reduce}

参考

1.社区

#https://www.codenong.com/cs106732883/

#速度与压缩比如何兼得?压缩算法在构建部署中的优化

2.官网

3.源码

Overview

The SPDK “reduce” block compression scheme is based on using SSDs for storing compressed blocks of storage and persistent memory for metadata. This metadata includes mappings of logical blocks requested by a user to the compressed blocks on SSD. The scheme described in this document is generic and not tied to any specific block device framework such as the SPDK block device (bdev) framework. This algorithm will be implemented in a library called “libreduce”. Higher-level software modules can built on top of this library to create and present block devices in a specific block device framework. For SPDK, a bdev_reduce module will serve as a wrapper around the libreduce library, to present the compressed block devices as an SPDK bdev.

【翻译】

关于“reduce”块压缩方案的说明主要涉及以下几点:

  1. 使用SSD来存储压缩的存储块,并使用持久内存来存储元数据。元数据包括用户请求的逻辑块到SSD上压缩块的映射。
  2. 该方案是通用的,并不依赖于任何特定的块设备框架,如SPDK块设备(bdev)框架。
  3. 该算法将在名为“libreduce”的库中实现。
  4. 可以在此库上构建更高级别的软件模块,以在特定的块设备框架中创建和呈现块设备。
  5. 对于SPDK,bdev_reduce模块将作为libreduce库的包装器,将压缩的块设备呈现为SPDK bdev。

简而言之,该方案允许使用SSD进行数据压缩存储,并使用持久内存来存储元数据。这是一个通用的方案,可以与多种块设备框架结合使用。在SPDK中,bdev_reduce模块将充当libreduce库的接口,以提供压缩的块设备。

This scheme only describes how compressed blocks are stored on an SSD and the metadata for tracking those compressed blocks. It relies on the higher-software module to perform the compression algorithm itself. For SPDK, the bdev_reduce module will utilize the DPDK compressdev framework to perform compression and decompression on behalf of the libreduce library.

【翻译】

这个方案只描述了压缩块如何在SSD上存储以及用于追踪这些压缩块的元数据。它依赖于更高级的软件模块来执行压缩算法本身。对于SPDK,bdev_reduce模块将利用DPDK compressdev框架来代表libreduce库执行压缩和解压缩操作。

(Note that in some cases, blocks of storage may not be compressible, or cannot be compressed enough to realize savings from the compression. In these cases, the data may be stored uncompressed on disk. The phrase “compressed blocks of storage” includes these uncompressed blocks.)

【翻译】

(请注意,在某些情况下,存储块可能无法压缩,或者无法压缩到足够实现压缩节省的程度。在这些情况下,数据可能会以未压缩的形式存储在磁盘上。短语“压缩的存储块”包括这些未压缩的块。)

A compressed block device is a logical entity built on top of a similarly-sized backing storage device. The backing storage device must be thin-provisioned to realize any savings from compression for reasons described later in this document. This algorithm has no direct knowledge of the implementation of the backing storage device, except that it will always use the lowest-numbered blocks available on the backing storage device. This will ensure that when this algorithm is used on a thin-provisioned backing storage device, blocks will not be allocated until they are actually needed.

【翻译】

压缩块设备是在同样大小的底层存储设备之上的逻辑实体。底层存储设备必须进行精简配置,以实现压缩节省,原因将在本文档后面部分说明。此算法不知道底层存储设备的实现方式,除非它将始终使用底层存储设备上可用的最低编号的块。这将确保当此算法在精简配置的底层存储设备上使用时,块将在实际需要时才被分配。

The backing storage device must be sized for the worst case scenario, where no data can be compressed. In this case, the size of the backing storage device would be the same as the compressed block device. Since this algorithm ensures atomicity by never overwriting data in place, some additional backing storage is required to temporarily store data for writes in progress before the associated metadata is updated.

【翻译】

底层存储设备必须按最坏情况场景进行配置,即无法压缩任何数据。在这种情况下,底层存储设备的大小将与压缩块设备的大小相同。由于此算法通过从不原地覆盖数据来确保原子性,因此需要一些额外的底层存储来临时存储正在进行的写操作中的数据,直到相关元数据被更新。

Storage from the backing storage device will be allocated, read, and written to in 4KB units for best NVMe performance. These 4KB units are called “backing IO units”. They are indexed from 0 to N-1 with the indices called “backing IO unit indices”. At start, the full set of indices represent the “free backing IO unit list”.

【翻译】

为了获得最佳的NVMe性能,底层存储设备上的存储将以4KB为单位进行分配、读取和写入。这些4KB单位被称为“底层IO单位”。它们从0到N-1进行索引,这些索引被称为“底层IO单位索引”。在开始时,整个索引集代表“空闲底层IO单位列表”。

A compressed block device compresses and decompresses data in units of chunks, where a chunk is a multiple of at least two 4KB backing IO units. The number of backing IO units per chunk determines the chunk size and is specified when the compressed block device is created. A chunk consumes a number of 4KB backing IO units between 1 and the number of 4KB units in the chunk. For example, a 16KB chunk consumes 1, 2, 3 or 4 backing IO units. The number of backing IO units depends on how much the chunk was able to be compressed. The blocks on disk associated with a chunk are stored in a “chunk map” in persistent memory. Each chunk map consists of N 64-bit values, where N is the maximum number of backing IO units in the chunk. Each 64-bit value corresponds to a backing IO unit index. A special value (for example, 2^64-1) is used for backing IO units not needed due to compression. The number of chunk maps allocated is equal to the size of the compressed block device divided by its chunk size, plus some number of extra chunk maps. These extra chunk maps are used to ensure atomicity on writes and will be explained later in this document. At start, all of the chunk maps represent the “free chunk map list”.

【翻译】

压缩块设备以块为单位对数据进行压缩和解压缩,其中块是至少两个4KB底层IO单位的倍数。每个块使用的底层IO单位数量决定了块的大小,并在创建压缩块设备时指定。一个块会占用1到块中4KB单位数量的底层IO单位。例如,一个16KB的块会占用1、2、3或4个底层IO单位。底层IO单位的数量取决于该块能够被压缩的程度。与块相关的磁盘上的块存储在持久性内存中的“块映射”中。每个块映射由N个64位值组成,其中N是块中最大的底层IO单位数。每个64位值对应一个底层IO单位索引。特殊值(例如,2^64-1)用于表示由于压缩而不需要的底层IO单位。分配的块映射数量等于压缩块设备的大小除以其块大小,再加上一些额外的块映射。这些额外的块映射用于确保写操作的原子性,将在本文档后面部分进行解释。在开始时,所有的块映射都代表“空闲块映射列表”。

Finally, the logical view of the compressed block device is represented by the “logical map”. The logical map is a mapping of chunk offsets into the compressed block device to the corresponding chunk map. Each entry in the logical map is a 64-bit value, denoting the associated chunk map. A special value (UINT64_MAX) is used if there is no associated chunk map. The mapping is determined by dividing the byte offset by the chunk size to get an index, which is used as an array index into the array of chunk map entries. At start, all entries in the logical map have no associated chunk map. Note that while access to the backing storage device is in 4KB units, the logical view may allow 4KB or 512B unit access and should perform similarly.

【翻译】

最后,压缩块设备的逻辑视图由“逻辑映射”表示。逻辑映射是将压缩块设备中的块偏移量映射到相应的块映射的映射。逻辑映射中的每个条目都是一个64位值,表示相应的块映射。如果没有关联的块映射,则使用特殊值(UINT64_MAX)。通过将字节偏移量除以块大小来获得索引,该索引用作块映射条目数组的索引,以确定映射。在开始时,逻辑映射中的所有条目都没有关联的块映射。请注意,尽管对底层存储设备的访问是以4KB为单位进行的,但逻辑视图可能允许4KB或512B单位的访问,并且性能应相似。

Example

To illustrate this algorithm, we will use a real example at a very small scale.

The size of the compressed block device is 64KB, with a chunk size of 16KB. This will realize the following:

  • “Backing storage” will consist of an 80KB thin-provisioned logical volume. This corresponds to the 64KB size of the compressed block device, plus an extra 16KB to handle additional write operations under a worst-case compression scenario.
  • “Free backing IO unit list” will consist of indices 0 through 19 (inclusive). These represent the 20 4KB IO units in the backing storage.
  • A “chunk map” will be 32 bytes in size. This corresponds to 4 backing IO units per chunk (16KB / 4KB), and 8B (64b) per backing IO unit index.
  • 5 chunk maps will be allocated in 160B of persistent memory. This corresponds to 4 chunk maps for the 4 chunks in the compressed block device (64KB / 16KB), plus an extra chunk map for use when overwriting an existing chunk.
  • “Free chunk map list” will consist of indices 0 through 4 (inclusive). These represent the 5 allocated chunk maps.
  • The “logical map” will be allocated in 32B of persistent memory. This corresponds to 4 entries for the 4 chunks in the compressed block device and 8B (64b) per entry.

In these examples, the value “X” will represent the special value (2^64-1) described above.

Initial Creation

                  +--------------------+
  Backing Device  |                    |
                  +--------------------+

  Free Backing IO Unit List  0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19

             +------------+------------+------------+------------+------------+
  Chunk Maps |            |            |            |            |            |
             +------------+------------+------------+------------+------------+

  Free Chunk Map List  0, 1, 2, 3, 4

              +---+---+---+---+
  Logical Map | X | X | X | X |
              +---+---+---+---+

Write 16KB at Offset 32KB

  • Find the corresponding index into the logical map. Offset 32KB divided by the chunk size (16KB) is 2.
  • Entry 2 in the logical map is “X”. This means no part of this 16KB has been written to yet.
  • Allocate a 16KB buffer in memory
  • Compress the incoming 16KB of data into this allocated buffer
  • Assume this data compresses to 6KB. This requires 2 4KB backing IO units.
  • Allocate 2 blocks (0 and 1) from the free backing IO unit list. Always use the lowest numbered entries in the free backing IO unit list - this ensures that unnecessary backing storage is not allocated in the thin-provisioned logical volume holding the backing storage.
  • Write the 6KB of data to backing IO units 0 and 1.
  • Allocate a chunk map (0) from the free chunk map list.
  • Write (0, 1, X, X) to the chunk map. This represents that only 2 backing IO units were used to store the 16KB of data.
  • Write the chunk map index to entry 2 in the logical map.
                  +--------------------+
  Backing Device  |01                  |
                  +--------------------+

  Free Backing IO Unit List  2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19

             +------------+------------+------------+------------+------------+
  Chunk Maps | 0 1 X X    |            |            |            |            |
             +------------+------------+------------+------------+------------+

  Free Chunk Map List  1, 2, 3, 4

              +---+---+---+---+
  Logical Map | X | X | 0 | X |
              +---+---+---+---+

Write 4KB at Offset 8KB

  • Find the corresponding index into the logical map. Offset 8KB divided by the chunk size is 0.
  • Entry 0 in the logical map is “X”. This means no part of this 16KB has been written to yet.
  • The write is not for the entire 16KB chunk, so we must allocate a 16KB chunk-sized buffer for source data.
  • Copy the incoming 4KB data to offset 8KB of this 16KB buffer. Zero the rest of the 16KB buffer.
  • Allocate a 16KB destination buffer.
  • Compress the 16KB source data buffer into the 16KB destination buffer
  • Assume this data compresses to 3KB. This requires 1 4KB backing IO unit.
  • Allocate 1 block (2) from the free backing IO unit list.
  • Write the 3KB of data to block 2.
  • Allocate a chunk map (1) from the free chunk map list.
  • Write (2, X, X, X) to the chunk map.
  • Write the chunk map index to entry 0 in the logical map.
                  +--------------------+
  Backing Device  |012                 |
                  +--------------------+

  Free Backing IO Unit List  3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19

             +------------+------------+------------+------------+------------+
  Chunk Maps | 0 1 X X    | 2 X X X    |            |            |            |
             +------------+------------+------------+------------+------------+

  Free Chunk Map List  2, 3, 4

              +---+---+---+---+
  Logical Map | 1 | X | 0 | X |
              +---+---+---+---+

Read 16KB at Offset 16KB

  • Offset 16KB maps to index 1 in the logical map.
  • Entry 1 in the logical map is “X”. This means no part of this 16KB has been written to yet.
  • Since no data has been written to this chunk, return all 0’s to satisfy the read I/O.

Write 4KB at Offset 4KB

  • Offset 4KB maps to index 0 in the logical map.
  • Entry 0 in the logical map is “1”. Since we are not overwriting the entire chunk, we must do a read-modify-write.
  • Chunk map 1 only specifies one backing IO unit (2). Allocate a 16KB buffer and read block 2 into it. This will be called the compressed data buffer. Note that 16KB is allocated instead of 4KB so that we can reuse this buffer to hold the compressed data that will be written later back to disk.
  • Allocate a 16KB buffer for the uncompressed data for this chunk. Decompress the data from the compressed data buffer into this buffer.
  • Copy the incoming 4KB of data to offset 4KB of the uncompressed data buffer.
  • Compress the 16KB uncompressed data buffer into the compressed data buffer.
  • Assume this data compresses to 5KB. This requires 2 4KB backing IO units.
  • Allocate blocks 3 and 4 from the free backing IO unit list.
  • Write the 5KB of data to blocks 3 and 4.
  • Allocate chunk map 2 from the free chunk map list.
  • Write (3, 4, X, X) to chunk map 2. Note that at this point, the chunk map is not referenced by the logical map. If there was a power fail at this point, the previous data for this chunk would still be fully valid.
  • Write chunk map 2 to entry 0 in the logical map.
  • Free chunk map 1 back to the free chunk map list.
  • Free backing IO unit 2 back to the free backing IO unit list.
                  +--------------------+
  Backing Device  |01 34               |
                  +--------------------+

  Free Backing IO Unit List  2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19

             +------------+------------+------------+------------+------------+
  Chunk Maps | 0 1 X X    |            | 3 4 X X    |            |            |
             +------------+------------+------------+------------+------------+

  Free Chunk Map List  1, 3, 4

              +---+---+---+---+
  Logical Map | 2 | X | 0 | X |
              +---+---+---+---+

Operations that span across multiple chunks

Operations that span a chunk boundary are logically split into multiple operations, each of which is associated with a single chunk.

Example: 20KB write at offset 4KB

In this case, the write operation is split into a 12KB write at offset 4KB (affecting only chunk 0 in the logical map) and a 8KB write at offset 16KB (affecting only chunk 1 in the logical map). Each write is processed independently using the algorithm described above. Completion of the 20KB write does not occur until both operations have completed.

Unmap Operations

Unmap operations on an entire chunk are achieved by removing the chunk map entry (if any) from the logical map. The chunk map is returned to the free chunk map list, and any backing IO units associated with the chunk map are returned to the free backing IO unit list.

Unmap operations that affect only part of a chunk can be treated as writing zeroes to that region of the chunk. If the entire chunk is unmapped via several operations, it can be detected via the uncompressed data equaling all zeroes. When this occurs, the chunk map entry may be removed from the logical map.

After an entire chunk has been unmapped, subsequent reads to the chunk will return all zeroes. This is similar to the “Read 16KB at offset 16KB” example above.

Write Zeroes Operations

Write zeroes operations are handled similarly to unmap operations. If a write zeroes operation covers an entire chunk, we can remove the chunk’s entry in the logical map completely. Then subsequent reads to that chunk will return all zeroes.

Restart

An application using libreduce will periodically exit and need to be restarted. When the application restarts, it will reload compressed volumes so they can be used again from the same state as when the application exited.

When the compressed volume is reloaded, the free chunk map list and free backing IO unit list are reconstructed by walking the logical map. The logical map will only point to valid chunk maps, and the valid chunk maps will only point to valid backing IO units. Any chunk maps and backing IO units not referenced go into their respective free lists.

This ensures that if a system crashes in the middle of a write operation - i.e. during or after a chunk map is updated, but before it is written to the logical map - that everything related to that in-progress write will be ignored after the compressed volume is restarted.

Overlapping operations on same chunk

Implementations must take care to handle overlapping operations on the same chunk. For example, operation 1 writes some data to chunk A, and while this is in progress, operation 2 also writes some data to chunk A. In this case, operation 2 should not start until operation 1 has completed. Further optimizations are outside the scope of this document.

Thin provisioned backing storage

Backing storage must be thin provisioned to realize any savings from compression. This algorithm will always use (and reuse) backing IO units available closest to offset 0 on the backing device. This ensures that even though backing storage device may have been sized similarly to the size of the compressed volume, storage for the backing storage device will not actually be allocated until the backing IO units are actually needed.

  • 14
    点赞
  • 24
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值