SSD工作原理（第三部分搬运与翻译）

最新推荐文章于 2023-06-29 20:47:32 发布

TeQuL

最新推荐文章于 2023-06-29 20:47:32 发布

阅读量1k

点赞数 2

文章标签： ssd 储存器

原文链接：http://codecapsule.com/2014/02/12/coding-for-ssds-part-6-a-summary-what-every-programmer-should-know-about-solid-state-drives/

版权

SSD工作原理（第三部分搬运与翻译）

原地址（Capsule，提供有中文翻译但不能打开 2020.0318 因此进行翻译）

也参考了知乎

该系列有６部分，其中本章节（３）包含了闪存转换层Flash Translation Layer FTL涉及到如何完成逻辑地址（来自主机程序）到物理地址（SSD上的位置）转换，因此最为重要。其余部分

１　简介

２　跑分

３　页、块、FTL

4 高级功能、内部并行

５　访问方式与系统优化

６　总结　　此部分是对SSD相关内容的按章节总结，第一部分在畅所欲言，这部分更概述。

本部分包含：

页page、块block级别的写SSD过程；

写入放大write amplification和损耗均衡wear leveling的基本概念

闪存转换层Flash Tranlation Layer FTL 的概念与两大基本功能：块逻辑映射 Logical block mapping 和 垃圾回收garbage collection

“混合日志块映射”

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-i8eB1luc-1584626293920)(/home/qin/JuniorB/习概/ssd-architecture.jpg)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sRNhlEIi-1584626293921)(/home/qin/JuniorB/习概/ssd-presentation-03.jpg)]

基础操作

3.1 读、写、擦除

NAND闪存特性使得无法对单个单元cell读写。

读以页为单位

即使读一个字节也要读一整页

写以页为单位

即使写一个字节也要写一整页

页不能被覆写

修改一页时候：将数据写入内部寄存器；更新时并不放回原来的页，而是放到一个可用“free”页；并将原本的页标记为过时stale。

擦除将stale页面变为free页面

擦除以块为单位

使用者只能读、写，不能擦除。擦除是由SSD的垃圾收集进程garbage collection process自动实现的。

3.2 例

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WembTNTc-1584626293922)(/home/qin/JuniorB/习概/ssd-writing-data.jpg)]

3.3 写入放大 Write Amplification

Because writes are aligned on the page size, any write operation that is not both aligned on the page size and a multiple of the page size will require more data to be written than necessary, a concept called write amplification

writing data in an unaligned way causes the pages to be read into cache before being modified and written back to the drive, which is slower than directly writing pages to the disk. This operations is known as read-modify-write, and should be avoided whenever possible.

由于闪存在可重新写入数据前必须先擦除，而擦除操作的粒度与写入操作相比低得多，执行这些操作就会多次移动（或改写）用户数据和元数据。因此，要改写数据，就需要读取闪存某些已使用的部分，更新它们，并写入到新的位置，如果新位置在之前已使被用过，还需连同先擦除；由于闪存的这种工作方式，必须擦除改写的闪存部分比新数据实际需要的大得多。此倍增效应会增加请求写入的次数，缩短SSD的寿命，从而减小SSD能可靠运行的时间。增加的写入也会消耗闪存的带宽，此主要降低SSD的随机写入性能。

一页不要写少于一页

避免写入放大和读-修改-写。默认应该以16KB（常见SSD一页大小）为单位

对齐 Align writes，一次写多页

缓存小规模写Buffer small writes

为了最大化吞吐率，将小规模写操作置于RAM中。缓存满时，再一次性写

3.4 损耗均衡

参考Section1.1, NAND-flash单元由于有限的P/E循环次数，寿命有限。

SSD Controller需要平均化这一损耗，以延长使用寿命（使得所有块同时到达寿命期限）。

因此，SSD Controller需要平衡各个块的写次数（可能要将数据在不同块之间移动）。块控制block management需要权衡损耗均衡与减小写入放大。其中一个策略是garbage collection。

闪存转换层Flash Translation Layer FTL

4.1 为什么有FTL

块逻辑地址Logical Block Adresses, LBA 适用于HDD，因为LBA的分区是可以覆写的（前面提到，SDD是不能覆写页的）。

但是，为了兼容HDD使用的接口（LBA的形式），隐藏闪存的内部细节，有FTL进行转换。FTL包含于SSDController。

FTL有两大功能：逻辑块映射Logical Block Mapping 和垃圾回收 Garbage Collection。

4.2 逻辑块映射

The logical block mapping translates logical block addresses (LBAs) from the host space into physical block addresses (PBAs) in the physical NAND-flash memory space.

逻辑块映射转换从主机空间的逻辑块地址（LBA）转换成在物理NAND闪存存储器空间物理块地址（PBA）。

实现需要一张表（LBA-PBA），表常驻在SSD的内存（速度），且在SSD中有备份（掉电保护），SSD通电时，会读取SSD备份版本并于SSD的内存(RAM)中重构该表。

页级映射：快速、灵活，但需要大内存，成本高昂。

块级映射：会增加写入放大的现象。

为了权衡二者利弊，产生了许多混合映射方法。其中最常见的是日志-块映射 log-block mapping，策略类似于日志结构文件系统log-structured file systems。

Incoming write operations are written sequentially to log blocks. When a log block is full, it is merged with the data block associated to the same logical block number (LBN) into a free block. Only a few log blocks need to be maintained, which allows to maintain them with a page granularity. Data blocks on the contrary are maintained with a block granularity。

传入的写操作按顺序写入日志块。当日志块已满时，它将与相同LBN的数据块结合并被写入一个空闲块。因为只需要维护几个日志块，所以可以以页面粒度维护它们。而数据块以块粒度维护。

一个混合日志-块FTL的例子（写操作、读操作）

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vJ2tZB9X-1584626293923)(/home/qin/JuniorB/习概/ssd-hybrid-ftl.jpg)]

例子中，一个block只有4张page。需要进行四次写操作，每次写的大小都是一页。

逻辑页号5、9都对应到（resolve to）逻辑块1，逻辑块1对应(associated to)物理块1000。

Initially, all the physical page offsets are null at the entry where LBN=1 is in the log-block mapping table, and the log block #1000 is entirely empty as well.

最初，所有逻辑块0的物理页偏移均为null，日志块1000也全空。

The first write, b’ at LPN=5, is resolving to LBN=1 by the log-block mapping table, which is associated to PBN=1000 (log block #1000). The page b’ is therefore written at the physical offset 0 in block #1000.

第一次写，（向逻辑页5写b’），被日志-块映射表解析到逻辑块1号。逻辑块1号又关联到物理块1000号。b’就被写到物理块1000、物理偏移为0处。

The metadata for the mapping now needs to be updated, and for this, the physical offset associated to the logical offset of 1 (arbitrary value for this example) is updated from null to 0.

映射表中的元数据（与数据有关的数据）要进行更新。LBN逻辑页偏移为1处，对应物理页偏移从Null更新到0.

The write operations go on and the mapping metadata is updated accordingly. When the log block #1000 is entirely filled, it is merged with the data block associated to the same logical block, which is block #3000 in this case. This information can be retrieved from the data-block mapping table, which maps logical block numbers to physical block numbers. The data resulting from the merge operation is written to a free block, #9000 in this case.

这里所说的merge过程，应该就是按照第1000号日志块偏移顺序，对应到日志-块映射表的逻辑页偏移位置，依次写入数据块相应位置（日志块里同一位置数据如果存在，用它取代数据块相同位置数据）。随后，被写的新数据块9000取代数据块映射表里面3000的位置。也就是说，逻辑块1号此时对应的有效数据块已经是9000。

Log-structured merge-tree 这一思想

An important thing to notice here is that the four write operations were concentrated on only two LPNs. The log-block approach enabled to hide the b’ and d’ operations during the merge, and directly use the more up-to-date b” and d” versions, allowing to achieve better write amplification.

四次操作中后两次覆盖了前两次，使用这种方法减少了写入放大。

Finally, if a read command is requesting a page that was recently updated and for which the merge step on the blocks has not occurred yet, then the page will be in a log block. Otherwise, the page will be found a data block. This is why read requests need to check both the log-block mapping table and the data-block mapping table, as shown in Figure 5.

上述方法减少了写次数，但读时，为了读到正确的数据（比如读入被修改，却没有真正写入数据块的数据），需要同时检视日志-块映射表和数据块映射表。

(less important)

The log-block FTL allows for optimizations, the most notable being the switch-merge, sometimes referred to as “swap-merge”. Let’s imagine that all the addresses in a logical block were written at once. This would mean that all the new data for those addresses would be written to the same log block. Since this log block contains the data for a whole logical block, it would be useless to merge this log block with a data block into a free block, because the resulting free block would contain exactly the data as the log block. It would be faster to only update the metadata in the data block mapping table, and switch the the data block in the data block mapping table for the log block, this is a switch-merge.

这里介绍了增加性能的switch-merge。大概的意思是，上一个只是保存了这个“要写”的操作，这个则是有变化时在log块里面相应位置真实地立即写了。

4.3 行业现状

FTL的具体策略十分影响性能，作为行业机密。

4.4 垃圾回收 garbage collection

如前所述，SSD不能写覆盖。垃圾回收将过期stale页面组成的块擦除erase，并且打上可用free标志。

Because of the high latency required by the erase command compared to the write command — which are respectively 1500-3500 μs and 250-1500 μs as described in Section 1 — this extra erase step incurs a delay which makes the writes slower. Therefore, some controllers implement a background garbage collection process, also called idle garbage collection, which takes advantage of idle time and runs regularly in the background to reclaim stale pages and ensure that future foreground operations will have enough free pages available to achieve the highest performance [1]. Other implementations use a parallel garbage collection approach, which performs garbage collection operations in parallel with write operations from the host [13].

Erase过程十分耗时，因此有一个常驻进程在利用空闲时间完成这项工作。

It is not uncommon to encounter workloads in which the writes are so heavy that the garbage collection needs to be run on-the-fly, at the same time as commands from the host. In that case, the garbage collection process supposed to run in background could be interfering with the foreground commands [1]. The TRIM command and over-provisioning are two great ways to reduce this effect, and are covered in more details in Sections 6.1 and 6.2.

有时候同时大量写导致回收机制甚至要和主机程序同时运行，并且是即时运行，这会影响到主机程序。TRIM命令和过度供应over-provisioning是两个应对这种情况的有效措施，在6.1 6.2进行介绍。

（不那么重要的点：）

A less important reason for blocks to be moved is the read disturb. Reading can change the state of nearby cells, thus blocks need to be moved around after a certain number of reads have been reached [14].

读干扰：读了一定次数后，块需要被移动

The rate at which data is changing is an important factor. Some data changes rarely, and is called cold or static data, while some other data is updated frequently, which is called hot or dynamic data. If a page stores partly cold and partly hot data, then the cold data will be copied along with the hot data during garbage collection for wear leveling, increasing write amplification due to the presence of cold data. This can be avoided by splitting cold data from hot data, simply by storing them in separate pages. The drawback is then that the pages containing the cold data are less frequently erased, and therefore the blocks storing cold and hot data have to be swapped regularly to ensure wear leveling.

Since the hotness of data is defined at the application level, the FTL has no way of knowing how much of cold and hot data is contained within a single page. A way to improve performance in SSDs is to split cold and hot data as much as possible into separate pages, which will make the job of the garbage collector easier [8].

冷热数据：有些数据常常改变，有些不常改变。应该分开（页）存储（避免冷数据不必要的移动）。

TeQuL

关注

2
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
SSD工作原理（第三部分搬运与翻译）

SSD工作原理（第三部分搬运与翻译）原地址（Capsule，提供有中文翻译但不能打开 2020.0318 因此进行翻译by Qinning）也参考了知乎该系列有６部分，其中本章节（３）包含了闪存转换层Flash Translation Layer FTL涉及到如何完成逻辑地址（来自主机程序）到物理地址（SSD上的位置）转换，因此最为重要。其余部分１　简介２　跑分３　页、块、FTL4...
复制链接

扫一扫