【存储引擎】日志结构化文件系统 Log-Structured File System

_山月

已于 2024-12-22 15:29:11 修改

阅读量1.3k

点赞数 16

文章标签：分布式存储

于 2024-09-17 18:07:16 首次发布

本文链接：https://blog.csdn.net/weixin_38333830/article/details/142316997

版权

概述

《The Design and Implementation of a Log-Structured File System》是 1992 年发表的一篇论文。在任职的公司开发的分布式存储产品中，论文提出的 LFS 日志结构化文件系统是整个加速引擎层的思想基础，论文中重点阐述的 cost-benefit 算法也已经引入产品使用。这篇博文对论文的核心章节进行了翻译、整理（论文的结构/思路见下图），对关键内容结合个人理解增加了评注，希望对大家阅读这篇论文有帮助。

论文摘要

This paper presents a new technique for disk storage management called a log-structured file system. A log-structured file system writes all modifications to disk sequentially in a log-like structure, thereby speeding up both file writing and crash recovery. The log is the only structure on disk; it contains indexing information so that files can be read back from the log efficiently. In order to maintain large free areas on disk for fast writing, we divide the log into segments and use a segment cleaner to compress the live information from heavily fragmented segments. We present a series of simulations that demonstrate the efficiency of a simple cleaning policy based on cost and benefit. We have implemented a prototype log-structured file system called Sprite LFS; it outperforms current Unix file systems by an order of magnitude for small-file writes while matching or exceeding Unix performance for reads and large writes. Even when the overhead for cleaning is included, Sprite LFS can use 70% of the disk bandwidth for writing, whereas Unix file systems typically can use only 5-10%.

本文提出了一种叫做日志结构化文件系统的磁盘存储管理新技术。日志结构化文件系统以类似日志的形式把所有的修改顺序写入磁盘，从而加快文件写入和崩溃恢复。日志是磁盘上唯一的结构，它包含了索引信息，这样就能从日志中高效地读取文件。为了在磁盘上维护大块的空闲空间以实现快速写入，我们把日志分割成多个 segment，并且使用 segment cleaner 来压紧高度碎片化的 segment 中的有效信息（// 评注：把碎片化的 segment 中仍有效的信息搬移/重新聚合到新的 segment 中）；我们展示了一系列的仿真结果，证明了一个简单的基于成本和收益的清理策略的高效性。我们已经实现了日志结构化文件系统的原型叫做 Sprite LFS；它在小文件写入上的性能表现超过了当前的 Unix 文件系统一个数量级，同时在读和大文件写上的性能表现和 Unix 文件系统相当或者比 Unix 文件系统更好。即使在包含清理开销的情况下，Sprite LFS 可以使用磁盘 70% 的带宽用于写入，但是 Unix 文件系统通常只能使用 5~10%。

论文第 3 章 Log-structured file systems

The fundamental idea of a log-structured file system is to improve write performance by buffering a sequence of file system changes in the file cache and then writing all the changes to disk sequentially in a single disk write operation. The information written to disk in the write operation includes file data blocks, attributes, index blocks, directories, and almost all the other information used to manage the file system. For workloads that contain many small files, a log-structured file system converts the many small synchronous random writes of traditional file systems into large asynchronous sequential transfers that can utilize nearly 100% of the raw disk bandwidth.

日志结构化文件系统的基本思想是先缓存一系列文件系统的变更，然后在一次磁盘写入操作中把所有这些变更顺序地写到硬盘中。在写入操作中被写入磁盘的信息有文件数据块、属性、索引块、目录和几乎所有其他用于管理文件系统的信息。对于包含很多小文件的工作负载，日志结构文件系统把传统文件系统中很多小型的同步的随机写转换为大型的异步的顺序写入，通过这样的转换，可以利用接近 100% 的原始磁盘带宽。

Although the basic idea of a log-structured file system is simple, there are two key issues that must be resolved to achieve the potential benefits of the logging approach. The first issue is how to retrieve information from the log; this is the subject of Section 3.1 below. The second issue is how to manage the free space on disk so that large extents of free space are always available for writing new data. This is a much more difficult issue; it is the topic of Sections 3.2-3.6. Table 1 contains a summary of the on-disk data structures used by Sprite LFS to solve the above problems; the data structures are discussed in detail in later sections of the paper.

尽管日志结构化文件系统的基本思想很简单，但是为了得到日志方法的潜在收益，有两个关键的问题必须要解决。第一个问题是怎么从日志中检索信息（在 3.1 节中介绍）。第二个问题是怎么管理磁盘上的剩余空间，才能始终有大段的剩余空间可以写入新的数据（这个问题的难度更大，在 3.2~3.6 节中介绍）。（// 评注：文章的核心内容）表 1 总结了 Sprite LFS 为了解决这两个问题在磁盘中所使用的数据结构；数据结构的细节会在文章后面的章节中讨论。

数据结构	用途	位置	章节
Inode	定位文件的 block，保存保护位、修改时间等	Log	3.1
Inode map	定位 inode 在日志中的位置，保存最后一次访问的时间和版本号	Log	3.1
Indirect block	定位大文件的 block	Log	3.1
Segment summary	识别 segment 的内容（文件号和每个 block 的偏移）	Log	3.2
Segment usage table	统计 segment 中的有效字节数，存储 segment 最后一次写数据的时间	Log	3.6
Superblock	保存静态配置信息，比如 segment 的数量和大小	磁盘固定位置	无
Checkpoint region	定位 inode map 和 segment usage table 的 block，标识日志中的最后一个 checkpoint	磁盘固定位置	4.1
Directory change log	记录目录操作以维持 inode 计数的一致性	Log	4.2

表格介绍了 Sprite LFS 中每个数据结构的作用。表格也说明了数据结构是存储在日志中还是存储在磁盘中的固定位置，以及在这篇文章中对数据结构进行详细讨论的位置。lnode、indirect block 以及 superblock 和 Unix FFS 中同名的数据结构是相似的。注意，Sprite LFS 既不包含位图，也不包含空闲链表。

3.1. File location and reading

// 评注：讲的是上文提到的“how to retrieve information from the log”，即存储在日志结构化文件系统的中的数据的索引方式

Although the term ''log-structured'' might suggest that sequential scans are required to retrieve information from the log, this is not the case in Sprite LFS. Our goal was to match or exceed the read performance of Unix FFS. To accomplish this goal, Sprite LFS outputs index structures in the log to permit random-access retrievals. The basic structures used by Sprite LFS are identical to those used in Unix FFS: for each file there exists a data structure called an inode, which contains the file's attributes (type, owner, permissions, etc.) plus the disk addresses of the first ten blocks of the file; for files larger than ten blocks, the inode also contains the disk addresses of one or more indirect blocks, each of which contains the addresses of more data or indirect blocks. Once a file's inode has been found, the number of disk I/Os required to read the file is identical in Sprite LFS and Unix FFS.

Sprite LFS 中并不是通过顺序扫描的方式来从日志中检索信息。作者的目标是要达到或者超过 Unix FFS 的读性能。为了达成这个目标，Sprite LFS 通过在日志中生成索引结构来支持随机访问检索。Sprite LFS 所使用的基本结构和 Unix FFS 所使用的非常相似：每个文件都有一个对应的 inode，inode 中包含了文件的属性（类型、所有者、权限等等）和文件的前 10 个 block 在磁盘中的地址；对于超过 10 个 block 的文件，inode 还会包含一个以上的 indirect block 的地址，这每个 indirect block 又会包含更多的 data block 或者 indirect block 的地址；一旦文件的 inode 被找到，读取文件所需要的磁盘 IO 次数在 Sprite LFS 和 Unix FFS 里是相同的。

In Unix FFS each inode is at a fixed location on disk; given the identifying number for a file, a simple calculation yields the disk address of the file's inode. In contrast, Sprite LFS doesn't place inodes at fixed positions; they are written to the log. Sprite LFS uses a data structure called an inode map to maintain the current location of each inode. Given the identifying number for a file, the inode map must be indexed to determine the disk address of the inode. The inode map is divided into blocks that are written to the log; a fixed checkpoint region on each disk identifies the locations of all the inode map blocks. Fortunately, inode maps are compact enough to keep the active portions cached in main memory: inode map lookups rarely require disk accesses.

在 Unix FFS 中，每个 inode 在磁盘上的位置都是固定的；给出一个文件的识别号，通过一个简单的计算就可以生成文件的 inode 在磁盘中的地址。与之相对的，Sprite LFS 没有把 inode 放在固定的位置上，而是把 inode 写入 log（// 评注：inode 相当于数据的元数据）。Sprite LFS 通过一个叫做 inode map 的数据结构来维护每个 inode 当前的位置。给出一个文件的识别号，必须先通过索引 inode map 来确定 inode 在磁盘中的地址；inode map 被拆分写入 log 的 block 中；每个磁盘上固定的 checkpoint 区域又标识了写有 inode map 的 block 的位置。幸运的是，inode map 非常紧凑，可以在内存中缓存活跃的部分，所以查找 inode map 很少需要访问磁盘。

// 评注：核心的索引关系是：checkpoint（磁盘上固定的区域） -> inode map（in the log，活跃部分缓存在内存中） -> inode（also in the log） -> file（在 file 过大情况下，inode 支持多级索引）

Figure 1 shows the disk layouts that would occur in Sprite LFS and Unix FFS after creating two new files in different directories. Although the two layouts have the same logical structure, the log-structured file system produces a much more compact arrangement. As a result, the write performance of Sprite LFS is much better than Unix FFS, while its read performance is just as good.

图 1 显示了在不同目录中创建两个新文件后，Sprite LFS 和 Unix FFS 中对应的磁盘布局。尽管这两种布局具有相同的逻辑结构，但是日志结构化文件系统对应的排布方式更加紧凑。因此 Sprite LFS 能够在具有好的读性能的同时，拥有比 Unix FFS 好得多的写性能。

Figure 1 —— A comparison between Sprite LFS and Unix FFS

This example shows the modified disk blocks written by Sprite LFS and Unix FFS when creating two single-block files named dir1/file1 and dir2/file2. Each system must write new data blocks and inodes for file1 and file2, plus new data blocks and inodes for the containing directories. Unix FFS requires ten non-sequential writes for the new information (the inodes for the new files are each written twice to ease recovery from crashes), while Sprite LFS performs the operations in a single large write. The same number of disk accesses will be required to read the files in the two systems. Sprite LFS also writes out new inode map blocks to record the new inode locations.

这个示例显示了在创建两个名为 dir1/file1 和 dir2/file2 的单 block 文件时，Sprite LFS and Unix FFS 所修改的 block。每个系统都必须为文件 1 和文件 2 写新的 data block 和 inode，再加上用于保存目录的新的 data block 和 inode。Unix FFS 需要为新信息进行 10 次非顺序写入（新文件的 inode 都被写了两次以简化崩溃后恢复的过程），而 Sprite LFS 只需要进行一次大型写入。Sprite LFS 还会写出新的 inode map block 来记录新的 inode 的位置。

3.2. Free space management: segments

// 评注：提出 segment 概念

The most difficult design issue for log-structured file systems is the management of free space. The goal is to maintain large free extents for writing new data. Initially all the free space is in a single extent on disk, but by the time the log reaches the end of the disk the free space will have been fragmented into many small extents corresponding to the files that were deleted or overwritten.

在日志结构化文件系统的设计中，最难的问题是空闲空间管理。空闲空间管理的目标是要保持大片的空闲空间用于写入新的数据。在一开始，磁盘上的空闲空间是一片单一的连续的区域，但是当磁盘快要写满日志时，由于文件的删除或者覆盖写，空闲空间已经被分割成了很多的小块（碎片化）。

From this point on, the file system has two choices: threading and copying. These are illustrated in Figure 2. The first alternative is to leave the live data in place and thread the log through the free extents. Unfortunately, threading will cause the free space to become severely fragmented, so that large contiguous writes won't be possible and a log-structured file system will be no faster than traditional file systems. The second alternative is to copy live data out of the log in order to leave large free extents for writing. For this paper we will assume that the live data is written back in a compacted form at the head of the log; it could also be moved to another log-structured file system to form a hierarchy of logs, or it could be moved to some totally different file system or archive. The disadvantage of copying is its cost, particularly for long-lived files; in the simplest case where the log works circularly across the disk and live data is copied back into the log, all of the long-lived files will have to be copied in every pass of the log across the disk.

从这儿开始，文件系统的设计有两个选择：threading 和 copying（图 2 对这两种选择进行了阐述）。threading 方法是在原地保留有效数据，并通过空闲空间把日志串联起来。不好的是，threading 方法会导致空闲空间变得严重的碎片化，以至于不能再进行大的连续的写入，并且日志结构化文件系统的速度相比于传统的文件系统也不再有优势。copying 方法是把有效数据从 log 中拷贝出来，这样可以留出大片的空闲空间用于写入。在本文中，我们假设有效数据以压紧的方式写回日志的开头。有效数据也可以写到其他的日志结构化文件系统，来组成层次化的日志结构，或者，有效数据也可以写到其他完全不同的文件系统或者归档系统；copying 方法的缺点是它的成本，尤其是对于有效期长的文件；在最简单的情况下，日志在磁盘上循环移动，有效数据不断地被拷贝回日志，日志文件在磁盘上的每一轮移动，所有有效期长的文件都必须要被复制一次。

Figure 2 —— Possible free space management solutions for log-structured file systems.

In a log-structured file system, free space for the log can be generated either by copying the old blocks or by threading the log around the old blocks. The left side of the figure shows the threaded log approach where the log skips over the active blocks and overwrites blocks of files that have been deleted or overwritten. Pointers between the blocks of the log are maintained so that the log can be followed during crash recovery. The right side of the figure shows the copying scheme where log space is generated by reading the section of disk after the end of the log and rewriting the active blocks of that section along with the new data into the newly generated space.

在日志结构化文件系统中，可以通过拷贝旧的 block 或者索引旧的 block 周围的日志来为新日志的写入生成可用空间。图的左边展示了索引日志的方法，日志跳过有效的 block，覆盖写那些对应的文件已经被删除或者已经被覆盖的 block。这种方式要维护日志 block 之间的指针，以便崩溃恢复期间可以跟踪日志。图的右边展示了通过拷贝方法生成日志空间的方法，读取日志的结尾之后的磁盘 section，把 section 中有效的 block 和新数据一起重写到新生成的空间中。

Sprite LFS uses a combination of threading and copying. The disk is divided into large fixed-size extents called segments. Any given segment is always written sequentially from its beginning to its end, and all live data must be copied out of a segment before the segment can be rewritten. However, the log is threaded on a segment-by-segment basis; if the system can collect long-lived data together into segments, those segments can be skipped over so that the data doesn't have to be copied repeatedly. The segment size is chosen large enough that the transfer time to read or write a whole segment is much greater than the cost of a seek to the beginning of the segment. This allows whole-segment operations to run at nearly the full bandwidth of the disk, regardless of the order in which segments are accessed. Sprite LFS currently uses segment sizes of either 512 kilobytes or one megabyte.

Sprite LFS 结合了 threading 和 coping 方法。把磁盘划分成一个一个大的区段（区段的大小固定），称为 segment。每一个 segment 都是从头开始顺序写入，一直到结尾，并且在 segment 可以重新写入数据之前，必须要先把有效数据都从 segment 里拷贝出来。另一方面，日志是按 segment 进行线索化的；如果系统可以把有效期长的数据汇聚在一起，写入特定的 segment，那在 copying 的过程中，就可以跳过这些 segment，避免长期有效的数据被重复复制。（// 评注：下文引出 cost-benefit 算法的基础，或者说需要有 cost-benefit 算法的前提条件）segment 的大小被设置得足够大，这样读写整个 segment 花费的传输数据的时间就比跳转到 segment 开头的寻道的时间大得多。这样就几乎可以以磁盘的最大带宽对 segment 进行读写，而无论访问 segment 的顺序如何（跳转到 segment 开头的时间可以忽略不计）。Sprite LFS 当前使用 segment 的大小是 512KB 或者 1MB。

3.3. Segment cleaning mechanism

// 评注：介绍了 segment 的回收过程

The process of copying live data out of a segment is called segment cleaning. In Sprite LFS it is a simple three-step process: read a number of segments into memory, identify the live data, and write the live data back to a smaller number of clean segments. After this operation is complete, the segments that were read are marked as clean, and they can be used for new data or for additional cleaning.

// 评注：回收 segment 的完整过程，除了提到的三个步骤外，还有重要的一步是更新 3.1 节提到的索引关系，更简洁的说法，就是还需要更新元数据

把有效数据拷贝出 segment 的过程称为 segment cleaning。在 Sprite LFS 里 segment cleaning 是简单的三个步骤：

把一些 segment 读到内存里；
识别出 segment 中的有效数据；
把有效数据写回更少数量的干净的 segment；

在操作完成之后，一开始读取数据的那些 segment 就被标记为干净的，可以写入新数据或者接收之后的 cleaning 的过程要重写的有效数据。

As part of segment cleaning it must be possible to identify which blocks of each segment are live, so that they can be written out again. It must also be possible to identify the file to which each block belongs and the position of the block within the file; this information is needed in order to update the file's inode to point to the new location of the block. Sprite LFS solves both of these problems by writing a segment summary block as part of each segment. The summary block identifies each piece of information that is written in the segment; for example, for each file data block the summary block contains the file number and block number for the block. Segments can contain multiple segment summary blocks when more than one log write is needed to fill the segment. (Partial-segment writes occur when the number of dirty blocks buffered in the file cache is insufficient to fill a segment.) Segment summary blocks impose little overhead during writing, and they are useful during crash recovery (see Section 4) as well as during cleaning.

// 评注：在回收 segment 的过程中，为了能够判断 block 的有效性和更新元数据，需要知道 segmeng 中 block 所对应的用户数据（和 3.1 节所描述索引关系正好相反），所用方法，即本段提到的 summary block，本质上 summary block 是反向的元数据

作为 segment cleaning 过程的一部分，必须要能识别每个 segment 中哪些 block 是有效的，这样它们（// 评注：有效的 block）才能从 segment 中写出。也必须要能够识别每个 block 属于哪个文件，以及它们在文件中的位置；这是为了能够更新文件的 inode 以指向新的 block 的位置（// 评注：即更新元数据）。Sprite LFS 解决这两个问题的方式是：在每个 segment 中写入一个 segment summary block（作为 segment 的一部分）。summary block 标识了写入 segment 的每条信息；比如，对于每个数据块，summary block 记录了文件号和 block 的块号。当需要写入多个日志才能填满 segment 时（当在 file cache 中缓存的 dirty block 的不足以填满整个 segment，就会出现只写了 segment 的一部分的情况），segment 就会包含多个 segment summary block。写入 segment summary block 的开销很小，但是 segment summary block 在 crash recovery（第 4 章）和 cleaning 的过程中却又很有用。

Sprite LFS also uses the segment summary information to distinguish live blocks from those that have been overwritten or deleted. Once a block's identity is known, its liveness can be determined by checking the file's inode or indirect block to see if the appropriate block pointer still refers to this block. If it does, then the block is live; if it doesn't, then the block is dead. Sprite LFS optimizes this check slightly by keeping a version number in the inode map entry for each file; the version number is incremented whenever the file is deleted or truncated to length zero. The version number combined with the inode number form a unique identifier (uid) for the contents of the file. The segment summary block records this uid for each block in the segment; if the uid of a block does not match the uid currently stored in the inode map when the segment is cleaned, the block can be discarded immediately without examining the file's inode.

// 评注：本段解释的是在回收 segment 的过程中判断 block 有效性的方法，清楚一点说，判断 block 有效性的方法是先通过 summary block 反向找到 block 的用户数据标识，再根据用户数据标识正向地索引到用户数据在存储系统中的最新位置，如果最新位置即当前正在回收的 segment 中的这个 block，说明 block 仍然有效，反之，说明已经发生过覆盖写或者删除，block 无效

Sprite LFS 还用 summary block 中来判断 block 的有效性（区分有效的 block 和已经被覆盖写或者被删除的 block）。一旦知道了 block 的标识，通过检查文件的 inode 或者 indirect block，查看 appropriate block pointer 是否仍然指向这个 block 就可以判断 block 的有效性；如果仍然指向这个 block，那么 block 就是有效的；如果不再指向这个 block，那 block 就不再有效；Sprite LFS 通过在 inode map entry 中为每个文件保留一个版本号的方式，稍微地优化了这种有效性检查；不管文件是被删除，还是文件长度被截断到零，版本号都是不断递增的。版本号和 inode 号一起组成了文件内容的唯一标识符 uid。segment summary block 为 segment 中的每个 block 都记录了 uid；当 segment 被清理的时候，如果 block 的 uid 和当前存储在 inode map 中的 uid 不同，block 可以被立即丢弃而不需要检查文件的 inode。

This approach to cleaning means that there is no free-block list or bitmap in Sprite. In addition to saving memory and disk space, the elimination of these data structures also simplifies crash recovery. If these data structures existed, additional code would be needed to log changes to the structures and restore consistency after crashes.

这种清理 segment 的方法意味着 Sprite LFS 中没有空闲 block 的链表或者位图。这样除了节省内存空间和磁盘空间，也能简化崩溃恢复的过程。因为如果存在链表/位图这些数据结构，就需要额外的代码来记录这些结构的变化，并在崩溃后重新恢复一致性。

3.4. Segment cleaning policies

// 评注：提出写入成本和计算公式

Given the basic mechanism described above, four policy issues must be addressed:

When should the segment cleaner execute? Some possible choices are for it to run continuously in background at low priority, or only at night, or only when disk space is nearly exhausted.
How many segments should it clean at a time? Segment cleaning offers an opportunity to reorganize data on disk; the more segments cleaned at once, the more opportunities to rearrange.
Which segments should be cleaned? An obvious choice is the ones that are most fragmented, but this turns out not to be the best choice.
How should the live blocks be grouped when they are written out? One possibility is to try to enhance the locality of future reads, for example by grouping files in the same directory together into a single output segment. Another possibility is to sort the blocks by the time they were last modified and group blocks of similar age together into new segments; we call this approach age sort.

// 评注：提出了四个方面的问题 —— GC 的启动时间、GC 的强度、GC 如何选择 segment（贪婪算法 or 后文提出的 cost-benefit 算法）、GC 搬移过程中如何组织数据（提出了年龄排序）

对于上述的 segment cleaning 机制，还有四个策略问题必须解决：

segment cleaner 什么时候应该执行？可能的选择有以低优先级在后台连续地运行，或者只在夜间运行，或者只在磁盘空间几乎要用完的时候运行。
应该同时清理几个 segment？segment cleaning 提供了一个对磁盘上的数据进行重新组织的机会；同时清理的 segment 的个数更多，重新排列数据的机会也就越多。
应该清理哪些 segment？一个显而易见的选择是清理那些碎片化最严重的 segment（// 评注：即贪婪算法，每次选择垃圾最多的 segment 进行清理），但是事实证明这不是最好的选择。
当有效的 block 从 segment 中被写出的时候，该如何分组？一种可能的选择是，尝试增强重组之后读取数据的局部性，比如，把相同目录的文件写到一个 segment 中。另一种可能的选择是，按照最后被修改的时间对 block 进行排序，并且把相同 age 的 block 聚合到新的相同的 segment 中；我们把这种方式称为年龄排序 age sort（// 评注：对搬移的有效数据按照年龄排序是后文提出的 cost-benefit 算法能够有效的基础）。

In our work so far we have not methodically addressed the first two of the above policies. Sprite LFS starts cleaning segments when the number of clean segments drops below a threshold value (typically a few tens of segments). It cleans a few tens of segments at a time until the number of clean segments surpasses another threshold value (typically 50~100 clean segments). The overall performance of Sprite LFS does not seem to be very sensitive to the exact choice of the threshold values. In contrast, the third and fourth policy decisions are critically important: in our experience they are the primary factors that determine the performance of a log-structured file system. The remainder of Section 3 discusses our analysis of which segments to clean and how to group the live data.

到目前为止，我们还没有系统地解决上面提到的策略问题中的前两个（segment cleaner 什么时候执行以及同时清理几个 segment）。Sprite LFS 的做法是，当干净的 segment 的数量降到阈值之下（通常是一、二十个或者二、三十个 segment），Sprite LFS 开始清理 segment。Sprite LFS 一次会清理一、二十个或者二、三十个 segment 直到干净的 segment 的数量超过另一个阈值（通常是 50~100 个干净的 segment）。阈值具体的值是多少似乎对于 Sprite LFS 总体的性能没有什么影响（Sprite LFS 总体的性能对于阈值的变化不敏感）。相比之下，对上面提到的第三个和第四个策略问题（清理哪些 segment 以及如何对从 segment 写出的有效 block 进行分组）所做的选择就非常重要：根据我们的经验，对这两个策略问题所做的选择是日志结构化文件系统性能表现的主要影响因素。第三章的剩余部分讨论了我们对选择哪些 segment 进行清理以及如何组织有效数据所做的分析。

We use a term called write cost to compare cleaning policies. The write cost is the average amount of time the disk is busy per byte of new data written, including all the cleaning overheads. The write cost is expressed as a multiple of the time that would be required if there were no cleaning overhead and the data could be written at its full bandwidth with no seek time or rotational latency. A write cost of 1.0 is perfect: it would mean that new data could be written at the full disk bandwidth and there is no cleaning overhead. A write cost of 10 means that only one-tenth of the disk's maximum bandwidth is actually used for writing new data; the rest of the disk time is spent in seeks, rotational latency, or cleaning.

// 评注：提出 write cost 的概念，在忽略了寻道时间和旋转延迟之后，本质上 write cost 是针对 segment 清理过程带来的读写放大的量化方法

我们用一个叫做 write cost 写入成本的术语来比较各种清理策略。write cost 是新写入磁盘的每个字节所花费的平均时间（也包括所有用于清理无效数据的开销）。write cost 表示的是新写入的数据花费的时间相对于最理想的情况下所花费时间的倍数，最理想的情况即没有清理无效数据的开销，没有寻道时间和旋转延迟，以磁盘全带宽写入新数据。在最理想的情况下，write cost 是1.0。write cost 为10表示只有十分之一的磁盘最大带宽的被真正用于写入新数据；磁盘剩余的带宽/时间都被花费在磁盘寻道、旋转延迟和无效数据清理上。

For a log-structured file system with large segments, seek and rotational latency are negligible both for writing and for cleaning, so the write cost is the total number of bytes moved to and from the disk divided by the number of those bytes that represent new data. This cost is determined by the utilization (the fraction of data still live) in the segments that are cleaned. In the steady state, the cleaner must generate one clean segment for every segment of new data written. To do this, it reads N segments in their entirety and writes out N*u segments of live data (where u is the utilization of the segments and 0 ≤ u < 1). This creates N*(1-u) segments of contiguous free space for new data. Thus，

对于具有大的 segment 的日志结构化文件系统，寻道时间和旋转延迟相对于写入新数据和清理无效数据的时间是可以忽略不计的，因此 write cost 是磁盘读写字节的总数除以其中属于新写入的字节的个数。这就意味着，写入成本由被清理的 segment 的利用率（仍然有效的数据部分）决定（// 评注：导向贪婪算法）。在稳定状态下（// 评注：指的是存储系统中数据占用的总容量保持不变），cleaner 要为每一个新写入的 segment 生成一个干净的 segment（// 评注：稳定状态下，新写入的数据的量和清理 segment 所释放出的空间大小相等）。为了做到这一点，cleaner 先读取N个段的全部数据，再写出N*u的有效数据（其中u是 segment 的利用率，0 ≤ u < 1）。这为新数据创建了N*(1-u)个 segment（连续的空闲空间）。因此，

In the above formula we made the conservative assumption that a segment must be read in its entirety to recover the live blocks; in practice it may be faster to read just the live blocks, particularly if the utilization is very low (we haven't tried this in Sprite LFS). If a segment to be cleaned has no live blocks(u = 0) then it need not be read at all and the write cost is 1.0.

在上面的公式中，我们保守地假设必须要读取整个 segment 来恢复有效数据；实际上，如果只读取有效的 block 可能更快，尤其是当使用率非常低的时候（我们还没有在 Sprite LFS 中尝试过这一点）。如果被清理的 segment 没有有效数据（u=0），那这个 segment 就根本不需要读取，写入成本 write cost 是1.0（读写字节总数等于新写字节数）。（// 评注：因为按照 Sprite LFS 的实现，就算完全没有有效数据，也要先读一遍整个 segment，write cost 最小是2.0，永远做不到 1.0）

Figure 3 graphs the write cost as a function ofu. For reference, Unix FFS on small file workloads utilizes at most 5%~10% of the disk bandwidth, for a write cost of 10~20 (see[11] and Figure 8 in Section 5.1 for specific measurements). With logging, delayed writes, and disk request sorting this can probably be improved to about 25% of the bandwidth[24] or a write cost of 4. Figure 3 suggests that the segments cleaned must have a utilization of less than .8 in order for a log-structured file system to outperform the current Unix FFS; the utilization must be less than .5 to outperform an improved Unix FFS.

图 3 展示了 write cost 和u的函数关系。作为参考，Unix FFS 在小文件负载下最多利用磁盘带宽的5%~10%，写入成本是 10~20（具体的测量值参见 [11] 和第 5.1 节中的图 8）。通过日志记录、延迟写入和磁盘请求排序，可能可以把磁盘带宽利用率提高到25%左右，把写入成本 write cost 降低到4。图 3 表明被清理的 segments 的利用率必须小于0.8，日志结构化文件系统（// 评注：指数型曲线）的性能才能优于当前的 Unix FFS（// 评注：长条虚线）；如果要优于改进型的 Unix FFS（// 评注：短条虚线），被清理的 segments 的利用率必须小于0.5。

Figure 3 —— Write cost as a function of u for small files

In a log-structured file system, the write cost depends strongly on the utilization of the segments that are cleaned. The more live data in segments cleaned the more disk bandwidth that is needed for cleaning and not available for writing new data. The figure also shows two reference points: "FFS today", which represents Unix FFS today, and "FFS improved", which is our estimate of the best performace possible in an improved Unix FFS. Write cost for Unix FFS is not sensitive to the amount of disk space in use.

在日志结构化文件系统中，写入成本在很大程度上取决于被清理的段的利用率。被清理的 segment 中的有效数据越多，就需要和写入新数据争抢更多的磁盘带宽用于清理。（// 评注：被清理的 segment 中的有效数据越多，清理过程导致的读写放大越严重）图中也展示了两个参考点："FFS today" 表示当前的 Unix FFS 的性能，“FFS improved”表示我们估计的改进之后的 Unix FFS 可能达到的最佳性能。Unix FFS 的写入成本对于使用的磁盘空间量不敏感。

It is important to note that the utilization discussed above is not the overall fraction of the disk containing live data; it is just the fraction of live blocks in segments that are cleaned. Variations in file usage will cause some segments to be less utilized than others, and the cleaner can choose the least utilized segments to clean; these will have lower utilization than the overall average for the disk.

注意，上面讨论的利用率，并不是磁盘包含有效数据的总比率；只是被清理的 segment 中有效 block 的比例。文件使用的变化会导致部分 segment 的利用率比其他的 segment 的低，cleaner 可以选择利用率最低的 segment 的来清理；这些 segment 的利用率会比磁盘总体利用率更低。// 评注：因为不是所有 segment 的利用率完全相等，才有可能做到低于磁盘总体使用率计算出的写入成本（类比于政治，整个存储系统内存的 segment 不是铁板一块，对不同的个体采取不同的手段，拉拢、分化、斗争（保留、重组、回收））

Even so, the performance of a log-structured file system can be improved by reducing the overall utilization of the disk space. With less of the disk in use, the segments that are cleaned will have fewer live blocks, resulting in a lower write cost. Log-structured file systems provide a cost-performance tradeoff: if disk space is underutilized, higher performance can be achieved but at a high cost per usable byte; if disk capacity utilization is increased, storage costs are reduced but so is performance. Such a tradeoff between performance and space utilization is not unique to log-structured file systems. For example, Unix FFS only allows 90% of the disk space to be occupied by files. The remaining 10% is kept free to allow the space allocation algorithm to operate efficiently.

即使是这样，也还是可以通过降低磁盘空间的总体利用率来提高日志结构化文件系统的性能。随着磁盘使用率的降低，被清理的 segment 包含的有效的 block 将会更少，也就会让 write cost 更低。日志结构化文件系统提供了一种在成本和性能之间的权衡：如果磁盘空间利用率低，可以实现更高的性能（// 评注：write cost 小），但是每个可用字节的成本也更会高（// 评注：同样大小的磁盘空间只写入了少量的数据，或者同样大小的数据要用更大空间的磁盘空间来存储）；如果磁盘利用率上升，存储成本会降低，但是性能也会跟着降低（// 评注：按照 write cost 的计算公式，write cost 会变大）。这种在性能和空间利用率之间的权衡并不是日志结构化文件系统独有的。比如，Unix FFS 只允许文件占用 90% 的磁盘空间。剩下的 10% 保持空闲，让空间分配算法的运行更高效。

The key to achieving high performance at low cost in a log-structured file system is to force the disk into a bimodal segment distribution where most of the segments are nearly full, a few are empty or nearly empty, and the cleaner can almost always work with the empty segments. This allows a high overall disk capacity utilization yet provides a low write cost. The following section describes how we achieve such a bimodal distribution in Sprite LFS.

日志结构化文件系统以低成本实现高性能的关键，是让磁盘达到两极化的 segment 分布，即大部分的 segment 几乎全满（都是有效数据）或者几乎全空（没有有效数据），cleaner 几乎总是在清理那些没有有效数据的空段。这样，就可以实现磁盘空间的高利用率，但同时又有低的写入成本。接下来的章节，描述的就是我们怎么样在 Sprite LFS 中实现这种 segment 的两极化分布。

3.5. Simulation results

// 评注：提出考虑冷热的 CB 算法

We built a simple file system simulator so that we could analyze different cleaning policies under controlled conditions. The simulator's model does not reflect actual file system usage patterns (its model is much harsher than reality), but it helped us to understand the effects of random access patterns and locality, both of which can be exploited to reduce the cost of cleaning. The simulator models a file system as a fixed number of 4-kbyte files, with the number chosen to produce a particular overall disk capacity utilization. At each step, the simulator overwrites one of the files with new data, using one of two pseudo-random access patterns:

Uniform：Each file has equal likelihood of being selected in each step.
Hot-and-cold：Files are divided into two groups. One group contains 10% of the files; it is called hot because its files are selected 90% of the time. The other group is called cold; it contains 90% of the files but they are selected only 10% of the time. Within groups each file is equally likely to be selected. This access pattern models a simple form of locality.

我们构建了一个简单的文件系统模拟器，来方便我们在受控条件下分析不同的清理策略。模拟器和实际的文件系统的使用模式并不相同（模拟器的模型比实际情况要严苛很多），但是，模拟器能够帮助我们理解随机访问模式和局部性带来的影响，而这两点都可以用来降低垃圾回收（清理无效数据、释放空闲空间）的成本。模拟器按照固定个数的 4KB 大小的文件对文件系统进行建模，通过改变 4KB 大小的文件的具体数量可以构造特定的磁盘总体使用率。在每个步骤中，模拟器都会使用两种伪随机访问模式中的一种，从所有 4KB 大小的文件中选出一个，用新数据覆盖这个文件（// 评注：要注意的是，segment 的大小是大于 4KB 的，也就是说，每次覆盖写的是 segment 的一部分）：

均匀访问模式：在每个步骤中，每个文件被选中的可能性相等。

冷热访问模式：文件被分为两组。一个组包含 10% 的文件；作为热组，在 90% 的时间内会选中热组中的文件。另一组作为冷组；包含 90% 的文件，但只有 10% 的时间会选择冷组中的文件。在冷/热组的内部，每个文件被选中的可能性相同。这种访问模式对简单形式的局部性进行了建模。

In this approach the overall disk capacity utilization is constant and no read traffic is modeled. The simulator runs until all clean segments are exhausted, then simulates the actions of a cleaner until a threshold number of clean segments is available again. In each run the simulator was allowed to run until the write cost stabilized and all cold-start variance had been removed.

在这种方法里，磁盘的总体使用率是恒定的（// 评注：按上文提到的，“构造特定的磁盘总体使用率”），并且没有对读流量进行建模。模拟器先持续写入数据直到干净的 segment 被用完，然后再模拟 cleaner 的工作清理无效数据（// 评注：整个测试过程是，一开始关闭 GC，一直做 4K 随机写直到耗尽所有可写的 segment，再启动 GC），直到干净的 segment 的数量再次达到一个阈值（// 评注：上文提到了通常是 50~100 个干净的 segment）。在每一次运行中，允许模拟器一直运行直到 write cost 稳定下来并且所有冷启动差异都已经被消除。（// 评注：Sprite LFS 使用 segment 的大小是 512KB 或者 1MB，但是在仿真中，每次都只覆盖写一个 4KB 大小的文件，当干净的 segment 被写完准备启动回收时，每个 segment 的使用率不同，对应的 write cost 也不同）

Figure 4 superimposes the results from two sets of simulations onto the curves of Figure 3. In the "LFS uniform" simulations the uniform access pattern was used. The cleaner used a simple greedy policy where it always chose the least-utilized segments to clean. When writing out live data the cleaner did not attempt to re-organize the data: live blocks were written out in the same order that they appeared in the segments being cleaned (for a uniform access pattern there is no reason to expect any improvement from re-organization).

图 4 把两组模拟结果叠加到图 3 的曲线上。粗实线 "LFS uniform" 对应的是使用均匀访问模式的模拟结果。cleaner 使用简单的贪婪策略，总是选择利用率最低的 segment 进行清理。当写出有效数据时，cleaner 没有尝试重新组织数据：有效的 block 保持它们在被清理的 segment 中的顺序，写到新的 segment 中（对于均匀访问模式，没有任何理由期望重新组织数据会有任何改进）。

Figure 4 —— Initial simulation results.

The curves labeled ''FFS today'' and ''FFS improved'' are reproduced from Figure 3 for comparison. The curve labeled ''Novariance'' shows the write cost that would occur if all segments always had exactly the same utilization. The ''LFS uniform'' curve represents a log-structured file system with uniform access pattern and a greedy cleaning policy: the cleaner chooses the least-utilized segments. The ''LFS hot-and-cold'' curve represents a log-structured file system with locality of file access. It uses a greedy cleaning policy and the cleaner also sorts the live data by age before writing it out again. The x-axis is overall disk capacity utilization, which is not necessarily the same as the utilization of the segments being cleaned.

// 评注：实验变量有均匀访问/冷热访问、对写出的有效数据按照年龄排序/不按照年龄排序，一共三条曲线（图 7 中还会有使用 cost-benefit 算法的第四种场景）：

“No variance”细实线，所有的 segment 都具有相同的使用率的理想情况，作为参照
“LFS uniform”粗实线，均匀访问 + 贪婪算法
“LFS hot-and-cold”点划线，冷热访问 + 贪婪算法 + 写出的有效数据按年龄排序

标有“FFS today”和“FFS improved”的横线是从图 3 中复制的，用来进行比较。标有“No variance”的曲线表示的是所有的 segment 都具有相同的使用率时的写入成本（// 评注：磁盘总体使用率即 segment 的使用率，存储系统整体的写入成本可以直接用公式 write cost = 2 / (1 - u) 进行计算，曲线和图 3 中的“Log-structured”曲线完全一致，这是一种理想情况，但是，因为没有使用率比磁盘总体使用率更低的 segment，GC 只能实实在在按照磁盘总体使用率计算得到的写入成本回收 segment，所以对 GC 来说也是最不好的情况）。“LFS uniform”曲线表示的是具有均匀访问模式和贪婪清理策略（cleaner 选择具有最低使用率的 segment 进行清理）的日志结构化文件系统的写入成本。“LFS hot-and-cold”曲线表示的是具有冷热访问模式（文件局部性访问）和贪婪清理策略的日志结构化文件系统的写入成本。“LFS hot-and-cold”曲线所对应的 cleaner 还会在写出有效数据之前先按照 age 对有效数据进行排序。X 轴是磁盘的总体使用率，和被清理的 segment 的使用率不一定相同。（// 评注：贪婪算法每次都选择使用率最低的 segment 进行回收，所清理的具体的 segment 的使用率肯定小于等于磁盘空间的总体使用率，按照写入成本的计算公式，使用率越低，写入成本越小，和图中的“LFS uniform”曲线、“LFS hot-and-cold”曲线在 y 轴上低于“No variance”曲线相符合）

Even with uniform random access patterns, the variance in segment utilization allows a substantially lower write cost than would be predicted from the overall disk capacity utilization and formula (1). For example, at 75% overall disk capacity utilization, the segments cleaned have an average utilization of only 55%. At overall disk capacity utilizations under 20% the write cost drops below 2.0; this means that some of the cleaned segments have no live blocks at all and hence don't need to be read in.

即使是均匀的随机访问模式（// 评注：图 4“LFS uniform”曲线），因为 segment 利用率的差异，也可能达到远低于通过磁盘总体利用率和公式 (1) 预测出来的 write cost（// 评注：图 4“No variance”曲线，实际上就是“均匀的随机访问模式”没有那么随机和均匀）。比如，磁盘容量的总体利用率是75%（// 评注：按照 write cost = 2 / (1 - u) 的计算结果为 8），但是清理的 segment 的平均利用率仅有 55%（// 评注：按照 write cost = 2 / (1 - u)进行计算，对应的 write cost 为 4.44，和图 4“LFS uniform” 曲线对应）。当磁盘容量的总体利用率低于20%，write cost 将降到2.0以下；这意味着一些被清理的 segment 根本没有有效的 blcok，也因此不需要读取 segment 的数据。（// 评注：公式 (1) write cost = 2 / (1 - u) 计算出的 write cost 大于等于 2，一旦存储系统总体的 write cost 小于 2，就说明有部分 segment 的使用率 u 是 0，也即 segment 中没有有效数据，对应的 write cost 是 1；另一方面，在工程实现中，当磁盘总体利用率低于 20%，就可以不启动 GC）

The "LFS hot-and-cold" curve shows the write cost when there is locality in the access patterns, as described above. The cleaning policy for this curve was the same as for "LFS uniform" except that the live blocks were sorted by age before writing them out again. This means that long-lived (cold) data tends to be segregated in different segments from short-lived (hot) data; we thought that this approach would lead to the desired bimodal distribution of segment utilizations.

如上所述， "LFS hot-and-cold" 曲线展示了在访问模式存在局部性情况下的写入成本。除了在把有效的 block 写出之前按照 age 进行排序， "LFS hot-and-cold" 曲线对应的清理策略和“LFS uniform”曲线对应的清理策略是一样的。而通过在写出有效数据的过程中按照 age 对 block 进行排序，可以让有效期长的数据和有效期短的数据分在不同的 segment 里（// 评注：数据块的 age 大即长时间未被覆盖写，即有效期更长，在将来被改动的可能性也更小，数据块的 age 小则反之）；我们认为这种方式可以实现所期望的 segment 利用率的两极化分布。（// 评注：在 3.4 节的结尾提到的“日志结构化文件系统以低成本实现高性能的关键，是让磁盘达到两极化的 segment 分布，即大部分的 segment 几乎全满（都是有效数据）或者几乎全空（没有有效数据），cleaner 几乎总是在清理那些没有有效数据的空段。这样，就可以实现磁盘空间的高利用率，但同时又有低的写入成本。”）

Figure 4 shows the surprising result that locality and "better" grouping result in worse performance than a system with no locality! We tried varying the degree of locality (e.g. 95% of accesses to 5% of data) and found that performance got worse and worse as the locality increased. Figure 5 shows the reason for this non-intuitive result. Under the greedy policy, a segment doesn't get cleaned until it becomes the least utilized of all segments. Thus every segment's utilization eventually drops to the cleaning threshold, including the cold segments. Unfortunately, the utilization drops very slowly in cold segments, so these segments tend to linger just above the cleaning point for a very long time. Figure 5 shows that many more segments are clustered around the cleaning point in the simulations with locality than in the simulations without locality. The overall result is that cold segments tend to tie up large numbers of free blocks for long periods of time.

图 4 展示了一个令人惊讶的结果，即相对于没有局部性的性的系统，局部性和“更好的”分组导致了更差的性能表现！（// 评注：图 4 点划线“LFS hot-and-cold”比粗实线“LFS uniform”更高）我们尝试改变局部性的程度（比如，针对 5% 的数据进行 95% 的访问）并且发现当局部性的提升，系统的性能表现越来越差。图 5 展示了这种反直觉的结果的原因。在贪婪算法下，一个 segment 只有在变成所有 segment 中利用率最低的那一个之后才会被清理。因此，每个 segment 的利用率最终都会降至清理阈值之下，包括有效期长（冷）的 segment。（// 评注：在仿真过程中，在选择要清理的 segment 时，只会选择那些使用率低于清理阈值的 segment，又在仿真过程中，在开启 GC 之后，仍会继续覆盖写，这样就会让有效期长的 segment 中的越来越多的数据块失效，最终 segment 中的利用率降到清理阈值以下）不幸的是，有效期长（冷）的 segment 的利用率下降得非常慢，以至于这些 segment 往往在清理阈值以上停留很长时间。图 5 展示了和没有局部性相比，在有局部性的仿真中，更多的 segment 聚集在清理阈值周围（// 评注：更多的 segment 的使用率是在清理阈值左右）。总体的结果是，有效期长（冷）的 segment 可能会长时间捆绑大量的空闲 block。

Figure 5 —— segment utilization distributions with greedy cleaner.

These figures show distributions of segment utilizations of the disk during the simulation. The distribution is computed by measuring the utilizations of all segments on the disk at the points during the simulation when segment cleaning was initiated. The distribution shows the utilizations of the segments available to the cleaning algorithm. Each of the distributions corresponds to an overall disk capacity utilization of 75%. The ''Uniform'' curve corresponds to ''LFS uniform'' in Figure 4 and ''Hot-and-cold'' corresponds to ''LFS hot-and-cold'' in Figure 4. Locality causes the distribution to be more skewed towards the utilization at which cleaning occurs; as a result, segments are cleaned at a higher average utilization.

曲线展示了在仿真过程中 segment 利用率的分布。曲线是在仿真过程中，段清理被初始化时（// 评注：更准确的，图 5 采样的是在完成了预热之后的稳态阶段），计算磁盘上所有 segment 的利用率得到的。分布曲线显示了可用清理算法进行清理的 segment 的利用率。“Uniform”（// 评注：粗实线）和“Hot-and-cold”（// 评注：细实线）两条分布线，对应的是图 4 中磁盘总体使用率为 75% 时的“LFS uniform”和“LFS hot-and-cold”。局部性导致被清理的 segment 的利用率分布更倾向于清理发生时的磁盘总体利用率（// 评注：图中两条曲线的“尖峰”即 segment 的清理点/回收点，“Hot-and-cold”曲线的“尖峰”比“Uniform”曲线的“尖峰”更“尖”更“高”）；因此，segment 以更高的平均利用率被清理（// 评注：从横轴看，往左，利用率低，有效数据少，垃圾多，写入成本（回收成本）低，而“Hot-and-cold”曲线的“尖峰”比“Uniform”曲线的“尖峰”更靠右）。

// 评注：贪婪算法下，被选中回收的 segment 的利用率最低也有 0.5 多，也就是说，在贪婪算法下，当 segment 的利用率降到 0.5 多即成为利用率最低的那一个，直接被选中回收，没有机会把利用率降到更低

After studying these figures we realized that hot and cold segments must be treated differently by the cleaner. Free space in a cold segment is more valuable than free space in a hot segment because once a cold segment has been cleaned it will take a long time before it re-accumulates the unusable free space. Said another way, once the system reclaims the free blocks from a segment with cold data it will get to "keep" them a long time before the cold data becomes fragmented and "takes them back again." In contrast, it is less beneficial to clean a hot segment because the data will likely die quickly and the free space will rapidly re-accumulate; the system might as well delay the cleaning a while and let more of the blocks die in the current segment. The value of a segment's free space is based on the stability of the data in the segment. Unfortunately, the stability cannot be predicted without knowing future access patterns. Using an assumption that the older the data in a segment the longer it is likely to remain unchanged, the stability can be estimated by the age of data.

在研究了这些数字之后，我们意识到 cleaner 必须对冷和热的 segment 进行不同的处理。冷的 segment 中的空闲空间比热的 segment 中的空闲空间更有价值，因为一旦冷的 segment 被清理，会需要很长的时间才能再累积起那么多不可用的空闲空间。换句话说，一旦系统从包含冷数据的 segment 中释放出空闲的 block，在冷数据重新变得碎片化捆绑空闲的 block 之前，系统可以更长时间地保持这些 block 可用。相反的，清理一个热的 segment 收益更少，因为数据可能会很快失效并且需要回收的空闲空间会迅速重新累积；系统可能会延迟一段时间进行清理，让当前的 segment 中更多的 block 失效。segment 中的空闲空间的价值基于 segment 中数据的稳定性。（// 评注：因为发生了删除或者覆盖写，copy 过程中写出的有效 block 会再次失效，写入这些有效 block 的 segment 也会同样变得碎片化，这些碎片化的 segment 捆绑住无效的已经处于空闲状态的 block 直到 cleaner 清理这些 segment，而 block 中数据的冷热程度决定了 block 失效的快慢，也就决定了 copy 这些 block 所付出的成本有意义的时间长度）不幸的是，在不知道未来的访问模式的情况下，没有办法对稳定性进行预测。但是如果假设 segment 中越老的数据，越不容易被改变，就可以通过数据的年龄来估计稳定性。

To test this theory we simulated a new policy for selecting segments to clean. The policy rates each segment according to the benefit of cleaning the segment and the cost of cleaning the segment and chooses the segments with the highest ratio of benefit to cost. The benefit has two components: the amount of free space that will be reclaimed and the amount of time the space is likely to stay free. The amount of free space is just 1 − 𝑢, where 𝑢 is the utilization of the segment. We used the most recent modified time of any block in the segment (i.e., the age of the youngest block) as an estimate of how long the space is likely to stay free. The benefit of cleaning is the space-time product formed by multiplying these two components. The cost of cleaning the segment is 1 + 𝑢 (one unit of cost to read the segment, 𝑢 to write back the live data). Combining all these factors, we get:

为了验证这个理论，我们对一种选择要清理的 segment 的新策略进行了仿真。这种策略根据清理 segment 的收益和成本对每一个 segment 进行评估，选择收益成本比最高的 segment 进行清理。清理 segment 的收益由两个部分组成：可以被释放的空闲空间的大小和空间可能保持空闲的时间长度。空闲空间的大小是1 − 𝑢，𝑢是 segment 的使用率。我们使用 segment 中的 block 最近被修改的时间（比如最年轻的 block 的 age）作为空间可能保持空闲的时间长度的评估。再把这两部相乘的时空乘积作为清理的收益。清理 segment 的成本是1 + 𝑢（读取这个 segment + 写出其中的有效数据𝑢）。结合所有这些因素，我们得到了：

We call this policy the cost-benefit policy; it allows cold segments to be cleaned at a much higher utilization than hot segments.

我们把这种选择策略称为 cost-benefit 算法；这种策略允许热度低但使用率高的 segment 先于热度高而使用率低的 segment 进行清理。（// 评注：贪婪算法会优先选择使用率低的 segment 进行回收，但是被选择的 segment 可能热度很高）

We re-ran the simulations under the hot-and-cold access pattern with the cost-benefit policy and age-sorting on the live data. As can be seen from Figure 6, the cost-benefit policy produced the bimodal distribution of segments that we had hoped for. The cleaning policy cleans cold segments at about 75% utilization but waits until hot segments reach a utilization of about 15% before cleaning them. Since 90% of the writes are to hot files, most of the segments cleaned are hot. Figure 7 shows that the cost-benefit policy reduces the write cost by as much as 50% over the greedy policy, and a log-structured file system out-performs the best possible Unix FFS even at relatively high disk capacity utilizations. We simulated a number of other degrees and kinds of locality and found that the cost-benefit policy gets even better as locality increases.

我们在冷热访问模式下，基于 cost-benefit 算法和在写出有效数据的过程中对数据按照年龄排序，重新进行仿真。和在图 6 中所能看到的那样，cost-benefit 策略生成了我们所期望的 segment 的两极化分布。清理策略在冷 segment 的使用率大约在 75% 的时候对 segment 进行清理，但是对于热的 segment，一直到它们的使用率下降到 15% 才会清理它们。因为 90% 的写入都是针对热文件，所以清理的大部分的 segment 都是热的。图 7 展示 cost-benefit 算法相对于贪婪算法，能减少大约 50% 的写成本，并且在磁盘使用率相对较高的情况下，日志结构化文件系统也比最好的 Unix FFS 的表现更加出色。我们对许多其他等级和类型的局部性进行仿真，发现随着局部性的增加，cost-benefit 的表现会变得更好。

Figure 6 —— Segment utilization distribution with cost-benefit policy.

This figure shows the distribution of segment utilizations from the simulation of a hot-and-cold access pattern with 75% overall disk capacity utilization. The "LFS Cost-Benefit" curve shows the segment distribution occurring when the cost-benefit policy is used to select segments to clean and live blocks grouped by age before being re-written. Because of this bimodal segment distribution, most of the segments cleaned had utilizations around 15%. For comparison, the distribution produced by the greedy method selection policy is shown by the "LFS Greedy" curve reproduced from Figure 5.

// 评注：同图 5，绘制曲线的采样同样是在完成了预热之后的稳态阶段

// 评注：同图 5，从横轴看，往左，利用率低，有效数据少，垃圾多，写入成本（回收成本）低

// 评注：双峰曲线的两个尖峰是 segment 的回收点，左侧尖峰为热数据区（segment 使用率低，(1 - u) / (1 + u) 的值大，能够被选中回收），右侧尖峰为冷数据区（(1 - u) / (1 + u) 值小，能被选中回收是因为 age 大，为了能够腾出空闲空间，即使是冷数据段，即使回收成本高，也不得不回收）

这张图显示了在进行冷热访问模式、磁盘总体利用率为 75% 时的仿真时的 segment 利用率的分布。“LFS Cost-Benefit”曲线显示了，当 cost-benefit 策略用于选择要清理的 segment，并且有效 block 在进行重写之前按照 age 进行分组时 segment 利用率的分布。由于这种两极化的 segment 分布，大部分被清理的 segment 的利用率大约为 15%。为了进行比较，使用贪婪算法选择策略产生的分布由从图 5 中复制来的“LFS Greedy”表示。

Figure 7 —— Write cost, including cost-benefit policy.

This graph compares the write cost of the greedy policy with that of the cost-benefit policy for the hot-and-cold access pattern. The cost-benefit policy is substantially better than the greedy policy, particularly for disk capacity utilizations above 60%.

这张图比较了冷热访问模式下贪婪算法和 cost-benefit 算法的写入成本。cost-benefit 算法比贪婪算法要好得多，尤其是磁盘使用率高于 60% 的时候。

The simulation experiments convinced us to implement the cost-benefit approach in Sprite LFS. As will be seen in Section 5.2, the behavior of actual file systems in Sprite LFS is even better than predicted in Figure 7.

仿真实验说服了我们在 Sprite LFS 中使用 cost-benefit 方法。如 5.2 节所示，Sprite LFS 作为实际文件系统的表现，甚至比图 7 所预测的还要好。

3.6. Segment usage table

// 评注：描述了支持 cost-benefit 算法的数据结构段使用表

In order to support the cost-benefit cleaning policy, Sprite LFS maintains a data structure called the segment usage table. For each segment, the table records the number of live bytes in the segment and the most recent modified time of any block in the segment. These two values are used by the segment cleaner when choosing segments to clean. The values are initially set when the segment is written, and the count of live bytes is decremented when files are deleted or blocks are overwritten. If the count falls to zero then the segment can be reused without cleaning. The blocks of the segment usage table are written to the log, and the addresses of the blocks are stored in the checkpoint regions (see Section 4 for details).

为了支持 cost-benefit 清理策略，Sprite LFS 维护了一个称为 segment usage table 段使用表的数据结构，这个数据结构记录了 segment 中的有效字节数，和 segment 中的 block 最近被修改的时间。segment cleaner 在选择要清理的 segment 的时候会用到这两个值。当 segment 写入数据的时候，初始化这两个值，并且当文件被删除或者 block 被覆盖写时，有效字节数会下降。当有效字节数下降到 0，segment 可以直接重用而不需要清理。存储 segment usage table 的 blcok 被写入日志文件中，这些 block 的地址又被存储在 checkpoint 区域中（见第 4 节详细介绍）。

In order to sort live blocks by age, the segment summary information records the age of the youngest block written to the segment. At present Sprite LFS does not keep modified times for each block in a file; it keeps a single modified time for the entire file. This estimate will be incorrect for files that are not modified in their entirety. We plan to modify the segment summary information to include modified times for each block.

为了按年龄对有效 block 进行排序，segment 的摘要信息，记录了写入 segment 的最年轻的 block 的年龄。目前 Sprite LFS 没有为文件的每一个 block 都保留修改时间；而是只为整个文件的保留了一个修改时间。对于不是整个被修改的文件，这种评估是不准确的。我们计划修改 segment 的摘要信息，来记录每一个 block 的修改时间。

论文第 4 章 Cash recovery

When a system crash occurs, the last few operations performed on the disk may have left it in an inconsistent state (for example, a new file may have been written without writing the directory containing the file); during reboot the operating system must review these operations in order to correct any inconsistencies. In traditional Unix file systems without logs, the system cannot determine where the last changes were made, so it must scan all of the metadata structures on disk to restore consistency. The cost of these scans is already high (tens of minutes in typical configurations), and it is getting higher as storage systems expand.

当系统崩溃时，对磁盘执行的最后几个操作可能会导致磁盘处于不一致的状态（比如，已经写入了一个新文件，但是还没有写入包含这个文件的目录）；在重启操作系统期间，为了纠正可能的不一致性，必须检查崩溃时的这些操作。在传统的没有日志文件的 Unix 文件系统中，系统无法确定最后几次更改的位置，所以必须扫描磁盘上所有的元数据结构来恢复一致性。这些扫描的成本已经很高了（在典型的配置中需要数十分钟），并且随着存储系统的扩展，成本会越来越高。

In a log-structured file system the locations of the last disk operations are easy to determine: they are at the end of the log. Thus it should be possible to recover very quickly after crashes. This benefit of logs is well known and has been used to advantage both in database systems[13] and in other file systems[2, 3, 14]. Like many other logging systems, Sprite LFS uses a two-pronged approach to recovery: checkpoints, which define consistent states of the file system, and roll-forward, which is used to recover information written since the last checkpoint.

在日志结构化文件系统中，最后几个磁盘操作的位置很容易就能确定：它们位于日志的末尾。因此，在崩溃后非常快地恢复应该是有可能做到的。使用日志的这个好处众所周知，并且已经在数据库[13]和其他文件系统中 [2, 3, 14] 被利用起来了。像许多其他日志系统一样，Sprite LFS 使用双管齐下的方法来恢复：checkpoints 检查点，定义了文件系统的一致性状态，以及 roll-forward 前滚，用于恢复最后一个 checkpoint 之后的信息。

4.1. Checkpoints

A checkpoint is a position in the log at which all of the file system structures are consistent and complete. Sprite LFS uses a two-phase process to create a checkpoint. First, it writes out all modified information to the log, including file data blocks, indirect blocks, inodes, and blocks of the inode map and segment usage table. Second, it writes a checkpoint region to a special fixed position on disk. The checkpoint region contains the addresses of all the blocks in the inode map and segment usage table, plus the current time and a pointer to the last segment written.

checkpoint 是日志中的一个位置，在这个位置，文件系统的所有结构的都是一致且完整的。Sprite LFS 使用两阶段过程来创建 checkpoint。首先，把所有修改信息写到日志中，包括文件的数据块、间接块、inode，以及 inode map 和 segment usage table 对应的块。然后，把 checkpoint region 写到磁盘上一个特殊的固定的位置。checkpoint region 包含 inode map 和 segment usage table 所使用的所有 block 的地址，再加上当前的时间和指向最后写入的 segment 的指针。

During reboot, Sprite LFS reads the checkpoint region and uses that information to initialize its mainmemory data structures. In order to handle a crash during a checkpoint operation there are actually two checkpoint regions, and checkpoint operations alternate between them. The checkpoint time is in the last block of the checkpoint region, so if the checkpoint fails the time will not be updated. During reboot, the system reads both checkpoint regions and uses the one with the most recent time.

在重启期间，Sprite LFS 读取 checkpoint region，并且使用 checkpoint region 所包含的信息初始化内存数据结构。为了处理在对 checkpoint 进行操作期间的崩溃，实际上有两个 checkpoint region，并且针对 checkpoint 的操作在这两个 checkpoint region 之间交替进行。checkpoint 的时间是在 checkpoint region 的最后一个 block 中，所以，如果针对 checkpoint 的操作失败，checkpoint 的时间不会被更新。在重启期间，系统会读取两个 checkpoint region，使用其中具有最近时间的那一个。

Sprite LFS performs checkpoints at periodic intervals as well as when the file system is unmounted or the system is shut down. A long interval between checkpoints reduces the overhead of writing the checkpoints but increases the time needed to roll forward during recovery; a short checkpoint interval improves recovery time but increases the cost of normal operation. Sprite LFS currently uses a checkpoint interval of thirty seconds, which is probably much too short. An alternative to periodic checkpointing is to perform checkpoints after a given amount of new data has been written to the log; this would set a limit on recovery time while reducing the checkpoint overhead when the file system is not operating at maximum throughput.

Sprite LFS 定期地生成 checkpoint 或者是在卸载文件系统、关闭系统时生成 checkpoint。checkpoint 之间的间隔长，可以减少生成 checkpoint 的开销，但是会增加恢复时前滚所需要的时间；短的 checkpoint 生成时间间隔可以优化恢复时间，但是会增加正常运行的成本。 Sprite LFS 当前使用的生成 checkpoint 的时间间隔是 30 秒，这可能太短了。定期生成 checkpoint 的一种可能的替代方案是，当日志中写入了一定数量的新数据后，创建 checkpoint；这将限制恢复时间，同时减少文件系统没有以最大吞吐量运行时生成 checkpoint 的开销。

4.2. Roll-forward

// 评注：讨论了前滚恢复过程，包括如何扫描日志段来恢复最后一次检查点之后写入的数据。

In principle it would be safe to restart after crashes by simply reading the latest checkpoint region and discarding any data in the log after that checkpoint. This would result in instantaneous recovery but any data written since the last checkpoint would be lost. In order to recover as much information as possible, Sprite LFS scans through the log segments that were written after the last checkpoint. This operation is called roll-forward.

原则上，在奔溃后的重启是安全的，只需要简单地读取最后一个 checkpoint region，丢弃 log 中在 checkpoint 之后的数据即可。这样可以做到即时恢复，但是所有在 checkpoint 之后写入的数据都会丢失。为了恢复尽可能多的数据，Sprite LFS 会扫描在最后一个 checkpoint 之后写入的日志 segment。这个操作叫 roll-forward 前滚。

During roll-forward Sprite LFS uses the information in segment summary blocks to recover recently-written file data. When a summary block indicates the presence of a new inode, Sprite LFS updates the inode map it read from the checkpoint, so that the inode map refers to the new copy of the inode. This automatically incorporates the file's new data blocks into the recovered file system. If data blocks are discovered for a file without a new copy of the file's inode, then the roll-forward code assumes that the new version of the file on disk is incomplete and it ignores the new data blocks.

在前滚期间，Sprite LFS 使用 segment summary blocks 中的信息来恢复最近写入的文件数据。当一个 summary block 提示存在一个新的 inode，Sprite LFS 就会更新从 checkpoint 读取到的 inode map，这样 inode map 就会指向 inode 的新副本。这会自动地将文件新的数据块吸收到恢复后的文件系统中。如果发现了新的数据块，但是没有对应的新的文件 inode 的副本，前滚的代码就会假定磁盘上新版本的文件是不完整的，并忽略新的数据块。

The roll-forward code also adjusts the utilizations in the segment usage table read from the checkpoint. The utilizations of the segments written since the checkpoint will be zero; they must be adjusted to reflect the live data left after roll-forward. The utilizations of older segments will also have to be adjusted to reflect file deletions and overwrites (both of these can be identified by the presence of new inodes in the log).

前滚代码还会调整从 checkpoint 读取出来的 segment usage table 中的使用率。自 checkpoint 之后写入的 segment 的使用率是零；必须对它们进行调整，以反映前滚之后留下来的有效数据。更老的 segment 的使用率也必须要调整，来反映文件删除和覆盖写（这两点都可以通过日志中是否存在新的 inode 来判断）。

The final issue in roll-forward is how to restore consistency between directory entries and inodes. Each inode contains a count of the number of directory entries referring to that inode; when the count drops to zero the file is deleted. Unfortunately, it is possible for a crash to occur when an inode has been written to the log with a new reference count while the block containing the corresponding directory entry has not yet been written, or vice versa.

前滚的最后一个问题，是如何恢复目录条目和 inode 之间的一致性。每一个 inode 都包含一个引用该 inode 的目录条目的计数；当引用计数减少到零，就表明文件被删除了。不幸的是，可能会在更新了引用计数的 inode 已经写入了日志，但是包含了对应的目录条目的 block 还没有完成写入的时候，发生崩溃，反之亦然。

To restore consistency between directories and inodes, Sprite LFS outputs a special record in the log for each directory change. The record includes an operation code (create, link, rename, or unlink), the location of the directory entry (i-number for the directory and the position within the directory), the contents of the directory entry (name and i-number), and the new reference count for the inode named in the entry. These records are collectively called the directory operation log; Sprite LFS guarantees that each directory operation log entry appears in the log before the corresponding directory block or inode.

为了恢复目录和 inode 之间的一致性，Sprite LFS 为每次目录变更在日志中输出一条特殊的记录。记录的内容包括操作码（创建，链接，重命名或者取消链接），目录条目的位置（目录的 i-number 以及目录中的位置），目录条目的内容（名字和 i-number），以及在条目中命令的 inode 的新的引用计数。这些记录统称为目录操作日志；Sprite LFS 保证每个目录操作日志条目都先于相关的 directory block 或者 inode 出现在日志中。

During roll-forward, the directory operation log is used to ensure consistency between directory entries and inodes: if a log entry appears but the inode and directory block were not both written, roll-forward updates the directory and/or inode to complete the operation. Roll-forward operations can cause entries to be added to or removed from directories and reference counts on inodes to be updated. The recovery program appends the changed directories, inodes, inode map, and segment usage table blocks to the log and writes a new checkpoint region to include them. The only operation that can't be completed is the creation of a new file for which the inode is never written; in this case the directory entry will be removed. In addition to its other functions, the directory log made it easy to provide an atomic rename operation.

在前滚期间，目录操作日志用于保证目录条目和 inode 之间的一致性：如果一个日志条目出现了，但是 inode 和目录块没有都没写入，前滚会更新目录和/或 inode 来完成操作。前滚操作可能会在目录中添加或者删除条目，并且更新 inode 上的引用计数。恢复程序追加变更的目录，inode，inode map 以及 segment usage table 对应的 block 到日志中，并且写入新的 checkpoint region 来包括它们。唯一无法完成的操作是创建一个从未写入过 inode 的新文件；在这种场景下，目录条目将会被删除。除了其他功能，目录日志还可以让重命名操作的原子化变得容易。

The interaction between the directory operation log and checkpoints introduced additional synchronization issues into Sprite LFS. In particular, each checkpoint must represent a state where the directory operation log is consistent with the inode and directory blocks in the log. This required additional synchronization to prevent directory modifications while checkpoints are being written.

目录操作日志和检查点之间的交互给 Sprite LFS 引入了额外的同步问题。具体地说，在每一个 checkpoint 中，目录操作日志和 log 中的 inode 以及目录 block 都必须是一致的。这需要额外的同步，来避免正在写入 checkpoint 的时候目录被修改。