The one parameter that is not improving by leaps and bounds is disk seek time (except for solid-state disks, which have no seek time).

一个没有大幅提高的参数是磁盘的查找时间seek time,(固态硬盘除外,固态硬盘没有查找时间)

The combination of these factors means that a performance bottleneck is arising in many file systems. Research done at Berkeley attempted to alleviate this problem by designing a completely new kind of file system, LFS (theLog-structured File System).
这些因素结合导致在许多文件系统中都会出现性能瓶颈。在Berkeley 实验室的研究工作尝试缓解这个问题,他们设计了一个完全新的文件系统LFS.

To make matters worse, in most file systems, writes are done in very small chunks. Small writes are highly inefficient, since a 50-μsec disk write is often preceded by a 10-msec seek and a 4-msec rotational delay. With these parameters, disk efficiency drops to a fraction of 1%.


To see where all the small writes come from, consider creating a new file on a UNIX system. To write this file, the i-node for the directory, the directory block, the i-node for the file, and the file itself must all be written. While these writes can
be delayed, doing so exposes the file system to serious consistency problems if a crash occurs before the writes are done. For this reason, the i-node writes are generally done immediately.


From this reasoning, the LFS designers decided to reimplement the UNIX file system in such a way as toachieve the full bandwidth of the disk, even in the face of a workload consisting in large part of small random writes. The basic idea is to
structure the entire disk as a great big log.


Periodically, and when there is a special need for it, all the pending writes being buffered in memory are collected into a single segment and written to the disk as a single contiguous segment at the end of the log. A single segment may thus contain i-nodes, directory blocks, and data blocks, all mixed together. At the start of each segment is a segment summary, telling what can be found in the segment. If the average segment can be made to be about 1 MB, almost the full bandwidth of the disk can be utilized.

定期的,或是有特别需要时,所有在内存中缓存的挂起写操作(的数据)集中到一个segment中,然后作为日志末尾的一个连续segment 写入到磁盘中。由此一个segment 可能包括许多i-node,目录块,数据块,所有这些混在一起。在每一个segment中起始位置是一个segment summary,告诉这个segment中包含什么数据。如果segment平均大小在1M左右,那么磁盘几乎所有的bandwidth都可以利用到。

In this design, i-nodes still exist and even have the same structure as in UNIX, but they are now scattered all over the log, instead of being at a fixed position on the disk. Nevertheless, when an i-node is located, locating the blocks is done in the usual way. Of course, finding an i-node is now much harder, since its address cannot simply be calculated from its i-number, as in UNIX. To make it possible to find i-nodes, an i-node map, indexed by i-number, is maintained. Entry i in this map points to i-node ion the disk. The map is kept on disk, but it is also cached, so the most heavily used parts will be in memory most of the time.

在设计中,I-node仍然存在甚至和UNIX有同样的结构,但是i-node分散在整个log中而不是在磁盘的固定位置。因此,i-node确定后,就可以采用一般的方法来定位磁盘位置。当然,这个设计中要找到i-node更难了,因为i-node的地址不再可以像UNIX那样简单的通过i-node的i-number成员值计算得到。为了能够找到i-nodes,需要维护一个i-node map,通过i-number进行检索。z

To summarize what we have said so far, all writes are initially buffered in memory, and periodically all the buffered writes are written to the disk in a single segment, at the end of the log. Opening a file now consists of using the map to locate the i-node for the file. Once the i-node has been located, the addresses of the blocks can be found from it. All of the blocks will themselves be in segments, somewhere in the log.
If disks were infinitely large, the above description would be the entire story.
However, real disks are finite, so eventually the log will occupy the entire disk, at which time no new segments can be written to the log. Fortunately, many existing segments may have blocks that are no longer needed. For example, if a file is overwritten, its i-node will now point to the new blocks, but the old ones will still be occupying space in previously written segments.



To deal with this problem, LFS has a cleaner thread that spends its time scanning the log circularly to compact it. It starts out by reading the summary of the first segment in the log to see which i-nodes and files are there. It then checks the current i-node map to see if the i-nodes are still current and file blocks are still in use. If not, that information is discarded. The i-nodes and blocks that are still in use go into memory to be written out in the next segment. The original segment is then marked as free, so that the log can use it for new data. In this manner, the cleaner moves along the log, removing old segments from the back and putting any live data into memory for rewriting in the next segment. Consequently, the disk is a big circular buffer, with the writer thread adding new segments to the front and the cleaner thread removing old ones from the back.
为了解决这个问题,LFS采用一个清理线程耗费时间循环的扫描log并进行压缩。首先读取log中第一个segment的summary数据来查看有哪些i-node和文件。然后检测当前的i-node map来判定i-nodes和文件块是否在使用。如果没有在使用,就忽略掉。在使用的i-node和磁盘块就放入内存然后写入到写一个segment.原始的segment标识为free,以此Log可以segment来装载新的数据。利用这个方法,清理线程沿着log文件一直进行下去,从后面去除老的segment,把live(在使用)的数据放进内存写入下一个segment.最后,磁盘变为一个大的循环缓冲区,写入线程在前面添加新的segment,清理线程从后面清楚老的segment.
The bookkeeping here is nontrivial, since when a file block is written back to a new segment, the i-node of the file (somewhere in the log) must be located, updated, and put into memory to be written out in the next segment. The i-node map must then be updated to point to the new copy. Nevertheless, it is possible to do the administration, and the performance results show that all this complexity is worthwhile. Measurements given in the papers cited above show that LFS outperforms UNIX by an order of magnitude on small writes, while having a performance that is as good as or better than UNIX for reads and large writes.

bookkeeping 是非常重要的,因为当一个文件块写入到一个新的segment时,这个文件的i-node必须要找到,然后更新,放进内存最后写入到下一个segment.i-node map也需要更新来指向新的复制数据。不过,实施是可行的,而且性能结果显示了这些增加的复杂度是值得的。在下面引用的文章中显示了在小文件写入时LFS性能超出UNIX非常多,在读写大文件时和UNIX的性能相当。


