2024年页面的状态_2024 linux storage, filesystem, memory management,-CSDN博客

https://lwn.net/Articles/973565/ By Jonathan Corbet May 15, 2024

5.16 版本 folio 合入内核，此后 folio 相关的修改持续合入内核

The advent of the folio structure to describe groups of pages has been one of the most fundamental transformations within the kernel in recent years. Since the folio transition affects many subsystems, it is fitting that the subject was covered at the beginning of the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit in a joint session of the storage, filesystem, and memory-management tracks. Matthew Wilcox used the session to review the work that has been done in this area and to discuss what comes next.
在近几年内，引入用于描述页面组的 folio 结构成为了内核中最基本的转变之一。鉴于 folio 转换影响了许多子系统，因此在2024年 Linux 存储、文件系统、内存管理和 BPF 峰会一开始就通过存储、文件系统和内存管理轨道的联合会议来讨论这个主题是恰当的。Matthew Wilcox 利用这个会议回顾了在这一领域所做的工作，并讨论了接下来的计划。

The first step of this transition, he began, was moving much of the information traditionally stored in the kernel’s page structure into folios instead, then converting users of struct page to use the new structure. The initial goal was to provide a type-safe representation for compound pages, but the scope has expanded greatly since then. That has led to a bit of ambiguity: what, exactly, is a folio in current kernels? For now, a folio is still defined as “the first page of a compound page”.
这一转换的第一步，他开始时说，是将传统上存储在内核的页面结构中的大量信息转移到 folios 中，然后将 struct page 的使用者转换为使用新的结构。最初的目标是为复合页面提供一种类型安全的表示，但从那时起，范围已大大扩展。这导致了一些模糊性：在当前内核中，什么确切是一个 folio ？目前，folio 仍然被定义为“复合页面的第一个页面”。

By the end of the next phase, the plan is for struct page to shrink down to a single, eight-byte memory descriptor, the bottom few bits of which describe what type of page is being described. The descriptor itself will be specific to the page type; slab pages will have different descriptors than anonymous folios or pages full of page-table entries, for example.
在下一个阶段结束时，计划是将 struct page 缩减为一个单一的、八字节的内存描述符，其底部的几位将描述所表述的页面的类型。描述符本身将特定于页面类型；例如，slab 页面将具有与匿名 folios 或装满页表条目的页面不同的描述符。

Among other motivations, a key objective behind the move to descriptors is reducing the size of the memory map — the large array of page structures describing every physical page in the system. Currently, the memory-map overhead is, at 1.6% of the memory it describes, too high. On systems where virtualization is used, the memory map is also found in guests, doubling the memory consumed by the memory-map. By moving to descriptors, that overhead can be reduced to 0.2% of memory, which can save multiple gigabytes of memory on larger systems.
除其他动机外，转向描述符背后的一个关键目标是减少内存映射的大小——这是一个描述系统中每一个物理页面的庞大的页面结构数组。目前，内存映射的开销占它所描述的内存的1.6%，这一比例过高。在使用虚拟化的系统中，内存映射也出现在客户机中，导致内存映射消耗的内存翻倍。通过转向描述符，这部分开销可以降低到内存的0.2%，这在大型系统上可以节省多达数千兆字节的内存。

Getting there, though, requires moving more information into the folio structure. Along the way, concepts like the pin count for a page can be clarified, cleaning up some longstanding problems in the memory-management subsystem. This move will, naturally, increase the size of the folio structure, to a point where it will be larger than struct page. The advantage, though, is that only one folio structure is needed for all of the base pages that make up the folio. For two-page folios, the total memory use is about the same; for folios of four pages or more, the usage is reduced. If the kernel is caching the contents of a 1GB file, it currently needs 60MB of page structures. If that caching is done entirely with base pages, that overhead will increase by 23MB in the future. But, if four-page folios are used instead, it drops to 9MB total.
不过，要达到这个目的，需要将更多信息转移到folio结构中。在此过程中，可以阐明像页面的引用计数这样的概念，清理内存管理子系统中一些长期存在的问题。这一举措自然会增加 folio 结构的大小，到一个比 struct page 更大的程度。不过，优势在于，对于构成 folio 的所有基础页面，只需要一个 folio 结构。对于两页的folios，总内存使用量大致相同；对于四页或更多页的 folios，使用量会减少。如果内核正在缓存一个1GB的文件，它目前需要60MB的页面结构。如果完全使用基础页面进行缓存，那么将来这个开销会增加23MB。但是，如果使用四页 folios代替，总量将降低到 9MB。

Some types of descriptors, including those for slab pages and page-table entries, have already been introduced. The page-table descriptors are quite a bit smaller than folios, since there are a number of fields that are not needed. For example, these pages cannot be mapped into user space, so there is no need for a mapping count.
一些类型的描述符，包括用于 slab 页面和页表条目的描述符，已经被引入了。页表描述符要比 folios 小得多，因为有许多字段是不需要的。例如，这些页面不能映射到用户空间，所以不需要映射计数。

Wilcox put up a plot showing how many times struct page and struct folio are mentioned in the kernel since 2021. On the order of 30% of the page mentions have gone away over that time. He emphasized that the end goal is not to get rid of struct page entirely; it will always have its uses. Pages are, for example, the granularity with which memory is mapped into user space.
Wilcox 展示了一张图表，显示了自2021年以来内核中提及 struct page 和 struct folio 的次数。在那段时间里，提及页面的次数减少了大约30%。他强调，最终目标并不是完全摒弃 struct page；它将始终有其用途。例如，页面是将内存映射到用户空间的粒度单位。

Since last year’s update, quite a lot of work has happened within the memory-management subsystem. Many kernel subsystems have been converted to folios. There is also now a reliable way to determine whether a folio is part of hugetlbfs, the absence of which turned out to be a bit of a surprising problem. The adoption of large anonymous folios has been a welcome improvement.
自去年更新以来，内存管理子系统内完成了相当多的工作。许多内核子系统已经转换为使用 folios。现在也有了一种可靠的方法来确定一个 folio 是否属于 hugetlbfs，其缺失出人意料地成为了一个问题。大型匿名 folios 的采用是一个受欢迎的改进。

The virtual filesystem layer has also seen a lot of folio-related work. The sendpage() callback has been removed in favor of a better API. The fs-verity subsystem now supports large folios. The conversion of the buffer cache is proceeding, but has run into a surprise: Wilcox had proceeded with the assumption that buffer heads are always attached to folios, but it turns out that the ext4 filesystem allocates slab memory and attaches that instead. That usage isn’t wrong, Wilcox said, but he is “about to make it wrong” and does not want to introduce bugs in the process.
虚拟文件系统层也见证了许多与folio相关的工作。sendpage()回调已被删除，以支持更好的API。fs-verity 子系统现在支持大型 folios。buffer cache 的转换正在进行中，但遇到了一个意外：Wilcox 曾假设 buffer heads 总是附加到 folios上，但结果表明，ext4 文件系统分配 slab 内存并将其附加上。Wilcox 表示，这种使用并没有错，但他“即将让它变得错误”，并且不希望在这个过程中引入bug。

Avoiding problems will require leaving some information in struct page that might have otherwise come out. In general, he said, he would not have taken this direction with buffer heads had he known where it would lead, but he does not want to back it out now. All is well for now, he said; the ext4 code is careful not to call any functions on non-folio-backed buffer heads that might bring the system down. But there is nothing preventing that from happening in the future, and that is a bit frightening.
为了避免问题，可能需要保留一些原本可以移出的信息在 struct page 中。他说，通常情况下，如果他知道这会导致什么后果，他不会选择这个方向处理 buffer head，但他现在也不想回退。他说，目前一切都好；ext4 代码很小心，没有在非 folio 支持的 buffer head 上调用任何可能导致系统崩溃的函数。但是没有什么可以防止这种情况在未来发生，这有点可怕。

The virtual filesystem layer is now allocating and using large folios through the entire write path; this has led to a large performance improvement. Wilcox has also added an internal function, folio_end_read(), that he seemed rather proud of. It sets the up-to-date bit, clears the lock bit, checks for others waiting on the folio, and serves as a memory barrier — all with a single instruction on x86 systems. Various other helpers have been added and callbacks updated. There is also a new writeback iterator that replaces the old callback-based interface; among other things, this helps to recover some of the performance that was taken away by Spectre mitigations.
虚拟文件系统层现在通过整个写入路径分配和使用大型 folios；这导致了大幅的性能提升。Wilcox 还添加了一个内部函数，folio_end_read()，他似乎对此相当自豪。它设置最新位，清除锁定位，检查是否有其他人在等待 folio，并作为内存屏障——所有这些在 x86 系统上仅需一条指令。还添加了各种其他辅助功能和更新了回调。还有一个新的回写迭代器，替代了旧的基于回调的接口；其中一个优点是，这有助于恢复因Spectre缓解措施而损失的一些性能。

With regard to individual filesystems, many have been converted to folios over the last year. Filesystems as a whole are being moved away from the writepage() API; it was seen as harmful, so no folio version was created. The bcachefs filesystem can now handle large folios — something that almost no other filesystems can do. The old NTFS filesystem was removed rather than being converted. The “netfs” layer has been created to support network filesystems. Wilcox put up a chart showing the status of many filesystems, showing that a lot of work remained to be done for most. “XFS is green”, he told the assembled developers, “your filesystem could be green too”.
关于单个文件系统，过去一年中许多已经转换为 folios。作为一个整体，文件系统正逐渐远离 writepage() API；它被认为是有害的，因此没有为 folio 创建版本。bcachefs 文件系统现在可以处理大型 folios——这是几乎没有其他文件系统能做到的事情。旧的 NTFS 文件系统未经转换就被移除了。"netfs"层已被创建，以支持网络文件系统。Wilcox 展示了一张图表，显示了许多文件系统的状态，表明大多数文件系统仍有大量工作要做。他对在场的开发者说：“XFS是绿色的，你们的文件系统也可以是绿色的。”
[图片]
The next step for folios is to move the mapping and index fields out of struct page. These fields could create trouble in the filesystems that do not yet support large folios, which is almost all of them. Rather than risk introducing bugs when those filesystems are converted, it is better to get those fields out of the way now. A number of page flags are also being moved; flags like PageDirty and PageReferenced refer to the folio as a whole rather than to individual pages within it, and thus should be kept there. There are plans to replace the write_begin() and write_end() address-space operations, which still use bare pages.
folios 的下一个步骤是将映射（mapping）和索引（index）字段从 struct page 中移出。这些字段可能会在尚未支持大型 folios 的文件系统中造成麻烦，几乎所有的文件系统都属于这种情况。与其冒险在这些文件系统转换时引入bug，不如现在就将这些字段挪开。许多页面标志（flags）也正在被移动；像 PageDirty 和 PageReferenced 这样的标志是指整个 folio 而不是其中的个别页面，因此应该保留在那里。有计划替换使用裸页面的 write_begin() 和write_end() 地址空间操作。

Beyond that, there is still the task of converting a lot of filesystems, many of which are “pseudo-maintained” at best. The hugetlbfs subsystem needs to be modernized. The shmem and tmpfs in-memory filesystems should be enhanced to use intermediate-size large folios. There is also a desire to eliminate all higher-order memory allocations that do not use compound pages, and thus cannot be immediately changed over to folios; the crypto layer has a lot of those allocations.
除此之外，还有许多文件系统需要转换，其中许多文件系统充其量只能算是“名义上维护”。hugetlbfs 子系统需要现代化改造。shmem 和 tmpfs 内存文件系统应该被增强，以使用中等大小的大型 folios。还有一个愿望是消除所有不使用复合页面的高阶内存分配，因此不能立即转换为 folios；加密层有很多这样的分配。

Then, there is the “phyr” concept. A phyr is meant to refer to a physical range of pages, and is “what needs to happen to the block layer”. That will allow block I/O operations to work directly on physical pages, eliminating the need for the memory map to cover all of physical memory.
此外，还有“phyr”（物理页范围）的概念。phyr 旨在指一个物理页面的范围，这是“需要对块层进行的处理”。这将允许块 I/O 操作直接在物理页面上工作，消除了内存映射需要覆盖所有物理内存的需求。

It seems that there will be a need for a khugepaged kernel thread that will collapse mid-size folios into larger ones. Other types of memory need to have special-purpose memory descriptors created for them. Then there is the KMSAN kernel-memory sanitizer, which hasn’t really even been thought about. KMSAN adds its own special bits to struct page, a usage that will need to be rethought for the folio-based future.
看来将需要一个名为 khugepaged 的内核线程，用以将中等大小的 folios 合并成更大的 folios。其他类型的内存需要为它们创建专用的内存描述符。还有KMSAN内核内存清理器，这个问题甚至还没有真正考虑过。KMSAN 向struct page 添加了它自己的特殊位，这种用法需要为基于 folio 的未来重新思考。

An important task is adding large-folio support to more filesystems. In the conversions that Wilcox has done, he has avoided adding that support except in the case of XFS. It is not an easy job and needs expertise in the specific filesystem type. But, as the overhead for single-page folios grows, the need to use larger folios will grow with it. Large folios also help to reduce the size of the memory-management subsystem’s LRU list, making reclaim more efficient.
一个重要的任务是为更多的文件系统添加大型 folio 支持。在 Wilcox 所做的转换中，他避免增加该支持，除了在XFS 的情况下。这不是一项简单的工作，需要特定文件系统类型的专业知识。但是，随着单页 folios 的开销增加，使用更大 folios 的需要也会随之增长。大型 folios 还有助于减少内存管理子系统 LRU 列表的大小，使回收更加高效。

Ted Ts’o asked how important this conversion is for little-used filesystems; does VFAT need to be converted? Wilcox answered that it should be done for any filesystem where somebody cares about performance. Dave Chinner added that any filesystem that works on an NVMe solid-state device will need large folios to perform well. Wilcox closed by saying that switching to large folios makes compiling the kernel 5% faster, and is also needed to support other desired features, so the developers in the room should want to do the conversion sooner rather than later.
Ted Ts’o 询问了对于很少使用的文件系统，这种转换有多重要；VFAT 是否需要转换？Wilcox 回答说，对于任何关心性能的文件系统都应该进行这样的转换。Dave Chinner 补充说，任何在 NVMe 固态设备上工作的文件系统都需要大型 folios 来表现良好。Wilcox 最后说，转换为使用大型 folios 能使编译内核的速度提高5%，也是支持其他期望功能所必需的，所以在场的开发者应该希望尽早完成转换，而不是拖延。