Looking forward to mapcount madness 2025-CSDN博客

One of the many important tasks that the kernel's memory-management subsystem must handle is keeping track of how pages of memory are mapped into the address spaces of the processes running on the system. As long as mappings to a given page exist, that page must be kept in place. As it turns out, tracking these mappings is harder than it seems it should be, and the move to folios within the memory-management subsystem is adding some complexities of its own. As a follow-up to the "mapcount madness" session that he ran at the 2024 Linux Storage, Filesystem, Memory-Management, and BPF summit, David Hildenbrand has posted a patch series intended to improve the handling of mapping counts for folios — but exact accounting remains elusive in some situations.

In theory, tracking mappings should be relatively straightforward: when a mapping to a page is added, increment that page's mapping count to match. The removal of a mapping should lead to an associated decrement of the mapping count. But huge pages and large folios complicate this scenario; they have their own mapping counts that are, essentially, the sum of the mapping counts for the pages they contain. It is often important to know whether a folio as a whole has mappings, so the separate count is useful, but it brings some complexities.

For example, one question that the kernel often asks is: how many processes have mapped a given page or folio? There are a number of operations that can be optimized if it is known that a mapping is exclusive — that the page or folio is mapped by a single process. The handling of copy-on-write pages is also hard to execute correctly if the exclusivity of a given mapping is not known; failures on this front have led to some unpleasant bugs in the past. For a single page, the exclusivity question is easily enough answered: if the mapping count is one, the mapping is exclusive, otherwise it is shared. That rule no longer applies if mapping counts are maintained at the folio level, though, since the folio-level count will almost never be just one.

The current scheme also has performance-related problems that folios could maybe help to improve. Mapping a traditional PMD-size huge page is equivalent to mapping 512 base pages; currently, if the entire huge page is mapped, the mapping count of each of its base pages must be incremented accordingly. Incrementing the mapping count on each of those base pages takes time the kernel developers would rather not spend; it would be a lot faster to only keep track of a single mapping count at the folio level. This optimization can only make the exclusivity question even harder to answer, though, especially in the presence of partially mapped folios (where only some of its pages are mapped into an address space).

Thus, it is not surprising the kernel developers have spent years trying to figure out how to properly manage the mapping counts as memory-management complexity increases.

Make room!

To better track mapping counts at the folio level, Hildenbrand first needed a bit more space in the folio structure for some additional information. struct folio is a bit of a complicated and confusing beast. As a way of facilitating the transition to folio use throughout the kernel, this structure is overlaid on top of struct page, which describes a single page. But folios often need to track more information than can be fit into the tightly packed page structure; this is especially true for large folios that contain many pages.

But, since a large folio does contain many pages — and physically contiguous pages at that — there are some tricks that can be employed. There is no real need to maintain a full page structure for every page within a folio, since they are managed as a unit; indeed, eliminating the management of all of those page structures is one of the objectives of the folio transition. But those page structures exist, laid out contiguously in the system's memory map. So a large folio does not have just one page structure's worth of memory at its disposal; it has the page structures for all of the component pages. The page structures for the "tail pages" — those after the first one — can thus be carefully put to use holding this additional information.

If one looks at the definition of struct folio, it quickly becomes clear that it is larger than a single page structure. After the initial fields that overlay the page structure for the head page, one will find this:

    union {
	struct {
	    unsigned long _flags_1;
	    unsigned long _head_1;
	    atomic_t _large_mapcount;
	    atomic_t _entire_mapcount;
	    atomic_t _nr_pages_mapped;
	    atomic_t _pincount;
    #ifdef CONFIG_64BIT
	    unsigned int _folio_nr_pages;
    #endif
	/* private: the union with struct page is transitional */
	};
	    struct page __page_1;
    };

This piece of the folio structure precisely overlays the page structure of the first tail page, assuming such a page exists. It contains information intended to help maintain the mapping count in current kernels, and other relevant fields. There is also a __page_2 component (not shown) that mainly holds information used by the hugetlbfs subsystem. As a result, the folio structure is actually the length of three page structures, though most of it is only valid for large (at least four pages) folios.

As sprawling as this seems, it still lacks the space Hildenbrand needed to better track mapping counts. To be able to handle order-1 (two-page) folios, he needed that space to fit within the page-1 union shown above. So the first six patches of the series are dedicated to shuffling fields around in the folio structure, adding a __page_3 union in the process. The __page_1 union gains some complexity, but the core of the work is in these new fields:

    mm_id_mapcount_t _mm_id_mapcount[2];
    union {
	mm_id_t _mm_id[2];
	unsigned long _mm_ids;
    };

They will be used to keep better track of the mapping for the folio to which they belong. Describing how that is done requires a bit more background, though.

One, two, or many

So how does all of this work help to improve the tracking of the mapping counts for large folios that may be shared between multiple processes and which can be partially mapped in any one of them? The starting point is the mm_struct structure that represents a process's address space. Any time a folio is mapped, that mapping will belong to a specific process, and thus a specific mm_struct structure. So the question of whether a folio is exclusively mapped comes down to whether all of its mappings belong to the same mm_struct. It is a simple matter of tracking which mm_struct structures hold mappings to the folio.

Of course, there could be thousands of those structures containing such mappings; consider that almost every process in the system will have the C library mapped, for example. Tracking all of those mappings without consuming a lot of time and memory would not be an easy task. But it is not really important to track every mapping to something like the C library; the purpose here is to stay on top of the folios that are exclusively mapped, and thus don't have all those mappings.

The _mm_id array that was added to page 1 of the folio structure is intended to serve this purpose; it can track up to two mm_struct structures that have mappings to the folio. The most straightforward way to do that would be to just store pointers to those mm_struct structures, but space in the folio structure is still at a premium. So, instead, a shorter "mm ID" is assigned to each mm_struct, using the kernel's ID allocator subsystem.

When a folio is first created, both _mm_id entries are set to MM_ID_DUMMY, indicating that they are unused. When the time comes to add a mapping, the kernel will search _mm_id for the appropriate mm ID, then increment the associated _mm_id_mapcount entry to record the new mapping(s). So, for example, if eight pages within a folio are mapped into the address space, the count will be incremented by eight to match. If the mm ID does not have an entry in _mm_id, the kernel will look for an MM_ID_DUMMY entry to use for this mm_struct, then start tracking the mappings there.

The kernel is now maintaining multiple mapping counts for this folio. The _large_mapcount field of the folio structure continues to count all of the mappings to the folio from any address space, as it does in current kernels. But there is also the _mm_id_mapcount count for each mm_struct tracking the number of mappings associated with that specific structure. The question of whether the folio is mapped exclusively is now easy to answer: if one of the _mm_id_mapcount counters is equal to _large_mapcount, then all of the mappings belong to the associated mm_struct and the kernel knows that the mapping is exclusive. Otherwise, the mapping is shared.

The ability to track two mm_struct structures handles the most common case of short-term shared mappings — when a process calls clone() to create a new child process. That new process will use the second _mm_id slot for the mapping that is now shared between the parent and the child. If, as often happens, the child calls execve() to run a new program, the shared mapping will be torn down, the child's _mm_id slot will be released, and the kernel will know that the folio is, again, mapped exclusively.

There is just one tiny gap in this mechanism, though: what happens when a third process comes along and maps the folio? There will be no _mm_id slot available for it, so its mapping(s) cannot be tracked. Should this happen, the kernel will set a special bit in the folio structure indicating that it no longer has a handle on where all the mappings to the folio come from, and will treat it as being shared. This could result in the kernel mistakenly concluding that a folio is mapped shared when it is mapped exclusively; the consequence will be worse performance, but no lack of correctness. If enough processes unmap the folio, there could come a time when _large_mapcount again aligns with one of the _mm_id_mapcount counts, and the kernel will once again know that the folio is mapped exclusively.

Per-page mapcounts and more

The result of all this work is that the kernel has a better handle on whether any given folio is mapped exclusively or shared, though it may still occasionally conclude that a folio is shared when it is not. But that was not the only objective of this work; Hildenbrand also would like to do away with the overhead of maintaining the per-page mapping counts in large folios. The final part of the patch series is an implementation of that goal; at the end, the per-page counts are no longer used or maintained.

The most significant consequence of dropping the per-page mapping counts appears to be making some of the memory-management statistics provided by the kernel (the various resident-set sizes, for example) a bit fuzzier. Hildenbrand suggests that this imprecision should not be a problem, but he also acknowledges that it will take time to see what the implications really are. To avoid surprises during that time, there is a new configuration parameter, CONFIG_NO_PAGE_MAPCOUNT, that controls whether these changes are effective. This work is considered experimental enough that, at this point, Hildenbrand does not want to have it enabled by default in production kernels.

There will be a desire to do that at some point, though; dropping the per-page map counts can make a clone() call up to 20% faster for some workloads, according to performance results included in the patch cover letter.

Meanwhile, this work enables another optimization with regard to how some transparent huge pages are used after a process forks. In current kernels, if the huge page (folio) is mapped at the base-page level ("PTE mapped"), it will not be reused after the fork. As the use of transparent huge pages — and, especially, in multi-size huge pages that must be PTE mapped — grows, reusing those huge pages will become increasingly important. Now, with the per-mm_struct mapping counts, the kernel can tell when a process has exclusive access to the huge page and can continue to use it as such. This reuse yields significant improvements in some benchmark results.

The use of large folios is expected to grow in the future; they are a more efficient way to manage much of the memory that any given process uses. So it is important to optimize that case as much as possible. Hildenbrand's patch set makes some steps in that direction while addressing a thorny problem that has resisted solution for years. These changes are currently in the linux-next repository, so there is a reasonable possibility that they could land in the mainline during the 6.15 merge window. If so, the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit, which will be concurrent with that merge window, may be the last to feature a "mapcount madness" session.

Linux内核的内存管理子系统有许多重要任务，其中之一是跟踪系统中运行的进程如何将内存页面映射到其地址空间。只要某个页面仍然被映射，它就必须保持在原位。然而，跟踪这些映射比看起来要复杂得多，而内存管理子系统向 folio（大页框）的过渡也增加了一些新的复杂性。作为 2024 年 Linux 存储、文件系统、内存管理和 BPF 峰会上“mapcount madness”（映射计数混乱）讨论会的后续内容，David Hildenbrand 发布了一组补丁，旨在改进 folio 的映射计数管理——但在某些情况下，精确的计算仍然难以实现。

映射计数的复杂性

理论上，跟踪映射应该相对简单：当向页面添加一个映射时，相应地增加该页面的映射计数；当映射被移除时，相应地减少该计数。然而，巨页（huge page）和大 folio 使情况变得更加复杂。它们拥有自己的映射计数，基本上是它们所包含的所有基础页面（base page）的映射计数总和。通常需要知道整个 folio 是否被映射，因此单独的计数很有用，但这也带来了一些复杂性。

例如，内核经常需要回答这样一个问题：某个页面或 folio 被多少个进程映射？如果能知道一个映射是独占的（即该页面或 folio 仅被单个进程映射），那么许多操作可以得到优化。在实现写时复制（copy-on-write, COW）时，能否正确判断映射是否独占也至关重要——过去的失败案例曾导致一些严重的 bug。对于单个页面来说，判断是否独占相对容易：如果映射计数为 1，则该映射是独占的；否则，它是共享的。然而，如果映射计数是在 folio 级别维护的，那么这个规则就不再适用了，因为 folio 级别的计数几乎不会仅为 1。

当前的映射计数方案还存在性能问题，folio 可能有助于优化。例如，映射一个传统的 PMD 级别（2MB）的巨页相当于映射 512 个基础页面。如果整个巨页都被映射，则当前内核需要为这 512 个基础页面分别递增映射计数。对于内核来说，花费时间递增所有这些基础页面的计数并不理想；如果只在 folio 级别维护一个映射计数，效率将更高。然而，这种优化将使“映射是否独占”这一问题更加难以回答，特别是在 folio 仅被部分映射（即 folio 中只有部分页面被映射到地址空间）时。

因此，内核开发者多年来一直在探索如何在日益复杂的内存管理环境中正确管理映射计数。

调整 folio 结构

为了更好地在 folio 级别跟踪映射计数，Hildenbrand 需要在 struct folio 结构中腾出额外的空间来存储新信息。struct folio 本身是一个复杂的结构。作为向 folio 过渡的一个中间步骤，该结构被覆盖在 struct page 之上，而 struct page 仅描述单个页面。然而，folio 通常需要跟踪比 struct page 更多的信息，尤其是那些包含多个页面的大 folio。

幸运的是，由于大 folio 包含多个物理上连续的页面，可以利用一些巧妙的优化技巧。大 folio 作为一个整体管理，因此不需要为其包含的每个基础页面维护完整的 struct page 结构。事实上，减少这些 struct page 结构的管理开销正是 folio 设计的目标之一。然而，这些 struct page 结构仍然存在，并在系统的内存映射中保持连续。因此，大 folio 实际上拥有多个 struct page 结构的存储空间，其中“尾页”（tail page）的 struct page 结构可以被巧妙地用来存储额外信息。

如果查看 struct folio 的定义，可以发现它比单个 struct page 结构要大。其关键部分如下：

union { 
        struct { 
                unsigned long _flags_1; 
                unsigned long _head_1; 
                atomic_t _large_mapcount; 
                atomic_t _entire_mapcount; 
                atomic_t _nr_pages_mapped; 
                atomic_t _pincount; 
#ifdef CONFIG_64BIT 
                unsigned int _folio_nr_pages; 
#endif 
        }; 
        struct page __page_1; 
};

上述结构覆盖了 folio 的第一个 tail page（如果存在）。其中 _large_mapcount 记录了 folio 在当前内核中的映射计数，而 _entire_mapcount 和 _nr_pages_mapped 也是相关的映射计数字段。此外，还存在一个 __page_2 组件（未展示），主要用于 hugetlbfs 子系统。因此，struct folio 实际上占用了三个 struct page 结构的空间，不过大部分内容仅适用于至少包含四个页面的 folio。

尽管 struct folio 结构已经足够庞大，Hildenbrand 仍然需要额外的空间来改进映射计数管理。为此，他对 struct folio 进行了六个补丁的调整，并新增了 __page_3 结构。调整后的关键新增字段如下：

mm_id_mapcount_t _mm_id_mapcount[2]; 
union { 
        mm_id_t _mm_id[2]; 
        unsigned long _mm_ids; 
};

这些字段用于更好地跟踪 folio 的映射信息。

改进 folio 的映射计数

那么，这些改动如何帮助改进 folio 在多个进程间共享的映射计数管理呢？关键在于 mm_struct 结构，它表示进程的地址空间。每当一个 folio 被映射，它就属于一个特定的 mm_struct。因此，判断 folio 是否独占映射，可以转换为“所有的映射是否都属于同一个 mm_struct”的问题。

然而，系统中可能有数千个 mm_struct 共享一个 folio（例如 C 标准库几乎被所有进程映射）。追踪所有这些映射开销过高，因此 Hildenbrand 采用了两级映射追踪的方法：