SLUB 分配器的下一步计划

https://lwn.net/Articles/974138/ By Jonathan Corbet May 20, 2024

There are two fundamental levels of memory allocator in the Linux kernel: the page allocator, which allocates memory in units of pages, and the slab allocator, which allocates arbitrarily-sized chunks that are usually (but not necessarily) smaller than a page. The slab allocator is the one that stands behind commonly used kernel functions like kmalloc(). At the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit, slab maintainer Vlastimil Babka provided an update on recent changes at the slab level and discussed the changes that are yet to come.
在 Linux 内核中有两个基本等级的内存分配器:页分配器(page allocator),它以页为单位分配内存;以及 slab 分配器,它分配任意大小的内存块,这些内存块通常(但不总是)比一个页面小。slab 分配器是支撑像 kmalloc() 这类常用内核函数的分配器。在2024年的Linux存储、文件系统、内存管理和 BPF 峰会上,slab 维护者 Vlastimil Babka 提供了关于 slab 层面最近变化的更新,并讨论了即将到来的变化。

Once upon a time, the kernel contained three slab-allocator implementations. That number had dropped to two in the 6.4 release, when the SLOB allocator (aimed at low-memory systems) was removed. At the 2023 summit, Babka began, the decision had been made to remove SLAB (one of the two general-purpose allocators), leaving only SLUB in the kernel. That removal happened in 6.8. Kernel developers now have greater freedom to improve SLUB without worrying about breaking the others. He thought that nobody was unhappy about this removal, he said, until he saw the recent report from the Embedded Open Source Summit, which contained some complaints. Even there, though, the primary complaint seemed to be that the removal had happened too quickly — even though he thought it had taken too long. Nobody seems to be clamoring to have SLAB back, anyway.
曾几何时,内核包含了三个 slab 分配器的实现。在 6.4 版本中,这个数字减少到了两个,当时面向低内存系统的SLOB 分配器被移除了。Babka 开始说,在 2023 年的峰会上,已经作出了移除 SLAB(两个通用分配器中的一个)的决定,只留下 SLUB 在内核中。这项移除在 6.8 版本中发生。内核开发者现在有更大的自由来改进 SLUB,而不用担心破坏其他的分配器。他说,他认为没有人对这次移除感到不满,直到他看到了来自嵌入式开源峰会的最近报告里的一些抱怨。即使在那里,主要的抱怨似乎也是移除动作发生得太快了——尽管他认为这已经花了太长时间。无论如何,似乎没有人在强烈要求将 SLAB 带回来。

Vlastimil Babka

Last year, some concerns had been expressed that SLUB was slower than SLAB for some workloads. But now, nobody is working on addressing any remaining problems. David Rientjes said that Google is still working on transitioning to SLUB; in the process it has turned up that using SLUB resolves some jitter problems that had been observed with SLAB, so folks there are happy with the change.
去年,有人表示担心对于一些工作负载而言,SLUB 比 SLAB 慢。但现在,没有人在努力解决任何剩余的问题。David Rientjes 说,谷歌仍在转向 SLUB 的过程中;在这个过程中,他们发现使用 SLUB 解决了之前在使用 SLAB 时观察到的一些抖动问题,所以那里的人对这一变化感到满意。
Babka said that he has been working on reducing the overhead created by the accounting of kernel memory allocations in control groups; this cost shows up in microbenchmarks, and “Linus is unhappy” about it. There are some improvements that are ready to go into 6.10, but there is more work to do. Another area of slab development is heap-spraying defense; these patches are a bit of a problem for him. He can review them as memory-management changes, but he lacks the expertise to judge the security aspect.
Babka 表示,他一直在努力减少在控制组(control groups)中内核内存分配核算所产生的开销;这种成本在微基准测试中显现出来,而“Linus 对此不满意”。有一些改进已经准备好在 6.10 版本中引入,但仍有更多工作要做。slab 开发的另一个领域是堆喷射防御;这些补丁对他来说有点问题。他可以作为内存管理的变更审查它们,但他缺乏判断安全性方面的专业知识。
Work is being done on object caching with prefilling. This feature would maintain a per-CPU array of objects that users could opt into; they would be able to prefill (preallocate) the objects prior to allocation so that they are ready to go when needed. That would be useful for objects allocated in critical sections, for example. The initial intended user is the maple tree data structure, which is currently bulk-allocating a worst-case number of objects before entering critical sections, then returning the unused objects afterward. The object cache would eliminate that back-and-forth while ensuring that objects could be allocated when needed.
正在进行的工作包括带有预填充的对象缓存(object caching with prefilling)。这个特性将维护一个per-CPU 数组的对象,用户可以选择加入;他们能够在分配前提前填充(预分配)对象,这样在需要时对象就已准备就绪。例如,这对于在关键部分分配的对象来说是有用的。最初的预期用户是枫树(maple tree)数据结构,它目前在进入关键部分之前正在批量分配最坏情况下的对象数,然后在之后返回未使用的对象。对象缓存将消除这种来回处理的问题,同时确保在需要时可以分配对象。
Michal Hocko pointed out that the real problem that is driving this feature is the combination of GFP_ATOMIC allocations with the __GFP_NOFAIL flag; that combination is difficult for the kernel to satisfy if memory is tight. The allocator currently emits a warning when it sees that combination; avoidance of it on the part of developers would be appreciated, he said. The prefilled object cache is one way of doing that. In the future, some sort of reservation mechanism may be added for such situations as well.
Michal Hocko 指出,推动这一特性的真正问题是 GFP_ATOMIC 分配与 __GFP_NOFAIL 标志的组合;如果内存紧张,内核很难满足这种组合。如果分配器看到这种组合,目前会发出警告;他说,开发者如果能避免使用这种组合将会受到欢迎。预填充对象缓存就是一种做法。将来,也可能为此类情况增加某种预留机制。
Another problem exposed by the maple tree has to do with its practice of freeing objects with kfree_rcu() — an approach taken often in kernel code. The problem is that memory freed in this way is not immediately made available for other uses; it must wait for an RCU grace period to pass first. That can lead to an overflow of the per-CPU arrays used by kfree_rcu(), causing flushing and, perhaps, a quick refill starting the cycle all over again. To complicate the issue on Android, RCU callbacks are only run on some CPUs, which isn’t useful for processing the per-CPU arrays on the CPUs that don’t run them.
maple tree 暴露的另一个问题与其使用 kfree_rcu() 释放对象的做法有关——这是内核代码中常采用的方法。问题在于,以这种方式释放的内存不能立即可用于其他用途;它必须先等待一个 RCU 宽限期过去。这可能导致使用kfree_rcu() 的每 CPU 数组溢出,引发冲刷并且可能迅速重新填充,开始新一轮的循环。在Android上,使问题复杂化的是 RCU 回调只在一些CPU上运行,并不有助于处理那些不运行它们的 CPU 上的每 CPU 数组。
The plan is to create a kfree_rcu() variant that puts objects in an array and sets them aside to be freed as a whole. Once that has happened, the entire array can be put back into the pool and made available to all CPUs. This array is to be called a “sheaf”; it will be stored in a per-node “barn”. One potential problem is that it may become necessary to allocate a new sheaf while freeing objects; allocations in the freeing path need to be avoided whenever possible. The group talked about alternatives for a while without coming to any conclusions.
计划是创建一个 kfree_rcu() 的变体,将对象放入数组中,并将它们放在一边作为一个整体释放。一旦发生这种情况,整个数组就可以被放回池中,并对所有CPU可用。这个数组将被称为“束”(sheaf);它将存储在每个节点的“谷仓”(barn)中。一个潜在的问题是,在释放对象时可能需要分配一个新的束;只要可能,就需要避免在释放路径中进行分配。小组讨论了一段时间的替代方案,但没有得出任何结论。
Meanwhile, Babka is not satisfied with removing just SLOB and SLAB; next on the target list is the special allocator used by the BPF subsystem. This allocator is intended to succeed in any calling context, including in non-maskable interrupts (NMIs). BPF maintainer Alexei Starovoitov is evidently in favor of this removal if SLUB is able to handle the same use cases. The BPF allocator currently adds an llist_node structure to allocated objects, making them larger; switching to SLUB would eliminate that overhead. It would also serve to make SLUB NMI-safe and remove the need to maintain yet another allocator.
与此同时,Babka 对于仅仅移除 SLOB 和 SLAB 感到不满意;下一个目标列表上的是 BPF 子系统使用的特殊分配器。这个分配器旨在任何调用上下文中成功,包括在不可屏蔽中断(NMIs)中。BPF 维护者 Alexei Starovoitov 显然赞成这种移除,如果 SLUB 能够处理相同的用例。BPF 分配器目前向分配的对象添加了一个 llist_node 结构,使它们变得更大;切换到SLUB将消除这一开销。这也将使SLUB对NMI安全,并消除维护另一个分配器的需要。
Babka would also like to integrate the objpool allocator, which was added to the 6.7 kernel without any consultation with the memory-management developers at all. Finally, as the session ran out of time, Babka mentioned the possibility of eventually integrating the mempool subsystem (which is another way of preallocating objects). The SLUB allocator could set aside objects for all of the mempools in the system, reducing the overhead as a whole. That, though, looks like a topic for discussion at the 2025 summit.
Babka 还希望整合 objpool 分配器,这个分配器在 6.7 内核中被添加进来,但完全没有与内存管理开发者协商。最后,随着会议时间的结束,Babka 提到了最终可能整合 mempool 子系统(这是另一种预分配对象的方式)的可能性。SLUB 分配器可以为系统中的所有 mempool 预留对象,从而减少整体开销。不过,这看起来像是 2025 年峰会讨论的一个话题。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值