对 Arm CPU TSO 内存模型的支持

https://lwn.net/Articles/970907/ By Jonathan Corbet April 26, 2024

At the CPU level, a memory model describes, among other things, the amount of freedom the processor has to reorder memory operations. If low-level code does not take the memory model into account, unpleasant surprises are likely to follow. Naturally, different CPUs offer different memory models, complicating the portability of certain types of concurrent software. To make life easier, some Arm CPUs offer the ability to emulate the x86 memory model, but efforts to make that feature available in the kernel are running into opposition.
在CPU层面上,内存模型描述了处理器重排序内存操作的自由度,包括但不限于此种自由度。如果底层代码没有考虑到内存模型,很可能会遇到不愉快的意外。自然地,不同的CPU提供不同的内存模型,这增加了某些类型并发软件的可移植性难度。为了简化问题,一些 Arm CPU 提供模拟 x86 内存模型的能力,但是在内核中使该功能可用的努力正遭遇到反对。
CPU designers will do everything they can to improve performance. With regard to memory accesses, “everything” can include caching operations, executing them out of order, combining multiple operations into one, and more. These optimizations do not affect a single CPU running in isolation, but they can cause memory operations to be visible to other CPUs in a surprising order. Unwary software running elsewhere in the system may see memory operations in an order different from what might be expected from reading the code; this article describes one simple scenario for how things can go wrong, and this series on lockless algorithms shows in detail some of the techniques that can be used to avoid problems related to memory ordering.
CPU 设计师会尽一切可能提升性能。关于内存访问,“一切”可能包括缓存操作(caching operations)、乱序执行、将多个操作合并为一个,等等。这些优化不会影响单独运行的单个 CPU,但是它们可能会导致内存操作以让其他 CPU 意外的顺序可见。系统中其他地方运行的不警惕的软件可能会看到与阅读代码时预期不同顺序的内存操作;本文描述了一个简单的情境,说明事情可能如何出错,而这个关于无锁算法的系列文章详细展示了一些可以避免与内存排序相关问题的技术。
The x86 architecture implements a model that is known as “total store ordering” (TSO), which guarantees that writes (stores) will be seen by all CPUs in the order they were executed. Reads, too, will not be reordered, but the ordering of reads and writes relative to each other is not guaranteed. Code written for a TSO architecture can, in many cases, omit the use of expensive barrier instructions that would otherwise be needed to force a specific ordering of operations.
x86架构实现了一个被称为“全存储排序”(Total Store Ordering,TSO)的模型,该模型保证所有 CPU 都将按照执行的顺序看到写操作(存储)。读操作也不会被重新排序,但是读和写相对于彼此的顺序并不保证。为 TSO 架构编写的代码在许多情况下可以省略使用代价高昂的屏障指令,这类指令在其他情况下需要用来强制特定的操作排序。
The Arm memory model, instead, is weaker, giving the CPU more freedom to move operations around. The benefits from this design are a simpler implementation and the possibility for better performance in situations where ordering guarantees are not needed (which is most of the time). The downsides are that concurrent code can require a bit more care to write correctly, and code written for a stricter memory model (such as TSO) will have (possibly subtle) bugs when run on an Arm CPU.
相比之下,Arm 内存模型较为弱化,给予 CPU 更大的自由来调整操作的顺序。这种设计的好处是实现更简单,并且在不需要顺序保证的情况下(通常是大部分时间)有可能获得更好的性能。不足之处是,编写并发代码需要更多的注意以确保正确性,而为更严格的内存模型(如TSO)编写的代码在 Arm CPU 上运行时将会有(可能是微妙的)错误。
The weaker Arm model is rarely a problem, but it seems there is one situation where problems arise: emulating an x86 processor. If an x86 emulator does not also emulate the TSO memory model, then concurrent code will likely fail, but emulating TSO, which requires inserting memory barriers, creates a significant performance penalty. It seems that there is one type of concurrent x86 code — games — that some users of Arm CPUs would like to be able to run; those users, strangely, dislike the prospect of facing the orc hordes in the absence of either performance or correctness.
较弱的 Arm 模型很少会造成问题,但似乎存在一种情况会引起问题:模拟 x86 处理器。如果 x86 模拟器没有同时模拟 TSO 内存模型,那么并发代码很可能会失败,但模拟 TSO 需要插入内存屏障,这创建了显著的性能损失。看来有一种并发 x86 代码——游戏——是一些 Arm CPU 的用户希望能够运行的;奇怪的是,这些用户不喜欢在没有性能或正确性的情况下面对兽人大军的前景。

TSO on Arm

As it happens, some Arm CPU vendors understand this problem and have, as Hector Martin described in this patch series, implemented TSO memory models in their processors. Some NVIDIA and Fujitsu CPUs run with TSO at all times; Apple’s CPUs provide it as an optional feature that can be enabled at run time. Martin’s purpose is to make this capability visible to, and controllable by, user space.
正如事实上的情况,一些 Arm CPU 供应商理解这个问题,并且正如 Hector Martin 在这一补丁系列中描述的,他们在自己的处理器中实现了 TSO 内存模型。一些 NVIDIA 和 Fujitsu 的 CPU 始终运行在 TSO 模式下;苹果的 CPU则提供了一个可选特性,可以在运行时启用。Martin 的目的是使这种能力对用户空间可见并可控。
The series starts by adding a couple of new prctl() operations. PR_GET_MEM_MODEL will return the current memory model implemented by the CPU; that value can be either PR_SET_MEM_MODEL_DEFAULT or PR_SET_MEM_MODEL_TSO. The PR_SET_MEM_MODEL operation will attempt to enable the requested memory model, with the return code indicating whether it was successful; it is allowed to select a stricter memory model than requested. For the always-TSO CPUs, requesting TSO will obviously succeed. For Apple CPUs, requesting TSO will result in the proper CPU bits being set. Asking for TSO on a CPU that does not support it will, as expected, fail.
该系列以增加几个新的 prctl() 操作开始。PR_GET_MEM_MODEL 将返回 CPU 实现的当前内存模型;该值可以是PR_SET_MEM_MODEL_DEFAULT 或 PR_SET_MEM_MODEL_TSO。PR_SET_MEM_MODEL 操作将尝试启用请求的内存模型,并通过返回码指示是否成功;它允许选择比请求更严格的内存模型。对于始终运行TSO的CPU,请求TSO 显然会成功。对于苹果的 CPU,请求 TSO 将导致设置适当的 CPU 位。在不支持 TSO 的 CPU 上请求 TSO,正如预期的,将会失败。
Martin notes that the code is not new: “This series has been brewing in the downstream Asahi Linux tree for a while now, and ships to thousands of users”. Interestingly, Zayd Qumsieh had posted a similar patch set one day earlier, but that version only implemented the feature for Linux running in virtual machines on Apple CPUs.
马丁指出这些代码并不是全新的:“这个系列在下游的 Asahi Linux 树中酝酿了一段时间,并已经向数千名用户发布。”有趣的是,一天前,Zayd Qumsieh 发布了一个类似的补丁系列,但那个版本仅为在苹果 CPU 上运行的 Linux虚拟机实现了该功能。
Unfortunately for people looking forward to faster games on Apple CPUs, neither patch set is popular with the maintainers of the Arm architecture code in the kernel. Will Deacon expressed his “strong objection”, saying that this feature would result in a fragmentation of user-space code. Developers, he said, would just enable the TSO bit if it appears to make problems go away, resulting in code that will fail, possibly in subtle ways, on other Arm CPUs. Catalin Marinas, too, indicated that he would block patches making this sort of implementation-defined feature available.
不幸的是,对于那些期待在苹果 CPU 上获得更快游戏速度的人来说,这两套补丁都不受内核中 Arm 架构代码维护者的欢迎。Will Deacon 表达了他的“强烈反对”,认为这个功能会导致用户空间代码的碎片化。他说,开发者如果发现启用 TSO 位看似可以消除问题就会这么做,这将导致代码在其他 Arm CPU 上失败,可能是以微妙的方式。Catalin Marinas 也表示,他会阻止使此类实现定义特性可用的补丁。
Martin responded that fragmentation is unlikely to be a problem, and pointed to the different page sizes supported by some processors (including Apple’s) as an example of how these incompatibilities can be dealt with. He said that, so far, nobody has tried to use the TSO feature for anything that is not an emulator, so abuse in other software seems unlikely. Keeping it out, he said, will not improve the situation:
马丁回应说,碎片化不太可能成为问题,并且以某些处理器(包括苹果的)支持的不同页面大小为例,指出了这些不兼容性可以如何处理。他说,到目前为止,还没有人尝试将 TSO 功能用于非仿真器的其他任何事物,因此在其他软件中的滥用似乎不太可能。他说,将其排除在外并不会改善情况:

There’s a pragmatic argument here: since we need this, and it absolutely will continue to ship downstream if rejected, it doesn’t make much difference for fragmentation risk does it? The vast majority of Linux-on-Mac users are likely to continue running downstream kernels for the foreseeable future anyway to get newer features and hardware support faster than they can be upstreamed. So not allowing this upstream doesn’t really change the landscape vis-a-vis being able to abuse this or not, it just makes our life harder by forcing us to carry more patches forever.
这里有一个实用主义的论点:既然我们需要这个,而且如果被拒绝,它绝对会继续在下游发行,那么这对于碎片化风险的影响并不大,对吧?从可预见的未来来看,绝大多数在 Mac 上运行 Linux 的用户可能还是会继续运行下游内核,以便比上游更新得快速获得新功能和硬件支持。所以,不允许这个在上游发行并没有真正改变景观,至于能否滥用这个功能,它只是通过迫使我们永远携带更多的补丁来使我们的生活变得更困难。

Deacon, though, insisted that, once a feature like this is merged, it will find uses in other software “and we’ll be stuck supporting it”.
然而,Deacon 坚持认为,一旦类似这样的功能被合并,它将会在其他软件中找到用途,“而我们将被迫支持它”。
If this patch is not acceptable, it is time to think about alternatives. One is to, as Martin described, just keep it out-of-tree and ship it on the distributions that actually run on that hardware. A long history of addition by distributions can, at times, eventually ease a patch’s way past reluctant maintainers. Another might be to just enable TSO unconditionally on Apple CPUs, but that comes with an overall performance penalty — about 9%, according to Martin. Another possibility was mentioned by Marc Zyngier, who suggested that virtual machines could be started with TSO enabled, making it available to applications running within while keeping the kernel out of the picture entirely.
如果这个补丁不可接受,那么是时候考虑其他替代方案了。一种方案是,正如马丁所描述的,将其保持在树外,并仅在实际运行在那硬件上的发行版中提供。长期以来,由发行版所做的补充有时最终可以让一个补丁通过不情愿的维护者的关口。另一个可能的方案是在苹果 CPU 上无条件启用 TSO,但这会带来整体性能的损失——据马丁称,大约有9%。马克·津格尔(Marc Zyngier)提出了另一种可能性,他建议可以在启动虚拟机时启用 TSO,使其可供运行在虚拟机内的应用程序使用,同时完全将内核排除在外。
This seems like the kind of discussion that does not go away quickly. One of the many ways in which Linux has stood out over the years is in its ability to allow users to make full use of their hardware; refusing to support a useful hardware feature runs counter to that history. The concerns about potential abuse of this feature are also based in long experience, though. This is a case where the development community needs to repeat another part of its long history by finding a solution that makes the needed functionality available in a supportable way.
这似乎是一种不会很快消失的讨论。多年来,Linux 在许多方面都有所突出,其中之一就是它能够允许用户充分利用他们的硬件;拒绝支持一个有用的硬件特性与这一历史背道而驰。然而,对这项功能可能被滥用的担忧也是基于长期的经验。这是一个情况,开发社区需要重复其长期历史中的另一个部分,即找到一种使所需功能以一种可支持的方式可用的解决方案。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值