编程参考 - 理解Atomics和memory ordering

夜流冰

于 2024-07-01 08:27:29 发布

阅读量796

点赞数 14

分类专栏：编程参考文章标签：学习

本文链接：https://blog.csdn.net/guoqx/article/details/140091088

版权

编程参考专栏收录该内容

85 篇文章 2 订阅

订阅专栏

Understanding Atomics and Memory Ordering

原子和内存排序总让人觉得是一个难以接近的话题。在众多拙劣的解释中，我希望通过描述我是如何推理这些乱七八糟的东西的，来为自己的解释再添一笔。这只是我的理解，所以如果你需要更好/更正式的解释，我建议你阅读一下你所使用编程语言的内存模型。在这种情况下，可以参考 cppreference.com 上描述的 C11 内存模型。

std::memory_order - cppreference.com

Atomics and Memory Ordering always feel like an unapproachable topic. In the sea of poor explanations, I wish to add another by describing how I reason about all of this mess. This is only my understanding so if you need a better/formal explanation, I recommend reading through the memory model for your given programming language. In this case, it would be the C11 Memory Model described at cppreference.com.

Shared Memory

在单线程执行代码方面，软件和硬件越来越接近性能的极限。为了继续提升计算性能，一种流行的解决方案是引入多个单线程执行单元，即多线程。这种计算形式体现在不同的抽象层级，从 CPU 中的多个内核到一台机器中的多个 CPU，甚至是跨网络的多台机器。本篇文章将更多地关注 CPU 中的内核，将其称为 "线程"。

Software and hardware is getting closer to the limits of performance when it comes to single-threaded execution of code. In order to continue scaling compute performance, a popular solution is to introduce multiple single-threaded execution units - or multi-threading. This form of computation manifests itself at different abstraction levels from multiple cores in a CPU to multiple CPUs in a machine and even multiple machines across a network. This post will be focusing more on cores in a CPU, referring to them as “threads”.

对于某些工作负载，任务可以清晰地划分，并分给线程执行。这类任务被称为 "理想并行"，不需要相互通信。这是多线程算法应该努力实现的理想状态，因为它可以利用单线程执行的所有现有优化功能。但这并不总是可行的，有时任务之间需要相互通信和协调，这就是我们需要在线程之间共享内存的原因。

https://en.wikipedia.org/wiki/Embarrassingly_parallel

For some workloads, the tasks can be divided cleanly and split off to the threads for execution. Such tasks are known as embarrassingly parallel and need not communicate with each other. This is the ideal that multithreaded algorithms should strive for since it takes advantage of all the existing optimizations available for single-threaded execution. However this isn't always possible and it's sometimes necessary for tasks to communicate and coordinate with each other, which is why we need to share memory between threads.

当你的代码在抢占式调度环境中运行时，通信是很困难的。这种环境意味着，在任何时候，你的代码都可能被中断，以便运行其他代码。在应用程序中，操作系统内核可以决定从运行你的程序切换到运行另一个程序。在内核中，硬件可以从运行内核代码切换到运行中断处理程序代码。像这样的任务切换被称为并发，为了同步/通信，我们需要一种方法在一小段时间内排除并发，否则我们就有可能在不完整/部分数据的情况下运行。

Communication is hard when your code is running in a preemptive scheduling setting. Such an environment means that, at any point, your code can be interrupted in order for other code to run. In applications, the operating system kernel can decide to switch from running your program to run another. In the kernel, hardware can switch from running kernel code to running interrupt handler code. Switching tasks around like this is known as concurrency and in order to synchronize/communicate, we need a way to exclude that concurrency for a small time frame or we risk operating with incomplete/partial data.

Atomics

幸运的是，CPU 为软件提供了在共享内存上进行操作的特殊指令，这些指令不会被中断。这些指令被称为原子内存操作，可分为三类：加载、存储和读取修改写入（RMW）。前两种操作不言自明。RMW 也很好描述：它允许你从内存中加载数据，对数据进行操作，并将结果存储回内存--所有这些都是原子操作。你可以认为 RMW 操作是原子增量、交换或比较和交换。

InterlockedIncrement function (winnt.h) - Win32 apps | Microsoft Learn

std::atomic<T>::exchange - cppreference.com

https://en.wikipedia.org/wiki/Compare-and-swap

Fortunately, CPUs supply software with special instructions to operate on shared memory which can't be interrupted. These are known as atomic memory operations and fit into three categories: Loads, Stores, and ReadModifyWrites (RMW). The first two are self explanatory. RMW is also pretty descriptive: it allows you to load data from memory, operate on the data, and store the result back into memory - all atomically. You may know RMW operations as atomic increment, swap, or compare and swap.

"原子式 "地做某件事意味着它必须完整地发生（或被观察到发生），或者根本不发生。这意味着它不能被中断。当某些操作是 "原子 "操作时，就无法观察到操作的撕裂（即部分完成）。原子操作允许我们编写代码，以安全的方式使用共享内存，防止并发中断。

To do something "atomically" means that it must happen (or be observed to happen) in its entirety or not at all. This implies that it cannot be interrupted. When something is "atomic", tearing (i.e. partial completion) of the operation cannot be observed. Atomic operations allow us to write code that can work with shared memory in a way that's safe against concurrent interruption.

关于 atomics 的另一件事是，当共享内存至少有一个写入器，可能还有多个读/写入器时，atomics 是与共享内存交互的唯一合理（即正确定义）方式。如果不使用原子操作，就会出现数据竞争，即未定义行为（UB）。未定义行为是指依赖于目标程序模型（在我们的例子中是 C11 内存模型）之外的假设的行为。这样做是不可靠的，因为编译器或CPU在其模型定义之外什么行为都有可能发生。

https://en.wikipedia.org/wiki/Race_condition#Data_race

Undefined behavior - cppreference.com

Another thing about atomics is that they're the only sound (i.e. correctly defined) way to interact with shared memory when there's at least one writer and possibly multiple readers/writers to the shared memory. Trying to do so without atomics is considered a data race which is undefined behavior (UB). UB is the act of relying on an assumption outside of your target program model (in our case, the C11 memory model). Doing so is unreliable as the compiler or cpu is allowed to do anything outside of its model.

数据竞争及其隐含的未定义行为不仅仅是一个理论问题。我之前提到的单线程优化之一就涉及 CPU 或编译器对内存读写的缓存。如果不使用原子操作，操作本身就会被省略，取而代之的是缓存结果，这就很容易破坏代码的逻辑：

Data races and the UB it implies isn’t just a theoretical issue. One of the single-threaded optimizations I mentioned earlier involves either the CPU or the compiler caching memory reads and writes. If you don’t use atomic operations, the operation itself could be ellided and replaced with its cached result which could break the logic of your code fairly easily:

# should be an atomic_load() but its data race

while (not load(bool)):

continue

# a potential single-threaded optimization

cached = load(bool)

while (not cached): # possibly infinite loop!

continue

Reordering

原子操作只能解决以原子方式访问内存的通信问题，但并非所有通信内存都能以原子方式访问。CPU 通常会对最多只有几个字节大的内存进行原子操作。如果要进行任何其他类型的通用内存通信，就意味着我们需要一种方法，让线程可以通过其他方式使用这些内存。

Atomics solve communication only on atomically accessed memory; but not all memory being communicated can be accessed atomically. CPUs generally expose atomic operations for memory that's at most a few bytes large. Trying to do any other sort of general purpose memory communication means we need a way to make this memory available to threads with other means.

向其他线程提供内存实际上比听起来更棘手。让我们看看这个代码示例：

Making memory available to other threads is actually trickier than it sounds. Let's check out this code example:

data = None

has_data = False

# Thread 1

write(&data, "hello")

atomic_store(&has_data, True)

# Thread 2

if atomic_load(&has_data):

d = read(&data)

assert(d == "hello")

乍一看，这似乎可行。即使每个线程在每条指令（此处的一行代码）之间都被抢占，assert()似乎也应该总是成功的。根据我的措辞，你可能已经发现这个 assert() 实际上可能会失败！原因在于另一种名为重新排序的单线程优化。

At first glance, this looks like it would work. Even if each thread were preempted between each instruction (line of code here), it seems the assert() should always succeed. Based on my wording, you've probably caught on that this assert() can actually fail! The reason for this is due to another single-threaded optimization called reordering.

硬件（CPU）或软件（编译器也一样）可以决定以任何方式移动（即 "重新排序"）代码和指令，只要最终结果与源代码的意图一致即可。这种 "指令调度 "自由允许进行各种优化。

Hardware (CPU) or software (the compiler as well) can decide to move around (i.e. “reorder”) your code and instructions any way they please as long as the end result is the same as the source code’s intent. This sort of “instruction scheduling” freedom allows for a variety of optimizations to take place.

重新排序的一个例子就是推理执行。这是指 CPU 开始执行尚未轮到执行的代码，以便在最终轮到执行该代码时，有机会将结果准备就绪。这是一种惊人的单线程吞吐量优化，但它意味着atomic_store()可以在write()之前启动，或者read()可以在atomic_load()之前启动；这两种情况都可能导致断言assert()失败。

One example of reordering is via speculative execution. This is when the CPU starts executing code that hasn't been reached yet, in the opportunistic chance that the results can be ready when that code is eventually reached. This is an amazing single-threaded throughput optimization, but it means that the atomic_store() can be started before the write(), or the read() can be started before the atomic_load(); both of which could make the assert() fail.

重新排序的另一个例子是 CPU 缓存。CPU 不会直接读/写共享内存，因为那样速度相对较慢。相反，每个 CPU 内核都有自己的快速访问本地内存，即缓存。大多数内存操作都在 CPU 的高速缓存中执行，并最终刷新到其他高速缓存中，这个过程称为高速缓存一致性（Cache Coherency）。在我们的示例中，atomic_store() 可能在 write() 之前从缓存刷新到共享内存（例如，如果刷新是后进先出），或者 atomic_load() 可能在 read() 之前刷新到缓存；这两种情况都可能导致 assert() 失败。

https://gist.github.com/jboner/2841832

https://en.wikipedia.org/wiki/Cache_coherence

Another example of reordering is by CPU caches. CPUs don't read/write directly to shared memory since that's relatively slow. Instead, each CPU core has its own fast-access, local memory called cache. Most memory operations are performed on a CPU's cache and eventually flushed to / refreshed from to other caches in a process called Cache Coherency. In our example, the atomic_store() could have flushed from cache to shared memory before the write() does (e.g. if flushing is done LIFO), or the atomic_load() could refresh in cache before the read() is; both of which could cause the assert() to fail.

即使编译器也可以对指令进行重新排序，但仅限于那些没有依赖关系的指令。如果一条指令（代码行）使用了前一条指令的结果，或者前一条指令是副作用，那么这条指令（代码行）就被称为 "依赖 "于前一条指令。编译器可以自由地对没有依赖关系的指令进行重新排序，但不能对有依赖关系的指令进行重新排序。这意味着 a = 5; b = 10; 可以重新排序为 b = 10; a = 5;，由于 "a "和 "b "互不依赖，因此保持了相同的语义（实现了相同的目的）。如果改为 a = 5; b = a + 1;，那么 "a "就不能移到 "b "后面，因为 "b "依赖于 "a"，这在逻辑上说不通。在我们的示例中，atomic_store() 并不依赖于 write()，因此它可以被移动，从而导致 assert() 失败。

https://en.wikipedia.org/wiki/Side_effect_(computer_science)

Even the compiler can reorder instructions, but only those without relationships called dependencies. One instruction (line of code) is said to "depend" on a previous instruction if it uses the result from the previous one or if the previous one is a side effect. The compiler is free to reorder instructions anywhere before their dependency, but not after. This means a = 5; b = 10; can be reordered to b = 10; a = 5; which keeps the same semantics (achieving the same thing) since "a" and "b" don't share a dependency with each other. If it were instead a = 5; b = a + 1; then "a" can't be moved after "b" since it wouldn't make logical sense as "b" has a dependency on "a". In our example, atomic_store() doesn't have a dependency on write() so it can be moved around which can make the assert() fail.

说到这里，大家应该明白了，指令重排序是一个问题，而且在与共享内存交互时，你必须意识到这一点。问题是原子操作本身并不能防止重排序。我们需要一个额外的原子概念来做到这一点。在 C11 中，原子操作包含另一个名为 "内存排序 "的参数，它有助于解决这个问题。

At this point it should be clear that instruction reordering is a thing and, when interacting with shared memory, you have to be aware of it. The problem is that atomic operations on their own don't prevent reordering. We need an additional concept for atomics to do this. In C11, atomic operations take in another parameter called "memory ordering" which helps solve this problem.

在我们前面的代码示例中，有两个主要问题：一个是重排序问题，另一个是可见性问题。内存排序解决了这两个问题，它可以防止代码围绕原子操作重新排序，并确保某些数据或操作变得可见，或在概念上 "从缓存中刷新/加载"。让我们看看是怎么做的。

In our previous code example, there were two main issues: one of reordering and one of visibility. Memory orderings solve them by preventing code from being reordered around atomic operations and ensures that certain data or operations become visible or get conceptually "flushed/reloaded from cache". Lets see what this looks like.

Release and Acquire

我们先介绍两种内存排序方式：获取（Acquire）和释放（Release）。Release 用于原子存储，确保在它之前声明的所有内存操作都在它之前进行。Acquire 用于原子加载，确保所有在它之后声明的内存操作都在它之后进行。这就解决了重新排序的问题。

std::memory_order - cppreference.com

We'll introduce two types of memory orderings for now: Acquire and Release. Release goes on atomic stores and ensures that all memory operations declared before it actually happen before it. Acquire goes on atomic loads and ensures all memory operations declared after actually happen after it. This solves the reordering problem.

然后，我们再声明一个约束：在某个 Release 之前的所有内存操作，都可以在匹配的 Acquire 之前被观察到。你可以把它理解为 Release 中的变化以 git push 的方式在 Acquire 中变得可见，而 Acquire 则以一种 git pull 的方式进行获取。这就解决了可见性问题。

https://en.wikipedia.org/wiki/Happened-before

We then declare one more constraint: All memory operations before a given Release can be observed to happen-before a matching Acquire. You could think of it as changes from the Release becoming visible in a git push manner to the Acquire which does a sort of git pull. This solves the visibility problem.

让我们把这些添加到代码示例中：

Let's add these to our code example:

data = None

has_data = False

# Thread 1

write(&data, "hello")

atomic_store(&has_data, True, Release)

# Thread 2

if atomic_load(&has_data, Acquire):

d = read(&data)

assert(d == "hello")

请注意，"释放 "和 "获取 "并不会 "等待 "或 "阻塞 "数据就绪。它们不是已知同步原语的替代品。相反，它们确保了如果我们的 atomic_load() 看到 has_data 为 True，那么它也能保证看到 write(&data, "hello")，这要归功于匹配的 Acquire 和 Release 保障，因此我们的断言绝不会失败。

Note that Release and Acquire don't do any sort of "waiting" or "blocking" for the data to become ready. They aren't replacements for known synchronization primitives. Instead, they ensure that if our atomic_load() sees has_data to be True, then it’s also guaranteed to see write(&data, "hello") thanks to the matching Acquire and Release barriers so our assert should never fail.

对于读取修改写入（RMW）原子指令，它们还可以采用一种名为 AcqRel 的内存排序。鉴于 RMW 操作在概念上同时进行了原子加载和原子存储，AcqRel 分别执行Acquire 和 Release这两个操作。当你需要一个原子操作：1）通过 Release 向其他线程提供内存；2）通过 Acquire 查看其他线程提供的内存时，这就非常有用了。

For ReadModifyWrite (RMW) atomic instructions, they can also take in a memory ordering called AcqRel. Given RMW operations conceptually do both an atomic load and an atomic store, AcqRel makes both operations Acquire and Release respectively. This is useful when you want an atomic operation which both 1) makes memory available to other threads via Release and 2) sees memory made available by other threads via Acquire.

Fences and Variables

你会注意到我一直在说 "匹配获取/释放"。在我们的示例中，匹配来自使用相同 "原子变量"（&has_data）的加载和存储。不同原子变量上的 Release 和 Acquires 不会相互同步，必须是同一个原子变量。

You'll notice that i've been saying "matching Acquire/Release". For our examples, the matching is from the load and store using the same "atomic variable' (&has_data). Release and Acquires on different atomic variables don't synchronize with each other, it has to be the same atomic variable.

但也有例外，那就是栅栏。栅栏是建立通常操作和原子内存操作的内存顺序的一种方法，但不一定与某个内存操作相关联。

There's an exception to the rule which manifests itself as fences. Fences are a way to establish memory orderings of normal and atomic memory operations without necessarily associating with one given memory op.

栅栏对我来说有点棘手，因为我很难描述它们，但它们本质上是创建环绕原子操作的happens-before关系，创建方式依据是所使用的内存排序方式：

Fences are a bit tricky for me as I have a hard time describing them, but they essentially create the happens-before relationship to surround atomics in a way that corresponds to the memory ordering being used:

一个 fence(Release) 与另一个 fence(Acquire) 建立了 happens-before 关系
如果后续的非 Release 原子存储有匹配的 Acquire 原子载入或匹配的 fence(Acquire)，则 fence(Release) 会将其变成 Release。
如果有匹配的 Release 原子存储空间或匹配的 fence(Release)，fence(Acquire) 会将之前的非 Acquire 原子载入转化为 Acquire。

* A fence(Release) creates a happens-before relationship with another fence(Acquire)

* A fence(Release) makes subsequent non-Release atomic stores into Release if they have a matching Acquire atomic load or matching fence(Acquire)

* A fence(Acquire) makes previous non-Acquire atomic loads into Acquire if they have a matching Release atomic store or matching fence(Release).

下面是一个例子，说明我们如何用栅栏来替代每次操作的内存排序：

Here's an example of how we could substitute the per-operation memory orderings with fences:

data = None

has_data = False

# Thread 1

write(&data, "hello")

fence(Release)

atomic_store(&has_data, True)

# Thread 2

if atomic_load(&has_data):

fence(Acquire)

d = read(&data)

assert(d == "hello")

Case Study: Mutex

你可能还注意到，这一部分被称为 "发布和获取"，而不是 "获取和发布"。这样做是有意为之，尽管先获取往往会解释为 "发生在先 "的关系。与其考虑锁定（获取）和解锁（释放），不如考虑解锁（释放）使锁定（获取）可以使用关键部分的更改：

You may have also noticed that this section is called "Release and Acquire" instead of "Acquire and Release". This is done intentionally as having Acquire first often construes the happens-before relationship. Instead of thinking about lock(Acquire) and unlock(Release), it should instead be thought about unlock(Release) making critical section changes available to lock(Acquire):

mutex = Mutex()

data = None

# Thread 1 (assume locked)

data = "hello"

fence(Release)

mutex.unlock()

# Thread 2 (assume unlocked)

mutex.lock()

fence(Acquire)

assert(data == "hello")

互斥体的释放顺序仅用于将更改 "释放 "给下一个互斥体锁定者，而后者则 "获取 "上一个互斥体解锁者先前释放的更改。与 "锁定() 获取和解锁() 释放 "相比，这种规范上的反向关系能更好地展示 Release 和 Acquire 之间的发生在先关系。

The Release ordering for a mutex only serves to "Release" the changes to the next mutex locker, who "Acquires" the previously released changes by the last mutex unlocker. The canonically backwards relationship better demonstrates the happens-before relationship between Release and Acquire compared to just saying "lock() acquires and unlock() releases".

我们在这里创建的叫做 "部分排序"。它是两组（内存）操作之间的排序。之所以说它是 "部分 "排序，是因为它是在两个操作集之间排序，而不是在单个操作本身排序：在 "释放 "之前的操作不需要按照 "获取 "时描述的顺序进行，只需要观察到这些操作已经发生即可。

What we have created here is called a Partial Ordering. It's an ordering between two sets of (memory) operations. The reason it's "partial" is because it orders between sets instead of the individual operations themselves: The operations before a Release don't need to be observed happening in the order they were described for an Acquire, they just need to be observed to have happened at all.

Sequential Consistency / 顺序一致性

在某些情况下，我们需要原子运算之间要遵照一定的顺序。我们现在需要的是一个完全的排序。这可以确保操作本身之间有某种确定的排序，而不是一组操作，这就是 SeqCst 内存排序的用途。

There are cases when you need certain atomic operations to be observed in a given order between each other. What we need now is a Total Ordering. This ensures there's some defined ordering between the operations themselves rather than a set of operations and is what the SeqCst memory ordering is used for.

Let's see another code example:

head = 0

tail = 0

buf = [...]

# Thread 1

steal():

h = atomic_load(&head)

t = atomic_load(&tail)

if t > h:

item = buf[h]

if atomic_cas(&head, h, h + 1):

return item

return None

# Thread 2

pop():

t = tail

atomic_store(&tail, t - 1)

h = atomic_load(&head)

if t > h + 1:

return buf[t - 1]

if t == h + 1 and atomic_cas(&head, h, t):

return buf[t - 1]

atomic_store(&tail, t)

return None

这段代码来自 Chase.Lev 的 LIFO Deque 实现。Lev. 它的作用并不一定很重要，但在实际需要 SeqCst 时，它可以作为一个很好的示例。

This is code taken from the implementation of a LIFO Deque by Chase. Lev. What it does isn't necessarily important but it serves as a nice example when SeqCst is actually needed.

https://fzn.fr/readings/ppopp13.pdf

对于 pop()，我们要确保在向头部加载之前就观察到向尾部存储。否则，pop() 可能无法看到从 steal() 中移除的项目。让我们尝试将获取和释放应用于 pop()：

For pop(), we want to ensure that the store to tail is observed to happen before the load to head. If not, pop() may not see the items removed from steal(). Let's try to apply Acquire and Release to pop():

atomic_store(&tail, t - 1, Release)

h = atomic_load(&head, Acquire)

这并没有完全达到我们的要求： Release 可以防止存储()之前的内容在存储()之后重新排序，而 Acquire 可以防止加载()之后的内容在加载()之前重新排序。无法保证存储（）和加载（）本身不会在彼此之前或之后重新排序。

This doesn't exactly do what we want: Release prevents stuff before the store() being reordered after, and Acquire prevents stuff after the load() being reordered before it. There's no guarantee that the store() and load() themselves can't be reordered before/after each other.

other memory operations

^ |

| X store release----

| |

----load acquire X |

| v

other memory operations

为了确保原子存储()和加载()保持其声明的顺序，我们需要在存储()上设置一个 Acquire 障碍，我们可以使用带有 AcqRel 的 RMW 操作（atomic_swap(&tail, t - 1, AcqRel)）从语义上实现这一点，或者我们需要 SeqCst。

In order to ensure that the atomic store() and load() stay in their declared order we either need an Acquire barrier on the store(), which we can semantically achieve using an RMW operation with AcqRel (atomic_swap(&tail, t - 1, AcqRel)), or we need SeqCst.

atomic_store(&tail, t - 1, SeqCst)

h = atomic_load(&head, SeqCst)

SeqCst 在这里做了两件事：它的作用是像以前一样，对存储进行 "释放"（Release）/对加载进行 "获取"（Acquire），但同时也确保了所有 SeqCst 操作之间的总排序。总排序确保在其他完全排序的操作中，存储将先于加载被看到。由于总排序只适用于其他 SeqCst 操作，因此我们需要将 SeqCst 应用于所有依赖于总排序的操作。这包括 pop() 中的原子加载/比较交换，以及 steal() 中的原子加载/比较交换。总排序属性也扩展到了 fence(SeqCst)，因此我们可以使用它们实现相同的重排序效果：

SeqCst does two things here: It acts as a Release for stores / Acquire for loads as before, but it also ensures a total-ordering between all SeqCst operations. The total-ordering ensures that the store will be seen before the load for other totally-ordered operations. Because total-ordering only applies to other SeqCst ops, we need to apply SeqCst to everything that relies on the total-ordering. This includes the atomic load/cas in pop() as well as the atomic loads/cas in steal(). The total-ordering property also extends to fence(SeqCst) so we can use those to achieve the same reordering effects:

steal():

t = atomic_load(&tail)

fence(SeqCst)

h = atomic_load(&head)

...

pop():

atomic_store(&tail, t - 1)

fence(SeqCst)

h = atomic_load(&head)

需要说明的是，SeqCst 不应以某种方式用于在存储时获得 Acquire 或在加载时获得 Release。这会导致错误的用法：store(SeqCst); load(Acquire) 并不能确保在 load() 之后存储不会被重新排序，因为 load() 并不是总排序的一部分（它也不是 SeqCst）。

To be clear, SeqCst shouldn't be used to somehow gain Acquire on stores or Release on loads. That can lead to incorrect usage: store(SeqCst); load(Acquire) doesn't ensure that the store will not be reordered after the load() since the load() isn't a part of its total-ordering (it's not SeqCst as well).

相反，它应该被用来在多个原子变量之间强制执行总排序，并引入部分排序（如之前的获取/释放），共同达到相同的效果。需要强调的是，总排序只适用于其他 SeqCst 原子操作或与 fence(SeqCst) 相关的周围操作。有关更多警告，请参见此问题。

It should be used instead to enforce a total-ordering between multiple atomic variables and introduce partial ordering (Acquire/Release as before) which together can achieve the same effect. More emphasis that total ordering only applies to other SeqCst atomic operations or to surrounding ops in relation to fence(SeqCst). See this issue for more warnings.

SeqCst as a default atomic ordering considered harmful · Issue #166 · rust-lang/nomicon · GitHub

Weak orderings

在大多数情况下，你可能并不需要对多个原子变量的操作进行总排序。需要 SeqCst 的情况非常罕见。在实践中，不幸的是，SeqCst 经常被过度使用，这也是程序员不确定使用何种内存排序的一个问题信号...... 总之，当你不需要对不同原子变量进行总排序，也不需要部分排序时，你应该使用放松内存排序（在 LLVM 中也称为 Monotonic）。

In most cases you probably don't need total-ordering on operations for multiple atomic variables. Having the requirement for SeqCst is pretty rare. In practice, SeqCst is unfortunately often overused and a problematic sign that the programmer wasn't sure what memory ordering to use... Anyway, when you don't want total-ordering over different atomic variables and don't need partial ordering, you should reach for the Relaxed memory ordering (also known as Monotonic under LLVM).

std::memory_order - cppreference.com

这样做的目的是确保对同一原子变量的所有原子操作之间的总顺序。换句话说，不在同一内存位置上的其他内存操作可以围绕它重新排序。因此，store(X); load(Y) 可以相互重新排序，但 store(Y); load(Y) 却不能。

All this does is ensure a total-order between all atomic operations to the same atomic variable. In other words, other memory operations not on the same memory location can be reordered around it. So store(X); load(Y) can be reordered around each other but store(Y); load(Y) can't.

所有其他内存排序（Acquire/Release/AcqRel/SeqCst）都继承了 "单变量总排序 "的 Relaxed 属性，而且已知比它 "更强"。Relaxed 适用于计数器或通用单原子数据等只需读取、更新和签出的数据。你不能用它来同步其他普通或原子内存操作。

All other memory orderings (Acquire/Release/AcqRel/SeqCst) inherit the Relaxed property of "single variable total-ordering" and are known to be "stronger" than it. Relaxed is useful for things like counters or generic single-atomic data that you just read, update, and check out. You cannot use this to synchronize other normal or atomic memory operations.

甚至在某些情况下，您并不需要对同一原子变量本身进行总排序，而只是想以原子方式执行某些内存操作（即避免数据竞争）。为此，您可以使用 LLVM 的无序内存排序。对这种排序的需求甚至比对 SeqCst 的需求还要少。在 C11 内存模型中，Unordered 甚至都不存在（只有 Relaxed 才是最 "弱 "的）。

There are even cases where you don't need the total-ordering on the same atomic variable itself and just want to perform some memory operation atomically (i.e. to be free of data-races). For this, you would use the LLVM's Unordered memory ordering. The need for this ordering is even more rare than the need for SeqCst. Unordered also isn't even present in the C11 memory model (it only gets as "weak" as Relaxed).

Hardware Quirks / 硬件特殊性

在现代 CPU 指令集架构（ISA）中，正常内存操作默认为原子操作。这样做的好处是，你不用为宽松/无序内存排序或原子加载/存储与正常操作相比付出代价。缺点是 ISA 不知道数据竞争，因此很难知道是否存在数据竞争。幸运的是，有一些工具（如 LLVM 的 ThreadSanitizer (TSAN)）可以通过检测内存访问来发现数据竞争。

ThreadSanitizer — Clang 19.0.0git documentation

On modern CPU instruction set architectures (ISA), normal memory operations are atomic by default. The upside is that you don't pay a price for Relaxed/Unordered memory orderings or atomic loads/stores vs normal operations. The downside is that data-races don't exist for the ISA so it's harder to know if you have one or not. Fortunately there are tools which can instrument your memory accesses to detect data races like LLVM's ThreadSanitizer (TSAN).

众所周知，某些 CPU ISA 具有总存储排序 (TSO)。这包括 x86 和 SPARC。在这里，当正常内存操作是原子操作时，它们还能免费获得部分排序。这意味着加载默认为 "获取"，存储默认为 "释放"。和以前一样，Release/Acquire 操作没有开销（除了抑制编译器优化），但也有其缺点。在这种情况下，它可以让你在排序上非常自由，因此你本应使用释放/获取的 Relaxed 代码在这里可以正常工作，但在其他架构上就会出现问题，从而很容易编写出内存排序不正确的代码。

Certain CPU ISAs are known to have Total-Store-Ordering (TSO). This includes things like x86 and SPARC. Here, upon normal memory operations being atomic, they also get partial ordering for free. This means loads are Acquire by default and stores are Release by default. As before, you get the benefit of Release/Acquire operations having no overhead (besides inhibiting compiler optimizations) but it also has its downsides. In this case, it lets you be pretty free with orderings so your Relaxed code that should be Release/Acquire will work there but break on other architectures making it easy to write code with incorrect memory orderings.

提到的 "其他 "架构被称为弱排序 ISA。这包括 ARM、AARCH64、POWERPC、RISCV、MIPS 等。在这些架构中，加载和存储默认情况下仍是原子的，但只是 "放松"（Relaxed）的，而且需要为 "获取/释放"（Acquire/Release）添加处理。这意味着，如果排序错误，就会有更大的机会观察到不正确的行为。从理论上讲，较弱的默认排序允许 CPU 有更多重新排序的机会，但鉴于现代 x86 CPU 在一般情况下的跨内核通信性能要好得多，这在实践中似乎并不重要。

The "other" architectures mentioned are called Weakly-Ordered ISAs. This includes things like ARM, AARCH64, POWERPC, RISCV, MIPS, etc. Here, loads and stores are still atomic by default but they're only Relaxed and you pay prices for Acquire/Release. This means that getting ordering wrong gives you a higher chance of observing incorrect behavior. The weaker default orderings theoretically allow for more reordering opportunities by the CPU but this doesn't appear to matter in practice given how much better modern x86 CPUs are for cross-core communication in the general case.

不过，说到顺序一致性，还真没有什么平台可以免费提供。尤其是 fence(SeqCst)，通常成本最高，因为它通常需要一个完整的屏障来实现，以防止所有形式的重排序。在 x86 平台上，可以使用 mfence 实现这一功能，不过如果不同步写入组合内存指令，使用锁前缀指令的成本会更低。SeqCst 加载/存储通常需要升级到 RMW 操作或获取/释放障碍，以保持其语义。这可能就是 SeqCst 操作被传为 "慢 "的原因（其实并不慢）。

When it comes to Sequential Consistency however, there aren't really any platforms where you get this for free. fence(SeqCst) in particular is generally the most costly since it often requires a full barrier to implement which prevents all forms of reordering. On x86, it's achieved with mfence although it can be done cheaper using lock prefixed instructions if you're not synchronizing write-combined memory instructions. SeqCst loads/stores often require either promotion to RMW ops or Acquire/Release barriers to keep their semantics. This may be why SeqCst operations are rumored to be "slow" (they really aren't).

multithreading - Does lock xchg have the same behavior as mfence? - Stack Overflow

Conclusion

使用原子运算时，对内存的推理方式与通常情况下截然不同。你既要考虑并发性以保证原子算法的有效性，又要考虑重排序/可见性以保证算法内存访问的有效性。难怪这被认为是一个具有挑战性的课题。

Working with atomic operations requires reasoning about memory very differently than you would normally. You have to take into account both concurrency for the validity of your atomic algorithm, and reordering/visibility for the validity of your algorithm's memory access. It's no wonder that it's considered a challenging topic to tackle.

希望你已经掌握了这些发布的信息，能让你更直观地了解这些东西是如何工作的。关于原子的讨论远不止这些，例如如何构建正确的原子数据结构、处理并发内存回收以及减少同步等。这些问题本身都很有趣，但应该留到下次再讨论。

Hopefully you have Acquired some of this Released information in a way which gives you more visibility into how all of this stuff works. There's more to discuss with atomics than presented here such as how to build correct atomic data structures, handling concurrent memory reclamation, and reducing synchronization. These are all interesting in their own right, but should be saved for another time.

参考：

Understanding Atomics and Memory Ordering - DEV Community

夜流冰

关注

14
点赞
踩
12

收藏

觉得还不错? 一键收藏
打赏
0
评论
编程参考 - 理解Atomics和memory ordering

这意味着 a = 5;，那么 "a "就不能移到 "b "后面，因为 "b "依赖于 "a"，这在逻辑上说不通。在这些架构中，加载和存储默认情况下仍是原子的，但只是 "放松"（Relaxed）的，而且需要为 "获取/释放"（Acquire/Release）添加处理。在实践中，不幸的是，SeqCst 经常被过度使用，这也是程序员不确定使用何种内存排序的一个问题信号...... 总之，当你不需要对不同原子变量进行总排序，也不需要部分排序时，你应该使用放松内存排序（在 LLVM 中也称为 Monotonic）。
复制链接

扫一扫