Memory Ordering at Compile Time-CSDN博客

这篇文章解释了我很多疑惑，绝对好文！关键点：

1. 编译器优化时会调整代码顺序，但优化有一个原则：保证该程序单线程时运作正确。

2. 写单线程程序时，一般不会注意到memory reorder，因为编译器优化原则。写多线程程序时，一般也不会对此特加注意，因为当使用mutex,semaphore时，编译器会在这些调用附近禁止memory reorder。只有当写lock-free的多线程程序时，才需要对此额外小心。

3. cpu运行时的memory order，不同架构的order保证是不同的，比如x86的cpu保证强一致性，而Alpha是弱order。为了屏蔽这层平台不一致，c++11提出了std::atomic运算符，用来规范编译器优化和保证cpu的memory order。比如著名的double-check singleton模式，跨平台是不安全的，但在x86下是安全的，因为x86本身保证了强一致性，不需要额外指令保护。

4. 含有barrier的function，本身对外也是一个compiler barrier；即使function不含barrier，也是一个barrier，编译器不能在此reorder memory，因为该方法可能修改了后面的内存，编译器无法监测。

Between the time you type in some C/C++ source code and the time it executes on a CPU, the memory interactions of that code may be reordered according to certain rules. Changes to memory ordering are made both by the compiler (at compile time) and by the processor (at run time), all in the name of making your code run faster.

The cardinal rule of memory reordering, which is universally followed by compiler developers and CPU vendors, could be phrased as follows:

Thou shalt not modify the behavior of a single-threaded program.

As a result of this rule, memory reordering goes largely unnoticed by programmers writing single-threaded code. It often goes unnoticed in multithreaded programming, too, since mutexes, semaphores and events are all designed to prevent memory reordering around their call sites. It’s only when lock-free techniques are used — when memory is shared between threads without any kind of mutual exclusion — that the cat is finally out of the bag, and the effects of memory reorderingcan be plainly observed.

Mind you, it is possible to write lock-free code for multicore platforms without the hassles of memory reordering. As I mentioned in myintroduction to lock-free programming, one can take advantage of sequentially consistent types, such asvolatile variables in Java or C++11 atomics — possibly at the price of a little performance. I won’t go into detail about those here. In this post, I’ll focus on the impact of the compiler on memory ordering for regular, non-sequentially-consistent types.

Compiler Instruction Reordering

As you know, the job of a compiler is to convert human-readable source code into machine-readable code for the CPU. During this conversion, the compiler is free to take many liberties.

Once such liberty is the reordering of instructions — again, only in cases where single-threaded program behavior does not change. Such instruction reordering typically happens only when compiler optimizations are enabled. Consider the following function:

int A, B;
 
void foo()
{
    A = B + 1;
    B = 0;
}

If we compile this function using GCC 4.6.1 without compiler optimization, it generates the following machine code, which we can view as an assembly listing using the -S option. The memory store to global variable B occurs right after the store to A, just as it does in the original source code.

$ gcc -S -masm=intel foo.c
$ cat test.s
        ...
        mov     eax, DWORD PTR _B  (redo this at home...)
        add     eax, 1
        mov     DWORD PTR _A, eax
        mov     DWORD PTR _B, 0
        ...

Compare that to the resulting assembly listing when optimizations are enabled using-O2:

$ gcc -O2 -S -masm=intel foo.c
$ cat test.s
        ...
        mov     eax, DWORD PTR B
        mov     DWORD PTR B, 0
        add     eax, 1
        mov     DWORD PTR A, eax
        ...

This time, the compiler has chosen to exercise its liberties, and reordered the store to B before the store to A. And why shouldn’t it? The cardinal rule of memory ordering is not broken. A single-threaded program would never know the difference.

On the other hand, such compiler reorderings can cause problems when writing lock-free code. Here’s a commonly-cited example, where a shared flag is used to indicate that some other shared data has been published:

int Value;
int IsPublished = 0;
 
void sendValue(int x)
{
    Value = x;
    IsPublished = 1;
}

Imagine what would happen if the compiler reordered the store to IsPublished before the store to Value. Even on a single-processor system, we’d have a problem: a thread could very well be pre-empted by the operating system between the two stores, leaving other threads to believe that Value has been updated when in fact, it hasn’t.

Of course, the compiler might not reorder those operations, and the resulting machine code would work fine as a lock-free operation on any multicore CPU having a strong memory model, such as an x86/64 — or in a single-processor environment, any type of CPU at all. If that’s the case, we should consider ourselves lucky. Needless to say, it’s much better practice to recognize the possibility of memory reordering for shared variables, and to ensure that the correct ordering is enforced.

Explicit Compiler Barriers

The minimalist approach to preventing compiler reordering is by using a special directive known as a compiler barrier. I’ve alreadydemonstrated compiler barriers in a previous post. The following is a full compiler barrier in GCC. In Microsoft Visual C++,_ReadWriteBarrier serves the same purpose.

int A, B;

void foo()
{
    A = B + 1;
    asm volatile("" ::: "memory");
    B = 0;
}

With this change, we can leave optimizations enabled, and the memory store instructions will remain in the desired order.

$ gcc -O2 -S -masm=intel foo.c
$ cat test.s
        ...
        mov     eax, DWORD PTR _B
        add     eax, 1
        mov     DWORD PTR _A, eax
        mov     DWORD PTR _B, 0
        ...

Similarly, if we want to guarantee our sendMessage example works correctly, and we only care about single-processor systems, then at an absolute minimum, we must introduce compiler barriers here as well. Not only does the sending operation require a compiler barrier, to prevent the reordering of stores, but receiving side needs one between the loads as well.

#define COMPILER_BARRIER() asm volatile("" ::: "memory")
 
int Value;
int IsPublished = 0;
 
void sendValue(int x)
{
    Value = x;
    COMPILER_BARRIER();          // prevent reordering of stores
    IsPublished = 1;
}
 
int tryRecvValue()
{
    if (IsPublished)
    {
        COMPILER_BARRIER();      // prevent reordering of loads
        return Value;
    }
    return -1;  // or some other value to mean not yet received
}

As I mentioned, compiler barriers are sufficient to prevent memory reordering on a single-processor system. But it’s 2012, and these days, multicore computing is the norm. If we want to ensure our interactions happen in the desired order in a multiprocessor environment, and on any CPU architecture, then a compiler barrier is not enough. We need either to issue a CPU fence instruction, or perform any operation which acts as a memory barrier at runtime. I’ll write more about those in the next post,Memory Barriers Are Like Source Control Operations.

The Linux kernel exposes several CPU fence instructions through preprocessor macros such assmb_rmb, and those macros are reduced to simple compiler barriers when compiling for a single-processor system.

Implied Compiler Barriers

There are other ways to prevent compiler reordering. Indeed, the CPU fence instructions I just mentioned act as compiler barriers, too. Here’s an example CPU fence instruction for PowerPC, defined as a macro in GCC:

#define RELEASE_FENCE() asm volatile("lwsync" ::: "memory")

Anywhere we place RELEASE_FENCE throughout our code, it will prevent certain kinds of processor reordering in addition to compiler reordering. For example, it can be used to make our sendValue function safe in a multiprocessor environment.

void sendValue(int x)
{
    Value = x;
    RELEASE_FENCE();
    IsPublished = 1;
}

In the new C++11 (formerly known as C++0x) atomic library standard, every atomic operation acts as a compiler barrier as well.

int Value;
std::atomic<int> IsPublished(0);
 
void sendValue(int x)
{
    Value = x;
    // <-- reordering is prevented here!
    IsPublished.store(1, std::memory_order_release);
}

And as you might expect, every function containing a compiler barrier must act as a compiler barrier itself, even when the function is inlined. (However, Microsoft’s documentation suggests that may not have been the case in earlier versions of the Visual C++ compiler. Tsk, tsk!)

void doSomeStuff(Foo* foo)
{
    foo->bar = 5;
    sendValue(123);       // prevents reordering of neighboring assignments
    foo->bar2 = foo->bar;
}

In fact, the majority of function calls act as compiler barriers, whether they contain their own compiler barrier or not. This excludes inline functions, functions declared with the pure attribute, and cases where link-time code generation is used. Other than those cases, a call to an external function is even stronger than a compiler barrier, since the compiler has no idea what the function’s side effects will be. It must forget any assumptions it made about memory that is potentially visible to that function.

When you think about it, this makes perfect sense. In the above code snippet, suppose our implementation ofsendValue exists in an external library. How does the compiler know thatsendValue doesn’t depend on the value of foo->bar? How does it knowsendValue will not modify foo->bar in memory? It doesn’t. Therefore, to obey the cardinal rule of memory ordering, it must not reorder any memory operations around the external call tosendValue. Similarly, it must load a fresh value for foo->bar from memory after the call completes, rather than assuming it still equals 5, even with optimization enabled.

$ gcc -O2 -S -masm=intel dosomestuff.c
$ cat dosomestuff.s
        ...
        mov	ebx, DWORD PTR [esp+32]
        mov	DWORD PTR [ebx], 5            // Store 5 to foo->bar
        mov	DWORD PTR [esp], 123
        call	sendValue                     // Call sendValue
        mov	eax, DWORD PTR [ebx]          // Load fresh value from foo->bar
        mov	DWORD PTR [ebx+4], eax
        ...

As you can see, there are many instances where compiler instruction reordering is prohibited, and even when the compiler must reload certain values from memory. I believe these hidden rules form a big part of the reason why people have long been saying that volatile data types in C are not usually necessary in correctly-written multithreaded code.

Out-Of-Thin-Air Stores

Think instruction reordering makes lock-free programming tricky? Before C++11 was standardized, there was technically no rule preventing the compiler from getting up to even worse tricks. In particular, compilers were free to introduce stores to shared memory in cases where there previously was none. Here’s a very simplified example, inspired by the examples provided inmultiple articles by Hans Boehm.

int A, B;
 
void foo()
{
    if (A)
        B++;
}

Though it’s rather unlikely in practice, nothing prevents a compiler from promoting B to a register before checking A, resulting in machine code equivalent to the following:

void foo()
{
    register int r = B;    // Promote B to a register before checking A.
    if (A)
        r++;
    B = r;          // Surprise! A new memory store where there previously was none.
}

Once again, the cardinal rule of memory ordering is still followed. A single-threaded application would be none the wiser. But in a multithreaded environment, we now have a function which can wipe out any changes made concurrently to B in other threads — even when A is 0. The original code didn’t do that. This type of obscure, technical non-impossibility is part of the reason why people have been saying that C++ doesn’t support threads, despite the fact that we’ve been happily writing multithreaded and lock-free code in C/C++ for decades.

I don’t know anyone who ever fell victim to such “out-of-thin-air” stores in practice. Maybe it’s just because for the type of lock-free code we tend to write, there aren’t a whole lot of optimization opportunities fitting this pattern. I suppose if I ever caught this type of compiler transformation happening, I would search for a way to wrestle the compiler into submission. If it’s happened to you, let me know in the comments.

In any case, the new C++11 standard explictly prohibits such behavior from the compiler in cases where it would introduce adata race. The wording can be found in and around section 1.10.22 of themost recent C++11 working draft:

Compiler transformations that introduce assignments to a potentially shared memory location that would not be modified by the abstract machine are generally precluded by this standard.

Why Compiler Reordering?

As I mentioned at the start, the compiler modifies the order of memory interactions for the same reason that the processor does it — performance optimization. Such optimizations are a direct consequence of modern CPU complexity.

I may going out on a limb, since I wasn’t compiling code at the time, but I somehow doubt that compilers did a whole lot of instruction reordering in the early 80′s, when CPUs had only a few hundred thousand transistors at most. I don’t think there would have been much point. But since then, Moore’s Law has provided CPU designers with about 10000 times the number of transistors to play with, and those transistors have been spent on tricks such as pipelining, memory prefetching, ILP and more recently, multicore. As a result of some of those features, we’ve seen architectures where the order of instructions in a program can make a significant difference in performance.

The first Intel Pentium released in 1993, with its so-called U and V-pipes, was the first processor where I really remember people talking aboutpipelining and the significance of instruction ordering. More recently, though, when I step through x86 disassembly in Visual Studio, I’m actually surprised howlittle instruction reordering there is. On the other hand, out of the times I’ve stepped through SPU disassembly on Playstation 3, I’ve found that the compiler really went to town. These are just anecdotal experiences; it may not reflect the experience of others, and certainly should not influence the way we enforce memory ordering in our lock-free code.