Beautiful Proof - Data-Race-Free Implies Sequential Consistency
刘未鹏(pongba)
C++的罗浮宫(http://blog.csdn.net/pongba)
Disclaimer: You should first (but not the least) understand the fundamentals of multithreading memory models to be able to read the following text. I personally recommend you start with the related C++0x memory model proposals if you’re a C++ programmer, or the java memory model if you’re a java programmer. That being said, the principles behind the proof apply to all kinds of the so-called data-race-free models.
If it happens that you’ve been slogging through all the glories proposals of C++0x memory model, and found yourself tripped up by the proof in question. Then you’ll find this post useful. Otherwise I suggest you at least read this paper before going on.
I do assume that you have a sufficient understanding of the fact that multithreading is becoming, if not has already become, the next big event and general skill in the programming field. There’re flames all over the places. And since C++ is one of my favorite programming languages, and, unfortunately one that has not included a multithreading memory model of some sort (well, not until C++0x is published), I kept tracking of all the progress that has been made with respect to it. It turns out that multithreading is this extremely tricky yet interesting business.
OK, enough said, let’s get down to business. Just one last thing though, I only write this post to explain the proof informally, so this is not going to be like an “Introduction to C++0x memory model” post (although I do have the plan to write one of those, just so long as there’s enough time), instead, I’ll go without any further preamble. For those of you who’re not familiar with the upcoming, brand-new C++0x memory model, and do consider yourself a C++ programmer (or, you know, a programmer in general), I strongly suggest you start now.
Anyway, The original paper is here。Let’s take a look at its conclusion first:
If you stick to simple synchronization rules, that is, using only lock and unlock to avoid data-races, then a consistent execution of your program will exhibit behavior that you can reason about according to one of its corresponding sequentially consistent executions.
Well, put simply, this states that you can regard your program as being executed by sequentially interleaving the actions of all the threads, provided that you use only lock and unlock to avoid data-races.
To state it even more loosely, it actually means that data-race-free implies Sequential Consistency, the most intuitive consistency model.
Actually, this is one of the most fundamentally important conclusions with respect to this mainstream multithreading memory model. It has two very nice implications. First, it says that we only need to make our program data-race-free to get sequential consistency; and second, it gives the implementations (compilers, vms, and hardwares) great latitude in optimizing the code, most of them being instruction reorderings of course.
For example, let’s consider the code snippet below:
int x, y; // globals
Thread 1 Thread 2
y = 1; x = 1;
r1 = x; r2 = y;
If the code is executed sequential-consistently, r1 and r2 could never be both 0, because if we observed that r1 is 0, then it must be the case that “r1 = x” precedes “x = 1” , together with the other two relations ( “y = 1” precedes “r1 = x” and “x = 1” precedes “r2 = y”), we can conclude that “y = 1” precedes “r2 = y” (since the relation precedes is transitive under the sequential consistency interpretation), making r2 observe the value 1. And the same reasoning applies the other way around.
Unfortunately, however, sequential consistency is too strict in that it prevents many useful optimizations, such as instruction reordering. It’s usually the case that, in a multithreaded program, the parts shared by two or more threads are rare. While present CPUs don’t actually know the existence of threads, to achieve sequential consistency we’d have to compile/execute the whole program sequential-consistently, not just the inter-thread parts. So relying on hardware obviously is not an option. This leaves us with the only choice of manually specifying which parts of a program are inter-threaded. And that’s where all those synchronization instructions kick in.
Like in the above example, most of today’s compiler could reorder the two actions in each thread, thus making “r1 == r2 == 0” a perfectly legitimate outcome, and even if your compiler hadn’t done it, your CPU will do it, too. But if we sprinkle some special crumbles (i.e. synchronization primitives) over our code, we’ll get sequential consistency easily without forcing the whole implementation to obey sequential consistency rules:
int x, y;
mutex mx, my;
Thread 1 Thread 2
Lock(my); Lock(mx);
y = 1; x = 1;
Unlock(my); Unlock(mx);
Lock(mx); Lock(my);
r1 = x; r2 = y;
Unlock(mx); Unlock(my);
Here, we simply used lock/unlock primitives to avoid data-race which could cause the infamous undefined behavior. Since lock/unlock primitives enforce happens-before relations, and more importantly, act as one-way memory fences, this will get us sequential consistency, in addition to data-race-free-ness. I think the reasoning should be pure simple, so I’ll just leave it to you.
So what’s the big deal? I mean, didn’t people always do that? The big deal is, we use lock/unlock to avoid all the data-races, but as a result, we not only get to avoid the data-race, we also get the whole sequential consistency property. That’s a really nice guarantee. It means that we don’t have to worry about all the gory optimizations that might change the behavior of our code; we just have to make our code data-race-free, and the whole sequential consistency property just comes as a side-effect. And maybe more importantly, the implementations still get to keep their unawareness of threads, hence most of the optimizations still go underneath, but you just don’t have to worry about them anymore.
The principle of data-race-free memory model is basically that the compilers or CPUs know nothing about threads; they compile code or execute instructions as if there were just one thread, doing all those crazy optimizations that are only legitimate in a single-thread context; the programmers, however, put some specially instructions in their code to prevent the crazy optimizations in some areas, which, supposedly, are those places where multiple threads communicate with each other. This way not only do we get to preserve the correctness (sequential consistency) of our code, but we get to allow most of the useful optimizations. This proved to be the most practical trade-off before the super-intelligent compiler or CPU shows up which could efficiently recognize all those inter-thread communication code for us, and respect them while doing all the optimizations.
Anyway, I’ll just get down to the proof of “data-race-free implies SC” before you get all bored ;)
Let’s consider a consistent execution of a program on a given input. Recall that, informally, “consistent” means this particular execution respects all the happens-before relations that are specified in the code using lock and unlock primitives, although it might not respect the is-sequenced-before relations, which is pretty much what instruction reordering is all about.
So, let’s call this particular execution P’, and apparently it’s not exactly sequentially consistent, because presumably there’d be lots of instructions that could and would be reordered. However, what’s important to us is if it exhibits the exact behavior of one of its corresponding sequential consistent counterparts. For convenience, let’s just call one of the latter ones P.
But what is this P anyway? How do we get it? Remember that we said a consistent execution should respect all the happens-before relations in the original program? So all the happens-before relations in P’ must be preserved in P, which doesn’t contradict the sequential consistency of P. And plus, P will have all those is-sequenced-before relations, which will make all the actions from each thread appear in the order in which they appear in their own thread. But it’s still not enough; to make P a sequentially consistent execution, all the actions should appear in a total order. Recall that two actions from different thread that aren’t ordered by happens-before could be executed in any order, so why don’t we just arbitrarily order them? This way we get to keep each thread order and will get a total order.
Now, we’re going to prove that we could use P to reason about the exact behavior of P’, which simply means that, despite all the optimizations that have been made to our code, we could still regard it as being executed sequential-consistently.
Apparently, from the standpoint of main memories, there’re just two kinds of actions performed by a CPU – store and load. So, an execution of a program can be seen as consisting of only stores and loads. When we’re reasoning about the behavior of a program on a given input, we are actually predicting the value seen by each load.
So let’s just consider a load L in P’, and assume the value seen by it is generated by a store Svisible. To prove that we could use P to reason about the behavior of P’, we just have to show that Svisible is the last preceding store in P that stores to the same location.
Clearly, if Svisible and L are in the same thread in P’, Svisible must be sequenced before L, because if not, it would either implies a violation of C++03;5p4 (which basically says that there shall not be conflicting accesses of the same scalar between two adjacent sequence points) or conflicts the intra-thread semantic constraints (which basically says that it should at least appear to be the case that a particular thread is executed sequentially, letting each load see the value written by the preceding store).
However, what if Svisible and L reside in different threads? Then Svisible must happen-before L because of the no data-race rule.
So, it turned out that Svisible either is-sequenced-before L or happens-before L. And here comes the real brilliant part. We’ve already known that P’ respects all the happens-before relations in the original program, so if Svisible happens-before L in P’, then it must happen-before L in P. But what if Svisible is-sequenced-before L? Recall that P is sequentially consistent; it respects all the is-sequenced-before relations in the original program. So if Svisible is-sequenced-before L, then it must be the case that it’s still sequenced before L in P. Hence, in either case, Svisible is before L in P. So, to get to the final conclusion that says we can use P to reason about the behavior of P’, it’s left to be shown that there’re no other stores to the same location that come sandwiched between Svisible and L in P, because this way when we decide in P that L sees the value written by Svisible, it would actually be the case that L would see the value written by Svisible in P’, too, which is what “reasoning about the behavior of P’ using P” is all about.
Let’s use the powerful “reductio ad absurdum”. Let’s assume that there’s another Smiddle sitting between Svisible and L in P. Since P clearly is data-race-free (P and P’ both are consistent executions of the original data-race-free program), Smiddle, Svisible, and L must be fully ordered (by is-sequenced-before or happens-before). However, Svisible and L are already ordered. So the only possibilities left are:
Smiddle is-sequenced-before/happens-before Svisible
L is-sequenced-before/happens-before Smiddle
Svisible is-sequenced-before/happens-before Smiddle is-sequenced-before/happens-before L
The reason the first two aren’t possible is that it contradicts the reverse ordering in P. Here’s the logic: If Smiddle is-sequenced-before/happens-before Svisible, then it will come before Svisible in P, too (because is-sequenced-before and happens-before are the two kinds of relations that P must respect), thus contradicting our assumption that Smiddle sits between Svisible and L. The same reasoning applies to the second case.
To show that the last case isn’t possible either, we just have to recall that P’ is a consistent execution (which means every load sees the last preceding store to the same location), so if Smiddle really comes between Svisible and L in P(in the sense shown by the last case below) and since P’ have to respect the happens-before relations (and the is-sequenced-before relation in this case), Smiddle will comes between them in P’, too, thus blocking the line of sight from L to Svisible, making L see the value written by Smiddle, not Svisible, which will, in turn, make P’ an inconsistent execution.
So it turns out this imaginary Smiddle appears neither here nor there. Then it must not exist, concluding the proof.