Ordering: Why and How?
Our intuitions: Each thread proceeds in program order. Reading a location should return the latest value written (by any thread).
what does "latest" mean exactly? (1)within a thread, it can be defined by program order; (2)across threads, unpredictable without consistency, need Memory Consistency Model.
Single thread execution model (C++03): Program will behave as-if it was you. We can not observe optimizations.
优化会加cache与buffer,导致内存一致性问题。
优化加入"out of order" pipelining与conditional branches, memory accesses may be performed out-of-order。
多核会造成不安全。
cache传播的可见性。
What abot two threads? Optimizations become observable. Memory accesses interleaved.
The CPU is within its rights to reorder the statements within both Thread P0() and Thread P1(), even on relatively strongly ordered systems such as x86.
When a given thread observes memory accesses from a different thread: those memory accesses can be (almost) arbitrarily jumbled around. 重排方式:
(1)不同地址访问操作的排列组合。
(2)基址+偏移量(数组形式)也可以重排。
(3)设备地址操作也可以重排。
Base Guarantee
Optimizations may break "naive" concurrent algorithms.
单CPU访问不同地址怎样优化结果都最终一致。
单CPU访问相同地址怎样优化结果都最终一致。
什么时候使用内存屏障?
Memory Order Semantics
直觉模型是顺序一致性。
the result of any execution is the same as-if:(1)the operations of all threads are executed in some sequential order;(2)the operations of each thread appear in this sequence in the order specified by their program.
需要硬件支持。
保证全局可见性。
Processor issues accesses one-at-a-Lme and stalls for completion. Low processor utilization (17% - 42%) even with caching.
TSO模型。
program oder在多线程执行时是分布式的,memory order是全局的。
happens-before语义。
Inter-thread happens before需要synchronize同步控制。
Synchronize (the easy way)
Synchronization来避免Data Races。
使用synchronize操作保证happens before。
Mutual exclusive execution of critical code blocks. lock() on the same mutex object.
std::atomic<>模板特化,对整型、指针类型、浮点型提供原子运算。
同时使用CAS原子操作可以方便实现lock-free数据结构。
C++ Memory Model (Programing Level)
内存一致性模型,c++11有对应。
(1) Relaxed Memory Order
只保证原子性。
Memory operations performed by the same thread on the same memory location are not reordered with respect to the modification order.
Each memory location has a total modification order (however, this order cannot be observed directly).
(2) Release Memory Order
用于store操作。
Not same as write barrier.
单独使用没有意义。
(3) Acquire Memory Order
用于读操作。
Not same as read barrier.
单独使用没有意义。
(4) Release-Acquire Model
Release-Acquire 配对使用。
一个线程使用release,另一个线程使用require。
用于RMW。
通过相同变量的release与acquire操作来同步。
注意使用while循环等待需要的条件,否则只是运行时某一种状态。
拥有传递性。
(5) Consume Memory Order
特定条件下的acquire model。
后续操作依赖于Consume的变量。
Release-Consume配合使用。
指针解引用与数组索引操作都属于依赖关系。
(6) Sequentially Consistent Memory Order
强一致模型。
没有SC模型则dekker算法失败。f1、f2读取交叉,都读的旧值。需要将store、load都使用memory_order_seq_cst。
由于缓存存在,读到旧值。不同CPU看到ab的10与01两种序列。
有了强一致模型,不同CPU只会看到ab值的一种序列。
没有强一致性模型,则(0,0)产生。
S1、S2交叉。
一般计数器只需要原子操作,不需要同步顺序性。
(7) Atomic Thread Fence
thread fence与变量的atomic操作搭配使用。generally slower than memory barriers associated with an atomic operation。
fence使用release,store(a)只需要relaxed,相当于store(a, release)。
fence使用acquire,store(a)只需要relaxed,相当于load(a, acquire)。
fence使用release-acquire搭配。
References
- https://en.cppreference.com/w/cpp/atomic/memory_order
- https://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/
- https://www.kernel.org/doc/Documentation/memory-barriers.txt
- Memory barriers in C
- C++ Memory Model, Valentin Ziegler
- Memory Consistency, CMU 15-418/15-618, Fall 2016
- 高并发编程--多处理器编程中的一致性问题 GTHub:高并发编程--多处理器编程中的一致性问题(上)
- volatile与内存屏障总结 郑传军:volatile与内存屏障总结