简单理解c++11内存序

现代计算机为了加快执行效率,自动的包含了很多的优化。这些优化能保证在单线程环境下不破坏原来的逻辑,但是一旦变为多线程,情况就不一样了,原因主要来自三个方面:

  • 编译器优化
  • CPU乱序执行
  • CPU Cache不一致性

内存模型简单来说是一种契约,开发者利用这个契约完成数据的同步以避免竞争条件,系统(包括编译器,操作系统,处理器)保证执行的逻辑符合契约。

c++11有6个内存序选项可应用于原子类型的操作:memory_order_relaxed、memory_order_consume、memory_order_acquire、memory_order_release、memory_order_acq_rel、memory_order_seq_cst。
代表3种内存模型:顺序一致性(sequentially consistent),获取释放序(memory_order_consume,memory_order_acquire,memory_order_release,memory_order_acq_rel)和自由序(memory_order_relaxed)。

Store操作,可选如下内存序:memory_order_relaxed, memory_order_release, memory_order_seq_cst。
Load操作,可选如下内存序:memory_order_relaxed, memory_order_consume, memory_order_acquire, memory_order_seq_cst。
Read-modify-write操作,可选如下内存序:memory_order_relaxed, memory_order_consume, memory_order_acquire, memory_order_release, memory_order_acq_rel, memory_order_seq_cst。

不推荐使用memory_order_consume。个人不推荐使用非默认序,难考虑周全,难测试,效率也不见得有明显提升。

顺序一致性

顺序一致性是默认选项,顺序一致性中,程序中的行为从任意角度去看,序列都保持一定顺序。每个线程看其他线程操作的顺序都是一样的,比较符合常规思维。
看一个顺序一致性的例子。(后面例子中全局变量x,y,z及其初值均与该例相同,每个函数都是一个线程)

std::atomic<bool> x{false}, y{false};
std::atomic<int> z{0};
// thread 1
void write_x()
{
	x = true; // 1
}
// thread 2
void write_y()
{
	y = true; // 2
}
// thread 3
void read_x_then_y()
{
	while(!x);
	if(y) ++z; // 3
}
// thread 4
void read_y_then_x()
{
	while(!y);
	if(x) ++z; // 4
}

因为序列要保持一定顺序,就能比较简单地分析出所有可能的情况。可以确定的是,1一定在3之前,2一定在4之前,所有可能的序列有:2413,2143,2134,1243,1234,1324,z的值分别为1,2,2,2,2,1。

自由序

自由序中,同一线程中对于同一变量的操作还是遵从先行关系,但别的线程看来就不一定了。
CPP Concurrency In Action 有个生动的例子来说明自由序:

为了了解自由序是如何工作的,可先将每一个变量想象成在一个独立房间中拿着记事本的人。他的记事本上是一组值的列表,可以通过打电话的方式让他给你一个值,或让他写下一个新值。如果告诉他写下一个新值,他会将这个新值写在列表的最后。如果让他给你一个值,他会从列表中读取一个值给你。
第一次与这人交谈时,如果问他要一个值,他可能会在现有的列表中(别人告诉他写下的值)选取 任意值 告诉你。
如果之后再问他要一个值,可能会得到与之前相同的值,或是列表中之前值下端的其他值,他不会给你上端的值。
如果让他写一个值,并且随后再问他要一个值,他要不就给你你刚告诉他的那个值,要不就是一个列表中那个值下端的值(别人告诉他写下的值)。

自由序的唯一要求是同一线程中同一原子变量的访问不能乱序。多个使用自由序的原子变量,每一个都拥有自己的修改顺序,但是他们之间没有任何关系。

一个自由序的例子:

void write_x_then_y()
{
	x.store(true, std::memory_order_relaxed); // 1
	y.store(true, std::memory_order_relaxed); // 2
}
void read_y_then_x()
{
	while(!y.load(std::memory_order_relaxed)); // 3
	if(x.load(std::memory_order_relaxed)) // 4
		++z;
}

线程1看来,x、y依次置为true,但在线程2看来,y为true时,x可能为false。

获取释放序

获取释放序是自由序的加强版,虽然操作依旧没有统一顺序,但引入了同步。以下是cppreference对其的说明:

memory_order_release
A store operation with this memory order performs the release operation on the affected memory location: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable.
读写不能乱序到release-store之后

memory_order_acquire
A load operation with this memory order performs the acquire operation on the affected memory location: no reads or writes in the current thread can be reordered before this load. All writes in other threads that release the same atomic variable are visible in the current thread.
读写不能乱序到acquire-load之前

memory_order_acq_rel
A read-modify-write operation with this memory order is both an acquire operation and a release operation. No memory reads or writes in the current thread can be reordered before the load, nor after the store. All writes in other threads that release the same atomic variable are visible before the modification and the modification is visible in other threads that acquire the same atomic variable.

将上面的例子改成:

void write_x_then_y()
{
	x.store(true, std::memory_order_relaxed); // 1
	y.store(true, std::memory_order_release); // 2
}
void read_y_then_x()
{
	while(!y.load(std::memory_order_acquire)); // 3
	if(x.load(std::memory_order_relaxed)) // 4
		++z;
}

2同步于3,1先于2,3先于4,所以1先于4,z为1。

将顺序一致性中的例子修改下:

void write_x()
{
	x.store(true, std::memory_order_release); // 1
}
void write_y()
{
	y.store(true, std::memory_order_release); // 3
}
void read_x_then_y()
{
	while(!x.load(std::memory_order_acquire)); // 2
	if(y.load(std::memory_order_acquire)) // 5
		++z;
}
void read_y_then_x()
{
	while(!y.load(std::memory_order_acquire)); // 4
	if(x.load(std::memory_order_acquire)) // 6
		++z;
}

1同步于2,但在5那,y不一定是true;3同步于4,在6那,x不一定为true。z可能为0,1,2。

fence

extern "C" void atomic_thread_fence(std::memory_order order) noexcept;
atomic_thread_fence建立非原子或宽松原子访问的order参数指示的内存同步排序。
atomic_thread_fence可以指定6种memory order,其中memory_order_relaxed没任何效果,memory_order_consume这里不关注,memory_order_acquire属于acquire fence,memory_order_release属于release fence,memory_order_acq_rel和memory_order_seq_cst是full fence。
建立同步至少需要一个原子对象。

以下内容来自cppreference
Fence-atomic synchronization
A release fence F in thread A synchronizes-with atomic acquire operation Y in thread B, if

  • there exists an atomic store X (with any memory order)
  • Y reads the value written by X (or the value would be written by release sequence headed by X if X were a release operation)
  • F is sequenced-before X in thread A

In this case, all non-atomic and relaxed atomic stores that are sequenced-before F in thread A will happen-before all non-atomic and relaxed atomic loads from the same locations made in thread B after Y.

简单来说,release fence和一个原子变量的任意memory order的store,与另一个线程中同一个原子变量的acquire load形成了同步。

Atomic-fence synchronization
An atomic release operation X in thread A synchronizes-with an acquire fence F in thread B, if

  • there exists an atomic read Y (with any memory order)
  • Y reads the value written by X (or by the release sequence headed by X)
  • Y is sequenced-before F in thread B

In this case, all non-atomic and relaxed atomic stores that are sequenced-before X in thread A will happen-before all non-atomic and relaxed atomic loads from the same locations made in thread B after F.

简单来说,一个原子变量的release store,与另一个线程中同一个原子变量的任意memory order的load和acquire fence形成了同步。

Fence-fence synchronization
A release fence FA in thread A synchronizes-with an acquire fence FB in thread B, if

  • There exists an atomic object M
  • There exists an atomic write X (with any memory order) that modifies M in thread A
  • FA is sequenced-before X in thread A
  • There exists an atomic read Y (with any memory order) in thread B
  • Y reads the value written by X (or the value would be written by release sequence headed by X if X were a * release operation)
  • Y is sequenced-before FB in thread B

In this case, all non-atomic and relaxed atomic stores that are sequenced-before FA in thread A will happen-before all non-atomic and relaxed atomic loads from the same locations made in thread B after FB

简单来说,release fence和一个原子变量的任意memory order的store,与另一个线程中同一个原子变量的任意memory order的load和acquire fence形成了同步。

atomic_thread_fence imposes stronger synchronization constraints than an atomic store operation with the same std::memory_order. While an atomic store-release operation prevents all preceding reads and writes from moving past the store-release, an atomic_thread_fence with memory_order_release ordering prevents all preceding reads and writes from moving past all subsequent stores.

重排

release fence可以防止fence前的内存操作重排到fence后的任意store之后,即阻止loadstore重排和storestore重排;
acquire fence可以防止fence后的内存操作重排到fence前的任意load之前,即阻止loadload重排和loadstore重排;
full fence是release fence和acquire fence的组合,所以防止loadload、loadstore、storestore重排。

这里的重排应该即有编译器也有CPU的重排,然而根据cppreference

On x86 (including x86-64), atomic_thread_fence functions issue no CPU instructions and only affect compile-time code motion, except for std::atomic_thread_fence(std::memory_order_seq_cst), which issues the full memory fence instruction MFENCE.

而原子变量的acquire和release,不比atomic_thread_fence更强,也就阻止不了CPU乱序了。不过好在,On x86 (including x86-64),只会进行storeload重排,而不会进行其他3种重排。

令据 X86/GCC memory fence的一些见解 所说,x86的memory_order_relaxed相当于compiler fence,会让编译器把所有缓存在寄存器中的内存变量flush到内存中,并重新从内存中读取这些值。同时该文提供了一个例子展示storeload重排:

#include <iostream>
#include <atomic>
#include <thread>

#define cpufence asm volatile("mfence" ::: "memory")

alignas(64) volatile int cntg = 0;
alignas(64) volatile int cnt1 = 0;
alignas(64) volatile int cnt2 = 0;
std::atomic<int> x;
std::atomic<int> y;
alignas(64) volatile int r1 = 0;
alignas(64) volatile int r2 = 0;

void fun1()
{
	while(true)
	{
		while(cntg == cnt1);
		cpufence;
		x.store(1, std::memory_order_release);
		// std::atomic_thread_fence(std::memory_order_seq_cst);
		r1 = y.load(std::memory_order_acquire);
		cpufence;
		++cnt1;
		cpufence;
	}
}
void fun2()
{
	while(true)
	{
		while(cntg == cnt2);
		cpufence;
		y.store(1, std::memory_order_release);
		// std::atomic_thread_fence(std::memory_order_seq_cst);
		r2 = x.load(std::memory_order_acquire);
		cpufence;
		++cnt2;
		cpufence;
	}
}
int main()
{
	thread thr1(fun1);
	thread thr2(fun2);
	int detected = 0;
	while(true)
	{
		x = 0;
		y = 0;
		++cntg;
		cpufence;
		while(cnt1 != cntg || cnt2 != cntg);
		if(r1 == 0 && r2 == 0)
		{
			++detected;
			std::cout << "bad, cntg: " << cntg << " detected: " << detected << std::endl;
			std::this_thread::sleep_for(std::chrono::milliseconds(50));
		}
	}
	return 0;
}

开O2优化编译,如果不使用std::atomic_thread_fence(std::memory_order_seq_cst);或MFENCE,则会进行storeload重排。

疑惑

按照memory_order_relaxed的含义,线程A写之后,线程B不一定能拿到最新值,那多个线程同时执行fetch_xx类操作又是实现最终一致性的呢?C/C++11 mappings to processors介绍了C/C++11原子操作不同处理器的指令实现,但并不包含fetch_xxx类函数。
后来我有点醒悟,正如本文开始所说的“内存模型简单来说是一种契约”,但平台可以选择更强的实现,比如x86下memory_order_relaxed包含内存同步的操作。

参考

CPP Concurrency In Action 5.3 Synchronizing operations and enforcing ordering
C++ 内存模型
关于std::atomic_thread_fence
memory_order
atomic_thread_fence
X86/GCC memory fence的一些见解
为什么在 x86 架构下只有 StoreLoad 屏障是有效指令?
C 表达式中的汇编指令


2024年1月18日
最近看了一些内存模型的文章,有了一些更深的理解。
由于CPU缓存一致性协议的存在,各CPU看到不同变量的读写都是顺序一致的。但为了加快执行效率,引入了Store Buffer、Invalid Queue等机制,本来一些同步操作变成了异步操作,代码层面看起来就是操作乱序了。为了在软件开发层面屏蔽掉这些复杂的硬件机制,抽象出内存模型这个东西来,比如TSO(Total Store Order)、PSO(Partial Store Order)、RMO(Relaxed Memory Order),不同内存模型会有不同程度的CPU乱序。同时编译器为了优化执行速度也可能会进行乱序。但这些乱序可能会影响并发程序的结果,为了阻止这些乱序,可以使用CPU层面的屏障和编译器屏障。c++11的内存序又是对这些屏障的一层抽象。
参考:
缓存一致性协议的工作方式
内存屏障的来历
内存一致性模型-PSO
内存一致性模型-TSO

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值