VC 下 volatile 变量能否建立 Memory Barrier 或并发锁

最新推荐文章于 2024-01-16 11:24:42 发布

wzsy

最新推荐文章于 2024-01-16 11:24:42 发布

阅读量2k

点赞数

文章标签：优化 cache locking signal exception windows

我之前以为 volatile 变量可以让编译器内建 Memory Barrier，从而可以实现自旋锁的功能（通常用 InterlockedExchange() 实现），于是写了下面的测试代码：

 
    01typedef struct _ThreadParam 
 
    02{ 
 
    03    UINT nRound; 
 
    04}ThreadParam; 
 
    05  
 
    06volatile UINT g_nCnt; 
 
    07  
 
    08int _tmain() 
 
    09{ 
 
    10    UINT nRnd1, nRnd2, nTestRnd; 
 
    11    const UINT RANGE_MIN = 10; 
 
    12    const UINT RANGE_MAX = 100; 
 
    13  
 
    14    srand((UINT)time(NULL)); 
 
    15  
 
    16    nTestRnd = 0; 
 
    17    while (TRUE) 
 
    18    { 
 
    19        g_nCnt = 0; 
 
    20        nRnd1 = (UINT)(((double)rand() / (double)RAND_MAX) * RANGE_MAX) + RANGE_MIN; 
 
    21        nRnd2 = (UINT)(((double)rand() / (double)RAND_MAX) * RANGE_MAX) + RANGE_MIN; 
 
    22  
 
    23        nTestRnd++; 
 
    24        TestConcurrentAccess(nRnd1, nRnd2); 
 
    25  
 
    26        _tprintf(_T("%d. round 1: %d, round 2: %d/n"), nTestRnd, nRnd1, nRnd2); 
 
    27        _tprintf(_T("global counter: %d/n"), g_nCnt); 
 
    28        if (nRnd1 + nRnd2 != g_nCnt) 
 
    29            break; 
 
    30    } 
 
    31  
 
    32    return 0; 
 
    33} 
 
    34  
 
    35DWORD TestConcurrentAccess(UINT nRnd1, UINT nRnd2) 
 
    36{ 
 
    37    DWORD dwTid; 
 
    38    HANDLE hThd, hThd2; 
 
    39  
 
    40    ThreadParam p1, p2; 
 
    41  
 
    42    p1.nRound = nRnd1; 
 
    43    p2.nRound = nRnd2; 
 
    44  
 
    45    hThd = CreateThread(NULL, 0, ThreadTest, &p1, 0, &dwTid); 
 
    46    hThd2 = CreateThread(NULL, 0, ThreadTest, &p2, 0, &dwTid); 
 
    47  
 
    48    WaitForSingleObject(hThd, INFINITE); 
 
    49    WaitForSingleObject(hThd2, INFINITE); 
 
    50  
 
    51    return 0; 
 
    52} 
 
    53  
 
    54DWORD WINAPI ThreadTest(LPVOID pParam) 
 
    55{ 
 
    56    _ASSERT(pParam != NULL); 
 
    57  
 
    58    ThreadParam* pThdParam = (ThreadParam*)pParam; 
 
    59  
 
    60    for (UINT i = 0; i < pThdParam->nRound; i++) 
 
    61    { 
 
    62        Sleep(5); 
 
    63        g_nCnt++; 
 
    64    } 
 
    65  
 
    66    return 0; 
 
    67}

编译与运行环境:

VC8
编译选项: /O2 /EHsc /MD
Windows: 带 PAE 的 MP 版内核 (ntkrpamp.exe)，CPU: Core 2 Duo

g_nCnt 被声明为 volatile，用 volatile 修饰的变量对其的操作都是对实际的内存进行操作，我以为上面的并发线程 ThreadTest() 中 g_nCnt++ 操作是原子的，即对并发线程 ThreadTest() 提供 [写-写并发] 功能（事实证明是错误的），所以上面的代码我的预期结果是 nRnd1 + nRnd2 != g_nCnt 永远不成立，程序一直在执行不退出。

但实际运行结果事与愿违，每次运行后，TestConcurrentAccess() 运行不超过 100 轮 (nTestRnd < 100)，就达到 nRnd1 + nRnd2 != g_nCnt 条件，程序就结束了。

后来请教 MSDN 上一位牛人江写生，以及自己反汇编调试，才明白 volatile 的实际作用。

volatile 变量的作用

volatile 变量的作用有两点：

1) volatile 变量的每次读/写都要用读/写内存指令，不能因为假定两次读/写之间没变化而编译优化，保存在寄存器中。

这个和 non-cached 数据无关，启动或关闭的编译优化不是 CPU 的 cache 特性（即不是这个 #pragma section("section", nocache)），而是：是否使用寄存器中的已有的数据。cache 的一致性由 CPU 硬件完成，不用程序员操心。
2) volatile 变量的每次读/写要严格按语句的先后顺序，不能因为假定 volatile 变量没额外的 side effects 而编译优化，产生打乱顺序的 "流水线" 执行顺序。（VC9 这点没严格做到，/O2 优化有时仍会打乱 volatile 与 non-volatile 之间的执行顺序）

volatile 本身不提供并发控制。

MSDN: volatile (C++) 参考里的示例代码没有并发控制的功能，它之所以能行，是因为一个线程写，另一个线程读，即 [读-写并发]。

示例中 Sentinel = false 只需一条指令，线程 ThreadFunc1() 读时，线程 ThreadFunc2() 要么在这条指令前，要么在这条指令后，所以没问题。

而我上面的 volatile UINT g_nCnt; 和 g_nCnt++; 代码，用 OllyDbg 调试后发现是这样编译的：

Debug Compile Option: /Od /EHsc /MDd

Release Compile Option: /O2 /EHsc /MD

 
    1__asm int 3; 
 
    2g_nCnt++; 
 
    3__asm nop;

Debug 编译为：

 
    1CC              INT3 
 
    2A1 88714100     MOV EAX,DWORD PTR DS:[417188] 
 
    383C0 01         ADD EAX,1 
 
    4A3 88714100     MOV DWORD PTR DS:[417188],EAX 
 
    590              NOP

Release 编译为：

 
    1CC                    INT3 
 
    28305 78334000 01      ADD DWORD PTR DS:[403378],1 
 
    390                    NOP

这是一条 Opcode=8305 的复杂 ADD 指令，在 UP 上似乎没问题，但在 MP 上就不保证并发的正确了，此时应该用 LOCK 指令前缀锁住总线，换句话说这条 ADD DWORD PTR DS:[403378],1 在我的 Core 2 Duo 上不是原子的指令，这也就是一开始的测试代码出错的原因。

InterlockedIncrement() 提供的写并发

有两组 InterlockedXXX() 函数，一组是 Windows API (kernel32.dll 导出)，另一组是 VC 编译器内建的 intrinsic 函数 _InterlockedXXX()，功能是相同的。

将测试代码中并发线程 ThreadTest() 中 g_nCnt++ 改为 _InterlockedIncrement((long*)&g_nCnt)。

对于 UINT g_nCnt 变量，无论用不用 volatile 修饰都可以正确完成并发，因为 _InterlockedIncrement() 依靠的是 LOCK 指令前缀而非 non-cached 特性，且 VC8 的 volatile 修饰也不产生 non-cached 段区。

仍然使用上面的编译选项，对 _InterlockedIncrement() 的编译如下：

 
    1__asm int 3; 
 
    2_InterlockedIncrement((long*)&g_nCnt); 
 
    3__asm nop;

Debug, Release 下均编译为：

 
    1CC                 INT3 
 
    2B8 78334000        MOV EAX,Volatile.00403378 
 
    3B9 01000000        MOV ECX,1 
 
    4F0:0FC108          LOCK XADD DWORD PTR DS:[EAX],ECX 
 
    590                 NOP

实测中，用上面 _InterlockedIncrement() 方法，运行了 500 轮 TestConcurrentAccess() 都没有出现过 nRnd1 + nRnd2 != g_nCnt 的问题。

volatile 使用示例

(1). 非 volatile 变量的无作用操作被优化掉

 
    01// Compile with /O2 
 
    02__declspec(noinline) void TestVolatile_1() 
 
    03{ 
 
    04    UINT res = 0; 
 
    05    volatile UINT v = 0; 
 
    06    UINT nv = 0; 
 
    07  
 
    08    // 下面代码不会优化掉，因为 v 是 volatile 的，涉及 volatile 的读/写 
 
    09    // 指令被保留下来，并且按照优化前的指令顺序和多少被原封不动的保留下来 
 
    10    __asm int 3; 
 
    11    v++; 
 
    12    v++; 
 
    13    res = v; 
 
    14  
 
    15    // 下面两个 nop 直接的代码完全被优化掉，因为 res 都没有被 
 
    16    // 使用，且 nv 不是 volatile 
 
    17    __asm nop; 
 
    18    nv++; 
 
    19    nv++; 
 
    20    res = nv; 
 
    21    __asm nop; 
 
    22}

(2). 非 volatile 变量的操作被优化成寄存器操作

 
    01// Compile with /O2 
 
    02__declspec(noinline) void TestVolatile_2() 
 
    03{ 
 
    04    volatile UINT v; 
 
    05    UINT nv; 
 
    06  
 
    07    // v 的操作每次都用内存读写指令 
 
    08    for (v = 0; v < 10; v++) 
 
    09        Sleep(100); 
 
    10  
 
    11    // nv 的操作被优化成寄存器操作 
 
    12    for (nv = 0; nv < 10; nv++) 
 
    13        Sleep(100); 
 
    14}

LOCK# 信号

参考：Intel 64 and IA-32 Architectures Software Developer's Manuals

以下是我对 Intel Manuals 的翻译和摘录：

From: Volume 2A.: Ch3. Instruction Set Reference: 3.2 Instruction (A-M)

LOCK -- Assert LOCK# Signal Prefix

LOCK 前缀将导致处理器在执行伴随指令时，产生 LOCK# 信号，这可以让此执行指令成为原子指令。在 MP 系统上，LOCK# 信号可以保证处理器互斥地访问共享内存。

对于 Intel 后来的一些处理器（包括 Pentium 4, Xeon, P6 family），可以在不产生 LOCK# 信号的情况下做总线锁定。See the "IA-32 Architecture Compatibility" section below.

The LOCK prefix can be prepended only to the following instructions and only to those forms of the instructions where the destination operand is a memory operand (目的操作数为内存): ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. If the LOCK prefix is used with one of these instructions and the source operand is a memory operand, an undefined opcode exception (#UD) may be generated. An undefined opcode exception will also be generated if the LOCK prefix is used with any instruction not in the above list. The XCHG instruction always asserts the LOCK# signal regardless of the presence or absence of the LOCK prefix. (XCHG 总是产生 LOCK# 信号，无论有没有 LOCK 前缀)

...

IA-32 Architecture Compatibility

从 P6 family 开始的处理器，当有 LOCK 前缀，并且正在访问的内存区域已在处理器内部 cached 时，通常不会产生 LOCK# 信号，而是仅锁定处理器的 cache。此时由处理器的 cache 一致性检查机制来保证原子性的内存操作。See "Effects of a Locked Operation on Internal Processor Caches" in Chapter 8 of Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3A, the for more information on locking of caches.

From: Volume 3A.: Ch8. Multiple-Processor Management

8.1.4 Effects of a LOCK Operation on Internal Processor Caches

For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow it's cache coherency mechanism to ensure that the operation is carried out atomically. This operation is called "cache locking." The cache coherency mechanism automatically prevents two or more processors that have cached the same area of memory from simultaneously modifying data in that area.

总之，有 LOCK 前缀不一定产生 LOCK# 信号，但 XCHG 一定会产生 LOCK# 信号，关于这点在云风的《游戏之旅：我的编程感悟》中第六章也有提到。所以应该在效率要求高的地方避免使用 XCHG。

MemoryBarrier()

Windows Server 2003 以上的 Windows API 中有一个 MemoryBarrier() 宏用于建立其前后紧挨指令的内存访问壁障（指令不能被处理器 reorder），在 IA32 VC8 上的实现是：

 
    1FORCEINLINE VOID MemoryBarrier (VOID) 
 
    2{ 
 
    3    LONG Barrier; 
 
    4    __asm { 
 
    5        xchg Barrier, eax 
 
    6    } 
 
    7}