临界区的LockCount为何小于-1
某日,在浙大国家实验室,与老方和小崔调试监控死锁问题。机柜里一溜架装服务器上出现死锁问题。用WinDbg看,发现其中导致死锁的临界区LockCount值是小于-1的数!!
多次重现该问题,发现LockCount经常是负的两三百。
我等本着不十分科学严谨,但又有一点科学严谨的态度,装模作样查了下资料,显示如下:
LockCount代表什么含义
ms-help://MS.MSDNQTR.v80.en/MS.MSDN.v80/dnmag03/html/CriticalSections1203default.htm 或 http://msdn.microsoft.com/zh-cn/magazine/cc164040(en-us).aspx
struct RTL_CRITICAL_SECTION { PRTL_CRITICAL_SECTION_DEBUG DebugInfo; LONG LockCount; LONG RecursionCount; HANDLE OwningThread; HANDLE LockSemaphore; ULONG_PTR SpinCount; };
LockCount 这是临界区里最重要的字段。其初始值为-1,而0或更大的值表示临界区被持有。当该值不等于-1,OwningThread字段(该字段在WinNT.h里定义错误的,应该用DWORD而不是HANDLE类型)存放了持有该临界区的线程ID。 LockCount - (RecursionCount - 1 ) 表示还有多少其他线程在等待获取该临界区。
(以下是英文原版) LockCount This is the most important field in a critical section. It is initialized to a value of -1; a value of 0 or greater indicates that the critical section is held or owned. When it's not equal to -1, the OwningThread field (this field is incorrectly defined in WINNT.H—it should be a DWORD instead of a HANDLE) contains the thread ID that owns this critical section. The delta between this field and the value of (RecursionCount -1) indicates how many additional threads are waiting to acquire the critical section.
|
LockCount的值是如何变化的。
网上有很多文章根据临界区的原理,总结了两个能使LockCount变换的函数的伪代码如下:
_RtlTryEnterCriticalSection
if(CriticalSection->LockCount == -1) { // 临界区可用 CriticalSection->LockCount = 0; CriticalSection->OwningThread = TEB->ClientID; CriticalSection->RecursionCount = 1;
return TRUE; } else { if(CriticalSection->OwningThread == TEB->ClientID) { // 临界区是当前线程获取 CriticalSection->LockCount++; CriticalSection->RecursionCount++;
return TRUE; } else { // 临界区已被其它线程获取 return FALSE; } } |
_RtlLeaveCriticalSection
if(--CriticalSection->RecursionCount == 0) { // 临界区已不再被使用 CriticalSection->OwningThread = 0;
if(--CriticalSection->LockCount) { // 仍有线程锁定在临界区上 _RtlpUnWaitCriticalSection(CriticalSection) } } else { --CriticalSection->LockCount } |
上述文字中的含义可以比较清晰地推断出:
1. RecursionCount有可能由于LeaveCriticalSection的多余调用而小于初值0 (已经实证)
2. LockCount的值只可能大于或等于初值-1
理论似乎再一次与事实不符!
我们开始胡思乱想,猜测如下几种可能:
1. EnterCriticalSection执行到一半异常中止
这种机会很小,即使发生,也找不出什么道理让LockCount变成负两三百这么离谱。
2. 内存错乱导致RTL_CRITICAL_SECTION结构被写坏。
但几种推测都查证无果。
一个偶然的机会 -_-!!! ,我在自己的计算机上实验的时候,居然也发现了LockCount小于-1!而且屡试不爽!
我的计算机装的Windows Vista,我们自然就有如下猜想:
在某个操作系统版本下,LockCount的机制本来就有所不同!!
这个猜想比较靠谱,立刻着手验证。实验室里发生这个问题的电脑都是Windows2003+SP1。我们马上在Windows2003+SP1系统做了测试,写了个非常简单的测试,创建一个临界区,然后调用EnterCriticalSection,果然发现LockCount编程了-2!而多线程下测试,也确实会出现负两三百的情况。
看来LockCount的含义在不同版本的Win下确实不一样。
其后我们多次尝试上网搜索关于LockCount含义在Windows不同版本中的变迁,却不得要领。
又一个偶然的机会 -_-!!! ,老方在WinDbg的帮助文档里发现了一段关于LockCount变迁的说明,全文如下(真是踏破铁鞋无觅处,得来全不费工夫)
Interpreting Critical Section Fields in Windows Server 2003 SP1 and Later
In Microsoft Windows Server 2003 Service Pack 1 and later versions of Windows, the LockCount field is parsed as follows:
The lowest bit shows the lock status. If this bit is 0, the critical section is locked; if it is 1, the critical section is not locked. The next bit shows whether a thread has been woken for this lock. If this bit is 0, then a thread has been woken for this lock; if it is 1, no thread has been woken. The remaining bits are the ones-complement of the number of threads waiting for the lock.
As an example, suppose the LockCount is -22. The lowest bit can be determined in this way:
0:009> ? 0x1 & (-0n22) uate expression: 0 = 00000000
The next-lowest bit can be determined in this way:
0:009> ? (0x2 & (-0n22)) >> 1 uate expression: 1 = 00000001
The ones-complement of the remaining bits can be determined in this way:
0:009> ? ((-1) - (-0n22)) >> 2 uate expression: 5 = 00000005
In this example, the first bit is 0 and therefore the critical section is locked. The second bit is 1, and so no thread has been woken for this lock. The complement of the remaining bits is 5, and so there are five threads waiting for this lock. |
事情至此总算水落石出!
转载自:http://www.cppblog.com/woaidongmao/archive/2011/01/13/138474.aspx