Mixed DLL Loading analysis
这个问题还是关于ASP.NET的. 问题背景:
客户的ASP.NET发生了hang, CPU为0,客户端得不到任何反应.问题发生的时候抓取了hang dump
打开dump后,一如既往地用!threads命令检查发现一共有20个活动的CLR thread,其中没有标记为GC的线程. 接下来用~* e !clrstack命令查看对应的CLR Callstack, 发现:
0:017> ~* e !clrstack
Thread 0
Not a managed thread.
Thread 1
Not a managed thread.
Thread 2
Not a managed thread.
Thread 3
Not a managed thread.
Thread 4
ESP EIP
Thread 5
ESP EIP
Thread 6
ESP EIP
Thread 7
Not a managed thread.
Thread 8
Not a managed thread.
Thread 9
Not a managed thread.
Thread 10
ESP EIP
0x01a9ec48 0x7c82ed54 [FRAME: GCFrame]
0x01a9f7e4 0x7c82ed54 [FRAME: ECallMethodFrame] [DEFAULT] Class System.Reflection.Assembly System.Reflection.Assembly.nLoad(Class System.Reflection.AssemblyName,String,Boolean,Class System.Security.Policy.Evidence,Boolean,Class System.Reflection.Assembly,ByRef ValueClass System.Threading.StackCrawlMark)
0x01a9f808 0x799afb84 [DEFAULT] Class System.Reflection.Assembly System.Reflection.Assembly.InternalLoad(Class System.Reflection.AssemblyName,Boolean,Class System.Security.Policy.Evidence,ByRef ValueClass System.Threading.StackCrawlMark)
0x01a9f830 0x799ea867 [DEFAULT] Class System.Reflection.Assembly System.Reflection.Assembly.Load(Class System.Reflection.AssemblyName)
0x01a9f838 0x0cdd2bef [DEFAULT] Class System.Collections.Hashtable System.Web.UI.Util.GetReferencedAssembliesHashtable(Class System.Reflection.Assembly)
0x01a9f84c 0x0c119870 [DEFAULT] [hasThis] Boolean System.Web.Compilation.PreservedAssemblyEntry.LoadDataFromFileInternal(Boolean)
0x01a9f89c 0x0c1194a2 [DEFAULT] [hasThis] Boolean System.Web.Compilation.PreservedAssemblyEntry.LoadDataFromFile(Boolean)
0x01a9f8c4 0x0c118a20 [DEFAULT] Class System.Web.Compilation.PreservedAssemblyEntry System.Web.Compilation.PreservedAssemblyEntry.GetPreservedAssemblyEntry(Class System.Web.HttpContext,String,Boolean)
0x01a9f8dc 0x0c118764 [DEFAULT] [hasThis] Class System.Web.UI.ParserCacheItem System.Web.UI.TemplateParser.GetParserCacheItemFromPreservedCompilation()
0x01a9f8f4 0x0c11740b [DEFAULT] [hasThis] Class System.Web.UI.ParserCacheItem System.Web.UI.TemplateParser.GetParserCacheItemInternal(Boolean)
0x01a9f91c 0x0c1186ae [DEFAULT] [hasThis] Class System.Web.UI.ParserCacheItem System.Web.UI.TemplateParser.GetParserCacheItemWithNewConfigPath()
0x01a9f94c 0x0c1172dc [DEFAULT] [hasThis] Class System.Web.UI.ParserCacheItem System.Web.UI.TemplateParser.GetParserCacheItem()
0x01a9f97c 0x0c115d84 [DEFAULT] Class System.Type System.Web.UI.ApplicationFileParser.GetCompiledApplicationType(String,Class System.Web.HttpContext,ByRef Class System.Web.UI.ApplicationFileParser)
0x01a9f994 0x0c115c4c [DEFAULT] [hasThis] Void System.Web.HttpApplicationFactory.CompileApplication(Class System.Web.HttpContext)
0x01a9f9a4 0x0c115b4f [DEFAULT] [hasThis] Void System.Web.HttpApplicationFactory.Init(Class System.Web.HttpContext)
0x01a9f9e4 0x0c115953 [DEFAULT] Class System.Web.IHttpHandler System.Web.HttpApplicationFactory.GetApplicationInstance(Class System.Web.HttpContext)
0x01a9fa10 0x0be76a97 [DEFAULT] [hasThis] Void System.Web.HttpRuntime.ProcessRequestInternal(Class System.Web.HttpWorkerRequest)
0x01a9fa4c 0x0be76690 [DEFAULT] Void System.Web.HttpRuntime.ProcessRequest(Class System.Web.HttpWorkerRequest)
0x01a9fa58 0x0be7320d [DEFAULT] [hasThis] I4 System.Web.Hosting.ISAPIRuntime.ProcessRequest(I,I4)
0x01a9fb20 0x79217188 [FRAME: ContextTransitionFrame]
0x01a9fc00 0x79217188 [FRAME: ComMethodFrame]
Thread 11
Not a managed thread.
Thread 12
Not a managed thread.
Thread 13
Not a managed thread.
Thread 14
ESP EIP
Thread 15
Not a managed thread.
Thread 16
ESP EIP
Thread 17
ESP EIP
0x0d1aeaf4 0x7c82ed54 [FRAME: GCFrame]
0x0d1af690 0x7c82ed54 [FRAME: ECallMethodFrame] [DEFAULT] Class System.Reflection.Assembly System.Reflection.Assembly.nLoad(Class System.Reflection.AssemblyName,String,Boolean,Class System.Security.Policy.Evidence,Boolean,Class System.Reflection.Assembly,ByRef ValueClass System.Threading.StackCrawlMark)
0x0d1af6b4 0x799afb84 [DEFAULT] Class System.Reflection.Assembly System.Reflection.Assembly.InternalLoad(Class System.Reflection.AssemblyName,Boolean,Class System.Security.Policy.Evidence,ByRef ValueClass System.Threading.StackCrawlMark)
0x0d1af6dc 0x799afa15 [DEFAULT] Class System.Reflection.Assembly System.Reflection.Assembly.InternalLoad(String,Class System.Security.Policy.Evidence,ByRef ValueClass System.Threading.StackCrawlMark)
0x0d1af6ec 0x799c2014 [DEFAULT] Class System.Reflection.Assembly System.Reflection.Assembly.Load(String)
0x0d1af6f4 0x0c11ab17 [DEFAULT] Class System.Collections.Hashtable System.Web.UI.CompilationConfiguration.LoadAssemblies(Class System.Collections.Hashtable)
0x0d1af758 0x0c11a8df [DEFAULT] [hasThis] Void System.Web.UI.TemplateParser.AppendConfigAssemblies()
0x0d1af784 0x0c119d01 [DEFAULT] [hasThis] Void System.Web.UI.TemplateParser.PrepareParse()
0x0d1af794 0x0c119b35 [DEFAULT] [hasThis] Void System.Web.UI.TemplateParser.Parse()
0x0d1af7c4 0x0c119a03 [DEFAULT] [hasThis] Class System.Web.UI.ParserCacheItem System.Web.UI.TemplateParser.GetParserCacheItemThroughCompilation()
0x0d1af7f4 0x0c117434 [DEFAULT] [hasThis] Class System.Web.UI.ParserCacheItem System.Web.UI.TemplateParser.GetParserCacheItemInternal(Boolean)
0x0d1af81c 0x0c1186ae [DEFAULT] [hasThis] Class System.Web.UI.ParserCacheItem System.Web.UI.TemplateParser.GetParserCacheItemWithNewConfigPath()
0x0d1af84c 0x0c1172dc [DEFAULT] [hasThis] Class System.Web.UI.ParserCacheItem System.Web.UI.TemplateParser.GetParserCacheItem()
0x0d1af87c 0x0c115d84 [DEFAULT] Class System.Type System.Web.UI.ApplicationFileParser.GetCompiledApplicationType(String,Class System.Web.HttpContext,ByRef Class System.Web.UI.ApplicationFileParser)
0x0d1af894 0x0c115c4c [DEFAULT] [hasThis] Void System.Web.HttpApplicationFactory.CompileApplication(Class System.Web.HttpContext)
0x0d1af8a4 0x0c115b4f [DEFAULT] [hasThis] Void System.Web.HttpApplicationFactory.Init(Class System.Web.HttpContext)
0x0d1af8e4 0x0c115953 [DEFAULT] Class System.Web.IHttpHandler System.Web.HttpApplicationFactory.GetApplicationInstance(Class System.Web.HttpContext)
0x0d1af910 0x0be76a97 [DEFAULT] [hasThis] Void System.Web.HttpRuntime.ProcessRequestInternal(Class System.Web.HttpWorkerRequest)
0x0d1af94c 0x0be76690 [DEFAULT] Void System.Web.HttpRuntime.ProcessRequest(Class System.Web.HttpWorkerRequest)
0x0d1af958 0x0be7320d [DEFAULT] [hasThis] I4 System.Web.Hosting.ISAPIRuntime.ProcessRequest(I,I4)
0x0d1afa20 0x79217188 [FRAME: ContextTransitionFrame]
0x0d1afb00 0x79217188 [FRAME: ComMethodFrame]
Thread 18
ESP EIP
0x0ec5ce80 0x7c82ed54 [FRAME: GCFrame]
0x0ec5d378 0x7c82ed54 [FRAME: GCFrame]
0x0ec5eac8 0x7c82ed54 [FRAME: GCFrame]
0x0ec5f664 0x7c82ed54 [FRAME: ECallMethodFrame] [DEFAULT] Class System.Reflection.Assembly System.Reflection.Assembly.nLoad(Class System.Reflection.AssemblyName,String,Boolean,Class System.Security.Policy.Evidence,Boolean,Class System.Reflection.Assembly,ByRef ValueClass System.Threading.StackCrawlMark)
0x0ec5f688 0x799afb84 [DEFAULT] Class System.Reflection.Assembly System.Reflection.Assembly.InternalLoad(Class System.Reflection.AssemblyName,Boolean,Class System.Security.Policy.Evidence,ByRef ValueClass System.Threading.StackCrawlMark)
0x0ec5f6b0 0x799ea867 [DEFAULT] Class System.Reflection.Assembly System.Reflection.Assembly.Load(Class System.Reflection.AssemblyName)
0x0ec5f6b8 0x0cdd2bef [DEFAULT] Class System.Collections.Hashtable System.Web.UI.Util.GetReferencedAssembliesHashtable(Class System.Reflection.Assembly)
0x0ec5f6cc 0x0c119870 [DEFAULT] [hasThis] Boolean System.Web.Compilation.PreservedAssemblyEntry.LoadDataFromFileInternal(Boolean)
0x0ec5f71c 0x0c1194a2 [DEFAULT] [hasThis] Boolean System.Web.Compilation.PreservedAssemblyEntry.LoadDataFromFile(Boolean)
0x0ec5f744 0x0c118a20 [DEFAULT] Class System.Web.Compilation.PreservedAssemblyEntry System.Web.Compilation.PreservedAssemblyEntry.GetPreservedAssemblyEntry(Class System.Web.HttpContext,String,Boolean)
0x0ec5f75c 0x0c118764 [DEFAULT] [hasThis] Class System.Web.UI.ParserCacheItem System.Web.UI.TemplateParser.GetParserCacheItemFromPreservedCompilation()
0x0ec5f774 0x0c11740b [DEFAULT] [hasThis] Class System.Web.UI.ParserCacheItem System.Web.UI.TemplateParser.GetParserCacheItemInternal(Boolean)
0x0ec5f79c 0x0c1186ae [DEFAULT] [hasThis] Class System.Web.UI.ParserCacheItem System.Web.UI.TemplateParser.GetParserCacheItemWithNewConfigPath()
0x0ec5f7cc 0x0c1172dc [DEFAULT] [hasThis] Class System.Web.UI.ParserCacheItem System.Web.UI.TemplateParser.GetParserCacheItem()
0x0ec5f7fc 0x0c115d84 [DEFAULT] Class System.Type System.Web.UI.ApplicationFileParser.GetCompiledApplicationType(String,Class System.Web.HttpContext,ByRef Class System.Web.UI.ApplicationFileParser)
0x0ec5f814 0x0c115c4c [DEFAULT] [hasThis] Void System.Web.HttpApplicationFactory.CompileApplication(Class System.Web.HttpContext)
0x0ec5f824 0x0c115b4f [DEFAULT] [hasThis] Void System.Web.HttpApplicationFactory.Init(Class System.Web.HttpContext)
0x0ec5f864 0x0c115953 [DEFAULT] Class System.Web.IHttpHandler System.Web.HttpApplicationFactory.GetApplicationInstance(Class System.Web.HttpContext)
0x0ec5f890 0x0be76a97 [DEFAULT] [hasThis] Void System.Web.HttpRuntime.ProcessRequestInternal(Class System.Web.HttpWorkerRequest)
0x0ec5f8cc 0x0be76690 [DEFAULT] Void System.Web.HttpRuntime.ProcessRequest(Class System.Web.HttpWorkerRequest)
0x0ec5f8d8 0x0be7320d [DEFAULT] [hasThis] I4 System.Web.Hosting.ISAPIRuntime.ProcessRequest(I,I4)
0x0ec5f9a0 0x79217188 [FRAME: ContextTransitionFrame]
0x0ec5fa80 0x79217188 [FRAME: ComMethodFrame]
Thread 19
ESP EIP
0x0ecaee18 0x7c82ed54 [FRAME: ECallMethodFrame] [DEFAULT] String System.Security.Util.Config._GetMachineDirectory()
0x0ecaee28 0x799a0a9f [DEFAULT] Void System.Security.Util.Config.GetFileLocales()
0x0ecaee34 0x799a07b7 [DEFAULT] [hasThis] Void System.Security.PolicyManager.InitData()
0x0ecaee50 0x799a072b [DEFAULT] Boolean System.Security.SecurityManager.InitPolicy()
0x0ecaee78 0x799a046b [DEFAULT] Class System.Security.PermissionSet System.Security.SecurityManager.ResolvePolicy(Class System.Security.Policy.Evidence,Class System.Security.PermissionSet,Class System.Security.PermissionSet,Class System.Security.PermissionSet,ByRef Class System.Security.PermissionSet,Boolean)
0x0ecaeeb8 0x799a02b1 [DEFAULT] Class System.Security.PermissionSet System.Security.SecurityManager.ResolvePolicy(Class System.Security.Policy.Evidence,Class System.Security.PermissionSet,Class System.Security.PermissionSet,Class System.Security.PermissionSet,ByRef Class System.Security.PermissionSet,ByRef I4,Boolean)
0x0ecaf180 0x791b3208 [FRAME: GCFrame]
0x0ecaf694 0x791b3208 [FRAME: ECallMethodFrame] [DEFAULT] [hasThis] Void System.AppDomain.SetupDomainSecurity(String,Class System.Security.Policy.Evidence,Class System.Security.Policy.Evidence,I)
0x0ecaf6b0 0x799cd45b [DEFAULT] Void System.AppDomain.InternalRemotelySetupRemoteDomainHelper(String,Class System.AppDomainSetup,I,SZArray Char,SZArray Char,SZArray UI1)
0x0ecaf6d8 0x799cd2c3 [DEFAULT] Void System.AppDomain.InternalRemotelySetupRemoteDomain(I4,I4,String,Class System.AppDomainSetup,I,SZArray Char,SZArray Char,SZArray UI1)
0x0ecaf6e4 0x799ccfda [FRAME: ContextTransitionFrame]
0x0ecaf738 0x799ccfda [DEFAULT] Void System.AppDomain.RemotelySetupRemoteDomain(Class System.AppDomain,String,Class System.AppDomainSetup,Class System.Security.Policy.Evidence,Class System.Security.Policy.Evidence,I)
0x0ecaf774 0x799c5a9a [DEFAULT] Class System.AppDomain System.AppDomain.CreateDomain(String,Class System.Security.Policy.Evidence,Class System.AppDomainSetup)
0x0ecaf790 0x0be70507 [DEFAULT] [hasThis] Object System.Web.Hosting.AppDomainFactory.Create(String,String,String,String,String,I4)
0x0ecaf96c 0x79217188 [FRAME: ComMethodFrame]
Thread 20
Not a managed thread.
Thread 21
Not a managed thread.
Thread 22
…
Thread 30
ESP EIP
Thread 31
Not a managed thread.
Thread 32
ESP EIP
Thread 33
ESP EIP
Thread 34
ESP EIP
Thread 35
Not a managed thread.
Thread 36
Not a managed thread.
Thread 37
…
Thread 50
ESP EIP
Thread 51
Not a managed thread.
…
Thread 64
Not a managed thread.
受到篇幅限制,中间省略了部分Not a managed thread和空callstack的结果.从上面的输出来看,虽然CLR thread有20个,但大多数都是idle的.发生blocking,看得到callstack的CLR thread只有四个,其中三个等在Assembly.Load上面.似乎在加载Assembly的时候出问题了.
在前面看到callstack停留在SqlCommand.Execute的时候,很容易理解对应的thread在等待SQL操作,大多数情况是在等待SQL服务器的执行结果,但是也看到等待GC的例子。
但这里很难理解什么直接的理由可以导致Assembly.Load发生堵塞.由于最终堵塞线程的必然是API调用,比如EnterCriticalSection, WaitForSingleObject,所以检查对应的unmanaged callstack是进一步弄清楚问题的下一步,切换到第一个活动的CLR 线程检查:
0:010> kb100
ChildEBP RetAddr Args to Child
01a9db1c 7c822124 7c83970f 00000154 00000000 ntdll!KiFastSystemCallRet
01a9db20 7c83970f 00000154 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
01a9db5c 7c839620 00000000 00000004 00000001 ntdll!RtlpWaitOnCriticalSection+0x19c
01a9db7c 7c832ad0 7c889d94 00000000 00000000 ntdll!RtlEnterCriticalSection+0xa8
01a9dbb0 7c833ea9 00000001 00000000 01a9dbec ntdll!LdrLockLoaderLock+0xe4
01a9de20 77e41c90 1c68d7b8 01a9de6c 01a9de4c ntdll!LdrLoadDll+0xc9
01a9de88 791e0610 368ec3b8 00000000 00000009 kernel32!LoadLibraryExW+0x1b2
01a9dea4 791ea056 368ec3b8 00000000 00000009 mscorsvr!WszLoadLibraryEx+0x5f
01a9decc 791e7e58 368ec50c 36922588 00129000 mscorsvr!CorMap::BaseAddress+0x8b
01a9dee8 791e7ec4 36922378 368ec50c 00000000 mscorsvr!PEFile::Setup+0x45
01a9df04 791ea014 368ec50c 01a9e018 00000000 mscorsvr!PEFile::Create+0x38
01a9df1c 791ea117 368ec50c 368ec50c 01a9e018 mscorsvr!PEFile::CreateImageFile+0x3a
01a9df90 791ea151 00000001 00000000 3691f3b8 mscorsvr!PEFile::VerifyModule+0x76
01a9dfd4 791e88f6 01a9e248 00000000 26000000 mscorsvr!PEFile::Create+0x11a
01a9e01c 791e94d8 01a9e248 00000000 26000000 mscorsvr!SystemDomain::LoadFile+0x184
01a9e46c 791e8084 1d2312d8 3685d0f0 01a9f56c mscorsvr!AssemblySpec::GetAssemblyFromFusion+0x630
01a9e6f4 791e7f4b 00000000 01a9e7ac 00000001 mscorsvr!AssemblySpec::LowLevelLoadManifestFile+0x17d
01a9e714 791e2c95 01a9f53c 01a9e7a4 01a9e7ac mscorsvr!AppDomain::BindAssemblySpec+0x50
01a9ec24 791eb38b 01a9eca0 01a9eca8 01a9f7bc mscorsvr!AssemblySpec::LoadAssembly+0x98
01a9f7a4 01bab9ac 00000000 01a9f830 00000000 mscorsvr!AssemblyNative::Load+0x407
WARNING: Frame IP not in any known module. Following frames may be wrong.
01a9f820 799ea867 01a9f830 00000000 00000001 0x1bab9ac
01a9f824 01a9f830 00000000 00000001 0cdd2bef mscorlib_79990000+0x5a867
01a9f828 00000000 00000001 0cdd2bef 01a9f894 0x1a9f830
从callstack上看到,CLR最后调用到LoadLibrary API上,在试图获取DLL Loader Lock的时候堵塞了。
DLL Loader Lock是OS维护的一个锁,用来确保系统加载DLL时候的顺序和同步。该锁是通过CriticalSection来实现的。在一个DLL的DLLMain里面不允许调用LoadLibrary加载其它的DLL,也跟该锁相关。详细的信息可以搜索网上的资料。
这里最后堵塞在CriticalSection上,根据第二章的介绍,!cs命令可以派上用途了。看看谁占用了这个CriticalSection没有释放:
0:010> !cs 7c889d94
-----------------------------------------
Critical section = 0x7c889d94 (ntdll!LdrpLoaderLock+0x0)
DebugInfo = 0x7c889dc0
LOCKED
LockCount = 0x12
WaiterWoken = No
OwningThread = 0x00000948
RecursionCount = 0x2
LockSemaphore = 0x154
SpinCount = 0x00000000
0:010> ~~[0x00000948]
18 Id: cc4.948 Suspend: 1 Teb: 7ffa3000 Unfrozen
Start: mscorsvr!ThreadpoolMgr::intermediateThreadProc (791fda8b)
Priority: 0 Priority class: 32 Affinity: 3
于是切换到线程18检查:
0:010> ~18s
eax=449db1b4 ebx=00000000 ecx=791b4ef3 edx=00000001 esi=793ef47c edi=00000588
eip=7c82ed54 esp=0ec5c384 ebp=0ec5c3c0 iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246
ntdll!KiFastSystemCallRet:
7c82ed54 c3 ret
0:018> kb100
ChildEBP RetAddr Args to Child
0ec5c380 7c822124 7c83970f 00000588 00000000 ntdll!KiFastSystemCallRet
0ec5c384 7c83970f 00000588 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
0ec5c3c0 7c839620 00000000 00000004 0ec5c65c ntdll!RtlpWaitOnCriticalSection+0x19c
0ec5c3e0 791b2f0c 793ef47c 0ec5c618 791e78ab ntdll!RtlEnterCriticalSection+0xa8
0ec5c3ec 791e78ab 793ef47c 791e7c89 0c067cb8 mscorsvr!EE_EnterCriticalSection+0xc
0ec5c3f4 791e7c89 0c067cb8 00000000 00000000 mscorsvr!CorMap::Enter+0x13
0ec5c618 791e7d32 0000009c 00000003 0ec5c65c mscorsvr!CorMap::OpenFile+0x73
0ec5c648 791e88f6 0ec5c8bc 00000000 26000000 mscorsvr!PEFile::Create+0xc3
0ec5c690 791e94d8 0ec5c8bc 00000000 26000000 mscorsvr!SystemDomain::LoadFile+0x184
0ec5cae0 791e8084 1d2130b8 368f3390 0ec5cf2c mscorsvr!AssemblySpec::GetAssemblyFromFusion+0x630
0ec5cd68 791e7f4b 00000000 0ec5cee0 00000001 mscorsvr!AssemblySpec::LowLevelLoadManifestFile+0x17d
0ec5cd88 79208071 0ec5cefc 0ec5cef8 0ec5cee0 mscorsvr!AppDomain::BindAssemblySpec+0x50
0ec5d19c 79207d79 00000003 00000024 00000000 mscorsvr!Assembly::ComputeBindingDependenciesClosure+0x25a
0ec5d1c4 791e31ca 369291b8 369210e0 0ec5d324 mscorsvr!BaseDomain::CreateShareableAssemblyNoLock+0x7e
0ec5d340 79258d85 36928d80 369210e0 0ec5d3cc mscorsvr!BaseDomain::LoadAssembly+0xacd
0ec5d5e0 79019ad7 00000000 00000001 00000000 mscorsvr!ExecuteDLL+0x281
0ec5d5fc 7901a789 44940000 00000001 00000000 mscoree!CorDllMainWorker+0x6c
0ec5d638 7c82257a 44940000 00000000 00000000 mscoree!_CorDllMain+0x106
0ec5d658 7c8358fb 449594e2 44940000 00000001 ntdll!LdrpCallInitRoutine+0x14
0ec5d760 7c835bcb 00000000 00000000 00000000 ntdll!LdrpRunInitializeRoutines+0x367
0ec5d9f4 7c833ee5 00000000 1c626830 0ec5dcbc ntdll!LdrpLoadDll+0x3cd
0ec5dc70 77e41c90 1c626830 0ec5dcbc 0ec5dc9c ntdll!LdrLoadDll+0x198
0ec5dcd8 791e0610 36821d50 00000000 00000008 kernel32!LoadLibraryExW+0x1b2
0ec5dcf4 791ea056 36821d50 00000000 00000008 mscorsvr!WszLoadLibraryEx+0x5f
0ec5dd1c 791e7e58 36821ed4 368f7740 368f7530 mscorsvr!CorMap::BaseAddress+0x8b
0ec5dd38 7922c8ba 368f7530 36821ed4 00000000 mscorsvr!PEFile::Setup+0x45
0ec5ddac 791ea151 00000000 36922378 369210e0 mscorsvr!PEFile::VerifyModule+0x195
0ec5ddf0 791e88f6 0ec5e0c8 00000000 26000000 mscorsvr!PEFile::Create+0x11a
0ec5de38 791e94d8 0ec5e0c8 00000000 26000000 mscorsvr!SystemDomain::LoadFile+0x184
0ec5e2ec 791e8084 1d2130b8 368cc008 0ec5f3ec mscorsvr!AssemblySpec::GetAssemblyFromFusion+0x630
0ec5e574 791e7f4b 00000000 0ec5e62c 00000001 mscorsvr!AssemblySpec::LowLevelLoadManifestFile+0x17d
0ec5e594 791e2c95 0ec5f3bc 0ec5e624 0ec5e62c mscorsvr!AppDomain::BindAssemblySpec+0x50
0ec5eaa4 791eb38b 0ec5eb20 0ec5eb28 0ec5f63c mscorsvr!AssemblySpec::LoadAssembly+0x98
0ec5f624 01bab9ac 00000000 0ec5f6b0 00000000 mscorsvr!AssemblyNative::Load+0x407
WARNING: Frame IP not in any known module. Following frames may be wrong.
0ec5f6a0 799ea867 0ec5f6b0 00000000 00000001 0x1bab9ac
0ec5f6a4 0ec5f6b0 00000000 00000001 0cdd2bef mscorlib_79990000+0x5a867
0ec5f6a8 00000000 00000001 0cdd2bef 0ec5f714 0xec5f6b0
这个callstack好长。理清楚逻辑,Assembly.Load调用被分发到CLR Runtime,CLR Runtime调用API LoadLibaray加载DLL,在加载结束以前,系统会用DLL_PROCESS_ATTACH作为dwReason调用当前进程里面所有的DLL的Entry Point (DLLMain). 所以mscoree!_CorDllMain被调用。Mscoree!_CorDllMain再次调入CLR Runtime做一些DLL/Assembly相关的操作,操作的过程中再次等待到某一个CriticalSectoin上
那再次检查对应的CriticalSection被谁占用:
0:018> !cs 793ef47c
-----------------------------------------
Critical section = 0x793ef47c (mscorsvr!CorMap::m_pCorMapCrst+0x0)
DebugInfo = 0x000ddfd0
LOCKED
LockCount = 0x1
WaiterWoken = No
OwningThread = 0x00000c6c
RecursionCount = 0x1
LockSemaphore = 0x588
SpinCount = 0x00000000
0:018> ~~[0x00000c6c]
17 Id: cc4.c6c Suspend: 1 Teb: 7ffa4000 Unfrozen
Start: mscorsvr!ThreadpoolMgr::intermediateThreadProc (791fda8b)
Priority: 0 Priority class: 32 Affinity: 3
Thread 17?! 原来thread 10在等thread 17, 而thread 17跟thread 18发生deadlock在互相等待。下一步需要弄清的就是thread 17根thread 18的CriticalSection到底有什么依赖关系。
前面已经看到thread 17在等待loader lock. 在thread 18的callstack上,LoadLibrary的调用还没有退出,所以loader lock没有释放,thread 17等待thread 18的原因很清楚。那thread 18为何需要等待thread 17呢?检查thread 18里EnterCriticalSection的caller, 是mscorsvr!CorMap::Enter。从名字上看,该CriticalSection应该是用来同步CorMap资源的,而CorMap应该是CLR的一个内部实现类。回头看thread 17, thread 17中的LoadLibrary的caller是mscorsvr!CorMap::BaseAddress!原来CorMap的相关函数在thread 17中的调用也没有结束,所以thread 18应该是在等待thread 17没有释放的CorMap锁
原因看明白了,解决方法呢?整理一下思路,这里无论是managed callstack还是unmanaged callstack,所有的参与函数都是CLR Runtime和OS API,中间没有涉及到任何客户自己的函数。说明问题在很大程度上跟客户的具体应用没有关系,很有可能是CLR或者OS的bug
关于CLR跟loader lock的冲突,最臭名昭著的就是前面介绍过的mixed DLL loading:
Mixed DLL Loading Problem
为了看看是不是足够幸运地遇上这个问题,接下来尝试去找到引发问题的DLL的名字,看看是不是用managed C++写的。检查LoadLibaray 的第一个参数可以看到:
0:018> du 36821d50
36821d50 "c:/windows/microsoft.net/framewo"
36821d90 "rk/v1.1.4322/temporary asp.net f"
36821dd0 "iles/root/c01b9164/6d04ba4/assem"
36821e10 "bly/dl2/ce86c938/007079bd_9d34c6"
36821e50 "01/Company.product.usage"
36821e90 "name.Class1.dll"
0:018> du 368ec3b8
368ec3b8 "c:/windows/microsoft.net/framewo"
368ec3f8 "rk/v1.1.4322/temporary asp.net f"
368ec438 "iles/root/6882f6e1/6ac01cc1/asse"
368ec478 "mbly/dl2/a3ad3432/5b25229b_ffa0c"
368ec4b8 "601/Company.Class2.dll"
上面的输出提供了两个信息:
1. DLL的具体名字
2. 从DLL的路径可以看到,该DLL是从ASP.NET临时文件架加载的。所以该DLL应该是被部署到了ASP.NET的bin目录下
从dump中用lmf命令找出这两个DLL,用reflector检查后发现下面的信息:
1. 这两个Assembly是StrongName的
2. 这两个Assembly是用managed C++开发的,因为里面包含了下面的类型信息,这些类型信息是MC++编译器生成的
[StructLayout(LayoutKind.Explicit, Size=520, Pack=1), DebugInfoInPDB]
internal struct $ArrayType$0x47914e9e
{
}
所以,这里不幸遭遇了mixed dll loading的问题。在CLR 1上面,该问题是没有完美解决方案的。所以最显而易见的方法是升级到CLR 2.0. 但是,为了挑战一下自己,看看能不能在现有的基础上尽量做点什么。
首先,从callstack上看到,问题根源在于两个thread互相等待。其实,只要DLL Loader早一点释放,或者CorMap lock早一点释放,问题都不会发生。也就是说,问题并非想重现就可以找到恰当的时机重现的。客户的程序肯定是执行了相同的code path很多次,终于有一次遇上了死锁的条件。换句话说,要避免这个问题,只需要尽可能地减少code path执行的频率就可以了。导致问题的code path用来加载DLL,为何需要频繁地加载Assembly呢?由于Assembly不能单独卸载,必须跟appDomain一起卸载,所以接下来可以用!dumpdomain看看appDomain跟Assenbly的情况:
0:018> !dumpdomain -stat
Domain Num Assemblies Size Assemblies Name
0x793f16a0 1 2,138,112 System Domain
0x793f2b70 328 288,662,016 Shared Domain
0x000cf478 2 2,490,368 DefaultDomain
0x001334b0 77 20,086,784 /LM/W3SVC/262227220/Root-1-127973823709142110
…
0x1d2312d8 27 16,484,864 /LM/W3SVC/1064537246/Root-36-127973891453478142
0x368594b8 11 7,123,456 /LM/W3SVC/182767242/Root-37-127973891487384175
0x36915188 0 0 Domain39
Total 41 Domains
哇,这么多appDomain。再去掉-stat参数看看每个appDomain记载的信息,发现上面两个Assembly在超过30个appDomain中都有加载。由此看来,这两个MC++ Assembly中包含的应该是一些公用的组件,由于在多个应用中都会用到,所以分别部署到了各个Virtual Directory的bin目录下。有超过30个Virtual Directory被分配到了同一个app pool, 每一个Virtual Directory对应一个appDomain, 同一个w3wp.exe进程中频繁的appDomian加载和卸载触发了这个问题。所以可以采取下面的方法来尽可能地避免问题的发生:
1. 检查是否有防毒程度频繁扫描web.config文件导致appDomain的重启。尽量避免对Virtual Directory的修改来避免appDomain的重启
2. 把多个Virtual Directory合并,以便共同引用同一套MC++ Assembly,减少MC++ Assembly的加载
3. 拆分Virtual Directory到多个不同的app pool.由于Loader lock根CorMap lock使用CriticalSection,是进程级别的锁,所以分割到多个进程可以减少触发问题的频率
4. 既然这两个Assembly是Strong Named的,应该放到GAC中,而不是bin目录下,详细参考:
ASP.NET: Strong named assemblies should not be stored in the bin directory
http://blogs.msdn.com/tess/archive/2006/04/13/575361.aspx