故障现象
Win7x64系统,多见于刚展开Windows桌面。鼠标一直转圈。Windows内新建进程没反应,但是可浏览文件夹,ctrl+alt+del可呼出winlogon桌面。但是任务管理器点击没反应,win+r可键入notepad,cmd等,但是均无反应。直到5-15分钟后自然恢复。
难以复现,但每天总有出现。
初步怀疑
故障时触发ScrollLock蓝屏,分析dmp,发现许多新进程的头号线程都卡在nt!KiStartUserThread。
THREAD fffffa80084de060 Cid 0a38.09c8 Teb: 000007fffffdd000 Win32Thread: 0000000000000000 INITIALIZED
Not impersonating
Owning Process fffffa8006553b00 Image: cmd.exe
Attached Process N/A Image: N/A
Wait Start TickCount 0 Ticks: 11123 (0:00:02:53.796)
Context Switch Count 0 IdealProcessor: 1
UserTime 00:00:00.000
KernelTime 00:00:00.000
Win32 Start Address 0x000000004a1990b4
Stack Init fffff88003219c70 Current fffff88003219960
Base fffff8800321a000 Limit fffff88003214000 Call 0000000000000000
Priority 8 BasePriority 8 PriorityDecrement 0 IoPriority 2 PagePriority 5
Child-SP RetAddr Call Site
fffff880`032199a0 fffff800`054eee30 nt!KiStartUserThread
fffff880`03219ae0 00000000`77573840 nt!KiStartUserThreadReturn (TrapFrame @ fffff880`03219ae0)
00000000`001bffb8 00000000`00000000 0x77573840
explorer作为父进程,观测其某些线程,发现确实执行的是win+r相关的功能。
fffffa800807ab48 NotificationEvent
fffff880032c4ab8 NotificationEvent
Child-SP | RetAddr | Call Site |
---|---|---|
fffff880`032c4510 | fffff800`0548df12 | nt!KiSwapContext+0x7a |
fffff880`032c4650 | fffff800`054a0e22 | nt!KiCommitThreadWait+0x1d2 |
fffff880`032c46e0 | fffff800`057154b4 | nt!KeWaitForMultipleObjects+0x272 |
fffff880`032c49a0 | fffff880`00dae85a | nt!FsRtlCancellableWaitForMultipleObjects+0xac |
fffff880`032c4a00 | fffff880`052ff55d | FLTMGR!FltSendMessage+0x4ea |
fffff880`032c4b30 | fffff880`053174c3 | 360FsFlt+0xd55d |
fffff880`032c4c10 | fffff880`0530b984 | 360FsFlt+0x254c3 |
fffff880`032c4d10 | fffff800`057308b6 | 360FsFlt+0x19984 |
fffff880`032c4dc0 | fffff800`0593277f | nt!PspInsertThread+0x61a |
fffff880`032c4f40 | fffff800`054f6bd3 | nt!NtCreateUserProcess+0x94f |
fffff880`032c5a70 | 00000000`7758a35a | nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffff880`032c5ae0) |
00000000`04e6cad8 | 00000000`771ee3d5 | ntdll!NtCreateUserProcess+0xa |
00000000`04e6cae0 | 00000000`771e03dc | kernel32!CreateProcessInternalW+0x12e7 |
00000000`04e6d5b0 | 000007fe`fe3fb943 | kernel32!CreateProcessW+0x6c |
00000000`04e6d620 | 000007fe`fe3fb6ab | SHELL32!_SHCreateProcess+0x32b |
00000000`04e6d7f0 | 000007fe`fe3fb547 | SHELL32!CExecuteApplication::_CreateProcess+0x15f |
00000000`04e6d890 | 000007fe`fe3fb463 | SHELL32!CExecuteApplication::_TryCreateProcess+0x126 |
00000000`04e6d8f0 | 000007fe`fe3fa62e | SHELL32!CExecuteApplication::_DoApplication+0x198 |
00000000`04e6d950 | 000007fe`fe4124e4 | SHELL32!CExecuteApplication::Execute+0x3e |
00000000`04e6d980 | 000007fe`fe412393 | SHELL32!CExecuteAssociation::_DoCommand+0xb0 |
00000000`04e6d9d0 | 000007fe`fe41375c | SHELL32!CExecuteAssociation::Execute+0xbf |
00000000`04e6da30 | 000007fe`fe413963 | SHELL32!CRegDataDrivenCommand::_Invoke+0x10d |
00000000`04e6dac0 | 000007fe`fe41389b | SHELL32!CRegistryVerbsContextMenu::_Execute+0x77 |
00000000`04e6db20 | 000007fe`fe4135ee | SHELL32!CRegistryVerbsContextMenu::InvokeCommand+0x102 |
00000000`04e6de10 | 000007fe`fe41346e | SHELL32!HDXA_LetHandlerProcessCommandEx+0x144 |
00000000`04e6df20 | 000007fe`fe4fa1bc | SHELL32!CDefFolderMenu::InvokeCommand+0x254 |
00000000`04e6e290 | 000007fe`fe4fa0b0 | SHELL32!CShellExecute::_InvokeInProcExec+0xff |
00000000`04e6e380 | 000007fe`fe4fa043 | SHELL32!CShellExecute::_InvokeCtxMenu+0x48 |
00000000`04e6e3c0 | 000007fe`fe3fcd36 | SHELL32!CShellExecute::_DoExecute+0xb4 |
00000000`04e6e400 | 000007fe`fe5957f1 | SHELL32!CShellExecute::ExecuteNormal+0x132 |
00000000`04e6e430 | 000007fe`fe4ab561 | SHELL32!ShellExecuteExW+0x129 |
00000000`04e6e690 | 000007fe`fe4ab389 | SHELL32!ShellExecCmdLine+0x1a7 |
00000000`04e6e9b0 | 000007fe`fe4ab18a | SHELL32!CRunDlg::OKPushed+0x29e |
00000000`04e6ef00 | 00000000`7742e53b | SHELL32!RunDlgProc+0x1cf |
00000000`04e6efb0 | 00000000`00000001 | USER32!GetCapture+0x40b |
初步怀疑是CPU调度问题,使得新进程的头号线程一直没能得到CPU时间片。但是其实不对。新线程并未完成初始化,故而自然没有权限执行。问题可能在于父进程并未能顺利执行下去,而是被360FsFlt+0x19984
卡住。经过试验,如果注册了PsSetCreateProcessNotifyRoutine,并且在回调函数里Sleep,那么父进程的这个线程就会被卡住,而且子进程的头号线程也是卡在KiStartUserThread
。
双机调试
开启双击调试环境,在Windows刚启动时尽早attach kernel。给nt!PsSetCreateProcessNotifyRoutine下断点。发现360FsFlt.sys果然注册了这个回调。
接下来观测卡住explorer线程的那两个NotificationEvent是由谁来解锁的。
那么反汇编360FsFlt+0xd55d
附近的代码,可知360FsFlt+0xd558
这个地址时,调用了FLTMGR!FltSendMessage
。于是给explorer.exe(进程地址fffffa80`ca24eb00)的360FsFlt+0xd558
位置下断点
bp /p fffffa80\
ca24eb00 360FsFlt+0xd558`
在win+r处输入notepad,即可触发断点。此时再给该线程fffffa80c8217b50设置断点nt!KeWaitForMultipleObjects。
bp /t fffffa80c8217b50 nt!KeWaitForMultipleObjects
这里有个坑,我们关注的调用栈是FLTMGR!FltSendMessage+0x4ea
,第一个断点不在这个调用栈下。
继续给nt!KeWaitForMultipleObjects设置断点
bp /t fffffa80c8217b50 nt!KeWaitForMultipleObjects
查看nt!KeWaitForMultipleObjects函数原型,
NTSTATUS
KeWaitForMultipleObjects (
ULONG Count,
PVOID Object[],
WaitType,
KWAIT_REASON WaitReason,
KPROCESSOR_MODE WaitMode,
BOOLEAN Alertable,
PLARGE_INTEGER Timeout,
PKWAIT_BLOCK WaitBlockArray
);
一般地,Windows x64位的c、c++函数前四个入参使用寄存器传参,从左到右为,rcx,rdx,r8,r9。那么r
命令查看寄存器,
rax=ffffffffffffee81 rbx=00000000004f1a63 rcx=0000000000000002
rdx=fffff88013af9a70 rsi=0000000000000000 rdi=fffff88013af9a98
rip=fffff80001c91d60 rsp=fffff88013af9998 rbp=0000000000000002
r8=0000000000000001 r9=0000000000000000 r10=fffff80001c52000
r11=fffff88013af9ab8 r12=0000000000000000 r13=fffff88013af9a70
r14=0000000000000001 r15=0000000000000000
可知Count=2,WaitType=WaitAny,
dp rdx
可得2个object为fffffa80`c9ce5558和fffff880`13af9ab8。
接下来关注是由谁来解锁这两个object。
先查看这两个object是什么,输入dt nt!_DISPATCHER_HEADER fffff880
13af9ab8`可知它的Type=0,即NotificationEvent。具体原理可参阅《深入解析Windows操作系统》的系统机制的内核分发器章节。另一个object亦如此。
2: kd> dt nt!_DISPATCHER_HEADER fffff880`13af9ab8
+0x000 Type : 0 ''
+0x001 TimerControlFlags : 0 ''
+0x001 Absolute : 0y0
+0x001 Coalescable : 0y0
+0x001 KeepShifting : 0y0
+0x001 EncodedTolerableDelay : 0y00000 (0)
+0x001 Abandoned : 0 ''
+0x001 Signalling : 0 ''
+0x002 ThreadControlFlags : 0x6 ''
+0x002 CpuThrottled : 0y0
+0x002 CycleProfiling : 0y1
+0x002 CounterProfiling : 0y1
+0x002 Reserved : 0y00000 (0)
+0x002 Hand : 0x6 ''
+0x002 Size : 0x6 ''
+0x003 TimerMiscFlags : 0 ''
+0x003 Index : 0y000000 (0)
+0x003 Inserted : 0y0
+0x003 Expired : 0y0
+0x003 DebugActive : 0 ''
+0x003 ActiveDR7 : 0y0
+0x003 Instrumented : 0y0
+0x003 Reserved2 : 0y0000
+0x003 UmsScheduled : 0y0
+0x003 UmsPrimary : 0y0
+0x003 DpcActive : 0 ''
+0x000 Lock : 0n393216
+0x004 SignalState : 0n0
+0x008 WaitListHead : _LIST_ENTRY [ 0xfffff880`13af9ac0 - 0xfffff880`13af9ac0 ]
解锁object,即置其为有信号,即将其中的SingalState赋值为1。为此,可用写断点观测何时置1。
SignalState的地址为结构体头部地址+4,即fffff880`13af9abc,长度为4
输入ba w 4 fffff880
13af9abc`。另一object亦然。
bd其它断点,不然target机运行速度比较慢。
g一段时间,发现果然被触发置信号,而且调用栈是
THREAD fffffa80c9b32b50 Cid 07dc.0f00 Teb: 000000007ef49000 Win32Thread: fffff900c21f8850 RUNNING on processor 2
IRP List:
fffffa80c7a063a0: (0006,0118) Flags: 00060000 Mdl: 00000000
Not impersonating
DeviceMap fffff8a001219d40
Owning Process fffffa80ca6e9640 Image: 360tray.exe
Attached Process N/A Image: N/A
Wait Start TickCount 5179698 Ticks: 0
Context Switch Count 5652 IdealProcessor: 2 LargeStack
UserTime 00:00:01.500
KernelTime 00:00:02.375
Win32 Start Address 0x0000000074582231
Stack Init fffff88012cdbc70 Current fffff88012cdb950
Base fffff88012cdc000 Limit fffff88012cd4000 Call 0000000000000000
Priority 11 BasePriority 8 PriorityDecrement 48 IoPriority 2 PagePriority 5
Child-SP RetAddr : Args to Child : Call Site
fffff880`12cdb630 fffff880`010a4454 : 00000000`00000001 fffff880`00000000 fffffa80`c9b32b00 00000000`00000000 : nt!KeSetEvent+0x64
fffff880`12cdb6a0 fffff880`010c7c22 : fffff880`12cdb870 fffff880`13af9aa0 fffff8a0`0000000c fffff880`12cdb7c8 : fltmgr!FltpFilterReply+0x144
fffff880`12cdb710 fffff880`010cbb62 : fffffa80`c7a063a0 00000000`09f6e910 00000000`00000000 fffff800`0218b7d0 : fltmgr!FltpMsgDeviceControl+0xa2
fffff880`12cdb770 fffff880`010a26a1 : fffffa80`c819ae40 00000000`00000000 fffffa80`c9ce59f0 00000000`00000001 : fltmgr!FltpMsgDispatch+0xd2
fffff880`12cdb7f0 fffff800`01f4d1fa : 00000000`00000002 00000000`00000000 fffffa80`c9ce59f0 fffffa80`c7a063a0 : fltmgr! ?? ::FNODOBFM::`string'+0x24f0
fffff880`12cdb850 fffff800`0210a8b1 : fffffa80`c9ce59f0 fffffa80`c9ce59f0 fffffa80`c9ce59f0 fffff880`01e81180 : nt!IopSynchronousServiceTail+0xfa
fffff880`12cdb8c0 fffff800`01f9b3c6 : fffffa80`ca6e9600 00000000`00000000 00000000`00000001 00000000`00000000 : nt!IopXxxControlFile+0xc51
fffff880`12cdba00 fffff800`01cf3f53 : fffffa80`ca6e9640 00000000`00000001 fffffa80`c9b32b50 fffff800`01f4a6d4 : nt!NtDeviceIoControlFile+0x56
fffff880`12cdba70 00000000`74932e09 : 00000000`74932944 00000000`74a9774f 00000000`00000023 00000000`00000246 : nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffff880`12cdbae0)
00000000`0582ed78 00000000`74932944 : 00000000`74a9774f 00000000`00000023 00000000`00000246 00000000`09f6ac68 : 0x74932e09
00000000`0582ed80 00000000`74a9774f : 00000000`00000023 00000000`00000246 00000000`09f6ac68 00000000`0582edd0 : 0x74932944
00000000`0582ed88 00000000`00000023 : 00000000`00000246 00000000`09f6ac68 00000000`0582edd0 00000000`00088023 : 0x74a9774f
00000000`0582ed90 00000000`00000246 : 00000000`09f6ac68 00000000`0582edd0 00000000`00088023 00000000`09f6e910 : 0x23
00000000`0582ed98 00000000`09f6ac68 : 00000000`0582edd0 00000000`00088023 00000000`09f6e910 00000000`0000001c : 0x246
00000000`0582eda0 00000000`0582edd0 : 00000000`00088023 00000000`09f6e910 00000000`0000001c 00000000`00000000 : 0x9f6ac68
00000000`0582eda8 00000000`00088023 : 00000000`09f6e910 00000000`0000001c 00000000`00000000 00000000`00000000 : 0x582edd0
00000000`0582edb0 00000000`09f6e910 : 00000000`0000001c 00000000`00000000 00000000`00000000 00000000`09f6e898 : 0x88023
00000000`0582edb8 00000000`0000001c : 00000000`00000000 00000000`00000000 00000000`09f6e898 00000000`09f6c470 : 0x9f6e910
00000000`0582edc0 00000000`00000000 : 00000000`00000000 00000000`09f6e898 00000000`09f6c470 00000000`09f6e898 : 0x1c
因此,可以确认原先explorer线程要由360tray.exe来解锁。而且explorer线程事先也是由于调用360FsFlt.sys的函数被block的。
基本可以断定360天擎出了什么毛病,它给加锁了,但是360tray又没有及时解锁。
dmp分析
接下来尝试分析360tray的那个线程,为什么没有执行解锁。
首先要找到具体是360tray的哪个线程。先在双机调试的环境里,找到360tray的线程,用.thread /w fffffa80c9b32b50
查看用户态代码:
*** Stack trace for last set context - .thread/.cxr resets it
# ChildEBP RetAddr Args to Child
00 09f6e858 72bb16e4 00000814 00000000 00000000 ntdll_77390000!ZwDeviceIoControlFile+0x15 (FPO: [10,0,0])
01 09f6e8bc 72bb23fd 00000814 00088023 09f6e910 fltlib!FilterpDeviceIoControl+0xdc (FPO: [Non-Fpo])
02 09f6e8e4 720d253d 00000814 09f6e910 0000001c fltlib!FilterReplyMessage+0x21 (FPO: [Non-Fpo])
WARNING: Stack unwind information not available. Following frames may be wrong.
03 09f6e8f8 720d319d 006dbb50 09f6e910 00000000 qutmload+0x253d
04 09f6e930 74504a59 006dbb50 00000853 00000000 qutmload!NotifyDriverClientResult+0xaa
05 09f6e950 74524cf0 00000853 00000000 00000008 appd!GetExtInfo+0x2d7e
06 09f6e9c4 0057005c 006e0069 006f0064 00730077 appd!GetExtInfo+0x23015
07 09f6e9c8 006e0069 006f0064 00730077 0053005c 0x57005c
08 09f6e9cc 006f0064 00730077 0053005c 00730079 0x6e0069
09 09f6e9d0 00730077 0053005c 00730079 0061006e 0x6f0064
0a 09f6e9d4 0053005c 00730079 0061006e 00690074 0x730077
0b 09f6e9d8 00730079 0061006e 00690074 00650076 0x53005c
0c 09f6e9dc 0061006e 00690074 00650076 0074005c 0x730079
0d 09f6e9e0 00690074 00650076 0074005c 00730061 0x61006e
0e 09f6e9e4 00650076 0074005c 00730061 0065006b 0x690074
0f 09f6e9e8 0074005c 00730061 0065006b 0067006e 0x650076
10 09f6e9ec 00730061 0065006b 0067006e 0065002e 0x74005c
11 09f6e9f0 0065006b 0067006e 0065002e 00650078 0x730061
12 09f6e9f4 0067006e 0065002e 00650078 00000000 0x65006b
13 09f6e9f8 0065002e 00650078 00000000 00650074 0x67006e
14 09f6e9fc 00650078 00000000 00650074 0047005c 0x65002e
15 09f6ea00 00000000 00650074 0047005c 006f006f 0x650078
这里确实难住了,不知为何r esp
看不了寄存器。RetAddr的00640000~00740000都是Private地址,不是来自dll的代码地址。
0`00640000 0`00740000 0`00100000 UserRange VAD fffffa80ca41e1a0 ReadWrite 100 Private
暂时也没法在dmp里找到这个线程。
最后,360天擎自己承认是个bug,升级到5500以上可解。