蓝屏错误是windows的一大特色,面对蓝屏,我们常常也是没有选择,只有重新启动。蓝屏死机的原因千奇百怪,不过从另一方面来说,在计算机维护过程中有蓝屏的死机总是比没有蓝屏的突然死机要好一些,这是因为蓝屏表示windows检查到这个错误并记录下来,也许我们能通过dump文件来找到错误的可能原因。
 
要排查蓝屏的原因,首先得有一台可以上网的机器,这是因为接下去使用的一个软件需要连接到MS的debug库里查找错误代码。
 
这个软件就是Debugging Tools for windos,可以在下面这个地址下载到他。
 
安装后运行windbg,我们需要先设置debug环境。
现在C盘创建一个空目录websymbols
在菜单栏里选择File/Symbol File Path..,
按照上面的填写
接着选择File\p_w_picpath file path,和上面填写的一样
最后就是保存好workspace了。
 
接下来就是导入dump文件,一般xp使用minidump,这是因为文件小,通常只有几十k,如果把所有内存都dump下来,没来几次硬盘就不够用了。
dump文件保存在C:\windows\minidump里
下面就是一个范例:
 
Microsoft (R) Windows Debugger Version 6.9.0003.113 X86
Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [C:\Documents and Settings\ddd\桌面\Mini081308-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available
Symbol search path is: SRV*C:\Websymbol*[url]http://msdl.microsoft.com/download/symbols[/url]
Executable search path is: SRV*C:\Websymbol*[url]http://msdl.microsoft.com/download/symbols[/url]
Windows XP Kernel Version 2600 (Service Pack 2) UP Free x86 compatible
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 2600.xpsp_sp2_gdr.070227-2254
Kernel base = 0x804d8000 PsLoadedModuleList = 0x805543a0
Debug session time: Wed Aug 13 12:27:47.088 2008 (GMT+8)
System Uptime: 0 days 3:36:07.478
Loading Kernel Symbols
.............................................................................................................................
Loading User Symbols
Mini Kernel Dump does not contain unloaded driver list
Unable to load p_w_picpath nv4_disp.dll, Win32 error 0n2
*** WARNING: Unable to verify timestamp for nv4_disp.dll
*** ERROR: Module load completed but symbols could not be loaded for nv4_disp.dll
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************
Use !analyze -v to get detailed debugging information.
BugCheck EA, {887e3da8, 89376dc8, 89362ea0, 1}
Probably caused by : nv4_disp.dll ( nv4_disp+1cca0 )
Followup: MachineOwner
这里基本上和蓝屏时看到的内容差不多,点击蓝色的!analyze -v可以看到更加详细的信息
******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************
THREAD_STUCK_IN_DEVICE_DRIVER (ea)
The device driver is spinning in an infinite loop, most likely waiting for
hardware to become idle. This usually indicates problem with the hardware
itself or with the device driver programming the hardware incorrectly.
If the kernel debugger is connected and running when watchdog detects a
timeout condition then DbgBreakPoint() will be called instead of KeBugCheckEx()
and detailed message including bugcheck arguments will be printed to the
debugger. This way we can identify an offending thread, set breakpoints in it,
and hit go to return to the spinning code to debug it further. Because
KeBugCheckEx() is not called the .bugcheck directive will not return bugcheck
information in this case. The arguments are already printed out to the kernel
debugger. You can also retrieve them from a global variable via
"dd watchdog!g_WdBugCheckData l5" (use dq on NT64).
On MP machines (OS builds <= 3790) it is possible to hit a timeout when the spinning thread is
interrupted by hardware interrupt and ISR or DPC routine is running at the time
of the bugcheck (this is because the timeout's work item can be delivered and
handled on the second CPU and the same time). If this is the case you will have
to look deeper at the offending thread's stack (e.g. using dds) to determine
spinning code which caused the timeout to occur.
Arguments:
Arg1: 887e3da8, Pointer to a stuck thread object.  Do .thread then kb on it to find
 the hung location.
Arg2: 89376dc8, Pointer to a DEFERRED_WATCHDOG object.
Arg3: 89362ea0, Pointer to offending driver name.
Arg4: 00000001, Number of times this error occurred.  If a debugger is attached,
 this error is not always fatal -- see DESCRIPTION below.  On the
 blue screen, this will always equal 1.
Debugging Details:
------------------

FAULTING_THREAD:  887e3da8
DEFAULT_BUCKET_ID:  GRAPHICS_DRIVER_FAULT
CUSTOMER_CRASH_COUNT:  1
BUGCHECK_STR:  0xEA
PROCESS_NAME:  devenv.exe
LAST_CONTROL_TRANSFER:  from b6dd83b0 to bf9f1ca0
STACK_TEXT: 
WARNING: Stack unwind information not available. Following frames may be wrong.
b4939478 b6dd83b0 fffffbbc b6db9000 00000011 nv4_disp+0x1cca0
b493947c fffffbbc b6db9000 00000011 00000d15 0xb6dd83b0
b4939480 b6db9000 00000011 00000d15 b49395dc 0xfffffbbc
b4939484 00000000 00000d15 b49395dc 00000024 0xb6db9000

STACK_COMMAND:  .thread 0xffffffff887e3da8 ; kb
FOLLOWUP_IP:
nv4_disp+1cca0
bf9f1ca0 8b4c2410        mov     ecx,dword ptr [esp+10h]
SYMBOL_STACK_INDEX:  0
SYMBOL_NAME:  nv4_disp+1cca0
FOLLOWUP_NAME:  MachineOwner
MODULE_NAME: nv4_disp
IMAGE_NAME:  nv4_disp.dll
DEBUG_FLR_IMAGE_TIMESTAMP:  4410c8d4
FAILURE_BUCKET_ID:  0xEA_IMAGE_nv4_disp.dll_DATE_2006_03_10
BUCKET_ID:  0xEA_IMAGE_nv4_disp.dll_DATE_2006_03_10
Followup: MachineOwner
---------
比较幸运,windbg直接告诉我们这个文件可能出现的问题是硬件问题,结合nv4_disp.dll,大致的问题就处在显卡上。打开机箱,发现显卡积了厚厚的灰,清理后问题接触。
 
以上就是大致的排查过程,也算留作收藏资料供日后使用吧