Windows 2003 系统应用故障的分析
                                                           2009-02-06
背景
2月5日早上4:00就打来值班电话,公司的生产统计系统上不了,急急到用户终端进行检查,服务器10.1.106.193能PING通,就是通过 [url]http://10.1.106.193:8081[/url]访问不正常。到机房检查服务器系统,3个服务端程序的DOS窗口都没有。运行3个批处理文件就启动了服务端程序。再测试系统正常了。 考虑系统故障引起的应用服务停止,为了谨慎还是先重新启动服务器,再启动应用服务,测试应用正常。
是什么原因导致这些服务停止了呢?是人为失误忘记执行批处理文件,还是系统故障导致的呢?
为了搞清楚这个原因,下决心搞清楚引起故障的根源。

1   服务器日志的分析
1.1   “安全性”记录 

                        
图1:绿色框内:是解决故障时人为进行服务器重新启动时在系统启动过程中的日志记录,时间是4:29:44 到4:29:50,共6秒;在绿色框下一行的“系统事件”是系统记录的关机日志。
红色框内:时间是从2:45:50到2:45:57,共7秒。其日志记录与系统重新启动日志一样,是不是系统也自动重新启动了?

1.2  凌晨4到5点的“系统”记录
 
图2的4:23到4:31是反应系统启动过程的日志记录。这里注意时间间隔,
图2在4:23:53到4:24:00,共7秒的时间,是输入用户名和密码后系统登录过程中出现的日志记录。
图2的4:23:53×××警告提示详细内容如图3,这个内容也与我去查服务器故障时第一次登录该服务器提示是一致的。说明我在登录系统前,系统已经出现故障。
 

4:24:00错误提示的详细内容如图4

 
图4
4:26:36系统日志服务停止,日志记录停止。
4:29:39系统日志服务器启动,日志开始记录。
4:26:36到4:29:39,共183秒的时间,这是系统重新启而系统日志服务处于关闭状态的过程,系统在这段时间之间没有日志记录。
小经以上对临时4点多的时间记录的日志分析,
小结:值班人员执行重新启动到系统重新启动成功的时间为:7+183+81=271秒。
 
1.3    凌晨2到3点的“系统”记录
 

 
4:29:39到4:31:00 ,共81秒,系统重新启动过程中日志开记录到系统成功启动的日志记录。4:31:00是系统启动成功的时刻记录,
图5,2:45:04到2:47:00,其日志来源与值班人员人为重新启动服务器的日志来源几乎完全一致。而凌晨的这段时间并没有人为启动。
与4点多的时间段日志记录不同的是:事件来源的eventlog时间及信息,说明日志服务没有关闭记录。
2:45:46的红色出错信息为。详细如图6, 提示2:42:51系统意外关闭。如果是重新启动为什么日志中没有日志服务的关闭记录,只有这种情况如果服务器突然掉电或类似突然掉电的重新启动日志记录中不会有日志服务关闭记录。再看与2:42:51最接近的记录时间是2:45:04 ,间隔133秒(2分13秒),这个133秒(掉电重新启动)与270秒(发出重新启动后系统正常关闭服务后再重新启动)的时间差也非常吻合。
 
通过对故障服务器日志记录的分析:
小结:服务器极有可能突然掉电发生的重新启动,或类似这样的系统故障。
 
2     查看网络记录
如果服务器真的掉电重新启动,那么服务器所连接的交换机端口在相应的时间内应该有Down和Up的记录。注意交换机日志记录的时间。
我首先检查了网络交换机上的日志。必须明确该服务器连接的交换机端口。
服务器端口的进行快速定位。
center-1#show arp | in 106.193
Internet  10.1.106.193            5   0017.0857.7280  ARPA   Vlan2  
center-1#show mac- | in 0857.7280
   2    0017.0857.7280   dynamic ip                    GigabitEthernet2/6   
center-1#show cdp n g2/6 de
-------------------------
Device ID: Ghsw-A101-04-03
Entry address(es):
  IP address: 10.1.107.11
Platform: cisco WS-C3560G-24TS,  Capabilities: Router Switch IGMP
Interface: GigabitEthernet2/6,  Port ID (outgoing port): GigabitEthernet0/25
Holdtime : 125 sec
Version :
Cisco IOS Software, C3560 Software (C3560-IPSERVICES-M), Version 12.2(25)SEE4, RELEASE SOFTWARE (fc1)
Copyright (c) 1986-2007 by Cisco Systems, Inc.
Compiled Mon 16-Jul-07 00:28 by myl
advertisement version: 2
Protocol Hello:  OUI=0x00000C, Protocol ID=0x0112; payload len=27, value=00000000FFFFFFFF010221FF000000000000001E4993BE00FF0000
VTP Management Domain: 'bjgh'
Native VLAN: 1
Duplex: full
center-1# 10.1.107.11
Trying 10.1.107.11 ... Open

User Access Verification
Password:
Ghsw-A101-04-03>en
Password:
Ghsw-A101-04-03#show mac- | 0857.7280
                            ^
% Invalid input detected at '^' marker.
Ghsw-A101-04-03#show mac- | in 0857.7280
   2    0017.0857.7280    DYNAMIC     Gi0/3
Ghsw-A101-04-03#show log | in  0/3
.Feb  4 09:55:13: %LINK-3-UPDOWN: Interface GigabitEthernet0/3, changed state to down
.Feb  4 09:55:16: %LINK-3-UPDOWN: Interface GigabitEthernet0/3, changed state to up
.Feb  4 09:55:17: %LINEPROTO-5-UPDOWN: Line protocol . Interface GigabitEthernet0/3, changed state to up
.Feb  5 02:57:46: %LINEPROTO-5-UPDOWN: Line protocol . Interface GigabitEthernet0/3, changed state to down
.Feb  5 02:57:47: %LINK-3-UPDOWN: Interface GigabitEthernet0/3, changed state to down
.Feb  5 02:57:50: %LINK-3-UPDOWN: Interface GigabitEthernet0/3, changed state to up
.Feb  5 02:57:51: %LINEPROTO-5-UPDOWN: Line protocol . Interface GigabitEthernet0/3, changed state to up
.Feb  5 02:59:12: %LINEPROTO-5-UPDOWN: Line protocol . Interface GigabitEthernet0/3, changed state to down
.Feb  5 02:59:13: %LINK-3-UPDOWN: Interface GigabitEthernet0/3, changed state to down
.Feb  5 02:59:16: %LINK-3-UPDOWN: Interface GigabitEthernet0/3, changed state to up
.Feb  5 02:59:17: %LINEPROTO-5-UPDOWN: Line protocol . Interface GigabitEthernet0/3, changed state to up
.Feb  5 04:41:46: %LINEPROTO-5-UPDOWN: Line protocol . Interface GigabitEthernet0/3, changed state to down
.Feb  5 04:41:47: %LINK-3-UPDOWN: Interface GigabitEthernet0/3, changed state to down
.Feb  5 04:41:50: %LINK-3-UPDOWN: Interface GigabitEthernet0/3, changed state to up
.Feb  5 04:41:51: %LINEPROTO-5-UPDOWN: Line protocol . Interface GigabitEthernet0/3, changed state to up
.Feb  5 04:43:12: %LINEPROTO-5-UPDOWN: Line protocol . Interface GigabitEthernet0/3, changed state to down
.Feb  5 04:43:13: %LINK-3-UPDOWN: Interface GigabitEthernet0/3, changed state to down
.Feb  5 04:43:15: %LINK-3-UPDOWN: Interface GigabitEthernet0/3, changed state to up
.Feb  5 04:43:16: %LINEPROTO-5-UPDOWN: Line protocol . Interface GigabitEthernet0/3, changed state to up
初一看是间不对,检查交换机时间,发现交换机时间不准确,比服务器的时间要慢14分钟。
交换机日志时间减去14分,刚好与服务器系统日志记录时间一致。
交换机提示服务器掉电时间:Feb  5 02:57:46: (减去14分);即2:43:42;系统服务器记录时间:2:42:51。(如果服务器类似掉电似的突然重新启动,服务器对这个时间记录是一个比实际时间偏小的估计值,相差约1秒)。说明服务器日志的提示时间和交换机端口DOWN的时间一致。
.Feb  5 04:41:46: 至Feb  5 04:43:16: (减去14分),是发现故障让服务器重新启动的交换机端口UP/DOWN的日志记录。
小结:在凌晨2:43左右该106.193服务器突然掉电重新启动。
 

是什么原因导致服务器突然掉电重新启动?
 

3    对DMP文件的分析
系统MEMORY.DMP文件的时间2:45
下面是DMP文件的分析:
Microsoft (R) Windows Debugger Version 6.10.0003.233 X86
Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [C:\Documents and Settings\zhou.j\My Documents\106.193\MEMORY.DMP]
Kernel Summary Dump File: .ly kernel address space is available
Symbol search path is: *** Invalid ***
****************************************************************************
* Symbol loading may be unreliable without a symbol search path.           *
* Use .symfix to have the debugger choose a symbol path.                   *
* After setting your symbol path, use .reload to refresh symbol locations. *
****************************************************************************
Executable search path is:
*********************************************************************
* Symbols can not be loaded because symbol path is not initialized. *
*                                                                   *
* The Symbol Path can be set by:                                    *
*   using the _NT_SYMBOL_PATH environment variable.                 *
*   using the -y <symbol_path> argument when starting the debugger. *
*   using .sympath and .sympath+                                    *
*********************************************************************
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for ntkrnlmp.exe -
Windows Server 2003 Kernel Version 3790 (Service Pack 2) MP (4 procs) Free x86 compatible
Product: Server, suite: TerminalServer SingleUserTS
Built by: 3790.srv03_sp2_gdr.080813-1204
Machine Name:
Kernel base = 0x80800000 PsLoadedModuleList = 0x808af9c8
Debug session time: Thu Feb  5 02:43:02.584 2009 (GMT+8)
System Uptime: 0 days 17:01:59.693
*********************************************************************
* Symbols can not be loaded because symbol path is not initialized. *
*                                                                   *
* The Symbol Path can be set by:                                    *
*   using the _NT_SYMBOL_PATH environment variable.                 *
*   using the -y <symbol_path> argument when starting the debugger. *
*   using .sympath and .sympath+                                    *
*********************************************************************
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for ntkrnlmp.exe -
Loading Kernel Symbols
...............................................................
......................................................
Loading User Symbols
PEB is paged out (Peb.Ldr = 7ffdb00c).  Type ".hh dbgerr001" for details
Loading unloaded module list
....
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for storport.sys -
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************
Use !analyze -v to get detailed debugging information.
BugCheck D1, {28d92f20, d0000002, 0, f734e12e}
***** Kernel symbols are WRONG. Please fix symbols to do analysis.
Page c5a39 not present in the dump file. Type ".hh dbgerr004" for details
Page c0f8d not present in the dump file. Type ".hh dbgerr004" for details
*** ERROR: Module load completed but symbols could not be loaded for HpCISSs2.sys
*************************************************************************
***                                                                   ***
***                                                                   ***
***    Your debugger is not using the correct symbols                 ***
***                                                                   ***
***    In order for this command to work properly, your symbol path   ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: nt!_KPRCB                                     ***
***                                                                   ***
*************************************************************************
*************************************************************************
***                                                                   ***
***                                                                   ***
***    Your debugger is not using the correct symbols                 ***
***                                                                   ***
***    In order for this command to work properly, your symbol path   ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: nt!KPRCB                                      ***
***                                                                   ***
*************************************************************************
*************************************************************************
***                                                                   ***
***                                                                   ***
***    Your debugger is not using the correct symbols                 ***
***                                                                   ***
***    In order for this command to work properly, your symbol path   ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: nt!_KPRCB                                     ***
***                                                                   ***
*************************************************************************
*************************************************************************
***                                                                   ***
***                                                                   ***
***    Your debugger is not using the correct symbols                 ***
***                                                                   ***
***    In order for this command to work properly, your symbol path   ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: nt!KPRCB                                      ***
***                                                                   ***
*************************************************************************
*************************************************************************
***                                                                   ***
***                                                                   ***
***    Your debugger is not using the correct symbols                 ***
***                                                                   ***
***    In order for this command to work properly, your symbol path   ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: nt!_KPRCB                                     ***
***                                                                   ***
*************************************************************************
*************************************************************************
***                                                                   ***
***                                                                   ***
***    Your debugger is not using the correct symbols                 ***
***                                                                   ***
***    In order for this command to work properly, your symbol path   ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: nt!_KPRCB                                     ***
***                                                                   ***
*************************************************************************
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for halmacpi.dll -
*************************************************************************
***                                                                   ***
***                                                                   ***
***    Your debugger is not using the correct symbols                 ***
***                                                                   ***
***    In order for this command to work properly, your symbol path   ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: nt!_KPRCB                                     ***
***                                                                   ***
*************************************************************************
*************************************************************************
***                                                                   ***
***                                                                   ***
***    Your debugger is not using the correct symbols                 ***
***                                                                   ***
***    In order for this command to work properly, your symbol path   ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: nt!_KPRCB                                     ***
***                                                                   ***
*************************************************************************
*********************************************************************
* Symbols can not be loaded because symbol path is not initialized. *
*                                                                   *
* The Symbol Path can be set by:                                    *
*   using the _NT_SYMBOL_PATH environment variable.                 *
*   using the -y <symbol_path> argument when starting the debugger. *
*   using .sympath and .sympath+                                    *
*********************************************************************
*********************************************************************
* Symbols can not be loaded because symbol path is not initialized. *
*                                                                   *
* The Symbol Path can be set by:                                    *
*   using the _NT_SYMBOL_PATH environment variable.                 *
*   using the -y <symbol_path> argument when starting the debugger. *
*   using .sympath and .sympath+                                    *
*********************************************************************
小结:Probably caused by : storport.sys ( storport!StorPortGetPhysicalAddress+2a )
Followup: MachineOwner
---------

4    查找有关storport.sys
可以在hP技术支持网站上有如下文章。
最近更新时间2008-4-17 15:08:14   浏览次数:14
 
文章ID:43167
文章标题:在繁重的输入/输出负荷下,HP ProLiant 服务器将有可能遇到蓝屏问题
文章关键字:,c01420773
文章路径:http://www.icare.hp.com.cn/techcenter_staticarticle/43167/43167.html
 
 
如果长时间承受繁重的输入/输出负荷,并且配有下列“影响范围”部分中所罗列的某个 HP Smart Array SCSI 或 SAS/SATA 控制器与软件,那么 HP ProLiant 服务器将有可能会遇到蓝屏问题,并且显示 Stop 0x000000D1 。
出现这种情况是因为安装 Storport 驱动程序的 Microsoft KB932755 更新文件之后,HP Insight Management Storage Agents 发出 I/O Control (IOCTL) 调用命令会有问题。
任何 HP ProLiant 服务器,并且配有下列某个 HP Smart Array SCSI 或 SAS/SATA 控制器与软件:
受影响的 HP Smart Array SCSI 控制器:
Smart Array 6400/6402/6404 EM 控制器
Smart Array 641/642 控制器
Smart Array 6i 控制器
Smart Array 5312 控制器
Smart Array 5300/5302/5304 控制器
Smart Array 532 控制器
Smart Array 5i 控制器
受影响的 HP Smart Array SAS/SATA 控制器:
Smart Array P800 控制器
Smart Array P600 控制器
Smart Array E500 控制器
Smart Array P400/400i 控制器
Smart Array E200/200i 控制器
受影响的软件配置:
Microsoft Windows Server 2003(x86 或 x64)任何版本。

HP ProLiant Smart Array 5x/6x Controller Driver (HPCISSS.SYS) 版本 5.18.0.64(或更低版本) 或 HP ProLiant Smart Array SAS/SATA Controller Driver (HPCISSS2.SYS) 版本 5.10.0.32 或 5.10.0.64(或更低版本)。

Microsoft KB932755 带来的 Microsoft Storport Driver for Windows Server 2003 版本 5.2.3790.2880(适用于 SP1) 或 5.2.3790.4021(适用于 SP2)。

HP Insight Management Storage Agents(任何版本)。
在下列更新版本中,蓝屏问题已经得到纠正:
对于运行 Windows Server 2003 64 位版本的 ProLiant 服务器:
(HPCISSS.SYS) HP ProLiant Smart Array 5x and 6x Controller Driver for Windows Server 2003 x64 Editions 版本 6.4.0.64(或更高版本)
(HPCISSS2.SYS) HP ProLiant Smart Array SAS/SATA Controller Driver for Windows Server 2003 x64 Editions 版本 6.2.0.64(或更高版本)
对于运行 Windows Server 2003 32 位版本的 ProLiant 服务器:
(HPCISSS2.SYS) HP ProLiant Smart Array SAS/SATA Controller Driver for Windows Server 2003 版本 6.2.0.32(或更高版本)
寻找驱动程序更新版本:
1.访问
www.hp.com
2.选择“Software and Driver Downloads”。
3.输入 ProLiant 服务器机型(例如“DL380 G5”)。
4.在“Product Search Results”页面(如有此页)中选择具体的服务器机型。
5.选择相应的 Windows Server 2003 版本。
6.选择 Driver - Storage Controller

7.下载相应驱动程序的最新版本 。
在安装正确的 HPCISSS.SYS 或 HPCISSS2.SYS 版本之前,利用“控制面板->添加或删除程序”删除 Storport 驱动程序的 KB932755 更新文件可以避免蓝屏问题(参见下图 1)。
图 1. 在“添加/删除程序”中删除 KB932755。
接受前瞻更新 : 通过电子邮件与 HP Subscriber"s Choice(惠普用户选择服务)预先获得支持提示(例如客户顾问文档),以及驱动程序更新文件、软件、固件与客户可更换组件。 访问下列网址注册 Subscriber"s Choice(用户选择服务):
http://www.hp.com/go/myadvisory
搜寻提示 : 关于访问 HP.com,为 ProLi
按上面HP的提示,在HP网站上找到对应HP服务器的阵列卡驱动,如下

下载这个驱动安装。监控这台服务器,自安装了这个驱动之后,服务器再也没有发生自动重新启动。
小结: 如果在运行某些 Microsoft Storport 存储端口驱动程序 (STORPORT.SYS) 版本的 ProLiant 服务器可能出现蓝屏消息。下载HP相应驱动(注意对应你的硬件阵列卡)并安装.

  最终结论
2月5日106.193服务器发生蓝屏重新启动,启动后,X应用服务需要人为启动,导致当天报表不能使用。 蓝屏原因是: 运行 Microsoft Windows Server 2003 SP2 和 Microsoft Storport 存储端口驱动程序 (STORPORT.SYS) 的某些 ProLiant 服务器可能蓝屏,安装HP发布的阵列卡硬件驱动问题解决,。