前些日子,客户的S7A主机发生了几次宕机,产生了CORE_DUMP文件,下面是利用crash命令分析宕机原因的过程
pwd
/
# hostname
s7a01
# cd /var/adm/ras
# ls -l 查看core文件名称
total 395133
-rw-rw-r-- 1 root system 4226 Apr 02 2003 BosMenus.log
-rw-r--r-- 1 root system 2 Jan 07 2000 SRCSemID
-rw------- 1 root system 8192 May 20 13:35 bootlog
-rw-r--r-- 1 root system 8388 Apr 02 2003 bosinst.data
-rw-rw-r-- 1 root system 16384 Apr 02 2003 bosinstlog
--w------- 1 root system 2 May 16 15:47 bounds
-rw-r--r-- 1 bin bin 197206 Jan 01 1970 codepoint.cat
-rw--w--w- 1 root system 16384 May 20 15:52 conslog
--w------- 1 root system 21 May 16 15:47 copyfilename
-rw-r--r-- 1 root system 57078 Apr 02 2003 devinst.log
-rw-r--r-- 1 root system 83319 May 20 14:00 diag_log
-rw------- 1 root system 8192 May 16 15:49 dumpsymplog
-rw-r--r-- 1 root system 151552 May 20 15:52 errlog
-rw-r--r-- 1 root system 151552 Apr 22 2004 errlog0422.log
-r--r--r-- 1 bin bin 103968 Jan 07 2000 errtmplt
-rw-r--r-- 1 root system 7949 Apr 02 2003 p_w_picpath.data
-rw-r--r-- 1 root system 8192 May 20 13:21 nimlog
-rw-rw-rw- 1 root system 1334264 Jan 20 2000 trcfile
-rw------- 1 root system 200136704 May 16 15:47 vmcore.0
# crash vmcore.0 开打vmcore.0文件
Using /unix as the default namelist file.
2 dump routines failed. The following were recorded:
0x0141cbe8 <.[kbddd_chrp:DATA]+9a8> failed with rc=14
0x01422764 <.[msedd_chrp:DATA]+664> failed with rc=14
> stat 查看宕机时的状态
sysname: AIX
nodename: s7a01
release: 3
version: 4
machine: 000AAD014C00
time of crash: Tue May 16 15:05:18 TAIST 2006
age of system: 22 hr., 51 min.
xmalloc debug: disabled
abend code: 300 查看错误代码,这个代码很关键
csa: 0x2ff3b400
exception struct:
dar: 0x00000000
dsisr: 0x00000000:
srv: 0x00000000
dar2: 0x00000000
dsirr: 0x00000000: (errno) "Error 0"
> trace -m
Skipping first MST
MST STACK TRACE:
0x2ff3b400 (excpt=00000004:0a000000:00000000:00000004:00000106) (intpri=11)
IAR: .compare_and_swap+2c (0000a4ec): stw r9,0x0(r4)
LR: .[aiopin:untie_knot]+a8 (0143d7a8)
2ff3a2e0: .[aio.ext:qlioreq]+b0 (014376ec)
2ff3a340: .[aio.ext:listio]+128 (01438f5c)
2ff3b3c0: .sys_call_ret+0 (00003a6c)
0001113a: lasttocentry+fead9 (00348001)
0452-771: Cannot read return address at address 0x01892c0b.
> le 0000a4ec
No loader entry found for module address 0x0000a4ec
No loader entry found for module named '0000a4ec'
> le 0143d7a8
LoadList entry at 0x04ea7980
Module *start:0x00000000_0143bef0 Module filesize:0x00000000_0000228c
Module *end:0x00000000_0143e17c
*data:0x00000000_0143dbe8 data length:0x00000000_00000594
Use-count:0x0001 load_count:0x0000 *file:0x00000000
flags:0x00000262 TEXT DATAINTEXT DATA DATAEXISTS
*exp:0x04ed8000 *lex:0x00000000 *deferred:0x00000000 expsize:0x6e6c732f
Name: /usr/lib/drivers/aiopin
ndepend:0x0001 maxdepend:0x0001
*depend[00]:0x05039280
*le_next: 04ea7680
> le 014376ec
LoadList entry at 0x04ea7680
Module *start:0x00000000_014348c0 Module filesize:0x00000000_00007624
Module *end:0x00000000_0143bee4
*data:0x00000000_0143a4c0 data length:0x00000000_00001a24
Use-count:0x0003 load_count:0x0001 *file:0x00000000
flags:0x00000272 TEXT KERNELEX DATAINTEXT DATA DATAEXISTS
*exp:0x051e3000 *lex:0x00000000 *deferred:0x00000000 expsize:0x6c696263
Name: /etc/drivers/aio.ext
ndepend:0x0002 maxdepend:0x0002
*depend[00]:0x04ea7980
*depend[01]:0x05039280
*le_next: 04edb700
> le 01438f5c
LoadList entry at 0x04ea7680
Module *start:0x00000000_014348c0 Module filesize:0x00000000_00007624
Module *end:0x00000000_0143bee4
*data:0x00000000_0143a4c0 data length:0x00000000_00001a24
Use-count:0x0003 load_count:0x0001 *file:0x00000000
flags:0x00000272 TEXT KERNELEX DATAINTEXT DATA DATAEXISTS
*exp:0x051e3000 *lex:0x00000000 *deferred:0x00000000 expsize:0x6c696263
Name: /etc/drivers/aio.ext
ndepend:0x0002 maxdepend:0x0002
*depend[00]:0x04ea7980
*depend[01]:0x05039280
*le_next: 04edb700
经查,宕机跟Name: /usr/lib/drivers/aiopin有关,
> errpt 查看宕机时产生的错误日志
LAST ERRORS READ BY ERRDEMON (MOST RECENT LAST):
Tue May 16 15:05:18 TAIST: DSI_PROC data storage interrupt : processor
Resource Name: SYSVMM
0a000000 00000000 00000004 00000086
LAST 3 ERRORS READ BY ERRDEMON (MOST RECENT FIRST):
> od vmmerrlog 9 rpco proc - 0
SLT ST PID PPID PGRP UID EUID TCNT NAME
0 a 0 0 0 0 0 1 swapper
FLAGS: swapped_in no_swap fixed_pri kproc
Links: *child:0xe20030c0 *siblings:0x00000000 *uinfo:0x50004020(0x0038)
*ganchor:0x00000000 *pgrpl:0x00000000 *ttyl:0x00000000
Dispatch Fields: pevent:0x00000000 *synch:0xffffffff
lock:0x00000000 lock_d:0x00000000
Thread Fields: *threadlist:0xe6000000 threadcount:1
active:1 suspended:0 local:0 terminating:0
Scheduler Fields: fixed pri: 16 repage:0x00000000 scount:0 sched_pri:0
*sched_next:0x00000000 *sched_back:0x00000000 cpticks:3087
msgcnt:0 majfltsec:0
Misc: adspace:0x0003c00f kstackseg:0x00000000 xstat:0x0000
*p_ipc:0x00000000 *p_dblist:0x00000000 *p_dbnext:0x00000000
Signal Information:
pending:hi 0x00000000,lo 0x00000000
sigcatch:hi 0x00000000,lo 0x00000000 sigignore:hi 0xffffffff,lo 0xfff7ffff
Statistics: size:0x00000000(pages) audit:0x00000000
accounting page frames:0 page space blocks:0
Number of virtual pages in use :0
pctcpu:0 minflt:1987 majflt:7
> thread - 0
SLT ST TID PID CPUID POLICY PRI CPU EVENT PROCNAME
0 s 3 0 unbound FIFO 10 78 swapper
t_flags: wakeonsig kthread
Links: *procp:0xe2000000 *uthreadp:0x2ff3b400 *userp:0x2ff3b6e0
*prevthread:0xe6000000 *nextthread:0xe6000000, *stackp:0x00000000
*wchan1(real):0x00000000 *wchan2(VMM):0x00000000 *swchan:0x00000000
wchan1sid:0x00000000 wchan1offset:0x00000000
pevent:0x00000000 wevent:0x00000001 *slist:0x00000000
Dispatch Fields: *prior:0xe6000000 *next:0xe6000000
polevel:0x0000000a ticks:0x0c0f *synch:0xffffffff result:0x00000000
*eventlst:0x00000000 *wchan(hashed):0x00000000 suspend:0x0001
thread waiting for: event(s)
Scheduler Fields: cpuid:0xffffffff scpuid:0xffffffff pri: 16 policy:FIFO
affinity:0x0001 affinity_ts:0x3b6e31e cpu:0x0078 run_queue:34a900
lpri: 0 wpri:127 time:0x00 sav_pri:0x10
Misc: lockcount:0x00000000 ulock:0x00000000 *graphics:0x00000000
dispct:0x00031718 fpuct:0x00000001 boosted:0x0000
userdata:0x00000000
fsflags: 00000000 adsp_flags: 0000
Signal Information: cursig:0x00 *scp:0x00000000
pending:hi 0x00000000,lo 0x00000000 sigmask:hi 0x00000000,lo 0x00000000
> q
#lslpp -w /usr/lib/drivers/aiopin 查看相关的文件集
File Fileset Type
----------------------------------------------------------------------------
/usr/lib/drivers/aiopin bos.rte.aio File
# lslpp -ah bos.rte.aio 查看这个文件集的版本为4.3.3.1
Fileset Level Action Status Date Time
----------------------------------------------------------------------------
Path: /usr/lib/objrepos
bos.rte.aio
4.3.3.0 COMMIT COMPLETE 01/01/70 08:29:52
4.3.3.1 COMMIT COMPLETE 01/07/00 09:57:11
4.3.3.1 APPLY COMPLETE 01/07/00 09:55:52
Path: /etc/objrepos
bos.rte.aio
4.3.3.0 COMMIT COMPLETE 01/01/70 08:29:52
4.3.3.1 COMMIT COMPLETE 01/07/00 09:57:11
4.3.3.1 APPLY COMPLETE 01/07/00 09:55:53
经查,宕机跟bos.rte.aio有关,在IBM网站上查到如下内容
IY05599: AIO CRASH IN COMPARE_AND_SWAP 00/01/14 PTF PECHANGE
APAR status
Closed as program error.
Error description
When the parameter passed to the compare_and_swap() expected
to be a pointer to an integer, but the code passed an integer.
I/O on this address (small integer) caused the system crashed
with DSI.
Local fix
Problem summary
***************************************************************
*USERS AFFECTED: *
* All users with the following filesets at these levels *
* bos.rte.aio 4.3.3.1.
***************************************************************
*PROBLEM DESCRIPTION: *
* When the parameter passed to the compare_and_swap()
* expected to be a pointer to an integer, but the code
* passed an integer. I/O on this address (small
* integer) caused the system crashed with DSI.
***************************************************************
*RECOMMENDATION: *
* Apply apar IY05599
***************************************************************
Problem conclusion
Corrected the parameter passed to compare_and_swap calls.
Temporary fix
Comments
APAR information
APAR number IY05599
Reported component name AIX 4.3.0
Reported component ID 5765C3403
Reported release 430
Status CLOSED PER
PE YesPE
HIPER NoHIPER
Submitted date 1999-11-02
Closed date 1999-11-08
Last modified date 2000-10-17
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name AIX 4.3.0
Fixed component ID 5765C3403
Applicable component levels
R430 PSY U467596 UP99/12/21 I 1000
现在确定,这台机器需要打相关补丁才能彻底解决宕机.
转载于:https://blog.51cto.com/liujia/561242