在RAC环境中,其中一个节点上的alert文件中出现了如下信息:waiting for instances to leave。
首先描述一下问题的发现过程。程序员报告RAC环境响应比较缓慢,而此时我们正好在这个RAC环境的节点上检查rman的日志,发现rman连接数据库很慢,很长时间都没有响应。而这时另一个节点几乎无法访问了,一个普通的ssh登录都需要1分钟以上,更不要说执行命令了。
检查当前还可以访问的节点的alert文件,发现了上面的waiting for instances to leave信息:
Mon Jul 27 15:52:14 2009
Waiting for instances to leave:
1
由于这个问题产生的原因比较复杂,为了将问题描述清楚,还须从头说起。
首先是硬件维护发现RAC服务器的其中一个节点DOWN机,这个服务器无法通过远程登录。在数据库中检查发现:节点1关闭,节点2正常运行。从节点2上的alert文件中可以看到,节点2对数据库进行了一些恢复性的操作并成功接管了全部数据库的管理操作:
Fri Jul 24 16:25:09 2009
Reconfiguration started(old inc 4, new inc 6)
List of nodes:
1
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Fri Jul 24 16:25:09 2009
LMS 0: 6 GCS shadows cancelled, 0 closed
Fri Jul 24 16:25:09 2009
LMS 1: 14 GCS shadows cancelled, 1 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Fri Jul 24 16:25:11 2009
Instance recovery: lookingfor dead threads
Fri Jul 24 16:25:11 2009
Beginning instance recoveryof 1 threads
Fri Jul 24 16:25:12 2009
LMS 0: 378032 GCS shadows traversed, 0 replayed
Fri Jul 24 16:25:12 2009
LMS 1: 381728 GCS shadows traversed, 0 replayed
Fri Jul 24 16:25:12 2009
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Fri Jul 24 16:25:14 2009
parallel recovery started with 7 processes
Fri Jul 24 16:25:15 2009
Started redo scan
Fri Jul 24 16:25:15 2009
Completed redo scan
3990 redo blocks read, 986 data blocks need recovery
Fri Jul 24 16:25:15 2009
Started redo applicationat
Thread 1: logseq 1678, block 572689
Fri Jul 24 16:25:15 2009
Recovery of Online RedoLog: Thread 1 Group 1 Seq 1678 Reading mem 0
Mem# 0: /dev/vx/rdsk/datavg/newtrade_redo1_1_1_1g
Mem# 1: /dev/vx/rdsk/datavg/newtrade_redo1_1_2_1g
Fri Jul 24 16:25:15 2009
Completed redo application
Fri Jul 24 16:25:16 2009
Completed instance recoveryat
Thread 1: logseq 1678, block 576679, scn 1112404710247
862 data blocks read, 1014 data blocks written,3990 redo blocks read
Switch log for thread 1 to sequence 1679
随后,硬件维护人员在现场启动节点1的服务器,在启动时似乎报了一个和内存有关的错误(我不在现场,听说的),随后检查没有发现系统的错误信息。
检查Oracle的alert信息没有发现任何的错误。于是怀疑节点突然崩溃和硬件异常有关。检查系统的dmesg信息也没有找到系统异常DOWN机的原因。
由于没有找到错误,在节点1启动后,将节点1上数据库的实例1也启动了。
但是没过多长时间就碰到了前面描述的错误。节点2的alert文件中详细的错误信息代码如下:
MonJul 27 15:20:42 2009
MMNLabsent for 1202 secs; Foregrounds taking over
MonJul 27 15:37:40 2009
MMNLabsent for 1599 secs; Foregrounds taking over
MonJul 27 15:48:19 2009
IPCSend timeout detected.Sender: ospid 2284
Receiver:inst 1 binc 27438 ospid 3739
MonJul 27 15:49:24 2009
IPCSend timeout detected.Sender: ospid 5165
Receiver:inst 1 binc 27438 ospid 3739
MonJul 27 15:49:31 2009
IPCSend timeout detected.Sender: ospid 5179
Receiver:inst 1 binc 27438 ospid 3739
……
MonJul 27 15:49:58 2009
IPCSend timeout detected.Sender: ospid 23757
Receiver:inst 1 binc 27438 ospid 3739
MonJul 27 15:49:59 2009
IPCSend timeout to 0.2 inc 12 for msg type 32 from opid 46
MonJul 27 15:49:59 2009
Communicationsreconfiguration: instance_number 1
MonJul 27 15:50:15 2009
IPCSend timeout to 0.2 inc 12 for msg type 32 from opid 43
……
MonJul 27 16:07:35 2009
Waitingfor instances to leave:
1
MonJul 27 16:07:55 2009
Waitingfor instances to leave:
1
MonJul 27 16:08:15 2009
Waitingfor instances to leave:
1
开始是IPC Send timeout detected错误,随后是waitingfor instances to leave信息。根据错误信息以及节点1的响应时间判断,问题多半就出在节点1上。可能是由于节点1太忙导致响应时间过长,而节点2要与节点1直接进行交互,在交互时出现了超时错误。
为了尽快解决这个问题,在节点2上执行svrctl尝试关闭节点1上的实例:
bash-3.00$ srvctl stopinst -d newtrade -i newtrade1
等待了一段时间,节点1上的实例1终于被关闭。而这时节点1上的响应终于恢复正常了。检查节点1上的alert文件,发现了大量的错误信息(摘要):
Mon Jul 27 15:05:00 2009
GES: Potential blocker(pid=3815) on resource CI-0000000A-00000002;
enqueue info in file /data/oracle/admin/newtrade/bdump/newtrade1_lmd0_3704.trcand DIAG trace file
……
Mon Jul 27 15:14:59 2009
kkjcre1p: unable to spawnjobq slave process
Mon Jul 27 15:14:59 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_cjq0_3879.trc:
Mon Jul 27 15:15:33 2009
WARNING: inbound connectiontimed out (ORA-3136)
Mon Jul 27 15:24:44 2009
GES: Potential blocker(pid=3815) on resource CI-0000000A-00000002;
enqueue info in file /data/oracle/admin/newtrade/bdump/newtrade1_lmd0_3704.trcand DIAG trace file
Mon Jul 27 15:25:10 2009
GES: Potential blocker(pid=3815) on resource CI-00000001-00000002;
enqueue info in file /data/oracle/admin/newtrade/bdump/newtrade1_lmd0_3704.trcand DIAG trace file
……
Mon Jul 27 15:33:07 2009
Process startup failed,error stack:
Mon Jul 27 15:33:07 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_psp0_3671.trc:
ORA-27300: OS systemdependent operation:fork failed with status: 11
ORA-27301: OS failuremessage: Resource temporarily unavailable
ORA-27302: failure occurredat: skgpspawn3
……
Mon Jul 27 15:57:04 2009
WARNING: inbound connectiontimed out (ORA-3136)
Mon Jul 27 15:57:32 2009
kkjcre1p: unable to spawnjobq slave process
Mon Jul 27 15:57:32 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_cjq0_3879.trc:
Mon Jul 27 15:57:45 2009
WARNING: inbound connectiontimed out (ORA-3136)
Mon Jul 27 15:58:33 2009
WARNING: inbound connectiontimed out (ORA-3136)
……
Mon Jul 27 16:05:55 2009
WARNING: inbound connectiontimed out (ORA-3136)
Mon Jul 27 16:06:22 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_arc0_4240.trc:
ORA-00600: internal errorcode, arguments: [2103], [1], [0], [1], [900], [], [], []
Mon Jul 27 16:06:24 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_arc1_4267.trc:
ORA-00600: internal errorcode, arguments: [2103], [1], [0], [1], [900], [], [], []
Mon Jul 27 16:06:54 2009
USER: terminating instancedue to error 481
……
Mon Jul 27 16:06:54 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_ckpt_3815.trc:
ORA-00481: LMON processterminated with error
Mon Jul 27 16:06:54 2009
Errors in file /data/oracle/admin/newtrade/udump/newtrade1_ora_16645.trc:
ORA-00481: LMON processterminated with error
Mon Jul 27 16:06:54 2009
Errors in file /data/oracle/admin/newtrade/udump/newtrade1_ora_16926.trc:
ORA-00481: LMON 进程因错误而终止
Mon Jul 27 16:06:54 2009
System state dump ismade for local instance
……
Mon Jul 27 16:06:56 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_mman_3757.trc:
ORA-00481: LMON processterminated with error
System State dumped totrace file /data/oracle/admin/newtrade/bdump/newtrade1_diag_3644.trc
Mon Jul 27 16:07:05 2009
Dump system state forlocal instance only
Mon Jul 27 16:07:05 2009
Trace dumping is performingid=[cdmp_20090727160705]
Mon Jul 27 16:07:08 2009
Shutting down instance(abort)
License high water mark= 104
Mon Jul 27 16:07:09 2009
Instance terminated byUSER, pid = 26184
Mon Jul 27 16:07:13 2009
Instance terminated by USER, pid = 17258
从中看出不但有常规的错误还有ORA-600错误,看来要解决的问题还真不少。
不过虽然错误信息很多,但是大部分都是重复的,分析一下主要错误信息只有几种:
首先出现了ORA-3136错误,本来10.2.0.3版本上就存在这个Bug,但是当前出现这个错误的数量已经超过了这个错误几年来累计出现的次数,这显然是不正常的。
随后是一些后台轻量级的进程出错并在这些进程的启动时报错:包括Q001进程、PZ98进程、JOB进程和CJQ进程等。
PSP进程启动时出现了ORA-27300错误。
LSM进程碰到了IPC Send timeoutdetected。
归档进程碰到了ORA-600(2103)的错误。
最后出现了ORA-00481错误,LMON进程中止了所有的后台进程。
最开始出现的ORA-3136错误不难理解,这个错误本来就和连接超时有关,由于存在异常导致系统响应非常缓慢,因此出现大量的超时错误不足为奇。
下面再来看看CJQ进程对应的错误信息:
bash-3.00$ more /data/oracle/admin/newtrade/bdump/newtrade1_cjq0_3879.trc
/data/oracle/admin/newtrade/bdump/newtrade1_cjq0_3879.trc
Oracle Database 10g EnterpriseEdition Release 10.2.0.3.0 - 64bit Production
With the Partitioning,Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /data/oracle/product/10.2/database
System name: SunOS
Node name: newtrade1
Release: 5.10
Version: Generic_127111-06
Machine: sun4u
Instance name: newtrade1
Redo thread mounted bythis instance: 1
Oracle process number:15
Unix process pid: 3879,image: oracle@newtrade1 (CJQ0)
*** 2009-07-27 15:12:03.901
*** SERVICE NAME:(SYS$BACKGROUND)2009-07-27 15:12:03.517
*** SESSION ID:(540.1)2009-07-27 15:12:03.412
Waited for process J000to be spawned for 60 seconds
……
*** 2009-07-27 15:12:55.540
Waited for process J000to be spawned for 111 seconds
*** 2009-07-27 15:14:59.846
*** 2009-07-27 15:53:28.354
Waited for process J000to be spawned for 60 seconds
……
*** 2009-07-27 15:54:18.914
Dumping diagnostic informationfor J000:
OS pid = 6781
loadavg : 1.32 2.43 2.14
swap info: free_mem =162.41M rsv = 21278.20M
alloc = 17453.59M avail = 22820.93 swap_free= 26645.53M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 S oracle 6781 1 0 40 20 ? 15905 ? 15:53:17 ? 0:00 ora_j000_newtrade1
6781: ora_j000_newtrade1
000000010405b53c skgmidrealm (ffffffff7fffe4a0,106527230, 0, ffffffff7fffe4c8, 3, 0) + fc
000000010405af38 skgmlocate (ffffffff7fffe4a0,106527230, ffffffff7fffe4c8, 29815c08, 0, 0) + 158
0000000104056db4 skgmattach (ffffffff7fffe4a0,106527230, ffffffff7fffe4c8, 29815c08, 1065278c0, ffffffff7fffe4c8) + 114
00000001011012f8 ksmlsge (0, 106400, 0, 106527000,106537000, 106400) + b8
00000001025153cc opirip (ffffffffffffffff, 1056db000,105400, ffffffff7ffff018, 10652a000, 10652a720) + 10c
000000010033f774 opidrv (106537460, 0, 32, 106534198,32, 0) + 4b4
0000000100339c10 sou2o (ffffffff7ffff4a8, 32, 4,ffffffff7ffff4d0, 105e0f000, 105e0f) + 50
00000001002fbfcc opimai_real (3, ffffffff7ffff5a8,0, 0, 247d6fc, 14800) + 10c
00000001002fbdf8 main (1, ffffffff7ffff6b8, 0,ffffffff7ffff5b0, ffffffff7ffff6c0, ffffffff79d00140) + 98
00000001002fbd1c _start (0, 0, 0, 0, 0, 0) + 17c
*** 2009-07-27 15:56:44.622
*** 2009-07-27 15:57:18.714
Killing process (ospid6781): (reason=x4 error=0)
... and the process isstill alive after kill!
*** 2009-07-27 15:57:32.677
*** 2009-07-27 16:06:55.038
KCL: caught error 481during cr lock op
*** 2009-07-27 16:06:55.162
ORA-00604: error occurredat recursive SQL level 1
ORA-00481: LMON process terminated with error
根据错误信息分析,错误原因是尝试spawned新的进程,出现了长时间的等待,这应该也是系统响应变慢所造成的,而不是导致问题的原因。
随后检查对应的ORA-600(2103)错误:
bash-3.00$ more /data/oracle/admin/newtrade/bdump/newtrade1_arc0_4240.trc
*** SERVICE NAME:(SYS$BACKGROUND)2009-07-27 11:03:13.336
*** SESSION ID:(533.1)2009-07-27 11:03:13.336
Redo shipping clientperforming standby login
*** 2009-07-27 11:03:13.46865194 kcrr.c
Logged on to standbysuccessfully
Client logon and securitynegotiation successful!
*** 2009-07-27 16:06:22.450
TIMEOUT ON CONTROL FILEENQUEUE
mode=S, type=0, wait=1,eqt=900
*** 2009-07-27 16:06:22.456
ksedmp: internal or fatalerror
ORA-00600: internal errorcode, arguments: [2103], [1], [0], [1], [900], [], [], []
----- Call Stack Trace-----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedmp()+744 CALL ksedst() 000000840?
FFFFFFFF7FFFB44C ?
000000000 ?
FFFFFFFF7FFF7F40 ?
FFFFFFFF7FFF6CA8 ?
FFFFFFFF7FFF76A8 ?
这个ORA-600错误显然是TIMEOUT ON CONTROL FILE ENQUEUE所导致的:归档进程要访问控制文件,而由于其他进程已经在访问控制文件,因此归档进程进入队列中,归档进程在队列中等待超过了900秒后引发了这个ORA-600(2103)的错误。显然这个错误也不是导致问题的主要原因,而是系统繁忙所引发的一个错误。
接着再看看psp进程对应的信息:
bash-3.00$ more /data/oracle/admin/newtrade/bdump/newtrade1_psp0_3671.trc
*** 2009-07-27 14:50:50.097
Dump diagnostics forprocess PZ99 pid 11183 which did not start after 120 seconds:
(spawn_time:xD42371 now:xD5FF06 diff:x1DB95)
*** 2009-07-27 14:50:50.121
Dumping diagnostic informationfor PZ99:
OS pid = 11183
loadavg : 0.98 1.52 2.10
swap info: free_mem =250.75M rsv = 21332.33M
alloc = 17360.75M avail = 22774.20 swap_free= 26745.77M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 S oracle 11183 1 0 39 20 ? 15905 ? 14:48:23 ? 0:00 ora_pz99_newtrade1
11183: ora_pz99_newtrade1
0000000000000000 ???????? (0, 0, 0, 0, 0, 0)
*** 2009-07-27 14:51:46.295
Killing process (ospid11183): did not start after 120 seconds
... and the process isstill alive after kill!
Killed process 11183is still alive after 62 seconds!
*** 2009-07-27 14:54:48.727
Killing process (ospid17027): requester cancelled request
... and the process isstill alive after kill!
*** 2009-07-27 14:57:57.161
Dump diagnostics forprocess PZ98 pid 20623 which did not start after 120 seconds:
(spawn_time:xDAA953 now:xDC83AF diff:x1DA5C)
*** 2009-07-27 14:57:57.205
Dumping diagnostic informationfor PZ98:
OS pid = 20623
loadavg : 1.32 1.07 1.64
swap info: free_mem =197.38M rsv = 21395.36M
alloc = 17448.21M avail = 22713.61 swap_free= 26660.76M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 S oracle 20623 1 0 40 20 ? 15849 ? 14:55:34 ? 0:00 ora_pz98_newtrade1
pstack: cannot examine20623: process is traced
*** 2009-07-27 15:00:42.098
Killing process (ospid20623): did not start after 120 seconds
... and the process isstill alive after kill!
Killed process 20623is still alive after 168 seconds!
……
*** 2009-07-27 15:56:14.520
Dump diagnostics forprocess J000 pid 6781 which did not start after 120 seconds:
(spawn_time:x11009AD now:x111E13E diff:x1D791)
*** 2009-07-27 15:56:14.538
Dumping diagnostic informationfor J000:
OS pid = 6781
loadavg : 1.02 1.95 2.00
swap info: free_mem =154.73M rsv = 21285.11M
alloc = 17466.89M avail = 22809.92 swap_free= 26628.14M
skgpgpstack: read() forcmd /bin/ps -elf | /bin/egrep 'PID | 6781' | /bin/grep -v grep timed out after 60seconds
pstack: cannotexamine 6781: no such process or core file
*** 2009-07-27 15:59:35.070
*** 2009-07-27 16:05:24.760
Killing process (ospid14395): requester cancelled request
... and the process isstill alive after kill!
*** 2009-07-27 16:06:55.478
error 481 detected inbackground process
ORA-00481: LMON process terminated with error
从错误信息看似乎是操作系统资源不足,导致fork新的进程时出现了错误。从系统内存信息看,当前内存仅有154MB的空闲空间。看来果然是内存出现了问题。
检查节点1的内存配置:
root@newtrade1 # /usr/sbin/prtconf| grep "Memory size"
Memory size: 16384 Megabytes
怎么只有16GB了,而正常应该是32GB才对,检查节点2:
root@newtrade2 # /usr/sbin/prtconf| grep "Memory size"
Memory size: 32768 Megabytes
这就难怪了,这个数据库的SGA大小还在20GB左右:
bash-3.00$ sqlplus /as sysdba
SQL*Plus: Release 10.2.0.3.0- Production on 星期一 7月 27 19:19:492009
Copyright (c) 1982, 2006,Oracle. All Rights Reserved.
连接到:
Oracle Database 10g EnterpriseEdition Release 10.2.0.3.0 - 64bit Production
With the Partitioning,Real Application Clusters, OLAP and Data Mining options
SQL> show sga
Total System GlobalArea 2.0972E+10 bytes
Fixed Size 2165832 bytes
Variable Size 7157380024 bytes
Database Buffers 1.3808E+10 bytes
Redo Buffers 4325376 bytes
问题已经很明显了,由于节点1上内存的故障导致节点1出现了DOWN机。随后启动的时候出现了内存的错误,导致16GB的内存没有被加载,当前系统仅仅加载了16GB的内存,而数据库启动的SGA就占用了20GB的内存。Oracle虽然可以启动,但是要借助SWAP区也就是硬盘上的空间进行中转。等待一段时间后,用户负载增大,由于系统的内存频繁分页,导致系统的响应时间迅速地增大,最终导致了节点1完全瘫痪,而节点2由于要与节点1进行交互,因此也处于等待状态。
其实早在内存减少一半后,数据库实例的第一次启动时,Oracle就给出了告警信息:
WARNING: Not enough physicalmemory for SHM_SHARE_MMU segment of size 0x00000004e200c000 [flag=0x4000]
如果早发现这个信息,可能会更迅速确定问题。
找到问题的原因后,解决问题就易如反掌了:解决硬件的问题,就可以从根本上解决数据库的问题;如果系统的负载不是很大,可以让数据库处于单实例工作状态中;如果硬件问题需要很长时间才能解决,还可以在节点1上设置实例级的SGA,根据节点1当前可用的内存,降低SGA和PGA的设置,这样也可以避免故障的产生。
在实例2上设置实例1的专用SGA_TARGET参数:
SQL> alter system setsga_target = 9663676416 scope = spfile sid = 'newtrade1';
系统已更改。
下面重启实例1,问题解决:
$ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.3.0- Production on 星期二 7月 28 17:55:412009
Copyright (c) 1982, 2006,Oracle. All Rights Reserved.
已连接到空闲例程。
SQL> startup
ORACLE 例程已经启动。
数据库装载完毕。
数据库已经打开。
等到解决硬件问题后,只须RESET实例1的SGA_TARGET参数就可以恢复正常了。
根据上面的判断,硬件人员现场检查硬件发现,服务器坏了一块背板,导致上面的两颗CPU和16GB内存无法使用。通过更换硬件,成功地解决了问题。
本文摘取自《Oracle DBA手记》第2篇 RAC环境诊断案例一则,作者:杨廷琨
配图来源:http://www.allsoft.co.kr/bbs/board.php?bo_table=study91_2&wr_id=30