RAC环境出现Waiting for instances to leave信息(二)

RAC环境的其中一个节点上的alert文件中,出现了这个信息:Waiting for instances to leave

这一篇描述问题的解决过程。

RAC环境出现Waiting for instances to leave信息(一):http://yangtingkun.itpub.net/post/468/493501

 

 

在节点1alert文件中发现了大量的错误信息,不过虽然错误信息很多,但是大部分都是重复的,主要的错误信息只有几种:

首先出现了ORA-3136错误,本来10.2.0.3上就存在这个bug,会出现ORA-3136错误,但是当前出现这个错误的数量,已经超过了这个错误几年来累计出现了次数;

随后是一些后台轻量级的进程出错并在这些进程的启动时报错:包括Q001进程、PZ98进程、JOB进程和cjq进程等;

PSP进程启动时出现了ORA-27300错误;

LSM进程碰到了IPC Send timeout detected

归档进程碰到了ORA-6002103的错误);

最后出现了ORA-00481错误,LMON进程中止了所有的后台进程。

最开始出现的ORA-3136错误不难理解,这个错误本来就和连接超时有关,由于存在异常导致系统响应非常缓慢,因此出现大量的超时错误不足为奇。

下面再来看看cjq进程对应的错误信息:

bash-3.00$ more /data/oracle/admin/newtrade/bdump/newtrade1_cjq0_3879.trc
/data/oracle/admin/newtrade/bdump/newtrade1_cjq0_3879.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /data/oracle/product/10.2/database
System name:    SunOS
Node name:      newtrade1
Release:        5.10
Version:        Generic_127111-06
Machine:        sun4u
Instance name: newtrade1
Redo thread mounted by this instance: 1
Oracle process number: 15
Unix process pid: 3879, image: oracle@newtrade1 (CJQ0)

*** 2009-07-27 15:12:03.901
*** SERVICE NAME:(SYS$BACKGROUND) 2009-07-27 15:12:03.517
*** SESSION ID:(540.1) 2009-07-27 15:12:03.412
Waited for process J000 to be spawned for 60 seconds
*** 2009-07-27 15:12:14.013
Waited for process J000 to be spawned for 70 seconds
*** 2009-07-27 15:12:24.155
Waited for process J000 to be spawned for 80 seconds
*** 2009-07-27 15:12:35.340
Waited for process J000 to be spawned for 91 seconds
*** 2009-07-27 15:12:45.430
Waited for process J000 to be spawned for 101 seconds
*** 2009-07-27 15:12:55.540
Waited for process J000 to be spawned for 111 seconds
*** 2009-07-27 15:14:59.846
*** 2009-07-27 15:53:28.354
Waited for process J000 to be spawned for 60 seconds
*** 2009-07-27 15:53:38.461
Waited for process J000 to be spawned for 70 seconds
*** 2009-07-27 15:53:48.556
Waited for process J000 to be spawned for 80 seconds
*** 2009-07-27 15:53:58.650
Waited for process J000 to be spawned for 90 seconds
*** 2009-07-27 15:54:08.829
Waited for process J000 to be spawned for 100 seconds
*** 2009-07-27 15:54:18.910
Waited for process J000 to initialize for 110 seconds
*** 2009-07-27 15:54:18.914
Dumping diagnostic information for J000:
OS pid = 6781
loadavg : 1.32 2.43 2.14
swap info: free_mem = 162.41M rsv = 21278.20M
           alloc = 17453.59M avail = 22820.93 swap_free = 26645.53M
 F S      UID   PID  PPID   C PRI NI     ADDR     SZ    WCHAN    STIME TTY         TIME CMD
 0 S   oracle  6781     1   0  40 20        ?  15905        ? 15:53:17 ?           0:00 ora_j000_newtrade1
6781:   ora_j000_newtrade1
 000000010405b53c skgmidrealm (ffffffff7fffe4a0, 106527230, 0, ffffffff7fffe4c8, 3, 0) + fc
 000000010405af38 skgmlocate (ffffffff7fffe4a0, 106527230, ffffffff7fffe4c8, 29815c08, 0, 0) + 158
 0000000104056db4 skgmattach (ffffffff7fffe4a0, 106527230, ffffffff7fffe4c8, 29815c08, 1065278c0, ffffffff7fffe4c8) + 114
 00000001011012f8 ksmlsge (0, 106400, 0, 106527000, 106537000, 106400) + b8
 00000001025153cc opirip (ffffffffffffffff, 1056db000, 105400, ffffffff7ffff018, 10652a000, 10652a720) + 10c
 000000010033f774 opidrv (106537460, 0, 32, 106534198, 32, 0) + 4b4
 0000000100339c10 sou2o (ffffffff7ffff4a8, 32, 4, ffffffff7ffff4d0, 105e0f000, 105e0f) + 50
 00000001002fbfcc opimai_real (3, ffffffff7ffff5a8, 0, 0, 247d6fc, 14800) + 10c
 00000001002fbdf8 main (1, ffffffff7ffff6b8, 0, ffffffff7ffff5b0, ffffffff7ffff6c0, ffffffff79d00140) + 98
 00000001002fbd1c _start (0, 0, 0, 0, 0, 0) + 17c
*** 2009-07-27 15:56:44.622
*** 2009-07-27 15:57:18.714
Killing process (ospid 6781):  (reason=x4 error=0)
... and the process is still alive after kill!
*** 2009-07-27 15:57:32.677
*** 2009-07-27 16:06:55.038
KCL: caught error 481 during cr lock op
*** 2009-07-27 16:06:55.162
ORA-00604: error occurred at recursive SQL level 1
ORA-00481: LMON process terminated with error

根据错误信息分析,错误原因是尝试spawned新的进程,出现了长时间的等待,这应该也是系统响应变慢造成的,而不是导致问题的原因。

随后检查对应的ORA-600(2103)错误:

bash-3.00$ more /data/oracle/admin/newtrade/bdump/newtrade1_arc0_4240.trc
/data/oracle/admin/newtrade/bdump/newtrade1_arc0_4240.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /data/oracle/product/10.2/database
System name:    SunOS
Node name:      newtrade1
Release:        5.10
Version:        Generic_127111-06
Machine:        sun4u
Instance name: newtrade1
Redo thread mounted by this instance: 1
Oracle process number: 22
Unix process pid: 4240, image: oracle@newtrade1 (ARC0)

*** 2009-07-27 11:03:13.336
*** SERVICE NAME:(SYS$BACKGROUND) 2009-07-27 11:03:13.336
*** SESSION ID:(533.1) 2009-07-27 11:03:13.336
Redo shipping client performing standby login
*** 2009-07-27 11:03:13.468 65194 kcrr.c
Logged on to standby successfully
Client logon and security negotiation successful!
*** 2009-07-27 16:06:22.450
TIMEOUT ON CONTROL FILE ENQUEUE
mode=S, type=0, wait=1, eqt=900
*** 2009-07-27 16:06:22.456
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [2103], [1], [0], [1], [900], [], [], []
----- Call Stack Trace -----
calling              call     entry                argument values in hex     
location             type     point                (? means dubious value)    
-------------------- -------- -------------------- ----------------------------
ksedmp()+744         CALL     ksedst()             000000840 ?
                                                   FFFFFFFF7FFFB44C ?
                                                   000000000 ?
                                                   FFFFFFFF7FFF7F40 ?
                                                   FFFFFFFF7FFF6CA8 ?
                                                   FFFFFFFF7FFF76A8 ?
kgeriv()+220         PTR_CALL 0000000000000000     000106400 ? 10652D264 ?
                                                   10652D000 ? 00010652D ?
                                                   000106400 ? 10652D264 ?
kgesiv()+112         CALL     kgeriv()             106527C18 ? 000000000 ?
                                                   000000837 ? 000000004 ?
                                                   FFFFFFFF7FFFB9A8 ?
                                                   000001430 ?
ksesic4()+96         CALL     kgesiv()             106527C18 ? 106762320 ?
                                                   000000837 ? 000000004 ?
                                                   FFFFFFFF7FFFB9A8 ?
                                                   000001420 ?
kccocx()+4528        CALL     ksesic4()            000000837 ? 000106527 ?
                                                   10652D258 ? 000106400 ?
                                                   10652D000 ? 00010652D ?
kcrribcx()+1732      CALL     kccbcx()             00000016B ? 38000F000 ?
                                                   000000001 ? 000000384 ?
                                                   000106400 ? 00038000F ?

这个ORA-600错误显然是TIMEOUT ON CONTROL FILE ENQUEUE导致的,归档进程要访问控制文件,而由于其他进程已经在访问控制文件,因此归档进程进入队列中,归档进程在队列中等待超过了900秒后,引发了这个ORA-600(2103)的错误。显然这个错误也不是导致问题的主要原因,而是系统繁忙所引发的一个错误。

接着再看看psp进程对应的信息:

bash-3.00$ more /data/oracle/admin/newtrade/bdump/newtrade1_psp0_3671.trc
/data/oracle/admin/newtrade/bdump/newtrade1_psp0_3671.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /data/oracle/product/10.2/database
System name:    SunOS
Node name:      newtrade1
Release:        5.10
Version:        Generic_127111-06
Machine:        sun4u
Instance name: newtrade1
Redo thread mounted by this instance: 1
Oracle process number: 4
Unix process pid: 3671, image: oracle@newtrade1 (PSP0)

*** SERVICE NAME:(SYS$BACKGROUND) 2009-07-27 14:50:50.097
*** SESSION ID:(553.1) 2009-07-27 14:50:50.097
*** 2009-07-27 14:50:50.097
Dump diagnostics for process PZ99 pid 11183 which did not start after 120 seconds:
  (spawn_time:xD42371 now:xD5FF06 diff:x1DB95)
*** 2009-07-27 14:50:50.121
Dumping diagnostic information for PZ99:
OS pid = 11183
loadavg : 0.98 1.52 2.10
swap info: free_mem = 250.75M rsv = 21332.33M
           alloc = 17360.75M avail = 22774.20 swap_free = 26745.77M
 F S      UID   PID  PPID   C PRI NI     ADDR     SZ    WCHAN    STIME TTY         TIME CMD
 0 S   oracle 11183     1   0  39 20        ?  15905        ? 14:48:23 ?           0:00 ora_pz99_newtrade1
11183:  ora_pz99_newtrade1
 0000000000000000 ???????? (0, 0, 0, 0, 0, 0)
*** 2009-07-27 14:51:46.295
Killing process (ospid 11183): did not start after 120 seconds
... and the process is still alive after kill!
Killed process 11183 is still alive after 62 seconds!
*** 2009-07-27 14:54:48.727
Killing process (ospid 17027): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 14:57:57.161
Dump diagnostics for process PZ98 pid 20623 which did not start after 120 seconds:
  (spawn_time:xDAA953 now:xDC83AF diff:x1DA5C)
*** 2009-07-27 14:57:57.205
Dumping diagnostic information for PZ98:
OS pid = 20623
loadavg : 1.32 1.07 1.64
swap info: free_mem = 197.38M rsv = 21395.36M
           alloc = 17448.21M avail = 22713.61 swap_free = 26660.76M
 F S      UID   PID  PPID   C PRI NI     ADDR     SZ    WCHAN    STIME TTY         TIME CMD
 0 S   oracle 20623     1   0  40 20        ?  15849        ? 14:55:34 ?           0:00 ora_pz98_newtrade1
pstack: cannot examine 20623: process is traced
*** 2009-07-27 15:00:42.098
Killing process (ospid 20623): did not start after 120 seconds
... and the process is still alive after kill!
Killed process 20623 is still alive after 168 seconds!
*** 2009-07-27 15:01:45.471
Killed process 20623 is still alive after 228 seconds!
*** 2009-07-27 15:02:45.974
Killed process 20623 is still alive after 288 seconds!
*** 2009-07-27 15:03:46.306
Killed process 20623 is still alive after 349 seconds!
*** 2009-07-27 15:04:46.853
Killed process 20623 is still alive after 409 seconds!
*** 2009-07-27 15:05:47.298
Killed process 20623 is still alive after 470 seconds!
*** 2009-07-27 15:06:47.791
Killed process 20623 is still alive after 530 seconds!
*** 2009-07-27 15:07:48.776
Killed process 20623 is still alive after 591 seconds!
*** 2009-07-27 15:14:39.249
Killing process (ospid 12827): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:19:46.554
Killing process (ospid 17820): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:20:47.222
Killed process 17820 is still alive after 60 seconds!
*** 2009-07-27 15:26:49.396
Killing process (ospid 24054): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:32:20.246
Killing process (ospid 797): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:33:07.352
Process startup failed, error stack:
ORA-27300: OS system dependent operation:fork failed with status: 11
ORA-27301: OS failure message: Resource temporarily unavailable
ORA-27302: failure occurred at: skgpspawn3
*** 2009-07-27 15:34:01.940
Killed process 797 is still alive after 101 seconds!
*** 2009-07-27 15:37:17.894
Killing process (ospid 4398): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:38:51.234
Killed process 4398 is still alive after 93 seconds!
*** 2009-07-27 15:45:29.769
Killing process (ospid 6013): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:52:00.619
Killing process (ospid 6192): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:56:14.520
Dump diagnostics for process J000 pid 6781 which did not start after 120 seconds:
  (spawn_time:x11009AD now:x111E13E diff:x1D791)
*** 2009-07-27 15:56:14.538
Dumping diagnostic information for J000:
OS pid = 6781
loadavg : 1.02 1.95 2.00
swap info: free_mem = 154.73M rsv = 21285.11M
           alloc = 17466.89M avail = 22809.92 swap_free = 26628.14M
skgpgpstack: read() for cmd /bin/ps -elf | /bin/egrep 'PID | 6781' | /bin/grep -v grep timed out after 60 seconds
pstack: cannot examine 6781: no such process or core file
*** 2009-07-27 15:59:35.070
*** 2009-07-27 16:05:24.760
Killing process (ospid 14395): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 16:06:55.478
error 481 detected in background process
ORA-00481: LMON process terminated with error

从错误信息看,似乎是操作系统资源不足,导致fork新的进程时,出现了错误。从系统内存信息看,当前内存信息仅有154M的空闲。看来果然是内存出现了问题。

检查节点1的内存配置:

root@newtrade1 # /usr/sbin/prtconf | grep "Memory size"
Memory size: 16384 Megabytes

怎么只有16G了,而正常应该是32G才对,检查节点2

root@newtrade2 # /usr/sbin/prtconf | grep "Memory size"
Memory size: 32768 Megabytes

这就难怪了,要知道这个数据库的SGA大小还在20G左右:

bash-3.00$ sqlplus / as sysdba

SQL*Plus: Release 10.2.0.3.0 - Production on 星期一 7 27 19:19:49 2009

Copyright (c) 1982, 2006, Oracle.  All Rights Reserved.


连接到:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> show sga

Total System Global Area 2.0972E+10 bytes
Fixed Size                  2165832 bytes
Variable Size            7157380024 bytes
Database Buffers         1.3808E+10 bytes
Redo Buffers                4325376 bytes

问题已经很明显了,由于节点1上内存的故障,导致节点1出现了DOWN机。随后启动的时候出现了内存的错误,导致16G的内存没有被加载。当前系统仅仅加载了16G的内存。而数据库启动的SGA就占用了20G的内容,Oracle虽然可以启动,但是要借助SWAP区,也就是硬盘上的空间进行中转。等待一段时间后,用户负载增大,由于系统的内存频繁的分页,导致系统的响应时间迅速的增大,最终导致了节点1完全的瘫痪,而节点2由于要与节点1进行交互,因此也处于等待状态。

其实早在内存减少一半后,数据库实例的第一次启动时,Oracle就给出了告警信息:

WARNING: Not enough physical memory for SHM_SHARE_MMU segment of size 0x00000004e200c000 [flag=0x4000]

如果一早发现这个信息,可能会更迅速的确定问题。

找到问题的原因后,解决问题就显得易如反掌了:解决硬件的问题,就可以从根本上解决数据库的问题;如果系统的负载不是很大,可以让数据库处于单实例工作状态中;如果硬件问题需要很长时间才能解决,还可以在节点1上设置实例级的SGA,根据节点1当前可用的内存,降低SGAPGA的设置,这样也可以避免问题的产生。

在实例2上设置实例1的专用SGA_TARGET参数:

SQL> alter system set sga_target = 9663676416 scope = spfile sid = 'newtrade1';

系统已更改。

下面重启实例1,问题解决:

$ sqlplus / as sysdba

SQL*Plus: Release 10.2.0.3.0 - Production on 星期二 7 28 17:55:41 2009

Copyright (c) 1982, 2006, Oracle.  All Rights Reserved.

已连接到空闲例程。

SQL> startup
ORACLE
例程已经启动。

Total System Global Area 9663676416 bytes
Fixed Size                  2045872 bytes
Variable Size            1442842704 bytes
Database Buffers         8204058624 bytes
Redo Buffers               14729216 bytes
数据库装载完毕。
数据库已经打开。

等到解决硬件问题后,只需要RESET实例1SGA_TARGET参数就可以恢复正常了。

 

 

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/4227/viewspace-618381/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/4227/viewspace-618381/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值