RAC环境出现Waiting for instances to leave信息(一)

RAC 环境的其中一个节点上的alert 文件中,出现了这个信息:Waiting for instances to leave

这一篇描述问题的发现过程。

 

首先描述一下问题的发现过程。程序员报告RAC 环境影响比较缓慢,而这时同事正好在这个RAC 环境的一个节点上检查RMAN 的日志,发现RMAN 连接数据库很慢,半天都没有响应。而这时,另一个节点几乎无法访问了,一个普通的ssh 登陆都需要1 分钟以上,更不要说执行命令了。

检查当前还可以访问的节点的alert 文件,发现了上面的Waiting for instances to leave 信息:

Mon Jul 27 15:52:14 2009
Waiting for instances to leave:
1

由于这个问题产生的原因比较复杂,为了将问题描述清楚,还需要重头说起。

首先是硬件维护任意发现RAC 服务器的其中一个节点DOWN 机,这个服务器无法通过远程登陆。

在数据库中检查,节点1 关闭,节点2 正常运行。从节点2 上的alert 文件中,可以看到,节点2 对数据库进行了一些恢复的操作,并成功接管了全部数据库的管理操作:

Fri Jul 24 16:25:09 2009
Reconfiguration started (old inc 4, new inc 6)
List of nodes:
1
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Fri Jul 24 16:25:09 2009
LMS 0: 6 GCS shadows cancelled, 0 closed
Fri Jul 24 16:25:09 2009
LMS 1: 14 GCS shadows cancelled, 1 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Fri Jul 24 16:25:11 2009
Instance recovery: looking for dead threads
Fri Jul 24 16:25:11 2009
Beginning instance recovery of 1 threads
Fri Jul 24 16:25:12 2009
LMS 0: 378032 GCS shadows traversed, 0 replayed
Fri Jul 24 16:25:12 2009
LMS 1: 381728 GCS shadows traversed, 0 replayed
Fri Jul 24 16:25:12 2009
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Fri Jul 24 16:25:14 2009
parallel recovery started with 7 processes
Fri Jul 24 16:25:15 2009
Started redo scan
Fri Jul 24 16:25:15 2009
Completed redo scan
3990 redo blocks read, 986 data blocks need recovery
Fri Jul 24 16:25:15 2009
Started redo application at
Thread 1: logseq 1678, block 572689
Fri Jul 24 16:25:15 2009
Recovery of Online Redo Log: Thread 1 Group 1 Seq 1678 Reading mem 0
Mem# 0: /dev/vx/rdsk/datavg/newtrade_redo1_1_1_1g
Mem# 1: /dev/vx/rdsk/datavg/newtrade_redo1_1_2_1g
Fri Jul 24 16:25:15 2009
Completed redo application
Fri Jul 24 16:25:16 2009
Completed instance recovery at
Thread 1: logseq 1678, block 576679, scn 1112404710247
862 data blocks read, 1014 data blocks written, 3990 redo blocks read
Switch log for thread 1 to sequence 1679

随后,硬件维护人员在现场启动了节点1 的服务器,在启动的时候似乎报了一个和内存有关的错误(我不在现场,听说的),随后检查没有发现系统的错误信息。

检查Oraclealert 信息,没有发现任何的错误。怀疑节点突然崩溃和硬件异常有关。

检查系统的dmest 信息,也没有找到系统异常DOWN 机的原因。

由于没有找到错误,在节点1 启动后,将节点1 上数据库的实例1 也启动了。

但是没过多长时间,就碰到了前面描述的错误。节点2alert 文件中详细的错误信息如下:

Mon Jul 27 15:20:42 2009
MMNL absent for 1202 secs; Foregrounds taking over
Mon Jul 27 15:37:40 2009
MMNL absent for 1599 secs; Foregrounds taking over
Mon Jul 27 15:48:19 2009
IPC Send timeout detected.Sender: ospid 2284
Receiver: inst 1 binc 27438 ospid 3739
Mon Jul 27 15:49:24 2009
IPC Send timeout detected.Sender: ospid 5165
Receiver: inst 1 binc 27438 ospid 3739
Mon Jul 27 15:49:31 2009
IPC Send timeout detected.Sender: ospid 5179
Receiver: inst 1 binc 27438 ospid 3739
.
.
.
Mon Jul 27 15:49:58 2009
IPC Send timeout detected.Sender: ospid 23757
Receiver: inst 1 binc 27438 ospid 3739
Mon Jul 27 15:49:59 2009
IPC Send timeout to 0.2 inc 12 for msg type 32 from opid 46
Mon Jul 27 15:49:59 2009
Communications reconfiguration: instance_number 1
Mon Jul 27 15:50:15 2009
IPC Send timeout to 0.2 inc 12 for msg type 32 from opid 43
Mon Jul 27 15:50:32 2009
Trace dumping is performing id=[cdmp_20090727154959]
Mon Jul 27 15:51:11 2009
Trace dumping is performing id=[cdmp_20090727154959]
Mon Jul 27 15:51:49 2009
Evicting instance 1 from cluster
Mon Jul 27 15:52:14 2009
Waiting for instances to leave:
1
Mon Jul 27 15:52:34 2009
Waiting for instances to leave:
1
Mon Jul 27 15:52:54 2009
Waiting for instances to leave:
1
.
.
.
Mon Jul 27 16:07:35 2009
Waiting for instances to leave:
1
Mon Jul 27 16:07:55 2009
Waiting for instances to leave:
1
Mon Jul 27 16:08:15 2009
Waiting for instances to leave:
1

开始是IPC Send timeout detected 错误,随后是Waiting for instances to leave 信息。根据错误信息,以及节点1 的响应时间判断,问题多半就出在节点1 上:多半是由于节点1 太忙,导致响应时间太长,而节点2 要与节点1 直接进行交互,在交互时出现了超时错误。

为了尽快的解决这个问题,在节点2 上执行svrctl 尝试关闭节点1 上的实例:

bash-3.00$ srvctl stop inst -d newtrade -i newtrade1

等待了一段时间,节点1 上的实例1 终于被关闭。而这时节点1 上的响应终于恢复正常了。检查节点1 上的alert 文件,发现了大量的错误信息:

Mon Jul 27 14:42:45 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 14:43:21 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 14:43:21 2009
WARNING: inbound connection timed out (ORA-3136)
.
.
.
Mon Jul 27 14:51:35 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 14:52:17 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 14:52:35 2009
Process PZ99 died, see its trace file
Mon Jul 27 14:52:35 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 14:52:35 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 14:53:29 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 14:54:18 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 14:54:42 2009
ksvcreate: Process(q002) creation failed
Mon Jul 27 14:55:27 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 14:55:44 2009
WARNING: inbound connection timed out (ORA-3136)
.
.
.
Mon Jul 27 15:02:27 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:02:48 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:05:00 2009
GES: Potential blocker (pid=3815) on resource CI-0000000A-00000002;
enqueue info in file /data/oracle/admin/newtrade/bdump/newtrade1_lmd0_3704.trc and DIAG trace file
Mon Jul 27 15:05:41 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:05:53 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:08:05 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:08:11 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:08:16 2009
Process PZ98 died, see its trace file
Mon Jul 27 15:08:40 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:08:41 2009
WARNING: inbound connection timed out (ORA-3136)
.
.
.
Mon Jul 27 15:10:36 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:10:55 2009
ksvcreate: Process(q001) creation failed
Mon Jul 27 15:11:50 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:12:26 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:14:05 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:14:59 2009
kkjcre1p: unable to spawn jobq slave process
Mon Jul 27 15:14:59 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_cjq0_3879.trc:

Mon Jul 27 15:15:33 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:16:12 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:24:04 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:24:44 2009
GES: Potential blocker (pid=3815) on resource CI-0000000A-00000002;
enqueue info in file /data/oracle/admin/newtrade/bdump/newtrade1_lmd0_3704.trc and DIAG trace file
Mon Jul 27 15:25:10 2009
GES: Potential blocker (pid=3815) on resource CI-00000001-00000002;
enqueue info in file /data/oracle/admin/newtrade/bdump/newtrade1_lmd0_3704.trc and DIAG trace file
Mon Jul 27 15:25:13 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:27:07 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:27:13 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:27:40 2009
ksvcreate: Process(q001) creation failed
Mon Jul 27 15:27:55 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:28:16 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:28:41 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:33:07 2009
Process startup failed, error stack:
Mon Jul 27 15:33:07 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_psp0_3671.trc:
ORA-27300: OS system dependent operation:fork failed with status: 11
ORA-27301: OS failure message: Resource temporarily unavailable
ORA-27302: failure occurred at: skgpspawn3
Mon Jul 27 15:33:08 2009
Process q001 died, see its trace file
Mon Jul 27 15:33:46 2009
ksvcreate: Process(q001) creation failed
Mon Jul 27 15:37:54 2009
ksvcreate: Process(q001) creation failed
Mon Jul 27 15:39:41 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:39:50 2009
WARNING: inbound connection timed out (ORA-3136)
.
.
.
Mon Jul 27 15:43:30 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:45:30 2009
ksvcreate: Process(q001) creation failed
Mon Jul 27 15:45:40 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:45:40 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:46:28 2009
IPC Send timeout detected. Receiver ospid 3739
Mon Jul 27 15:46:48 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_lms1_3739.trc:
Mon Jul 27 15:47:32 2009
IPC Send timeout detected. Receiver ospid 3739
Mon Jul 27 15:47:39 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_lms1_3739.trc:
Mon Jul 27 15:48:16 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:49:27 2009
WARNING: inbound connection timed out (ORA-3136)
.
.
.
Mon Jul 27 15:50:00 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:50:34 2009
Trace dumping is performing id=[cdmp_20090727154959]
Mon Jul 27 15:52:29 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:52:30 2009
WARNING: inbound connection timed out (ORA-3136)
.
.
.
Mon Jul 27 15:57:04 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:57:32 2009
kkjcre1p: unable to spawn jobq slave process
Mon Jul 27 15:57:32 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_cjq0_3879.trc:

Mon Jul 27 15:57:45 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:58:33 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:59:08 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 16:00:00 2009
ksvcreate: Process(q001) creation failed
Mon Jul 27 16:01:23 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 16:01:23 2009
WARNING: inbound connection timed out (ORA-3136)
.
.
.
Mon Jul 27 16:05:55 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 16:06:22 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_arc0_4240.trc:
ORA-00600: internal error code, arguments: [2103], [1], [0], [1], [900], [], [], []
Mon Jul 27 16:06:24 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_arc1_4267.trc:
ORA-00600: internal error code, arguments: [2103], [1], [0], [1], [900], [], [], []
Mon Jul 27 16:06:26 2009
Trace dumping is performing id=[cdmp_20090727160625]
Mon Jul 27 16:06:30 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_arc0_4240.trc:
ORA-00600: internal error code, arguments: [2103], [1], [0], [1], [900], [], [], []
.
.
.
Mon Jul 27 16:06:30 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_arc1_4267.trc:
ORA-00600: internal error code, arguments: [2103], [1], [0], [1], [900], [], [], []
Mon Jul 27 16:06:54 2009
USER: terminating instance due to error 481
Mon Jul 27 16:06:54 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_lms1_3739.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:54 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_lms0_3714.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:54 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_ckpt_3815.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:54 2009
Errors in file /data/oracle/admin/newtrade/udump/newtrade1_ora_16645.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:54 2009
Errors in file /data/oracle/admin/newtrade/udump/newtrade1_ora_16926.trc:
ORA-00481: LMON
进程因错误而终止
Mon Jul 27 16:06:54 2009
System state dump is made for local instance
Mon Jul 27 16:06:54 2009
Errors in file /data/oracle/admin/newtrade/udump/newtrade1_ora_16834.trc:
ORA-00481: LMON
进程因错误而终止
Mon Jul 27 16:06:55 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_dbw0_3779.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:55 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_lmd0_3704.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:55 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_pmon_3633.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:55 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_cjq0_3879.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:55 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_psp0_3671.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:55 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_smon_3842.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:55 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_lck0_4026.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:55 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_lgwr_3805.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:55 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_lmon_3682.trc:
ORA-29740: evicted by member 1, group incarnation 14
Mon Jul 27 16:06:56 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_smon_3842.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:56 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_mman_3757.trc:
ORA-00481: LMON process terminated with error
System State dumped to trace file /data/oracle/admin/newtrade/bdump/newtrade1_diag_3644.trc
Mon Jul 27 16:07:05 2009
Dump system state for local instance only
Mon Jul 27 16:07:05 2009
Trace dumping is performing id=[cdmp_20090727160705]
Mon Jul 27 16:07:08 2009
Shutting down instance (abort)
License high water mark = 104
Mon Jul 27 16:07:09 2009
Instance terminated by USER, pid = 26184
Mon Jul 27 16:07:13 2009
Instance terminated by USER, pid = 17258

不但有常规的错误,还有ORA-600 错误,看来要解决的问题还真是不少。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值