Waiting for instances to leave

在RAC环境中,其中一个节点上的alert文件中出现如下信息:waiting for instances to leave.

硬件维护发现RAC服务器的其中一个节点DOWN机,这个服务器无法通过远程登录。在数据库中检查发现:节点1关闭,节点2正常运行。节点2对数据库进行了一些恢复情的操作并成接管了全部数据库的管理操作:

Fri Jul 24 16:25:09

Reconfiguration started(old inc 4,new inc 4)

List of nodes:

1

Global Resource Directory frozen

*dead instance decected - domain 0 invalid = TRUE

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Fri Jul 24 16:25:09

LMS 0:6 GCS shadows cancelled,0 closed

Fri Jul 24 16:25:09

LMS 1 : 14GCS shadows cancelled, 1 closed

Set master node info

Submitted all remote-enqueue requests

Dwn-Cvts replayed,VALBLKs dubious

All grantable enqueues granted

Post SMON to start lst pass IR

Fri Jul 24 16:25:11

Instance recovery:looling for dead threads

Fri Jul .....

Beginning instance recovery of 1 threads

Fri Jul

LMS 0 : 378032 GCS shadows traversed, 0 replayed

Fri Jul

LMS 1: 381728 GCS shadows traversed, 0 replayed

Fri Jul

Submitted all GCS remote-cache requests

Fix write in gcs resources

Reconfiguration complete

Fri Jul

Started redo scan

Fri Jul

Completed redo scan 3990 redo blocks read,986 data blocks need recovery

Fri Jul

Started redo application at Thread 1 : logseq 1678,block 572689

Fri Jul

Recovery of Online Redo Log: Thread 1 Group 1 seg 1678 Reading mem 0

Mem# 0 : /dev/vx/.../newtrade_redo1....

Mem# 1: /dev/vx/.../newtrade_redo .....

Fri Jul

Completed redo application

Fri Jul

Completed instance recovery at Thread 1: logseq 1678,block 576679,scn 111111 862 data blocks read, 1014 data blocks written ,3990 redo blocks read

Switch log for thread 1 to sequence 1679

现场启动节点1服务器,在启动报内在相关错误,随便检查没有发现系统的错误信息

检查Oracle 的alert 信息没有发现任何的错误,于是怀疑节点突然崩溃和硬件异常有关。检查系统的dmesg信息也没有找到系统异常DOWN机的原因

将节点1上数据库的实例1重新启动,但没有过多长时间就碰到前面描述的错误,节点2的alert文件中详细的错误信息代码如下:
Mon Jul

MMNL absent for 1202 secs; Foregrounds taking over

Mon Jul

MMNL absent for 1599 secs; Foregrounds taking over

Mon Jul

IPC send timeout detected. Sender:ospid 2284

Receiver:inst 1 binc 27438 ospid 3739

Mon Jul

IPC Send timeout detected. Sender:ospid 5165

Receiver:inst 1 binc 27438 ospid 3739

Mon Jul

IPC Send timeout detected.Sender:ospid 5179

Receiver:inst 1 binc 27438 ospid 3739

Mon Jul

IPC Send timeout detected.Sender:ospid 23757

Receiver:inst 1 binc 27438 ospid 3739

Mon Jul

IPC Send timeout to 0.2 inc 12 for msg type 32 from opid 43

Mon Jul

Communications reconfiguration:instance_number 1

Mon Jul

IPC Send timeout to 0.2 inc 12 for msg type 32 from opid 43

Mon Jul

Waiting for instances to leave:

开始是IPC Send timeout detected错误,然后是waiting for instances to leave信息.根据错误信息以及了点1的响应时间判断,问题多半就出在节点1上,可能是由于节点1太忙导致响应时间过长,而节点2要与节点1直接进行交互,在交互时出现了超时错误.

为了尽快解决这个问题,在节点2上执行svrctl 尝试关闭节点1上的实现

srvctl stop inst -d newtrade -i newtrade1

等待一段时间,节点1上的实例1终于被关闭,而这时节点1上的响应终于恢复正常。检查节点1的alert文件,发现了大量的错误信息(摘要):

Mon Jul

GES:Potential blocker(pid=3815) on resource CI-00000000000A-0000000002

enqueue info in file /data/oracle/admin/newtrade/bdump/newtradel_1mdo_3704.trc and DIAG trace file

Mon Jul

kkjcrelp:unable to spawn jobq slave process

Mon Jul

Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_cj10_3879.trc:

Mon Jul

WARNING:inbound connect timed out(ORA-3136)

Mon Jul

GES:Potential blocker(pid=3815) on resource CI-0000A-00002;

enquue info in file /data/oracde/admin/newtrade/bdump/newtrade...trc and DIAG trace file

Mon Jul

GES:Potential blocker(pid=3815) on resource CI-0001-000..2;

enqueue info in file /data/oracle/admin/newtrade/bdump/new.....trc and DIAG trace file

Mon Jul

Process startup failed,error stack:

Mon Jul

Errors in file /data/oracle/.../...trc

ORA-27300:OS system dependent operation:fork failed with status:11

ORA-27301:OS failure message:Resource temporarily unavailable

ORA-27301:failure occurred at :skgpspawn3

MonJul

WARING:inbound connection timed out(ORA-3136)

Mon Jul

kkjcrelp:unable to spawn jobq slave process

Mon Jul

Errors in file /data/.../..trc

Mon Jul

WARNING:inbound connection timed out(ORA-3136)

Mon Jul

Errors in file /data...

ORA-00600:internal error code, arguments:..

Mon Jul

USER: terminating instance dure to error 481

Mon JUl

Errors in file /.../trc

ORA-00481:LMNO process terminated with error

Mon Jul

Errors in file /.../trc

Mon Jul

ORA-00481:LMON process terminated with error

Mon Jul

Errors in file /.../trc

ORA-00481:LMON 进程因错误而终止

Mon Jul

System state dump is made for local instace

------------------------------------------------206

 从中看出不但有常规的错误还有ORA-600错误,主要的几种错误信息:

1.ORA-3136,10.2.0.3版本上就存在这个BUG,但是当前出现这个错误的数据数量已经超过了这个错误几年来累计出现的资料。

2.后台轻量级的进程出错并在这些进程的启动时报错:包括Q001进程,PZ98进程,JOB进程和CJQ进程

3.PSP进程启动时出现了ORA-27300错误

4.LSM进程碰到了IPC Send timeout detected.

5.归档进程碰到了ORA-600(2103)的错误

6.ORA-00481,LMON进程中止了所有的后台进程.

7.ORA-3136和连接超时有关,由于存在异常导致系统响应非常缓慢,因些出现大量的超时错误不足为奇.

 查看CJQ进程对应的错误信息:
more /data/..../.._cjq0_...trc

Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production

With the Partitioning, Real Application Clusters,OLAP and Data Mining options

ORACLE_HOME=/data/oracle/...

System name :SunOS

Node name: newtrade1

Release:5.10

Version:Generic_...

Machine:sun4u

Instance name:newtrade1

Redo thread mounted by this instance:1

Oracle process number:15

Unix process pid:3879,image:oracle@newtrade1(CJQ0)

*** 2009..

***SERVICE NAME:(SYS$BACKGROUND)..

***SESSION ID:(540.1)...

Waited for process J000 to be spawned for 60 seconds

.....

***2009 ..

Waited for process J000 to be spawned for 111 seconds

***2009

Waited for process J000 to be spawned for 60 seconds

....

***2009-07

Dumping diagnostic information for J000;

OS pid = 6781

loadavg : 1.32 2.43 2.14

swap info: free_mem = 162.41M rsv=21278.20M alloc = 17453.59M avail = 22820.93 swap_free = 26645.53M

....

....

Killing prcess(ospid 6781):(reason=x4 error=0)

... and the process is still alive after kill!

KCL:caught error 481 during cr lock op

...

ORA-00604:error occurred at recursive SQL level 1

ORA-00481:LMON process terminated with error

根据错误信息分析,错误原因是尝试spawned新的进程,出现了长时间的等待,这应该也是系统响应变慢所造成的,而不是导致问题的原因。

检查对应的ORA-600(2103)错误:

more /data/.../.._arc0_4240.trc

***SERVICE NAME:(SYS$BACKGROUND) 2009..

***SESSION ID:(533.1)

Redo shipping client performing standby login

***

Logged on to standby successfully

Client logon and security negotiation successful!

 

TIMEOUT ON CONTROL FILE ENQUEUE mode=S,type=O,wait=1,eqt=900

***

ksedmp:internal or fatal error

ORA-00600:internal error code,arguments:

.....

这个ORA-600错误显然是TIMEOUT ON CONTROL FILE ENQUEUE所导致的:归档进程要访问控制文件,而由于其他进程已经在访问控制文件,因此归档程进入队列中,归档进程在队列中等待超过了900秒后引发了这个ORA-600(2103)的错误。这个错误也不是导致问题的主要原因,而是系统繁忙所引发的一个错误。

查看PSP进程对应的信息:

more /data/.../newtrade1_psp0_3671.trc

***

Dump diagnostics for process PZ99 pid 11183 which did not start after 120 seconds:(spawn_time:xD... now:xD5F.. diff:x1D..)

***

Dumping diagnostic information for PZ99:

OS pid = 11183

loadavg:0.98 1.52 2.10

swap info:free_mem = 250.75M rsv=21332.33M alloc=17360.75M avail=22774.20 swap_free=26745.77M

...

Dump diagnostics for process J000 pid 6781 which did not start after 120 seconds:

skgpgpstack:read() for cmd /bin/ps -elf| /bin/egrep 'PID | 6781'| /bin/grep -v grep timed out after 60 seconds

pstack:cannot examine 6781:no such process or core file

***

***

killing process(ospid 14395):requester cancelled request and the process is still alive after kill!

***

error 481 detected in background process

ORA-00481:LMON process terminated with error

从错误信息看似乎是操作系统资源不足,导致fork新的进程时出现了错误.从系统内存信息看,当前内存仅有154MB的空闲空间

检查节点1的内在配置

/user/sbin/prtconf | grep 'Memory size"

Memory size:16384 Megabytes

只有16GB,正常应该是32GB。

检查节点2的内存

/user/sbin/prtconf | grep "Memory size"

Memory size:32768 M

数据的SGA大小还在20GB左右

由于节点1内存故障导致节点1出现了DOWN机。随后启动的时候出现了内在的错误,导致16GB的内存没有被加载,当前系统仅仅加载了16GB的内存,而数据库启动的SGA就占用了20GB的内存。Oracle虽然可以启动,但是要借助SWAP区也就是硬盘上的空间进行中转。等待一段时间后,用户负载增大,由于系统的内存频繁分页,导致系统的响应时间迅速增大,最终导致节点1完全瘫痪,而节点2由于与节点1进行交互,因此也处于等待状态,其实早在内存减少一半后,数据实例的第一次启动时,Oracle就给出了告警信息:

WARNING:Not enough physical memory for SHM_SHARE_MMU segment of size ....

解决办法:

解决硬件的问题,就可以从根本上解决数据库的问题;如果硬件问题需要很长时间才能解决,还可以在节点1上设置实例级的SGA,根据节点1当前可用的内在,降低SGA和PGA的设置,这样也可以避免故障的产生.

在实例2上设置实例1的专用SGA_TARGET参数:

SQL>alter system set sga_target=9663676416 scope=spfile sid='newtrade1'

重启动实例1

解决硬件问题后,只须RESET实例1的SGA_TARGET参数就可以恢复正常。

 

 

 

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值