环境:
OS:SUSE 10 SP2
kernel:2.6.16.60-0.21-bigsmp
node1:orcl1 ip:192.168.10.23
node2:orcl2 ip:192.168.10.24
db:oracle 10g RAC
ORACLE_SID=ORCL
hostname:rac1 & rac2
SERVER:IBM X3850M2
SD:IBM DS4700
故障描述:
从5月25号以来数据库每天监听都会自动断两次。有时候是节点一,有时候是节点二。每次重启过后过一段时间就会自动断掉。今天早上9点左右来检查数据库的状态一切正常,下午两点过再次检查的时候,节点二的监听断掉(如下)
oracle@rac1:~> crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.orcl.db application ONLINE ONLINE rac1
ora....l1.inst application ONLINE ONLINE rac1
ora....l2.inst application ONLINE ONLINE rac2
ora....SM1.asm application ONLINE ONLINE rac1
ora....C1.lsnr application ONLINE ONLINE rac1
ora.rac1.gsd application ONLINE ONLINE rac1
ora.rac1.ons application ONLINE ONLINE rac1
ora.rac1.vip application ONLINE ONLINE rac1
ora....SM2.asm application ONLINE ONLINE rac2
ora....C2.lsnr application ONLINE OFFLINE
ora.rac2.gsd application ONLINE ONLINE rac2
ora.rac2.ons application ONLINE ONLINE rac2
ora.rac2.vip application ONLINE ONLINE rac1
oracle@rac2:~> crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.orcl.db application ONLINE ONLINE rac1
ora....l1.inst application ONLINE ONLINE rac1
ora....l2.inst application ONLINE ONLINE rac2
ora....SM1.asm application ONLINE ONLINE rac1
ora....C1.lsnr application ONLINE ONLINE rac1
ora.rac1.gsd application ONLINE ONLINE rac1
ora.rac1.ons application ONLINE ONLINE rac1
ora.rac1.vip application ONLINE ONLINE rac1
ora....SM2.asm application ONLINE ONLINE rac2
ora....C2.lsnr application ONLINE OFFLINE
ora.rac2.gsd application ONLINE ONLINE rac2
ora.rac2.ons application ONLINE ONLINE rac2
ora.rac2.vip application ONLINE ONLINE rac1
检查alert日志:
在节点一及节点二中并没有发现有六月一号的日志存在。都只到了5月31号就没有日志内容。
oracle@rac2:~/product/10gR2/crs/log/rac2> date
Tue Jun 1 15:07:43 CST 2010
oracle@rac1:~/product/10gR2/crs/log/rac1> tail -50 alertrac1.log
2010-04-27 14:51:21.772
[evmd(18987)]CRS-1401:EVMD started on node rac1.
2010-04-27 14:51:21.892
[crsd(19064)]CRS-1012:The OCR service started on node rac1.
2010-04-27 14:51:25.843
[crsd(19064)]CRS-1201:CRSD started on node rac1.
2010-04-27 14:51:41.560
[cssd(19614)]CRS-1605:CSSD voting file is online: /dev/raw/raw5. Details in /opt/oracle/product/10gR2/crs/log/rac1/cssd/ocssd.log.
2010-05-04 09:43:20.631
[cssd(19614)]CRS-1612:node rac2 (2) at 50% heartbeat fatal, eviction in 29.094 seconds
2010-05-04 09:43:35.629
[cssd(19614)]CRS-1611:node rac2 (2) at 75% heartbeat fatal, eviction in 14.104 seconds
2010-05-28 16:39:21.378
[cssd(6886)]CRS-1605:CSSD voting file is online: /dev/raw/raw3. Details in /opt/oracle/product/10gR2/crs/log/rac1/cssd/ocssd.log.
2010-05-28 16:39:21.409
[cssd(6886)]CRS-1605:CSSD voting file is online: /dev/raw/raw4. Details in /opt/oracle/product/10gR2/crs/log/rac1/cssd/ocssd.log.
2010-05-28 16:39:21.409
[cssd(6886)]CRS-1605:CSSD voting file is online: /dev/raw/raw5. Details in /opt/oracle/product/10gR2/crs/log/rac1/cssd/ocssd.log.
2010-05-28 16:39:21.409
[cssd(6886)]CRS-1604:CSSD voting file is offline: /dev/raw/raw5. Details in /opt/oracle/product/10gR2/crs/log/rac1/cssd/ocssd.log.
2010-05-28 16:39:41.914
[cssd(6886)]CRS-1605:CSSD voting file is online: /dev/raw/raw5. Details in /opt/oracle/product/10gR2/crs/log/rac1/cssd/ocssd.log.
[cssd(6886)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 rac2 .
2010-05-28 16:39:42.983
[crsd(6390)]CRS-1012:The OCR service started on node rac1.
2010-05-28 16:39:42.986
[evmd(6309)]CRS-1401:EVMD started on node rac1.
2010-05-28 16:39:47.065
[crsd(6390)]CRS-1201:CRSD started on node rac1.
2010-05-31 20:27:41.384
[cssd(7144)]CRS-1605:CSSD voting file is online: /dev/raw/raw3. Details in /opt/oracle/product/10gR2/crs/log/rac1/cssd/ocssd.log.
2010-05-31 20:27:41.386
[cssd(7144)]CRS-1605:CSSD voting file is online: /dev/raw/raw4. Details in /opt/oracle/product/10gR2/crs/log/rac1/cssd/ocssd.log.
2010-05-31 20:27:41.391
[cssd(7144)]CRS-1605:CSSD voting file is online: /dev/raw/raw5. Details in /opt/oracle/product/10gR2/crs/log/rac1/cssd/ocssd.log.
[cssd(7144)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 rac2 .
2010-05-31 20:27:51.189
[crsd(6463)]CRS-1012:The OCR service started on node rac1.
2010-05-31 20:27:51.192
[evmd(6476)]CRS-1401:EVMD started on node rac1.
2010-05-31 20:27:54.869
[crsd(6463)]CRS-1201:CRSD started on node rac1.
oracle@rac2:~/product/10gR2/crs/log/rac2> tail -50 alertrac2.log
2010-05-28 16:39:14.626
[cssd(6935)]CRS-1604:CSSD voting file is offline: /dev/raw/raw5. Details in /opt/oracle/product/10gR2/crs/log/rac2/cssd/ocssd.log.
[cssd(6935)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac2 .
2010-05-28 16:39:22.294
[crsd(6361)]CRS-1005:The OCR upgrade was completed. Version has changed from 169870336 to 169870336. Details in /opt/oracle/product/10gR2/crs/log/rac2/crsd/crsd.log.
2010-05-28 16:39:22.294
[crsd(6361)]CRS-1012:The OCR service started on node rac2.
2010-05-28 16:39:22.302
[evmd(6368)]CRS-1401:EVMD started on node rac2.
2010-05-28 16:39:24.563
[crsd(6361)]CRS-1201:CRSD started on node rac2.
2010-05-28 16:39:25.087
[crsd(6361)]CRS-1205:Auto-start failed for the CRS resource . Details in rac2.
2010-05-28 16:39:25.235
[crsd(6361)]CRS-1205:Auto-start failed for the CRS resource . Details in rac2.
2010-05-28 16:39:32.562
[crsd(6361)]CRS-1205:Auto-start failed for the CRS resource . Details in rac2.
2010-05-28 16:39:35.459
[crsd(6361)]CRS-1205:Auto-start failed for the CRS resource . Details in rac2.
2010-05-28 16:39:35.625
[cssd(6935)]CRS-1605:CSSD voting file is online: /dev/raw/raw5. Details in /opt/oracle/product/10gR2/crs/log/rac2/cssd/ocssd.log.
2010-05-28 16:39:38.436
[crsd(6361)]CRS-1205:Auto-start failed for the CRS resource . Details in rac2.
2010-05-28 16:39:38.464
[crsd(6361)]CRS-1205:Auto-start failed for the CRS resource . Details in rac2.
[cssd(6935)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 rac2 .
2010-05-31 20:27:25.441
[cssd(7264)]CRS-1605:CSSD voting file is online: /dev/raw/raw3. Details in /opt/oracle/product/10gR2/crs/log/rac2/cssd/ocssd.log.
2010-05-31 20:27:25.546
[cssd(7264)]CRS-1605:CSSD voting file is online: /dev/raw/raw4. Details in /opt/oracle/product/10gR2/crs/log/rac2/cssd/ocssd.log.
2010-05-31 20:27:25.566
[cssd(7264)]CRS-1605:CSSD voting file is online: /dev/raw/raw5. Details in /opt/oracle/product/10gR2/crs/log/rac2/cssd/ocssd.log.
[cssd(7264)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac2 .
2010-05-31 20:27:40.260
[crsd(6611)]CRS-1005:The OCR upgrade was completed. Version has changed from 169870336 to 169870336. Details in /opt/oracle/product/10gR2/crs/log/rac2/crsd/crsd.log.
2010-05-31 20:27:40.261
[crsd(6611)]CRS-1012:The OCR service started on node rac2.
2010-05-31 20:27:40.268
[evmd(6474)]CRS-1401:EVMD started on node rac2.
2010-05-31 20:27:42.244
[crsd(6611)]CRS-1201:CRSD started on node rac2.
2010-05-31 20:27:44.114
[crsd(6611)]CRS-1205:Auto-start failed for the CRS resource . Details in rac2.
2010-05-31 20:27:44.277
查看crs日志
oracle@rac1:~/product/10gR2/crs/log/rac1/crsd> tail crsd.log
2010-05-31 20:28:19.669: [ CRSRES][2682809248]0startRunnable: setting CLI values
2010-05-31 20:28:19.718: [ CRSRES][2691201952]0startRunnable: setting CLI values
2010-05-31 20:28:35.457: [ CRSRES][2707987360]0startRunnable: setting CLI values
2010-05-31 20:28:38.307: [ CRSRES][3065169584]0 startup = 1
2010-05-31 20:28:38.512: [ CRSRES][3065169584]0 startup = 1
2010-05-31 20:28:38.668: [ CRSRES][2699594656]0CRS-1002: Resource 'ora.rac1.LISTENER_RAC1.lsnr' is already running on member 'rac1'
2010-05-31 20:28:38.746: [ CRSRES][2707987360]0startRunnable: setting CLI values
2010-05-31 20:28:38.885: [ CRSRES][2707987360]0Attempting to start `ora.rac1.ASM1.asm` on member `rac1`
2010-05-31 20:28:46.292: [ CRSRES][2707987360]0Start of `ora.rac1.ASM1.asm` on member `rac1` succeeded.
2010-05-31 20:28:46.488: [ CRSRES][2707987360]0startRunnable: setting CLI values
2010-05-31 20:28:46.627: [ CRSRES][2707987360]0Attempting to start `ora.orcl.orcl1.inst` on member `rac1`
2010-05-31 20:29:22.143: [ CRSRES][2707987360]0Start of `ora.orcl.orcl1.inst` on member `rac1` succeeded.
2010-05-31 20:29:27.132: [ OCRUTL][2919832480]u_freem: mem passed is null
2010-06-01 11:10:37.736: [ CRSRES][2699594656]0startRunnable: setting CLI values
oracle@rac2:~/product/10gR2/crs/log/rac2/crsd> tail crsd.log
2010-05-31 20:28:35.745: [ CRSRES][2674412448]0Attempting to start `ora.rac1.LISTENER_RAC1.lsnr` on member `rac1`
2010-05-31 20:28:37.980: [ CRSRES][2674412448]0Start of `ora.rac1.LISTENER_RAC1.lsnr` on member `rac1` succeeded.
2010-06-01 11:10:36.615: [ CRSAPP][2656766880]0CheckResource error for ora.rac2.vip error code = 1
2010-06-01 11:10:36.616: [ CRSRES][2656766880]0In stateChanged, ora.rac2.vip target is ONLINE
2010-06-01 11:10:36.617: [ CRSRES][2656766880]0ora.rac2.vip on rac2 went OFFLINE unexpectedly
2010-06-01 11:10:36.617: [ CRSRES][2656766880]0StopResource: setting CLI values
2010-06-01 11:10:36.643: [ CRSRES][2656766880]0Attempting to stop `ora.rac2.vip` on member `rac2`
2010-06-01 11:10:36.915: [ CRSRES][2656766880]0Stop of `ora.rac2.vip` on member `rac2` succeeded.
2010-06-01 11:10:36.915: [ CRSRES][2656766880]0ora.rac2.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0
2010-06-01 11:10:36.943: [ CRSRES][2656766880]0ora.rac2.vip failed on rac2 relocating.
2010-06-01 11:10:37.052: [ CRSRES][2656766880]0StopResource: setting CLI values
2010-06-01 11:10:37.067: [ CRSRES][2656766880]0Attempting to stop `ora.rac2.LISTENER_RAC2.lsnr` on member `rac2`
2010-06-01 11:10:37.360: [ CRSRES][2656766880]0Stop of `ora.rac2.LISTENER_RAC2.lsnr` on member `rac2` succeeded.
2010-06-01 11:10:37.507: [ CRSRES][2656766880]0Attempting to start `ora.rac2.vip` on member `rac1`
2010-06-01 11:10:47.477: [ CRSRES][2656766880]0Start of `ora.rac2.vip` on member `rac1` succeeded.
2010-06-01 11:10:47.760: [ CRSRES][2699590560]0CRS-1002: Resource 'ora.rac2.vip' is already running on member 'rac1'
查看racg里面日志
ora.rac1.vip.log里面并没有6月1号的内容。ora.rac2.vip.log里面有ping to 192.168.10.254 via eth0 failed, rc = 1 (host=rac1) ping to 192.168.10.254 via eth0 failed, rc = 1 (host=rac1)失败。
oracle@rac1:~/product/10gR2/crs/log/rac1/racg> tail -50 ora.rac1.vip.log
2010-05-30 06:23:29.834: [ RACG][3066017456] [7155][3066017456][ora.rac1.vip]: clsrcexecut: env ORACLE_CONFIG_HOME=/opt/oracle/product/10gR2/crs
2010-05-30 06:23:29.834: [ RACG][3066017456] [7155][3066017456][ora.rac1.vip]: clsrcexecut: cmd = /opt/oracle/product/10gR2/crs/bin/racgeut -e _USR_ORA_DEBUG=0 54 /opt/oracle/product/10gR2/crs/bin/racgvip check rac1
2010-05-30 06:23:29.834: [ RACG][3066017456] [7155][3066017456][ora.rac1.vip]: clsrcexecut: rc = 1, time = 6.220s
2010-05-30 06:23:29.834: [ RACG][3066017456] [7155][3066017456][ora.rac1.vip]: end for resource = ora.rac1.vip, action = check, status = 1, time = 6.250s
2010-05-31 09:22:06.114: [ RACG][3065804464] [3333][3065804464][ora.rac1.vip]: IP:192.168.10.25 is already up in the network (host=rac1)
IP:192.168.10.25 is already up in the network (host=rac1)
ping to 192.168.10.254 via eth0 failed, rc = 1 (host=rac1)
ping to 192.168.10.254 via eth0 failed, rc = 1 (host=rac1)
2010-05-31 09:22:06.114: [ RACG][3065804464] [3333][3065804464][ora.rac1.vip]: Interface eth0 checked failed (host=rac1)
Invalid parameters, or failed to bring up VIP (host=rac1)
2010-05-31 09:22:06.114: [ RACG][3065804464] [3333][3065804464][ora.rac1.vip]: clsrcexecut: env ORACLE_CONFIG_HOME=/opt/oracle/product/10gR2/crs
2010-05-31 09:22:06.114: [ RACG][3065804464] [3333][3065804464][ora.rac1.vip]: clsrcexecut: cmd = /opt/oracle/product/10gR2/crs/bin/racgeut -e _USR_ORA_DEBUG=0 54 /opt/oracle/product/10gR2/crs/bin/racgvip start rac1
2010-05-31 09:22:06.114: [ RACG][3065804464] [3333][3065804464][ora.rac1.vip]: clsrcexecut: rc = 1, time = 13.020s
2010-05-31 09:22:09.326: [ RACG][3065804464] [3333][3065804464][ora.rac1.vip]: clsrcexecut: env ORACLE_CONFIG_HOME=/opt/oracle/product/10gR2/crs
2010-05-31 09:22:09.326: [ RACG][3065804464] [3333][3065804464][ora.rac1.vip]: clsrcexecut: cmd = /opt/oracle/product/10gR2/crs/bin/racgeut -e _USR_ORA_DEBUG=0 54 /opt/oracle/product/10gR2/crs/bin/racgvip check rac1
2010-05-31 09:22:09.326: [ RACG][3065804464] [3333][3065804464][ora.rac1.vip]: clsrcexecut: rc = 1, time = 3.210s
2010-05-31 09:22:09.326: [ RACG][3065804464] [3333][3065804464][ora.rac1.vip]: end for resource = ora.rac1.vip, action = start, status = 1, time = 16.240s
2010-05-31 09:24:06.451: [ RACG][3065546416] [4295][3065546416][ora.rac1.vip]: ping to 192.168.10.254 via eth0 failed, rc = 1 (host=rac1)
ping to 192.168.10.254 via eth0 failed, rc = 1 (host=rac1)
2010-05-31 09:24:08.552: [ RACG][3065816752] [4410][3065816752][ora.rac1.vip]: clsrcstartorp: Error with malloc
2010-05-31 20:28:35.167: [ RACG][3065271984] [7652][3065271984][ora.rac1.vip]: ping to 192.168.10.254 via eth0 failed, rc = 1 (host=rac1)
ping to 192.168.10.254 via eth0 failed, rc = 1 (host=rac1)
oracle@rac1:~/product/10gR2/crs/log/rac1/racg> tail -30 ora.rac2.vip.log
2010-05-31 15:40:11.835: [ RACG][3065345712] [9437][3065345712][ora.rac2.vip]: clsrcexecut: env ORACLE_CONFIG_HOME=/opt/oracle/product/10gR2/crs
2010-05-31 15:40:11.835: [ RACG][3065345712] [9437][3065345712][ora.rac2.vip]: clsrcexecut: cmd = /opt/oracle/product/10gR2/crs/bin/racgeut -e _USR_ORA_DEBUG=0 54 /opt/oracle/product/10gR2/crs/bin/racgvip check rac2
2010-05-31 15:40:11.835: [ RACG][3065345712] [9437][3065345712][ora.rac2.vip]: clsrcexecut: rc = 1, time = 3.200s
2010-05-31 15:40:11.835: [ RACG][3065345712] [9437][3065345712][ora.rac2.vip]: end for resource = ora.rac2.vip, action = start, status = 1, time = 12.450s
2010-05-31 15:40:24.489: [ RACG][3066042032] [9611][3066042032][ora.rac2.vip]: ping to 192.168.10.254 via eth0 failed, rc = 1 (host=rac1)
ping to 192.168.10.254 via eth0 failed, rc = 1 (host=rac1)
2010-05-31 15:42:37.084: [ RACG][3065276080] [10567][3065276080][ora.rac2.vip]: ping to 192.168.10.254 via eth0 failed, rc = 1 (host=rac1)
ping to 192.168.10.254 via eth0 failed, rc = 1 (host=rac1)
Interface eth0 checked failed (host=rac1)
Invalid parameters, or failed to bring up VIP (host=rac1)
2010-05-31 15:42:37.085: [ RACG][3065276080] [10567][3065276080][ora.rac2.vip]: clsrcexecut: env ORACLE_CONFIG_HOME=/opt/oracle/product/10gR2/crs
2010-05-31 15:42:37.085: [ RACG][3065276080] [10567][3065276080][ora.rac2.vip]: clsrcexecut: cmd = /opt/oracle/product/10gR2/crs/bin/racgeut -e _USR_ORA_DEBUG=0 54 /opt/oracle/product/10gR2/crs/bin/racgvip check rac2
2010-05-31 15:42:37.085: [ RACG][3065276080] [10567][3065276080][ora.rac2.vip]: clsrcexecut: rc = 1, time = 6.200s
2010-05-31 15:42:37.085: [ RACG][3065276080] [10567][3065276080][ora.rac2.vip]: end for resource = ora.rac2.vip, action = check, status = 1, time = 6.230s
2010-05-31 15:42:37.360: [ RACG][3065542320] [10656][3065542320][ora.rac2.vip]: bounce eth0 (host=rac1)
2010-06-01 11:10:47.790: [ RACG][3065239216] [31482][3065239216][ora.rac2.vip]: ping to 192.168.10.254 via eth0 failed, rc = 1 (host=rac1)
ping to 192.168.10.254 via eth0 failed, rc = 1 (host=rac1)
通过上面日志显示节点不能ping通默认网关,分别查看两台服务器网关,节点一与节点二的默认网关都为192.168.10.254。但是不能ping通,查看其他服务器也不能ping通网关,后来确认是移动在防火墙上面做了限制不能ping。关闭节点,并修改网关为192.168.10.11,在启动服务。
oracle@rac1:~/product/10gR2/crs/log/rac1> crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.orcl.db application ONLINE ONLINE rac1
ora....l1.inst application ONLINE ONLINE rac1
ora....l2.inst application ONLINE ONLINE rac2
ora....SM1.asm application ONLINE ONLINE rac1
ora....C1.lsnr application ONLINE ONLINE rac1
ora.rac1.gsd application ONLINE ONLINE rac1
ora.rac1.ons application ONLINE ONLINE rac1
ora.rac1.vip application ONLINE ONLINE rac1
ora....SM2.asm application ONLINE ONLINE rac2
ora....C2.lsnr application ONLINE ONLINE rac2
ora.rac2.gsd application ONLINE ONLINE rac2
ora.rac2.ons application ONLINE ONLINE rac2
ora.rac2.vip application ONLINE ONLINE rac2
能够正常启动。但还需要通过检查确认。继续观察。。。。
参考文档:http://software.digitalchina.com/fw/jsyk-show.asp?column_id=316&column_cat_id=41
疑问:1。既然不能ping通网关,那为什么两节点都能够正常启动起来。因为在做吉林RAC的时候,设置了网关但是不能ping通,节点不能启动。
2。为什么每次都只是一个节点的监听断掉,节点都不确定。
3。为什么在alertrac1.log,alertrac2.log中并没有存在记录。
后续观察情况:未出现前几天出现监听自动断掉情况(6月2日)
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/22779291/viewspace-664180/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/22779291/viewspace-664180/