ORA-03113: end-of-file on communication channel,另一个节点正常

1.1.1        问题描述

操作系统:REDHAT 5.5

数据库版本:11.2.0.3+asm+rac

 
  

[oracle@ctp1-db1 bin]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.2.0 Production on Fri Jul 5 23:26:54 2013
Copyright (c) 1982, 2010, Oracle.  All rights reserved.
Connected to an idle instance.
SQL> startup
ORA-03113: end-of-file on communication channel

 
  
Trace文件日志:
can not attach to DLM

1.1.2        分析

  • alert_.log from non-first node

lmon registered with NM - instance number 2 (internal mem no 1)
Tue Dec 06 06:16:15 2011
System state dump requested by (instance=2, sid=19095 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /g01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_diag_19138.trc
Tue Dec 06 06:16:15 2011
PMON (ospid: 19095): terminating the instance due to error 481
Dumping diagnostic data in directory=[cdmp_20111206061615], requested by (instance=2, sid=19095 (PMON)), summary=[abnormal instance termination].
Tue Dec 06 06:16:15 2011
ORA-1092 : opitsk aborting process

Note: ASM instance terminates shortly after "lmon registered with NM"

If ASM on non-first node was running previously, likely the following will be in alert.log when it failed originally:

..
IPC Send timeout detected. Sender: ospid 32231 [oracle@ftdcslsedw01b (PING)]
..
ORA-29740: evicted by instance number 1, group incarnation 10

  • diag trace from non-first ASM (+ASMn_diag_.trc)

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE])

  • alert_.log from first node

LMON (ospid: 15986) detects hung instances during IMR reconfiguration
LMON (ospid: 15986) tries to kill the instance 2 in 37 seconds.
Please check instance 2's alert log and LMON trace file for more details.
..
Remote instance kill is issued with system inc 64
Remote instance kill map (size 1) : 2
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 2
Reconfiguration started (old inc 64, new inc 66)


If the issue happens while running root script. (root.sh or rootupgrade.sh) as part of Grid Infrastructure installation/upgrade process, the following symptoms will present:

  • root script. screen output

Start of resource "ora.asm" failed

CRS-2672: Attempting to start 'ora.asm' on 'racnode1'
CRS-5017: The resource action "ora.asm start" encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0
. For details refer to "(:CLSN00107:)" in "/ocw/grid/log/racnode1/agent/ohasd/oraagent_grid/oraagent_grid.log".
CRS-2674: Start of 'ora.asm' on 'racnode1' failed
..
Failed to start ASM at /ispiris-qa/app/11.2.0.3/crs/install/crsconfig_lib.pm line 1272

  • $GRID_HOME/cfgtoollogs/crsconfig/rootcrs_.log

2011-11-29 15:56:48: Executing cmd: /ispiris-qa/app/11.2.0.3/bin/crsctl start resource ora.asm -init
..
>  CRS-2672: Attempting to start 'ora.asm' on 'racnode1'
>  CRS-5017: The resource action "ora.asm start" encountered the following error:
>  ORA-03113: end-of-file on communication channel
>  Process ID: 0
>  Session ID: 0 Serial number: 0
>  . For details refer to "(:CLSN00107:)" in "/ispiris-qa/app/11.2.0.3/log/racnode1/agent/ohasd/oraagent_grid/oraagent_grid.log".
>  CRS-2674: Start of 'ora.asm' on 'racnode1' failed
>  CRS-2679: Attempting to clean 'ora.asm' on 'racnode1'
>  CRS-2681: Clean of 'ora.asm' on 'racnode1' succeeded
..
>  CRS-4000: Command Start failed, or completed with errors.
>End Command output
2011-11-29 15:59:00: Executing cmd: /ispiris-qa/app/11.2.0.3/bin/crsctl check resource ora.asm -init
2011-11-29 15:59:00: Executing cmd: /ispiris-qa/app/11.2.0.3/bin/crsctl status resource ora.asm -init
2011-11-29 15:59:01: Checking the status of ora.asm
..
2011-11-29 15:59:53: Start of resource "ora.asm" failed

1.1.3    Case1: link local IP (169.254.x.x) is being used by other adapter/network

Symptoms:

  • $GRID_HOME/log//alert.log

[/ocw/grid/bin/orarootagent.bin(4813)]CRS-5018:(:CLSN00037:) Removed unused HAIP route:  169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0

  • OS messages (optional)

Dec  6 06:11:14 racnode1 dhclient: DHCPREQUEST on usb0 to 255.255.255.255 port 67
Dec  6 06:11:14 racnode1 dhclient: DHCPACK from 169.254.95.118

  • ifconfig -a

..
usb0      Link encap:Ethernet  HWaddr E6:1F:13:AD:EE:D3
        inet addr:169.254.95.120  Bcast:169.254.95.255  Mask:255.255.255.0
..
Note: it's usb0 in this case, but it can be any other adapter which uses link local

Solution:

Link local IP must not be used by any other network on cluster nodes. In this case, an USB network device gets IP 169.254.95.118 from DHCP server which disrupted HAIP routing, and solution is to black list the device in udev from being activated automatically.

1.1.4    Case2: firewall exists between nodes on private network (iptables etc)

No firewall is allowed on private network (cluster_interconnect) between nodes including software firewall like iptables, ipmon etc

1.1.5    Case3: HAIP is up on some nodes but not on all

Symptoms:

  • alert_.log for some instances

Cluster communication is configured to use the following interface(s) for this instance
10.1.0.1

  • alert_.log for other instances

Cluster communication is configured to use the following interface(s) for this instance
169.254.201.65

Note: some instances is using HAIP while others are not, so they can not talk to each other

Solution:
The solution is to bring up HAIP on all nodes.
To find out HAIP status, execute the following on all nodes:

$GRID_HOME/bin/crsctl stat res ora.cluster_interconnect.haip -init
If it's offline, try to bring it up as root:

$GRID_HOME/bin/crsctl start res ora.cluster_interconnect.haip –init


If HAIP fails to start, refer to note 1210883.1 for known issues.
If the "up node" is not using HAIP, and no outage is allowed, the workaround is to set

init.ora/spfile parameter cluster_interconnect to the private IP of each node to allow ASM/DB to come up on "down node". Once a maintenance window is planned, the parameter must be removed to allow HAIP to work.

1.1.6    Case4: HAIP is up on all nodes but some do not have route info

Symptoms:

  • alert_.log for all instances

Cluster communication is configured to use the following interface(s) for this instance
169.254.xxx.xxx

  • "netstat -rn" for some nodes (surviving nodes) missing HAIP route

netstat -rn
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
161.130.90.0     0.0.0.0         255.255.248.0   U         0 0          0 bond0
160.131.11.0     0.0.0.0         255.255.255.0   U         0 0          0 bond2
0.0.0.0      160.11.80.1     0.0.0.0         UG        0 0          0   bond0
The line for HAIP is missing, i.e:
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 bond2
Note: As HAIP route info is missing on some nodes, HAIP is not pingable; usually newly restarted node will have HAIP route info,一定要测试interconnect是否能够ping

Solution:

The solution is to manually add HAIP route info on the nodes that's missing:
4.1. Execute "netstat -rn" on any node that has HAIP route info and locate the following:

169.254.0.0 0.0.0.0 255.255.0.0 U  0 0  0 bond2

Note: the first field is HAIP subnet ID and will start with 169.254.xxx.xxx, the third field is HAIP subnet netmask and the last field is private network adapter name

4.2. Execute the following as root on the node that's missing HAIP route:

# route add -net netmask dev
i.e.
# route add -net 169.254.0.0 netmask 255.255.0.0 dev bond2


4.3. Start ora.crsd as root on the node that's partial up:.

# $GRID_HOME/bin/crsctl start res ora.crsd -init


The other workaround is to restart GI on the node that's missing HAIP route with "crsctl stop crs -f" and "crsctl start crs" command as root.

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/26276376/viewspace-766226/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/26276376/viewspace-766226/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值