oracle 11g 169.254,Oracle 11gR2 RAC HAIP特性相关的故障的判断及解决方法

最新推荐文章于 2021-04-11 05:05:04 发布

weixin_39615643

最新推荐文章于 2021-04-11 05:05:04 发布

阅读量1k

点赞数

文章标签： oracle 11g 169.254

ASM on Non First Node (Second or Other Node) Fails to Come up With: PMON (ospid: nnnn): terminating the instance due to error 481 [ID 1383737.1]

修改时间:2012-11-9

类型:REFERENCE

状态:PUBLISHED

优先级:2

In this Document

Applies to:

Oracle Server - Enterprise Edition - Version 11.2.0.1 and later

Information in this document applies to any platform.

Purpose

This note lists common causes of ASM start up failure with the following error on non-first node (second or others):

alert_.log from non-first node

lmon registered with NM - instance number 2 (internal mem no 1)

Tue Dec 06 06:16:15 2011

System state dump requested by (instance=2, sid=19095 (PMON)), summary=[abnormal instance termination].

System State dumped to trace file /g01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_diag_19138.trc

Tue Dec 06 06:16:15 2011

PMON (ospid: 19095): terminating the instance due to error 481

Dumping diagnostic data in directory=[cdmp_20111206061615], requested by (instance=2, sid=19095 (PMON)), summary=[abnormal instance termination].

Tue Dec 06 06:16:15 2011

ORA-1092 : opitsk aborting process

Note: ASM instance terminates shortly after "lmon registered with NM"

If ASM on non-first node was running previously, likely the following will be in alert.log when it failed originally:

IPC Send timeout detected. Sender: ospid 32231 [oracle@ftdcslsedw01b (PING)]

ORA-29740: evicted by instance number 1, group incarnation 10

diag trace from non-first ASM (+ASMn_diag_.trc)

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE])

alert_.log from first node

LMON (ospid: 15986) detects hung instances during IMR reconfiguration

LMON (ospid: 15986) tries to kill the instance 2 in 37 seconds.

Please check instance 2's alert log and LMON trace file for more details.

Remote instance kill is issued with system inc 64

Remote instance kill map (size 1) : 2

LMON received an instance eviction notification from instance 1

The instance eviction reason is 0x20000000

The instance eviction map is 2

Reconfiguration started (old inc 64, new inc 66)

If the issue happens while running root script. (root.sh or rootupgrade.sh) as part of Grid Infrastructure installation/upgrade process, the following symptoms will present:

root script. screen output

Start of resource "ora.asm" failed

CRS-2672: Attempting to start 'ora.asm' on 'racnode1'

CRS-5017: The resource action "ora.asm start" encountered the following error:

ORA-03113: end-of-file on communication channel

Process ID: 0

Session ID: 0 Serial number: 0

. For details refer to "(:CLSN00107:)" in "/ocw/grid/log/racnode1/agent/ohasd/oraagent_grid/oraagent_grid.log".

CRS-2674: Start of 'ora.asm' on 'racnode1' failed

Failed to start ASM at /ispiris-qa/app/11.2.0.3/crs/install/crsconfig_lib.pm line 1272

$GRID_HOME/cfgtoollogs/crsconfig/rootcrs_.log

2011-11-29 15:56:48: Executing cmd: /ispiris-qa/app/11.2.0.3/bin/crsctl start resource ora.asm -init

> CRS-2672: Attempting to start 'ora.asm' on 'racnode1'

> CRS-5017: The resource action "ora.asm start" encountered the following error:

> ORA-03113: end-of-file on communication channel

> Process ID: 0

> Session ID: 0 Serial number: 0

> . For details refer to "(:CLSN00107:)" in "/ispiris-qa/app/11.2.0.3/log/racnode1/agent/ohasd/oraagent_grid/oraagent_grid.log".

> CRS-2674: Start of 'ora.asm' on 'racnode1' failed

> CRS-2679: Attempting to clean 'ora.asm' on 'racnode1'

> CRS-2681: Clean of 'ora.asm' on 'racnode1' succeeded

> CRS-4000: Command Start failed, or completed with errors.

>End Command output

2011-11-29 15:59:00: Executing cmd: /ispiris-qa/app/11.2.0.3/bin/crsctl check resource ora.asm -init

2011-11-29 15:59:00: Executing cmd: /ispiris-qa/app/11.2.0.3/bin/crsctl status resource ora.asm -init

2011-11-29 15:59:01: Checking the status of ora.asm

2011-11-29 15:59:53: Start of resource "ora.asm" failed

Details

Case1: link local IP (169.254.x.x) is being used by other adapter/network

Symptoms:

$GRID_HOME/log//alert.log

[/ocw/grid/bin/orarootagent.bin(4813)]CRS-5018:(:CLSN00037:) Removed unused HAIP route: 169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0

OS messages (optional)

Dec 6 06:11:14 racnode1 dhclient: DHCPREQUEST on usb0 to 255.255.255.255 port 67

Dec 6 06:11:14 racnode1 dhclient: DHCPACK from 169.254.95.118

ifconfig -a

usb0 Link encap:Ethernet HWaddr E6:1F:13:AD:EE:D3

inet addr:169.254.95.120 Bcast:169.254.95.255 Mask:255.255.255.0

Note: it's usb0 in this case, but it can be any other adapter which uses link local

Solution:

Link local IP must not be used by any other network on cluster nodes. In this case, an USB network device gets IP 169.254.95.118 from DHCP server which disrupted HAIP routing, and solution is to black list the device in udev from being activated automatically.

Case2: firewall exists between nodes on private network (iptables etc)

No firewall is allowed on private network (cluster_interconnect) between nodes including software firewall like iptables, ipmon etc

Case3: HAIP is up on some nodes but not on all

Symptoms:

alert_.log for some instances

Cluster communication is configured to use the following interface(s) for this instance

10.1.0.1

alert_.log for other instances

Cluster communication is configured to use the following interface(s) for this instance

169.254.201.65

Note: some instances is using HAIP while others are not, so they can not talk to each other

Solution:

The solution is to bring up HAIP on all nodes.

To find out HAIP status, execute the following on all nodes:

$GRID_HOME/bin/crsctl stat res ora.cluster_interconnect.haip -init

If it's offline, try to bring it up as root:

$GRID_HOME/bin/crsctl start res ora.cluster_interconnect.haip -init

If HAIP fails to start, refer to note 1210883.1 for known issues.

If the "up node" is not using HAIP, and no outage is allowed, the workaround is to set init.ora/spfile parameter cluster_interconnect to the private IP of each node to allow ASM/DB to come up on "down node". Once a maintenance window is planned, the parameter must be removed to allow HAIP to work.

Case4: HAIP is up on all nodes but some do not have route info

Symptoms:

alert_.log for all instances

Cluster communication is configured to use the following interface(s) for this instance

169.254.xxx.xxx

"netstat -rn" for some nodes (surviving nodes) missing HAIP route

netstat -rn

Destination Gateway Genmask Flags MSS Window irtt Iface

161.130.90.0 0.0.0.0 255.255.248.0 U 0 0 0 bond0

160.131.11.0 0.0.0.0 255.255.255.0 U 0 0 0 bond2

0.0.0.0 160.11.80.1 0.0.0.0 UG 0 0 0 bond0

The line for HAIP is missing, i.e:

169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 bond2

Note: As HAIP route info is missing on some nodes, HAIP is not pingable; usually newly restarted node will have HAIP route info

Solution:

The solution is to manually add HAIP route info on the nodes that's missing:

4.1. Execute "netstat -rn" on any node that has HAIP route info and locate the following:

169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 bond2

Note: the first field is HAIP subnet ID and will start with 169.254.xxx.xxx, the third field is HAIP subnet netmask and the last field is private network adapter name

4.2. Execute the following as root on the node that's missing HAIP route:

# route add -net netmask dev i.e.

# route add -net 169.254.0.0 netmask 255.255.0.0 dev bond2

4.3. Start ora.crsd as root on the node that's partial up:.

# $GRID_HOME/bin/crsctl start res ora.crsd -init

The other workaround is to restart GI on the node that's missing HAIP route with "crsctl stop crs -f" and "crsctl start crs" command as root.

Database - RAC/Scalability Community

To discuss this topic further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support

References

NOTE:1210883.1- 11gR2 Grid Infrastructure Redundant Interconnect and ora.cluster_interconnect.haip

NOTE:1386709.1- The Basics of IPv4 Subnet and Oracle Clusterware

总结：

1.确保169.254.x.x 地址被绑定到私有网卡上。

2.确保地址是以169.254开头。

3.确保所有节点私有网络之间没有防火墙。

4.确保所有节点的ora.cluster_interconnect.haip资源都启动成功。

5.所有节点的ora.cluster_interconnect.haip资源启动成功后，确保所有节点绑定的169.254.x.x 地址在节点之间都能相互PING通。

注意：在ora.cluster_interconnect.haip资源启动之前，cssd进程会检查私有网络的健康状况，从而判定是否启动cssd进程，这个时候私有网络的IP是在操作系统级别设置的IP地址；当ora.cluster_interconnect.haip资源启动之后，ora.asm中的LMON等进程会检查私有网络的通信的健康状况，从而判定是否启动集群ora.asm，这个时候私有网络的IP地址是169.254.x.x，如果节点相互之间的一个或多个169.254.x.x网络地址不通，实际就是脑裂的情况，asm实例必定只能在部分节点运行，asm实例不能启动，Clusterware和数据库实例都无法启动。

在11.2.0.2以上的GI上使用多网卡构成的HAIP技术，那么不同网卡应该在不同的子网上，如果所有的网卡在同一个子网上，那么拔掉其中一个网卡可能导致节点被踢出。详情参加最佳实践：

--end--