Top 5 issues for Instance Eviction (Doc ID 1374110.1)

In this Document

 Purpose
 Scope
 Details
 Issue #1  The alert.log shows ora-29740 as a reason for instance crash/eviction
 Symptoms:
 Possible causes:
 Solutions:
 Issue #2  The alert.log shows "ipc send timeout" error before the instance crashes or is evicted
 Symptoms:
 Possible causes:
 Solutions:
 Issue #3  The problem instance was hanging before the instance crashes or is evicted
 Symptoms:
 Possible causes:
 Solutions:
 Issue #4  The alert.log shows "Waiting for clusterware split-brain resolution" before one or more instances crashes or is evicted
 Symptoms:
 Possible causes:
 Solutions:
 Issue #5  The problem instance is killed by CRS because another instance tried to evict the problem instance and could not evict it.
 Symptoms:
 Possible causes:
 Solutions:
 References

APPLIES TO:

Oracle Database - Enterprise Edition - Version 10.2.0.1 to 11.2.0.3 [Release 10.2 to 11.2]
Information in this document applies to any platform.

PURPOSE

This note is to give DBAs a quick summary of the top issues that cause the instance eviction.

SCOPE

DBAs

DETAILS

Issue #1  The alert.log shows ora-29740 as a reason for instance crash/eviction

Symptoms:

An instance crashes and the alert.log shows "ORA-29740: evicted by member ..." error.

Possible causes:
An ORA-29740 error occurs when an instance evicts another instance in a RAC database.  The instance that gets evicted reports ora-29740 error in the alert.log.
Some of the reasons for this are a communications error in the cluster, failure to issue a heartbeat to the control file, and other reasons.  

Checking the lmon trace files of all instances is very important to determine the reason code.  Look for the line with "kjxgrrcfgchk: Initiating reconfig".
This will give a reason code such as "kjxgrrcfgchk: Initiating reconfig, reason 3".  Most of the ora-29740 error when an instance is evicted is due to reason 3 which means "Communications Failure".

The  Document 219361.1 (Troubleshooting ORA-29740 in a RAC Environment) states the following as the likely cause of the ora-29740 error with reason 3:

a) Network Problems.
b) Resource Starvation (CPU, I/O, etc..)
c) Severe Contention in Database.
d) An Oracle bug.
Solutions:
1) Check network and make sure there is no network error such as UDP error or IP packet loss or failure errors.
2) Check network configuration to make sure that all network configurations are set up correctly on all nodes.
   For example, MTU size must be same on all nodes and the switch can support MTU size of 9000 if jumbo frame is used.
3) Check if the server had a CPU load problem or a free memory shortage.
4) Check if the database was hanging or having a severe performance problem prior to the instance eviction.
5) Check CHM (Cluster Health Monitor) output to see if the server had a CPU or memory load problem, network problem, or spinning lmd or lms processes. The CHM output is available only on certain platform and version, so please check the CHM FAQ  Document 1328466.1
6) Set up to run OSWatcher by following the instruction in the note  Document 301137.1 if OSWatcher is not set up already.
   Having OSWatcher output is helpful when CHM output is not available.

 

Issue #2  The alert.log shows "ipc send timeout" error before the instance crashes or is evicted

Symptoms:

An instance is evicted with alert.log showing many "IPC send timeout" errors.  This message normally accompanies a database performance problem.

Possible causes:
In RAC, processes like lmon, lmd, and lms processes constantly talk to processes in other instances.  The lmd0 process  is responsible for managing enqueues while lms processes are responsible for managing data block resources and transferring data blocks to support the cache fusion.   When one or more of these processes are stuck, spin, or are extremely busy with the load, then these processes can cause the "IPC send timeout" error.

Another cause of "IPC send timeout" error reported by lmon, lms, and lmd processes is the  network problem or the server resource (CPU and memory) issue.  Those processes may not get scheduled to run on CPU or the network packet sent by those processes can get lost.

The communication problem involving lmon, lmd, and lms processes causes an instance eviction.  The alert.log of the evicting instance shows messages similar to

IPC Send timeout detected.Sender: ospid 1519
Receiver: inst 8 binc 997466802 ospid 23309

If an instance is evicted, the "IPC Send timeout detected" in alert.log is normally followed by other issues like ora-29740 and "Waiting for clusterware split-brain resolution"
Solutions:
The solution in here is similar to issue #1.

1) Check network and make sure there is no network error such as UDP error or IP packet loss or failure errors.
2) Check network configuration to make sure that all network configurations are set up correctly on all nodes.
   For example, MTU size must be same on all nodes and the switch can support MTU size of 9000 if jumbo frame is used.
3) Check if the server had a CPU load problem or a free memory shortage.
4) Check if the database was hanging or having a severe performance problem prior to the instance eviction.
5) Check CHM (Cluster Health Monitor) output to see if the server had a CPU or memory load problem, network problem, or spinning lmd or lms processes. The CHM output is available only on certain platform and version, so please check the CHM FAQ  Document 1328466.1
6) Set up to run OSWatcher by following the instruction in the note  Document 301137.1 if OSWatcher is not set up already.
   Having OSWatcher output is helpful when CHM output is not available.

 

Issue #3  The problem instance was hanging before the instance crashes or is evicted

Symptoms:

The instance or database was hanging before the instance crashed/evicted.  It could also be that the node hang.

Possible causes:
Different processes such as lmon, lmd, and lms communicate with corresponding processes on other instances, so when the instance and database hang, those processes may be waiting for a resource such as a latch, an enqueue, or a data block.  Those processes that are waiting can not respond to the network ping or send any communication over the network to the remote instances.  As a result, other instances evict the problem instance.

You may see a message similar to the following in the alert.log of the instance that is evicting another instance:
Remote instance kill is issued [112:1]: 8
or
Evicting instance 2 from cluster
Solutions:
1) Find out the reason for the database or instance hang. Getting a global system state dump and global hang analyze output is critical when troubleshooting the database or instance hang issue. If the global system state dump can not be obtained, get the local system state dump from all instances around same time.
2) Check CHM (Cluster Health Monitor) output to see if the server had a CPU or memory load problem, network problem, or spinning lmd or lms processes. The CHM output is available only on some platforms and versions, so please check the CHM FAQ  Document 1328466.1
3) Set up to run OSWatcher by following the instruction in the note  Document 301137.1 if OSWatcher is not set up already.
Having OSWatcher output is helpful when CHM output is not available.

 

Issue #4  The alert.log shows "Waiting for clusterware split-brain resolution" before one or more instances crashes or is evicted

Symptoms:

Before one of more instances crash, the alert.log shows "Waiting for clusterware split-brain resolution".  This is often followed by "Evicting instance n from cluster" where n is the instance number that is getting evicted.

Possible causes:
The lmon process sends a network ping to remote instances, and if lmon processes on the remote instances do not respond, a split brain at the instance level occurred.  Therefore, finding out the reason that the lmon can not communicate with each other is important in resolving this issue.

The common causes are:
1) The instance level split brain is frequently caused by the network problem, so checking the network setting and connectivity is important.  However, since the clusterware (CRS) would have failed if the network is down, the network is likely not down as long as both CRS and database use the same network.   
2) The server is very busy and/or the amount of free memory is low -- heavy swapping and scanning or memory will prevent lmon processes from getting scheduled.  
3) The database or instance is hanging and lmon process is stuck.
4) Oracle bug

The above causes are similar to the causes for the issue #1 (The alert.log shows ora-29740 as a reason for instance crash/eviction).
Solutions:
The solution in here is similar to issue #1.

1) Check network and make sure there is no network error such as UDP error or IP packet loss or failure errors.
2) Check network configuration to make sure that all network configurations are set up correctly on all nodes. 
   For example, MTU size must be same on all nodes and the switch can support MTU size of 9000 if jumbo frame is used.
3) Check if the server had a CPU load problem or a free memory shortage.
4) Check if the database was hanging or having a severe performance problem prior to the instance eviction.
5) Check CHM (Cluster Health Monitor) output to see if the server had a CPU or memory load problem, network problem, or spinning lmd or  lms processes. The CHM output is available only on certain platform and version, so please check the CHM FAQ  Document 1328466.1
6) Set up to run OSWatcher by following the instruction in the note  Document 301137.1 if OSWatcher is not set up already.
   Having OSWatcher output is helpful when CHM output is not available.

 

Issue #5  The problem instance is killed by CRS because another instance tried to evict the problem instance and could not evict it.

Symptoms:

When an instance evicts another instance, all instance waits until the problem instance shuts down itself, but if the problem instance does not terminate for any reason,
the same instance that initiated the eviction issues a member kill request.  The member kill request asks the CRS to kill the problem instance.  This feature is available from 11.1 and higher.

Possible causes:
The alert.log of the instance that is asking CRS to kill the problem instance shows
Remote instance kill is issued [112:1]: 8

For example, the above message means that the member kill request to kill the instance 8 is sent to CRS.

The problem instance is hanging for any reason and is not responsive.  This could be due to the node having CPU and memory problem, and the processes for the problem instance is not getting scheduled to run on CPU.

The second common cause is a severe contention in the database is preventing the problem instance from realizing that remote instances evicted the instance. 

Another cause could be due to the one or more processes surviving the "shutdown abort" when the instance tries to abort itself.  Unless all processes for the instance is killed, CRS does not think the instance terminated and will not inform other instances that the problem instance aborted.  One common problem for this is that one or more processes become defunct processes and do not terminate.
This leads to the recycle of CRS either through a node reboot or a rebootless restart of CRS (node does not get rebooted but CRS gets restarted).  
In this case, the alert.log if the problem instance shows
Instance termination failed to kill one or more processes
Instance terminated by LMON, pid = 23305
Solutions:
The solution for this is similar to issue #3

1) Find out the reason for the database or instance hang. Getting a global system state dump and global hang analyze output is critical when troubleshooting the database or instance hang issue. If the global system state dump can not be obtained, get the local system state dump from all instances around same time.
2) Check CHM (Cluster Health Monitor) output to see if the server had a CPU or memory load problem, network problem, or spinning lmd or lms processes. The CHM output is available only on some platforms and versions, so please check the CHM FAQ Document 1328466.1
3) Set up to run OSWatcher by following the instruction in the note  Document 301137.1 if OSWatcher is not set up already.
Having OSWatcher output is helpful when CHM output is not available.

 

Database - RAC/Scalability Community
To discuss this topic further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle SupportDatabase - RAC/Scalability Community

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
城市应急指挥系统是智慧城市建设的重要组成部分,旨在提高城市对突发事件的预防和处置能力。系统背景源于自然灾害和事故灾难频发,如汶川地震和日本大地震等,这些事件造成了巨大的人员伤亡和财产损失。随着城市化进程的加快,应急信息化建设面临信息资源分散、管理标准不统一等问题,需要通过统筹管理和技术创新来解决。 系统的设计思路是通过先进的技术手段,如物联网、射频识别、卫星定位等,构建一个具有强大信息感知和通信能力的网络和平台。这将促进不同部门和层次之间的信息共享、交流和整合,提高城市资源的利用效率,满足城市对各种信息的获取和使用需求。在“十二五”期间,应急信息化工作将依托这些技术,实现动态监控、风险管理、预警以及统一指挥调度。 应急指挥系统的建设目标是实现快速有效的应对各种突发事件,保障人民生命财产安全,减少社会危害和经济损失。系统将包括预测预警、模拟演练、辅助决策、态势分析等功能,以及应急值守、预案管理、GIS应用等基本应用。此外,还包括支撑平台的建设,如接警中心、视频会议、统一通信等基础设施。 系统的实施将涉及到应急网络建设、应急指挥、视频监控、卫星通信等多个方面。通过高度集成的系统,建立统一的信息接收和处理平台,实现多渠道接入和融合指挥调度。此外,还包括应急指挥中心基础平台建设、固定和移动应急指挥通信系统建设,以及应急队伍建设,确保能够迅速响应并有效处置各类突发事件。 项目的意义在于,它不仅是提升灾害监测预报水平和预警能力的重要科技支撑,也是实现预防和减轻重大灾害和事故损失的关键。通过实施城市应急指挥系统,可以加强社会管理和公共服务,构建和谐社会,为打造平安城市提供坚实的基础。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值