Waiting for clusterware split-brain resolution

APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.1 to 12.1.0.2 [Release 11.2 to 12.1]
Information in this document applies to any platform.

PURPOSE

The purpose of this document is to explain when a "Waiting for clusterware split-brain resolution" alert log message precedes an instance crash or eviction.

TROUBLESHOOTING STEPS

Background

Before one of more instances crash, the alert.log shows "Waiting for clusterware split-brain resolution".  This is often followed by "Evicting instance n from cluster" where n is the instance number that is getting evicted.  The lmon process sends a network ping to remote instances, and if lmon processes on the remote instances do not respond, a split brain at the instance level occurred.  Therefore, finding out the reason that the lmon can not communicate with each other is important in resolving this issue.

The common causes are:
1) The instance level split brain is frequently caused by the network problem, so checking the network setting and connectivity is important.  However, since the clusterware (CRS) would have failed if the network is down, the network is likely not down as long as both CRS and database use the same network.   
2) The server is very busy and/or the amount of free memory is low -- heavy swapping and scanning or memory will prevent lmon processes from getting scheduled.  
3) The database or instance is hanging and lmon process is stuck.
4) Oracle bug

Troubleshooting Instructions

1) Check network and make sure there is no network error such as UDP error or IP packet loss or failure errors.
2) Check network configuration to make sure that all network configurations are set up correctly on all nodes. 
   For example, MTU size must be same on all nodes and the switch can support MTU size of 9000 if jumbo frame is used.
3) Check if the server had a CPU load problem or a free memory shortage.
4) Check if the database was hanging or having a severe performance problem prior to the instance eviction.
5) Check CHM (Cluster Health Monitor) output to see if the server had a CPU or memory load problem, network problem, or spinning lmd or  lms processes. The CHM output is available only on certain platform and version, so please check the CHM FAQ Document 1328466.1
6) Set up to run OSWatcher by following the instruction in the note Document 301137.1 if OSWatcher is not set up already.
   Having OSWatcher output is helpful when CHM output is not available.

 

Diagnostic Collection

If TFA is installed (:)) simply run the following command:

$GI_HOME/tfa/bin/tfactl diagcollect -from "MMM/dd/yyyy hh:mm:ss" -to "MMM/dd/yyyy hh:mm:ss"

Format example: "Jul/1/2014 21:00:00"
Specify the "from time" to be 4 hours before and the "to time" to be 4 hours after the time of error.

 

If TFA is not installed (:():

Datatbase logs & trace files:

cd $(orabase)/diag/rdbms
tar cf - $(find . -name '*.trc' -exec egrep '<date_time_search_string>' {} \; grep -v bucket) | gzip >  /tmp/database_trace_files.tar.gz

ASM logs & trace files:

cd $(orabase)/diag/asm/+asm/
tar cf - $(find . -name '*.trc' -exec egrep '<date_time_search_string>' {} \; grep -v bucket) | gzip >  /tmp/asm_trace_files.tar.gz

Clusteware logs:

<GI home>/bin/diagcollection.sh --collect --crs --crshome <GI home>

OS logs:

/var/adm/messages* or /var/log/messages* or 'errpt -a' or Windows System Event Viewer log (saved as .TXT file)

 

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/20747382/viewspace-2131514/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/20747382/viewspace-2131514/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值