Advanced Troubleshooting of CSS Heartbeat Failures

这个文章是oracle内部的一篇文章,大家看看,关于rac心跳失败的。


  

Applies to:

Oracle Server - Enterprise Edition -Version: 10.2 to 10.2
Information in this document applies to any platform.

Purpose

Internal Note to Explain How toTroubleshoot CSS Heartbeat Failures (Network).

Scope and Application

For Oracle Support and Development

Internal Only: Advanced Troubleshooting of CSS HeartbeatFailures (Network)

The tell tale sign of a CSS network heartbeat related issueis the following type of message in the CSS log:

[CSSD]2007-09-06 18:38:45.861 [18] >WARNING: clssnmPollingThread: nodescadfrai11 (2) at 50% heartbeat fatal, eviction in 14.275 seconds

The first thing to do is find out if the missed checkinsARE the problem or are a result of the node going down due to other reasons.Check the messages file to see what exact time the node went down and compareit to the time of the missed checkins.

- If the messages file reboot time <missed checkin time then the node eviction was
likely not due to these missed checkins.

- If the messages file reboot time > missed checkin time then the nodeeviction was
likely a result of the missed checkins.

If you determine that CSS actually did reboot the node dueto missed checkins, then we need to find out why.  There are 3 threads inCSS that deal with CSS heartbeats:

clssnmSendingThread - Thisthread periodically wakes up and sends appropriate packets (based on the joinstate) to other nodes so that other members know we are still alive. For theheartbeat he sends "status" packets every second to the other nodesto tell them that they are alive. This can be seen at CSS trace level 2 orhigher. Example:

[CSSD]2007-09-06 18:37:07.857 [19] >TRACE: clssnmSendingThread: sendingstatus msg to all nodes

Notice that the thread # for the clssnmSendingThread is 19in this example.  This is important to know so that you don't end uptroubleshooting the wrong thread  

With CSS trace level 4 + CLSC tracing you can see moredetails important details like:

[CSSD]2007-09-06 18:38:31.584 [19] >TRACE: clssnmsendmsg: sending msg type 3to node 1
[ CSSD]2007-09-06 18:38:31.584 [19] >TRACE: clscsendx: (b8aaf0)sending msg (f395c8) type 3, size 68, srqh (eed3d0)
[ CSSD]2007-09-06 18:38:31.587 [19] >TRACE: clscsendx: (b8aaf0) sentmsg type 3, size 68, rc 0 srqh (eed3d0)

If you also turn on CLSC NS tracing you can see the OracleNet tracing for this.  You can see something like:

(19) [00001306-SEP-2007 18:38:31:586] nsdo: cid=1, opcode=67, *bl=68, *what=1, uflgs=0x2,cflgs=0x3
(19) [000013 06-SEP-2007 18:38:31:586] snsbitts_ts: acquired the bit
(19) [000013 06-SEP-2007 18:38:31:586] nsdo: rank=64, nsctxrnk=0
(19) [000013 06-SEP-2007 18:38:31:586] nsdo: nsctx: state=8, flg=0x400d,mvd=0
(19) [000013 06-SEP-2007 18:38:31:586] nsdo: 68 bytes to transport
(19) [000013 06-SEP-2007 18:38:31:586] nttmwr: entry
(19) [000013 06-SEP-2007 18:38:31:587] nttmwr: socket 26 had byteswritten=68
(19) [000013 06-SEP-2007 18:38:31:587] nttmwr: exit

And of course OS Watcher is good to look at thingsunderneath CSS.

The code path for sending a heartbeat message isclssnmSendingThread-->clssnmsendmsg-->clscsendx

clssnmClusterListener thread - This thread listens for incoming packets from other nodes by callingclscselect and dispatches them for appropriate handling. When theclssnmClusterListener receives a packet, it calls clssnmProcessPkt to processthe packet. If it is a status packet then clssnmProcessPkt callsclssnmHandleStatus. In clssnmHandleStatus we put the packet info in memory tolet it get seen by the polling thread so it can decide what to do with them.

At CSS trace level 3 or higher you can see the clssnmClusterListener threadwaking up and going through it's loop. You can also see

[ CSSD]2007-09-0618:38:31.299 [13] >TRACE: clssnmClusterListener: Entering select blocking

Notice that the thread # for the clssnmClusterListenerthread is 13 in this example. This is important to know so that you don't endup troubleshooting the wrong thread

With trace level 4 + CLSC tracing you can see more detailsimportant details like:

[CSSD]2007-09-06 18:38:30.070 [13] >TRACE: clsc_receive: (f16470) prepare toreceive
[ CSSD]2007-09-06 18:38:30.074 [13] >TRACE: clsc_receive: (f16470)compl recv, rc 0, nsret 0, flgs 0x0000, size 68
[ CSSD]2007-09-06 18:38:30.074 [13] >TRACE: clsc_receive: (f16470, 1)recv msg type 3, size 68, msgsz 68
[ CSSD]2007-09-06 18:38:30.074 [13] >TRACE: clssnmHandleStatus: src[2]dest[0] dom[0] seq[0] sync[53]
[ CSSD]2007-09-06 18:38:31.299 [13] >TRACE: clssnmClusterListener:Entering select blocking

If you also turn on CLSC NS tracing you can see the OracleNet tracing for this.  You can see something like:

(13) [00001406-SEP-2007 18:38:30:071] nsdo: cid=2, opcode=68, *bl=131136, *what=0,uflgs=0x0, cflgs=0x3
(13) [000014 06-SEP-2007 18:38:30:071] snsbitts_ts: acquired the bit
(13) [000014 06-SEP-2007 18:38:30:071] nsdo: rank=64, nsctxrnk=0
(13) [000014 06-SEP-2007 18:38:30:071] nsdo: nsctx: state=8, flg=0x400c,mvd=0
(13) [000014 06-SEP-2007 18:38:30:072] nsdo: reading from transport...
(13) [000014 06-SEP-2007 18:38:30:072] nttmrd: entry
(13) [000014 06-SEP-2007 18:38:30:073] nttmrd: socket 95 had bytesread=68
(13) [000014 06-SEP-2007 18:38:30:073] nttmrd: exit

And of course OS Watcher is good to look at thingsunderneath CSS.

The code path for receiving a heartbeat message isclssnmClusterListener-->clscselect
The code path to process a received heartbeat message isclssnmClusterListener-->clssnmProcessPkt-->clssnmHandleStatus

clssnmPollingThread -Periodically wakes up and scans to see who is active and has been checking inregularly by reading from the node db in memory. If the last last packet timein memory reaches 1/2 of misscount you will see this famous message:

[CSSD]2007-09-06 18:38:45.861 [18] >WARNING: clssnmPollingThread: nodescadfrai11 (2) at 50% heartbeat fatal, eviction in 14.275 seconds

Not much more useful trace info you can turn on for thisguy. He just reads the last packet time in memory and takes action if we reachmisscount (sends out poison packets, etc...).

Things to Look For:

- Is the clssnmSendingThread sending a status message toevery other node every second?

If the answer is no, see if theclssnmSendingThread is stuck (stack traces of all threads of CSS) or if thereis a resource problem or network error in OS Watcher (netstat, vmstat, etc...)

- Is the clssnmClusterListener receiving a status messagefrom every other node every second? 

Again, if the answer is no, see if theclssnmClusterListener is stuck (stack traces of all threads of CSS) or if thereis a resource problem or network error in OS Watcher (netstat, vmstat, etc...)

As long as you can confirm that the sending thread fromnode x is sending every second and the listening thread from node y isreceiving every second you should not see any missed checkins. 

For Extremely Detailed Tracing of CSSHeartbeats (Be Careful Suggesting This to Customers):

10.2.0.2 (Bundle Patch 2) and above:

<CRS_HOME>/bin/crsctl debug log css CSSD:4 (default is 1)
(No subsequent reboot needed if you do it this way)

Prior versions:

<CRS_HOME>/bin/crsctl set css trace 4 (default is 1)

And set the following <CRS_HOME>/bin/ocssd:

CLSC_TRACE_LVL=5
export CLSC_TRACE_LVL
CLSC_NSTRACE_LVL=12
export CLSC_NSTRACE_LVL

Thenrestart CRS. 

This would provide A LOT of detail about the underlying Oracle network layersunder CSS but obviously would generate trace info very fast inCRS_HOME/log/<hostname>/cssd so the data would have to be capturedquickly before the CSS log wraps. Either that or have a job to save off andcompress all the logs every 10-15 min.
Stkcd [股票代码] ShortName [股票简称] Accper [统计截止日期] Typrep [报表类型编码] Indcd [行业代码] Indnme [行业名称] Source [公告来源] F060101B [净利润现金净含量] F060101C [净利润现金净含量TTM] F060201B [营业收入现金含量] F060201C [营业收入现金含量TTM] F060301B [营业收入现金净含量] F060301C [营业收入现金净含量TTM] F060401B [营业利润现金净含量] F060401C [营业利润现金净含量TTM] F060901B [筹资活动债权人现金净流量] F060901C [筹资活动债权人现金净流量TTM] F061001B [筹资活动股东现金净流量] F061001C [筹资活动股东现金净流量TTM] F061201B [折旧摊销] F061201C [折旧摊销TTM] F061301B [公司现金流1] F061302B [公司现金流2] F061301C [公司现金流TTM1] F061302C [公司现金流TTM2] F061401B [股权现金流1] F061402B [股权现金流2] F061401C [股权现金流TTM1] F061402C [股权现金流TTM2] F061501B [公司自由现金流(原有)] F061601B [股权自由现金流(原有)] F061701B [全部现金回收率] F061801B [营运指数] F061901B [资本支出与折旧摊销比] F062001B [现金适合比率] F062101B [现金再投资比率] F062201B [现金满足投资比率] F062301B [股权自由现金流] F062401B [企业自由现金流] Indcd1 [行业代码1] Indnme1 [行业名称1] 季度数据,所有沪深北上市公司的 分别包含excel、dta数据文件格式及其说明,便于不同软件工具对数据的分析应用 数据来源:基于上市公司年报及公告数据整理,或相关证券交易所、各部委、省、市数据 数据范围:基于沪深北证上市公司 A股(主板、中小企业板、创业板、科创板等)数据整理计算
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值