trouble shooting-集群无法启动之-Ocssd.bin无法启动排查

jianbo ye

已于 2023-05-24 17:32:28 修改

阅读量275

点赞数

文章标签：数据库 oracle

于 2023-05-24 17:18:51 首次发布

本文链接：https://blog.csdn.net/yejb7456/article/details/130851482

版权

ocssd 启动引导（Bootstrapping）需要读取gpnp profile

1、查找ocr和votedisk位置，并读取voting disk的磁盘头，一旦发现足够数量的磁盘，就启动剩下的堆栈

2、读取网络信息，包括公网和私网的网卡和网段信息。保证网络正确和能通

情况一： GPnP profile is accessible - gpnpd needs to be fully up to serve profile

无法正常读取到gpnp文件的日志

Otherwise messages like following will show in ocssd.log

2010-02-03 22:26:17.057: [ GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1100] GIPC gipcretConnectionRefused (29) gipcConnect(ipc-ipc://GPNPD_rac1)

2010-02-03 22:26:17.057: [ GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1101] Result: (48) CLSGPNP_COMM_ERR. Failed to connect to call url "ipc://GPNPD_rac1"

2010-02-03 22:26:17.057: [ GPnP][3852126240]clsgpnp_getProfileEx: [at clsgpnp.c:546] Result: (13) CLSGPNP_NO_DAEMON. Can't get GPnP service profile from local GPnP daemon

2010-02-03 22:26:17.057: [ default][3852126240]Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).

2010-02-03 22:26:17.057: [ CSSD][3852126240]clsgpnp_getProfile failed, rc(13)

The solution is to ensure gpnpd is up and running properly.

情况二： Voting Disk is accessible

1：无法正常读取到voting disk的情况

In 11gR2, ocssd.bin discover voting disk with setting from GPnP profile, if not enough voting disks can be identified, ocssd.bin will abort itself.

2010-02-03 22:37:22.212: [ CSSD][2330355744]clssnmReadDiscoveryProfile: voting file discovery string(/share/storage/di*)

2010-02-03 22:37:22.227: [ CSSD][1145538880]clssnmvDiskVerify: Successful discovery of 0 disks

2010-02-03 22:37:22.227: [ CSSD][1145538880]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery

2010-02-03 22:37:22.227: [ CSSD][1145538880]clssnmvFindInitialConfigs: No voting files found

2010-02-03 22:37:22.228: [ CSSD][1145538880]###################################

2010-02-03 22:37:22.228: [ CSSD][1145538880]clssscExit: CSSD signal 11 in thread clssnmvDDiscThread

2：等待voting file重组完成

ocssd.bin may not come up with the following error if all nodes failed while there's a voting file change in progress:

2010-05-02 03:11:19.033: [ CSSD][1197668093]clssnmCompleteInitVFDiscovery: Detected voting file add in progress for CIN 0:1134513465:0, waiting for configuration to complete 0:1134513098:0

The solution is to start ocssd.bin in exclusive mode with note 1364971.1

3：voting disk在存放在非ASM上，需要有正确的owner和属组
If the voting disk is located on a non-ASM device, ownership and permissions should be:

-rw-r----- 1 ogrid oinstall 21004288 Feb 4 09:13 votedisk1

情况三： Network is functional and name resolution is working:

1：如果ocssd.bin无法绑定到任何网络

If ocssd.bin can't bind to any network, likely the ocssd.log will have messages like following:

2010-02-03 23:26:25.804: [GIPCXCPT][1206540320]gipcmodGipcPassInitializeNetwork: failed to find any interfaces in clsinet, ret gipcretFail (1)
2010-02-03 23:26:25.804: [GIPCGMOD][1206540320]gipcmodGipcPassInitializeNetwork: EXCEPTION[ ret gipcretFail (1) ] failed to determine host from clsinet, using default
..
2010-02-03 23:26:25.810: [    CSSD][1206540320]clsssclsnrsetup: gipcEndpoint failed, rc 39
2010-02-03 23:26:25.811: [    CSSD][1206540320]clssnmOpenGIPCEndp: failed to listen on gipc addr gipc://rac1:nm_eotcs- ret 39
2010-02-03 23:26:25.811: [    CSSD][1206540320]clssscmain: failed to open gipc endp

2：私网不通，存在问题
If there's connectivity issue on private network (including multicast is off), likely the ocssd.log will have messages like following:

2010-09-20 11:52:54.014: [    CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 180441784, wrtcnt, 453, LATS 328297844, lastSeqNo 452, uniqueness 1284979488, timestamp 1284979973/329344894
2010-09-20 11:52:54.016: [    CSSD][1078421824]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
.. >>>> after a long delay
2010-09-20 12:02:39.578: [    CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 180441784, wrtcnt, 1037, LATS 328883434, lastSeqNo 1036, uniqueness 1284979488, timestamp 1284980558/329930254
2010-09-20 12:02:39.895: [    CSSD][1107286336]clssgmExecuteClientRequest: MAINT recvd from proc 2 (0xe1ad870)
2010-09-20 12:02:39.895: [    CSSD][1107286336]clssgmShutDown: Received abortive shutdown request from client.
2010-09-20 12:02:39.895: [    CSSD][1107286336]###################################
2010-09-20 12:02:39.895: [    CSSD][1107286336]clssscExit: CSSD aborting from thread GMClientListener
2010-09-20 12:02:39.895: [    CSSD][1107286336]###################################

To validate network, please refer to note 1054902.1
Please also check if the network interface name is matching the gpnp profile definition ("gpnptool get") for cluster_interconnect if CSSD could not start after a network change.

In 11.2.0.1, ocssd.bin may bind to public network if private network is unavailable

检查：

情况四： Vendor clusterware is up (if using vendor clusterware) 供应商集群软件

Grid Infrastructure provide full clusterware functionality and doesn't need Vendor clusterware to be installed; but if you happened to have Grid Infrastructure on top of Vendor clusterware in your environment, then Vendor clusterware need to come up fully before CRS can be started, to verify, as grid user:

$GRID_HOME/bin/lsnodes -n

racnode1 1

racnode1 0

If vendor clusterware is not fully up, likely ocssd.log will have similar messages like following:

2010-08-30 18:28:13.207: [ CSSD][36]clssnm_skgxninit: skgxncin failed, will retry

2010-08-30 18:28:14.207: [ CSSD][36]clssnm_skgxnmon: skgxn init failed

2010-08-30 18:28:14.208: [ CSSD][36]###################################

2010-08-30 18:28:14.208: [ CSSD][36]clssscExit: CSSD signal 11 in thread skgxnmon

Before the clusterware is installed, execute the command below as grid user:

$INSTALL_SOURCE/install/lsnodes -v

One issue on hp-ux: note 2130230.1 - Grid infrastructure startup fails due to vendor Clusterware did not start (HP-UX Service guard)

情况五： Command "crsctl" being executed from wrong GRID_HOME

Command "crsctl" must be executed from correct GRID_HOME to start the stack, or similar message will be reported:

2012-11-14 10:21:44.014: [    CSSD][1086675264]ASSERT clssnm1.c 3248
2012-11-14 10:21:44.014: [    CSSD][1086675264](:CSSNM00056:)clssnmvStartDiscovery: Terminating because of the release version(11.2.0.2.0) of this node being lesser than the active version(11.2.0.3.0) that the cluster is at
2012-11-14 10:21:44.014: [    CSSD][1086675264]###################################
2012-11-14 10:21:44.014: [    CSSD][1086675264]clssscExit: CSSD aborting from thread clssnmvDDiscThread#

情况六：进程无法设置优先级

现象：

### 11.2.0.4双节点RAC, 节点1重启服务器后无法启动集群服务，节点二正常运行。
# 1、节点一 CSSD.log报错信息
心跳等均正常。关闭正常节点2，节点1仍然无法启集群服务。
CSSD.log日志出现无法设置优先级报错
2022-09-16 22:20:12.918: [ CSSD][3851802432]clssscGetParameterOLR: OLR fetch for parameter priority (15) failed with rc 21 2022-09-16 22:20:12.918: [ CSSD][3851802432]clssscSetPrivEnv: Setting priority to 4 2022-09-16 22:20:12.924: [ CSSD][3851802432]**clssscSetPrivEnv: unable to set priority to 4** 2022-09-16 22:20:12.924: [ CSSD][3851802432]SLOS: cat=-2, opn=scls_set_priority_realtime, dep=1, loc=setsched unable to escalate to real time

解决：

find /etc/systemd/system.conf /etc/systemd/system /usr/lib/systemd -type f | xargs grep -e CPUAccounting -e CPUWeight -e StartupCPUWeight -e CPUShares -e StartupCPUShares -e CPUQuota

若查询到有开启Cpuaccounting 的服务，对比二节点已运行的服务将有差异的服务禁掉。

#青藤云安全代理发现有青藤云（安全防护软件）

tinagent.service
tinaxxxx.service
tinaxxxx.service
tinaxxxx.service
tinaxxxx.service

# 3、重启集群服务
停止并disable 禁用开启CPUaccounting的服务，重新启动集群，正常启动