Oracle Cluster Synchroniaction Services Daemon(OCSSD)集群同步服务后台程序,在10g中由init——>init.cssd——>cssd/oprocd/cssdmonitor,此进程负责集群同步、集群成员及组成员的管理,此进程会通过network heartbeat 和disk heartbeat两个最基本的心跳机制来保证节点间正常通信,即避免了出现脑裂导致的非同步写不一致问题。每个节点每秒通过私有网络发送心跳信息,且每秒向一个voting disk发起一次disk heartbeat操作。如果两种心跳其一不正常,在规定时间里,通过跳票选举将故障节点驱逐出集群。
–环境:
[root@trsen01 network-scripts]# uname -a
Linux trsen01.com 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
[oracle@trsen01 ~]$ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.5.0 – Production on Wed Dec 3 15:31:15 2014
Copyright (c) 1982, 2010, Oracle. All Rights Reserved.
Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 – 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
SQL> select * from v$version;
BANNER
—————————————————————-
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 – 64bi
PL/SQL Release 10.2.0.5.0 – Production
CORE 10.2.0.5.0 Production
TNS for Linux: Version 10.2.0.5.0 – Production
NLSRTL Version 10.2.0.5.0 – Production
出现问题的情景:机房ups损坏异常断电后,集群所有节点起不来,一节点ocr状态不一致,一节点直接主板损坏
—————–disk heartbeat问题—————————
[root@trsen01 bin]#./crsctl start crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM
[root@trsen01 log]# ps -ef | grep d.b
root 15417 16372 0 15:54 pts/0 00:00:00 grep d.b
[root@trsen01 log]#ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
[root@trsen01 log]#
[root@trsen01 log]#crsctl query css votedisk
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [Invalid argument] [2]
这里果断不查看crsd或cssd日志,直接来到OS日志
[root@trsen01 log]# cat messages.1 | grep dependencies. | more
Nov 28 01:07:12 trsen01 logger: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.13715.
Nov 28 01:07:12 trsen01 logger: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.13606.
Nov 28 01:07:12 trsen01 logger: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.13761.
Nov 28 01:08:12 trsen01 logger: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.13715.
Nov 28 01:08:12 trsen01 logger: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.13606.
…….
…..
[root@trsen01 log]# cat /tmp/crsctl.13715
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]
初步诊断OCR文件损坏或者所在disk出现了故障,
[root@trsen01 log]# ls -trl /etc/oracle/ocr.loc
-rw-r–r– 1 root oinstall 45 Sep 29 2011 /etc/oracle/ocr.loc
[root@trsen01 log]# more /etc/oracle/ocr.loc
ocrconfig_loc=/dev/raw/raw1
local_only=FALSE
[root@trsen01 log]# ls -l /dev/raw/raw*
居然查不到裸设备的存在,尝试重启主机,成功解决,初步怀疑是后端存储和主机都断电了,起动过程中,后端存储的磁盘状态与主机的状态不一致导致。所以一样的现象不一定就是同样的问题所致如ocr损坏
—————–network heartbeat问题—————————
主板损坏导致,维修后,无法启动crs,又是一样的报错,
[root@trsen01 bin]#./crsctl start crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM
[root@trsen01 log]# ps -ef | grep d.b
root 15417 16372 0 15:54 pts/0 00:00:00 grep d.b
[oracle@trsen02 admin]$ ocrcheck
Status of Oracle Cluster Registry is as follows :
Version : 2
Total space (kbytes) : 102184
Used space (kbytes) : 4364
Available space (kbytes) : 97820
ID : 1138124715
Device/File Name : /dev/raw/raw1
Device/File integrity check succeeded
Device/File not configured
Cluster registry integrity check succeeded
[oracle@trsen02 admin]$ ls -trl /etc/oracle/ocr.loc
-rw-r–r– 1 root oinstall 45 Sep 29 2011 /etc/oracle/ocr.loc
[oracle@trsen02 admin]$ more /etc/oracle/ocr.loc
ocrconfig_loc=/dev/raw/raw1
local_only=FALSE
[oracle@trsen02 admin]$ ls -ltr /dev/raw/raw*
crw-rw—- 1 root oinstall 162, 1 Dec 3 09:18 /dev/raw/raw1
crw-rw—- 1 oracle dba 162, 2 Dec 3 16:41 /dev/raw/raw2
没有问题,那就看日志吧
[root@trsen02 log]# more /var/log/messages
Nov 28 01:08:12 trsen01 logger: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.16063.
[root@trsen02 log]# ore /tmp/crsctl.16063
Failed 3 to bind listening endpoint: (ADDRESS=(PROTOCOL=tcp)(HOST=trsen02-priv
报错变了,查看网络配置,才发现少了一个private网络的端口,速度做好private网络的端口
一般css起不来,都是ocr文件或disk出现的问题较多,网络的较少