结论:
1,11.2.0.3或者说ORACLE不同版本的RAC进程依赖机制一直在发展演化,一定要尽力搞清RAC各进程间依赖关系,到关重要2,CRS-1714:Unable to discover any voting files只是表面现象,并非真正是VOTING DISK损坏,具体需要你结合对应的LOG进行分析
3,如果RAC节点的GPNPD进程所用的配置文件PROFILE.XML(OLR),可能要重建损坏的节点
4,删除RAC节点以及添加节点,一定要详细查看官方手册,因为里面分类很多
5,最重要的一点,如果在分析LOG日志,卡住没思路或从未碰过类似问题,一定要查看MOS,搜索关键字,比如本案例的GPNP PROFILE
分析过程:
1,redhat 6.4上面的2节点11.2。0.4 RAC的CRSD进程没有启动,从集群ALERT日志发现,找不到表决磁盘2015-09-16 16:53:36.138
[cssd(25059)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/grid/11.2.0.4/log/jingfa1/cssd/ocssd.log
2015-09-16 16:53:51.176
2,运行如下命令关闭2个节点的所在ORACLE相关进程
/u01/grid/11.2.0.4/bin/crsctl stop crs
3,确认2个节点的ORACLE进程全部关闭
ps -ef|grep d.bin
root 1077 24425 0 09:00 pts/1 00:00:00 grep d.bin
4,在第1个节点以独占方式启动CRS
/u01/grid/11.2.0.4/bin/crsctl start crs -excl -nocrs
5,在第1个节点查看ASM进程是否启动
6,在第1个节点查看集群进程是否以独占方式启动
7,在第1个节点查看ocr磁盘是否工作正常
/u01/grid/11.2.0.4/bin/ocrcheck
8,如果ocr磁盘工作不正常,且其备份存在,可用备份恢复ocr磁盘
/u01/grid/11.2.0.4/bin/ocrconfig -showbackup
/u01/grid/11.2.0.4/bin/ocrconfig -restore ocr备份文件
9,在第1个节点以GRID用户查看OCR及VOTING DISK磁盘组是否存在,发现存在
1* select disk_number,path from v$asm_disk
SQL> /
DISK_NUMBER PATH
----------- --------------------------------------------------
0 /dev/ocr_vote
0 /dev/data
SQL>
SQL>
SQL> show parameter disk_
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
asm_diskgroups string DATA
asm_diskstring string /dev/*
SQL> select name,sector_size,block_size,allocation_unit_size/1024/1024 as au_mb from v$asm_diskgroup;
NAME SECTOR_SIZE BLOCK_SIZE AU_MB
------------------------------ ----------- ---------- ----------
DATA 512 4096 2
OCRVOTE 512 4096 2
10,在第1个节点确认VOTING DISK是否工作不正常,确实发现不了
/u01/grid/11.2.0.4/bin/crsctl query css votedisk
11,从上述第9步的asm_diskgroups发现,仅加载一个ASM磁盘组DATA,而没有加载OCRVOTE,所以调整其参数,让ASM实例启动时加载OCRVOTE及DATA磁盘组,这样
我想就可以在ASM实例启时自动加载VOTING DISK磁盘组了
alter system set asm_diskgroups=data,ocrvote sid='*';
show parameter disk_
12,关闭节点1的CRS集群相关进程
/u01/grid/11.2.0.4/bin/crsctl stop crs
13,重启2个节点的集群进程,确认crsd进程是否正常,发现问题依旧,还是找不到表决磁盘
/u01/grid/11.2.0.4/bin/crsctl start crs
14,关闭2个节点的集群进程,然后在节点1以独占方式启动集群进程
/u01/grid/11.2.0.4/bin/crsctl stop crs
/u01/grid/11.2.0.4/bin/crsctl start crs -excl -nocrs
15,在节点1直接替换ocrvote磁盘组,修复voting disk
/u01/grid/11.2.0.4/bin/crsctl replace votedisk +ocrvote
16,在节点1查看voting disk是否正常
/u01/grid/11.2.0.4/bin/crsctl query css votedisk
17,关闭节点的集群进程,然后在2节点重启集群进程
/u01/grid/11.2.0.4/bin/crsctl stop crs
/u01/grid/11.2.0.4/bin/crsctl start crs
18,在2个节点确认VOTING DISK是否可以正常工作(如下命令必须CRSD进程启动才有结果,否则为空,且CRSD进程是在集群所有进程最后一个启动),这下节点1正常了,但节点2还是CRSD进程启不来
/u01/grid/11.2.0.4/bin/crsctl query css votedisk
19,查看节点2的GRID用户的TRC文件,发现节点2的VOTING DISK的CLUSTER GUID标识和GPNP PROFILE不一致,所以最终节点2发现不了VOTING DISK
2015-09-16 17:58:51.847: [ CSSD][1851041536]clssnmvDiskVerify: discovered a potential voting file
2015-09-16 17:58:51.847: [ SKGFD][1851041536]Handle 0x7fd95808f980 from lib :UFS:: for disk :/dev/ocr_vote:
---这里GPNP进程发现VOTING DISK的GUID和CLUSTER GUID不相同
2015-09-16 17:58:51.965: [ CSSD][1851041536]clssnmvDiskCreate: Cluster guid 0acef774f25dcfb0bf3d0c7b3db02abe found in voting disk /dev/ocr_vote does not match with the
cluster guid 7d8026436ade6fe0ff597a0f6df497e1 obtained from the GPnP profile
--移除了VOTING DISK
2015-09-16 17:58:51.965: [ CSSD][1851041536]clssnmvDiskDestroy: removing the voting disk /dev/ocr_vote
2015-09-16 17:58:51.965: [ SKGFD][1851041536]Lib :UFS:: closing handle 0x7fd95808f980 for disk :/dev/ocr_vote:
--找不到VOTING DISK
2015-09-16 17:58:51.965: [ CSSD][1851041536]clssnmvDiskVerify: Successful discovery of 0 disks
2015-09-16 17:58:51.965: [ CSSD][1851041536]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2015-09-16 17:58:51.965: [ CSSD][1851041536]clssnmvFindInitialConfigs: No voting files found
2015-09-16 17:58:51.965: [ CSSD][1851041536](:CSSNM00070:)clssnmCompleteInitVFDiscovery: Voting file not found. Retrying discovery in 15 seconds
21,我们在第2个节点看看GPNP进程是个什么东西
[grid@jingfa2 jingfa2]$ ps -ef|grep -i gpnp
grid 5238 32255 0 10:02 pts/1 00:00:00 grep -i gpnp
grid 18060 1 0 09:45 ? 00:00:01 /u01/grid/11.2.0.4/bin/gpnpd.bin
22,在第2个节点看看gpnp profile文件在哪儿
[grid@jingfa2 gpnpd]$ locate gpnp|grep -i --color profile
/u01/grid/11.2.0.4/gpnp/profiles
/u01/grid/11.2.0.4/gpnp/jingfa2/profiles
/u01/grid/11.2.0.4/gpnp/jingfa2/profiles/peer
/u01/grid/11.2.0.4/gpnp/jingfa2/profiles/peer/pending.xml
/u01/grid/11.2.0.4/gpnp/jingfa2/profiles/peer/profile.old
/u01/grid/11.2.0.4/gpnp/jingfa2/profiles/peer/profile.xml --我估计就是这个文件
/u01/grid/11.2.0.4/gpnp/jingfa2/profiles/peer/profile_orig.xml
/u01/grid/11.2.0.4/gpnp/profiles/peer
/u01/grid/11.2.0.4/gpnp/profiles/peer/profile.xml
/u01/grid/11.2.0.4/gpnp/profiles/peer/profile_orig.xml
23,查看节点2gpnp profile文件的内容,从/u01/grid/11.2.0.4/gpnp/jingfa2/profiles/peer/profile.xml文件,发现7d8026436ade6fe0ff597a0f6df497e1这个GUID,可见就是这个文件
同时我对比了节点1的这个文件,发现0acef774f25dcfb0bf3d0c7b3db02abe在此文件可以找到,所以我尝试手工更新GUID,用0acef774f25dcfb0bf3d0c7b3db02abe替换7d8026436ade6fe0ff597a0f6df497e1
0acef774f25dcfb0bf3d0c7b3db02abe
[grid@jingfa2 gpnpd]$ more /u01/grid/11.2.0.4/gpnp/jingfa2/profiles/peer/profile.xml|grep -i --color 7d8026436ade6fe0ff597a0f6df497e1
<?xml version="1.0" encoding="UTF-8"?><gpnp:GPnP-Profile Version="1.0" xmlns="http://www.grid-pnp.org/2005/11/gpnp-profile" xmlns:gpnp="http://www.grid-pnp.org/2005/11/gpnp-profile"
xmlns:orcl="http://www.oracle.com/gpnp/2005/11/gpnp-profile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.grid-pnp.org/2005/11/gpnp-profile
gpnp-profile.xsd" ProfileSequence="7" ClusterUId="7d8026436ade6fe0ff597a0f6df497e1" ClusterName="jingfa-scan" PALocation=""><gpnp:Network-Profile><gpnp:HostNetwork id="gen"
HostName="*"><gpnp:Network id="net1" IP="192.168.0.0" Adapter="eth0" Use="public"/><gpnp:Network id="net2" IP="10.0.0.0" Adapter="eth1"
Use="cluster_interconnect"/></gpnp:HostNetwork></gpnp:Network-Profile><orcl:CSS-Profile id="css" DiscoveryString="+asm"
LeaseDuration="400"/><orcl:ASM-Profile id="asm" DiscoveryString="/dev/ocr*"
SPFile="+OCRVOTE/jingfa-scan/asmparameterfile/registry.253.849167179"/><ds:Signature
xmlns:ds="http://www.w3.org/2000/09/xmldsig#"><ds:SignedInfo><ds:CanonicalizationMethod Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/>
<ds:SignatureMethod Algorithm="http://www.w3.org/2000/09/xmldsig#rsa-sha1"/><ds:Reference URI=""><ds:Transforms>
<ds:Transform Algorithm="http://www.w3.org/2000/09/xmldsig#enveloped-signature"/><ds:Transform Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#">
<InclusiveNamespaces xmlns="http://www.w3.org/2001/10/xml-exc-c14n#" PrefixList="gpnp orcl xsi"/></ds:Transform></ds:Transforms>
<ds:DigestMethod Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/><ds:DigestValue>cPtosOiD17nSId/92MTAPaQ+dLU=</ds:DigestValue></ds:Reference>
</ds:SignedInfo><ds:SignatureValue>Ca56sx6DgsCSxrRqPz2ReOzhkf9eYiqVYuj2XLadwuBURX2PL+nYD7LhLFFj27EpuSIx0SfGVhOPm/i016ws7tWATeSKBJDVyTAELgBEYPsMumW4vKm7rVXs
SbVJolycA3pFHtGqZ7FZjzSXxdj5Xq4LlBLGVWR3gYKnqxuRGv0=</ds:SignatureValue>
</ds:Signature></gpnp:GPnP-Profile>
[grid@jingfa2 gpnpd]$
24,调整文件前先备份节点2这个文件
cp /u01/grid/11.2.0.4/gpnp/jingfa2/profiles/peer/profile.xml /u01/grid/11.2.0.4/gpnp/jingfa2/profiles/peer/profile.xml.20150917bak
vi /u01/grid/11.2.0.4/gpnp/jingfa2/profiles/peer/profile.xml
:s/7d8026436ade6fe0ff597a0f6df497e1/0acef774f25dcfb0bf3d0c7b3db02abe/g
保存即可
25,在节点2重启集群进程,发现节点1的集群进程发生了重启,而且奇怪的是我24步改的又回以了原样,再次强行修改,再重启节点2集群进程
经过反复尝试,说明gpnp进程会对此文件进行恢复,即使你手工改了也没用
26,即使上面的方法行不通,换另一个方法,查查2个节点AGENT进程有何区别
[root@jingfa1 ~]# ps -ef|grep agent|grep grid|grep -v grep
grid 3647 1 0 09:44 ? 00:00:10 /u01/grid/11.2.0.4/bin/oraagent.bin
root 3660 1 0 09:4