前两天一套集群的节点2down 了
查看集群alert日志
2017-05-12 15:35:41.738
[cssd(743)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/proddb-2/cssd/ocssd.log
2017-05-12 15:35:56.746
[cssd(743)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/proddb-2/cssd/ocssd.log
2017-05-12 15:36:11.754
[cssd(743)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/proddb-2/cssd/ocssd.log
2017-05-12 15:36:26.761
[cssd(743)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/proddb-2/cssd/ocssd.log
查看ocssd.log
OCSSD LOG
------------------ Filename=ocssd.log 2017-05-05 13:46:56.088: [ CSSD][704857856]clssscGetParameterProfile: buffer passed for parameter VF discovery (2) is too short, required 23, passed 20 2017-05-05 13:46:56.088: [ CSSD][704857856]clssnmReadDiscoveryProfile: voting file discovery string(/u02/oradata/grid/vote) 2017-05-05 13:46:56.088: [ CSSD][704857856]clssnmvDDiscThread: using discovery string /u02/oradata/grid/vote for initial discovery 2017-05-05 13:46:56.088: [ SKGFD][704857856]Discovery with str:/u02/oradata/grid/vote: 2017-05-05 13:46:56.088: [ SKGFD][704857856]UFS discovery with :/u02/oradata/grid/vote: 2017-05-05 13:46:56.089: [ SKGFD][704857856]Fetching UFS disk :/u02/oradata/grid/vote: 2017-05-05 13:46:56.089: [ SKGFD][704857856]OSS discovery with :/u02/oradata/grid/vote: 2017-05-05 13:46:56.089: [ CSSD][704857856]clssnmvDiskVerify: Successful discovery of 0 disks 2017-05-05 13:46:56.089: [ CSSD][704857856]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery 2017-05-05 13:46:56.089: [ CSSD][704857856]clssnmvFindInitialConfigs: No voting files found 2017-05-05 13:46:56.089: [ CSSD][704857856](:CSSNM00070:)clssnmCompleteInitVFDiscovery: Voting file not found. Retrying discovery in 15 seconds 2017-05-05 13:47:12.576: [ CSSD][4018611968]clsu_load_ENV_levels: Module = CSSD, LogLevel = 2, TraceLevel = 0 2017-05-05 13:47:12.576: [ CSSD][4018611968]clsu_load_ENV_levels: Module = GIPCNM, LogLevel = 2, TraceLevel = 0 2017-05-05 13:47:12.576: [ CSSD][4018611968]clsu_load_ENV_levels: Module = GIPCGM, LogLevel = 2, TraceLevel = 0 2017-05-05 13:47:12.576: [ CSSD][4018611968]clsu_load_ENV_levels: Module = GIPCCM, LogLevel = 2, TraceLevel = 0 2017-05-05 13:47:12.576: [ CSSD][4018611968]clsu_load_ENV_levels: Module = CLSF, LogLevel = 0, TraceLevel = 0 2017-05-05 13:47:12.576: [ CSSD][4018611968]clsu_load_ENV_levels: Module = SKGFD, LogLevel = 0, TraceLevel = 0 2017-05-05 13:47:12.576: [ CSSD][4018611968]clsu_load_ENV_levels: Module = GPNP, LogLevel = 1, TraceLevel = 0 2017-05-05 13:47:12.576: [ CSSD][4018611968]clsu_load_ENV_levels: Module = OLR, LogLevel = 0, TraceLevel = 0 [ CSSD][4018611968]clsugetconf : Configuration type [4]. 2017-05-05 13:47:12.576: [ CSSD][4018611968]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1493963232 2017-05-05 13:47:12.576: [ CSSD][4018611968]clssscmain: Environment is production 2017-05-05 13:47:12.576: [ CSSD][4018611968]clssscmain: Core file size limit extended 2017-05-05 13:47:12.579: [ CSSD][4018611968]clssscmain: GIPCHA down 0 2017-05-05 13:47:12.580: [ CSSD][4018611968]clssscGetParameterOLR: OLR fetch for parameter logsize (8) failed with rc 21 2017-05-05 13:47:12.580: [ CSSD][4018611968]clssscExtendLimits: The current soft limit for file descriptors is 65536, hard limit is 65536 2017-05-05 13:47:12.580: [ CSSD][4018611968]clssscExtendLimits: The current soft limit for locked memory is 4294967295, hard limit is 4294967295 2017-05-05 13:47:12.580: [ CSSD][4018611968]clssscGetParameterOLR: OLR fetch for parameter priority (15) failed with rc 21 2017-05-05 13:47:12.580: [ CSSD][4018611968]clssscSetPrivEnv: Setting priority to 4 2017-05-05 13:47:12.586: [ CSSD][4018611968]clssscSetPrivEnv: Can't access local IPMI device--no device configured or driver missing/incompatible. IPMI support may be available with static IP configuration. 2017-05-05 13:47:12.586: [ CSSD][4018611968]clssscmain: Running as user grid 2017-05-05 13:47:12.587: [ CSSD][4018611968]clssscmain: RT queue setting is at default value 2017-05-05 13:47:12.587: [ CSSD][4018611968]clssscGetParameterOLR: OLR fetch for parameter auth rep (9) failed with rc 21 2017-05-05 13:47:12.587: [ CSSD][4018611968]clssscGetParameterOLR: OLR fetch for parameter diagwait (14) failed with rc 21 2017-05-05 13:47:12.590: [ CSSD][4018611968]clssnmInitNMInfoMin: Initializing first-reconfig to (0) [ clsdmt][4009936640]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=proddb-2DBG_CSSD)) 2017-05-05 13:47:12.590: [ clsdmt][4009936640]PID for the Process [17269], connkey 4 2017-05-05 13:47:12.590: [ CSSD][4018611968]clssscmain: initgminfo done 2017-05-05 13:47:12.590: [ CSSD][4003038976]clssgmclientlsnr: Spawned 2017-05-05 13:47:12.590: [ CSSD][4003038976]clssgmEvtInformation: reqtype (13) cmProc ((nil)) client ((nil)) 2017-05-05 13:47:12.590: [ CSSD][4003038976]clssgmEvtInformation: reqtype (13) req (0x7fe9e4000920) 2017-05-05 13:47:12.590: [ CSSD][4003038976]clssnmQueueNotification: type (13) 0x7fe9e4000920 2017-05-05 13:47:12.591: [ CSSD][4003038976]clssgmclientlsnr: listening on clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_proddb-2_)(GIPCID=00000000-00000000-17269)) 2017-05-05 13:47:12.591: [ GPNP][4018611968]clsgpnp_Init: [at clsgpnp0.c:585] '/u01/app/11.2.0/grid' in effect as GPnP home base. 2017-05-05 13:47:12.591: [ GPNP][4018611968]clsgpnp_Init: [at clsgpnp0.c:619] GPnP pid=17269, GPNP comp tracelevel=1, depcomp tracelevel=0, tlsrc:ORA_DAEMON_LOGGING_LEVELS, apitl:0, complog:1, tstenv:0, devenv:0, envopt:0, flags=3 2017-05-05 13:47:12.613: [ GPNP][4018611968]clsgpnpkwf_initwfloc: [at clsgpnpkwf.c:399] Using FS Wallet Location : /u01/app/11.2.0/grid/gpnp/proddb-2/wallets/peer/ [ CLWAL][4018611968]clsw_Initialize: OLR initlevel [70000] 2017-05-05 13:47:12.628: [ GPNP][4018611968]clsgpnp_profileCallUrlInt: [at clsgpnp.c:2104] get-profile call to url "ipc://GPNPD_proddb-2" disco "" [f=0 claimed- host: cname: seq: auth:] 2017-05-05 13:47:12.634: [ GPNP][4018611968]clsgpnp_profileCallUrlInt: [at clsgpnp.c:2234] Result: (0) CLSGPNP_OK. Successful get-profile CALL to remote "ipc://GPNPD_proddb-2" disco "" 2017-05-05 13:47:12.635: [ CSSD][4018611968]clssscGetParameterProfile: profile fetch failed for parameter ocrid (4) with return code 5 2017-05-05 13:47:12.635: [ CSSD][4018611968]clssscmain: OCRID is 0 2017-05-05 13:47:12.635: [ CSSD][4018611968]clssscmain: Cluster GUID is 208b625386b2df39bfb02751ce50ee56 2017-05-05 13:47:12.635: [ CSSD][4018611968]clssnmNotifyReq: type (12) 2017-05-05 13:47:12.635: [ CSSD][4018611968]clssscmain: last used node number 2 2017-05-05 13:47:12.635: [ CSSD][4018611968]clssscGetParameterProfile: buffer passed for parameter VF discovery (2) is too short, required 23, passed 20 2017-05-05 13:47:12.635: [ CSSD][4018611968]clssnmReadDiscoveryProfile: voting file discovery string(/u02/oradata/grid/vote) 2017-05-05 13:47:12.635: [ CSSD][4018611968]clssnkInit: NK generic layer initializing. 2017-05-05 13:47:12.637: [ SKGFD][4000347904]NOTE: No asm libraries found in the system 2017-05-05 13:47:12.637: [ CLSF][4000347904]Allocated CLSF context 2017-05-05 13:47:12.637: [ CSSD][4000347904]clssnmvDDiscThread: using discovery string /u02/oradata/grid/vote for initial discovery 2017-05-05 13:47:12.637: [ SKGFD][4000347904]Discovery with str:/u02/oradata/grid/vote: 2017-05-05 13:47:12.637: [ SKGFD][4000347904]UFS discovery with :/u02/oradata/grid/vote: 2017-05-05 13:47:12.638: [ SKGFD][4000347904]Fetching UFS disk :/u02/oradata/grid/vote: 2017-05-05 13:47:12.638: [ SKGFD][4000347904]OSS discovery with :/u02/oradata/grid/vote: 2017-05-05 13:47:12.638: [ CSSD][4000347904]clssnmvDiskVerify: Successful discovery of 0 disks 2017-05-05 13:47:12.638: [ CSSD][4000347904]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery 2017-05-05 13:47:12.638: [ CSSD][4000347904]clssnmvFindInitialConfigs: No voting files found 2017-05-05 13:47:12.639: [ CSSD][4000347904](:CSSNM00070:)clssnmCompleteInitVFDiscovery: Voting file not found. Retrying discovery in 15 seconds
查看仲裁文件/u02/oradata/grid/vote 是存在的,权限也是正常的,没有人修改过这个文件权限
node 1:
[root@PRODDB-1 ~]# /u01/app/11.2.0/grid/bin/crsctl query css votedisk ## STATE File Universal Id File Name Disk group -- ----- ----------------- --------- --------- 1. ONLINE 483976892bf34f7ebfdecd4d03533205 (/u02/oradata/grid/vote) [] Located 1 voting disk(s). [root@PRODDB-1 ~]# ls -l /u02/oradata/grid/ total 23600 -rw-r----- 1 oracle oinstall 272756736 May 11 12:37 ocr -rw-r----- 1 oracle oinstall 21004800 May 11 14:27 vote node 2: [root@PRODDB-2 ~]# /u01/app/11.2.0/grid/bin/crsctl query css votedisk Unable to communicate with the Cluster Synchronization Services daemon. [root@PRODDB-2 ~]# ls -l /u02/oradata/grid/ total 23600 -rw-r----- 1 oracle oinstall 272756736 May 11 12:37 ocr -rw-r----- 1 oracle oinstall 21004800 May 11 14:27 vote
之后就很费解,向oracle 提SR,做了ocssd进程trace
1. crsctl start crs
2. get the ocssd.bin pid ps -ef|grep ocssd.bin 3. Execute following command: strace -fto /tmp/ocssd_strace.log -p <PID> wait 10mins ,and cancel the command
查看trace log
STRACE OUTPUT
------------------------- Filename=ocssd_strace.zip 22372 14:54:05 write(4, "2017-05-11 14:54:05.449: [ GP"..., 188) = 188 22594 14:54:05 stat("/u02/oradata/grid/vote", <unfinished ...> 22372 14:54:05 write(4, "2017-05-11 14:54:05.450: [ CS"..., 603) = 603 22372 14:54:05 futex(0xf2ec94, FUTEX_WAIT_PRIVATE, 899, NULL <unfinished ...> 22594 14:54:05 <... stat resumed> {st_mode=S_IFREG|0640, st_size=21004800, ...}) = 0 22594 14:54:05 stat("/u02/oradata/grid/vote", {st_mode=S_IFREG|0640, st_size=21004800, ...}) = 0 22594 14:54:05 access("/u02/oradata/grid/vote", R_OK|W_OK) = -1 EACCES (Permission denied) <=====
跟据错误信息网上搜了一下,有相同错误的文章,只不过下面权限问题在asm磁盘设备,而公司用的是virtas storge foundation
http://www.askmaclean.com/archives/discover-your-missed-asm-disks.html
于是试着修改了/u02/oradata/grid/vote 的权限从640 修改成644 ,查看日志错误依然不能发现仲裁文件,又改成664 后 crs 启动一切正常了
不过费解的是,这个文件权限没有人修改过,如果没有修改,当初创建这个集群怎么启动的,如果被修改了,当然人为原因肯定排除,又是怎么被修改的呢?
|