测试Solaris环境上一套Oracle 10.2.0.2的RAC数据库出现报错ORA-29701: unable to connect to Cluster Manager。
检查alertlog:
Mon Sep 26 22:00:08 2011
Errors in file /oracle/oracle/admin/CR/bdump/c1_j000_16760.trc:
ORA-29701: unable to connect to Cluster Manager
Mon Sep 26 22:00:08 2011
GATHER_STATS_JOB encountered errors. Check the trace file.
部分跟踪文件信息:
Dump file /oracle/oracle/admin/CR/bdump/c1_j000_16760.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /oracle/oracle/product/10.2.0.1
System name: SunOS
Node name: sunvs-a
Release: 5.10
Version: Generic_139555-08
Machine: sun4u
Instance name: C1
Redo thread mounted by this instance: 1
Oracle process number: 62
Unix process pid: 16760, image: oracle@sunvs-a (J000)
*** ACTION NAME:(GATHER_STATS_JOB) 2011-09-26 22:00:08.201
*** MODULE NAME:(DBMS_SCHEDULER) 2011-09-26 22:00:08.201
*** SERVICE NAME:(SYS$USERS) 2011-09-26 22:00:08.201
*** SESSION ID:(76.2468) 2011-09-26 22:00:08.201
clsc_connect: (106648790) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_sunvs-a_crs))
2011-09-26 22:00:08.201: [ CSSCLNT]clsssInitNative: connect failed, rc 9
kgxgncin: CLSS init failed with status 3
kjfmsgr: unable to connect to NM for reg in shared group
ORA-29701: unable to connect to Cluster Manager
这套RAC数据库有两个节点,1号节点在应用执行SQL语句时也报此错误,2号节点运行正常,以下是2个节点的检查输出:
1号节点:
oracle@sunvs-a@/oracle/oracle $ crsctl check crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM
oracle@sunvs-a@/oracle/oracle $ cd /var/tmp/.oracle
oracle@sunvs-a@/var/tmp/.oracle $ ls -l
总数 0
oracle@sunvs-a@/oracle/oracle $ crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
2号节点:
oracle@sunvs-b@/oracle/oracle $ crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
oracle@sunvs-b@/oracle/oracle $ cd /var/tmp/.oracle
oracle@sunvs-b@/var/tmp/.oracle $ ls -l
总数 0
srwxrwxrwx 1 oracle oinstall 0 2011 2月 18 s#12384.1
srwxrwxrwx 1 oracle oinstall 0 2011 2月 18 sAsunvs-b_crs_evm
srwxrwxrwx 1 root root 0 2011 2月 18 sCRSD_UI_SOCKET
srwxrwxrwx 1 oracle oinstall 0 2011 2月 18 sCsunvs-b_crs_evm
srwxrwxrwx 1 oracle oinstall 0 2011 2月 18 sOCSSD_LL_sunvs-b_crs
srwxrwxrwx 1 oracle oinstall 0 2011 2月 18 sOracle_CSS_LclLstnr_crs_0
srwxrwxrwx 1 oracle oinstall 0 2011 2月 18 sSYSTEM.evm.acceptor.auth
srwxrwxrwx 1 root root 0 2011 2月 18 sora_crsqs
srwxrwxrwx 1 root root 0 2011 2月 18 sprocr_local_conn_0_PROC
srwxrwxrwx 1 root root 0 2011 2月 18 ssunvs-bDBG_CRSD
srwxrwxrwx 1 oracle oinstall 0 2011 2月 18 ssunvs-bDBG_CSSD
srwxrwxrwx 1 oracle oinstall 0 2011 2月 18 ssunvs-bDBG_EVMD
oracle@sunvs-b@/oracle/oracle $ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora....s-a.gsd application ONLINE OFFLINE
ora....s-a.ons application ONLINE OFFLINE
ora....s-a.vip application ONLINE ONLINE sunvs-a
ora....s-b.gsd application ONLINE ONLINE sunvs-b
ora....s-b.ons application ONLINE ONLINE sunvs-b
ora....s-b.vip application ONLINE ONLINE sunvs-b
可以看到,出问题的节点无论是使用crsctl check crs还是crs_stat命令,输出都有问题,而且最关键的是/var/tmp/.oracle目录下没有文件,这种现象是不正常的。
检查了定时作业内容:
sunvs-a/#crontab -l
#ident "@(#)root 1.21 04/03/23 SMI"
#
# The root crontab should be used to perform. accounting data collection.
#
#
10 3 * * * /usr/sbin/logadm
15 3 * * 0 /usr/lib/fs/nfs/nfsfind
30 3 * * * [ -x /usr/lib/gss/gsscred_clean ] && /usr/lib/gss/gsscred_clean
#10 3 * * * /usr/lib/krb5/kprop_script. ___slave_kdcs___
# Start of lines added by SUNWscr
20 4 * * 0 /usr/cluster/lib/sc/newcleventlog /var/cluster/logs/eventlog
20 4 * * 0 /usr/cluster/lib/sc/newcleventlog /var/cluster/logs/DS
20 4 * * 0 /usr/cluster/lib/sc/newcleventlog /var/cluster/logs/commandlog
0 0,4,8,12,16,20 * * * su - oracle /oracle/oracle/shell/Archive_clear.sh
# End of lines added by SUNWscr
0 2 * * * find /var/tmp -ctime +3 |xargs rm -f
看来不知道是哪个开发人员,擅自加上了定期删除/var/tmp下文件的作业,还是root用户执行的,这种作业实在太可怕了。马上注释了这个作业。
在隐藏目录/var/tmp/.oracle下,存在一些用于与oracle进程(比如tns listner,CSS、CRS、EVM守护进程,ASM实例等)通信的特殊socket文件,这些文件会伴随进程的启动而被创建,这些文件是不能被人为删除的。
我们知道,在UNIX下,如果一个进程正在打开一个文件,那么即使删除这个文件(比如rm),这个文件占用的资源也不会被释放,进程仍可以使用该文件,不过在文件系统层面这个文件会不可见。如果有进程想再次尝试读取这个被删除的文件,那么就会报错:No such file or directory。如果有进程想尝试写入这个文件,那么系统会创建一个同名的文件进行写入,显然这个文件和一开始被删除的文件是不同的。所以,如果一个Oracle进程在创建时打开了/var/tmp/.oracle下的socket文件,之后这个socket文件被删,那么如果进程尝试再次打开这个socket文件,就会报ORA-29701错。
解决的方法只有重启相关进程(instance, listener, CRS)来重新创建这些特殊文件,对于RAC环境,需要重启整个CRS stack。
由于2号节点运行正常,所以考虑对1号节点进行系统重启。
首先停止相关数据库实例,这里shutdown immediate是会报错的:
oracle@sunvs-a@/oracle/oracle $ export ORACLE_SID=C1
oracle@sunvs-a@/oracle/oracle $ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.2.0 - Production on Tue Sep 27 13:24:38 2011
Copyright (c) 1982, 2005, Oracle. All Rights Reserved.
Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
SQL> shutdown immediate;
ORA-29701: unable to connect to Cluster Manager
直接shutdown abort。
之后停其他相关进程,比如listener,应用进程等。
之后系统重启。
#sync
#sync
#reboot
机器启动后
# $ORA_CRS_HOME/bin/crsctl start crs
#su - oracle
$lsnrctl start
检查发现系统已恢复正常。
oracle@sunvs-a@/oracle/oracle $ crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
这次的事件告诫我们:
无论是测试还是生产环境,删除的动作总是危险的,除非你有十足的把握,否则这将带来十分严重可怕的结果。
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/20750200/viewspace-708560/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/20750200/viewspace-708560/