环境描述:
11.2.0.4的2个节点rac,RHEL 6 Update 5
[root@rac2 ~]# uname -a
Linux rac2 2.6.32-431.el6.x86_64 #1 SMP Sun Nov 10 22:19:54 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
[root@rac2 ~]# uname -r
2.6.32-431.el6.x86_64
[oracle@rac2 ~]$ cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.188.18 rac1
192.168.188.19 rac2
192.168.188.20 rac3
192.168.188.118 rac1-vip
192.168.188.119 rac2-vip
192.168.188.120 rac3-vip
192.168.182.18 rac1-priv
192.168.182.19 rac2-priv
192.168.182.20 rac3-priv
192.168.188.105 scan
[oracle@rac2 ~]$
在添加第三个节点的dbca时遇到如下报错,然后第三个db instance添加不成功
/u01/app/11.2.0/grid/log/rac3/agent/crsd/oraagent_oracle/oraagent_oracle.log 的部分报错如下:
2015-09-10 01:38:21.978: [ora.orcl.db][3571566336]{1:28142:484} [start] crsHome = /u01/app/11.2.0/grid
2015-09-10 01:38:21.978: [ora.orcl.db][3571566336]{1:28142:484} [start] oracleHome = /u02/app/oracle/product/11.2.0/dbhome_1
2015-09-10 01:38:21.978: [ora.orcl.db][3571566336]{1:28142:484} [start] command = '/u01/app/11.2.0/grid/bin/setasmgidwrap oracle_binary_path=/u02/app/oracle/product/11.2.0/dbhome_1/bin/oracle'
2015-09-10 01:38:21.979: [ora.orcl.db][3571566336]{1:28142:484} [start] start dependency = hard(ora.DATA.dg) weak(type:ora.listener.type,global:type:ora.scan_listener.type,uniform:ora.ons,global:ora.gns,ora.FRA.dg) pullup(ora.DATA.dg)
2015-09-10 01:38:21.979: [ora.orcl.db][3571566336]{1:28142:484} [start] ASM disk group dependency found
2015-09-10 01:38:21.979: [ora.orcl.db][3571566336]{1:28142:484} [start] Utils:execCmd action = 1 flags = 6 ohome = /u01/app/11.2.0/grid cmdname = setasmgidwrap.
2015-09-10 01:38:23.937: [ AGFW][3567363840]{1:28142:484} Agent received the message: RESOURCE_MODIFY_ATTR[ora.orcl.db 3 1] ID 4355:671
2015-09-10 01:38:50.992: [ora.orcl.db][3571566336]{1:28142:484} [start] execCmd ret = 0
2015-09-10 01:38:50.992: [ USRTHRD][3571566336]{1:28142:484} InstConnection::initMutex AttachLock 00ae3210 DetachLock 00ae3228
2015-09-10 01:38:50.994: [ora.orcl.db][3571566336]{1:28142:484} [start] clsnInstConnection::makeConnectStr UsrOraEnv m_oracleHome /u02/app/oracle/product/11.2.0/dbhome_1 Crshome /u01/app/11.2.0/grid
2015-09-10 01:38:50.994: [ora.orcl.db][3571566336]{1:28142:484} [start] makeConnectStr = (DESCRIPTION=(ADDRESS=(PROTOCOL=beq)(PROGRAM=/u02/app/oracle/product/11.2.0/dbhome_1/bin/oracle)(ARGV0=oracleorcl3)(ENVS='ORACLE_HOME=/u02/app/oracle/product/11.2.0/dbhome_1,ORACLE_SID=orcl3,LD_LIBRARY_PATH=')(ARGS='(DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))'))(CONNECT_DATA=(SID=orcl3)))
2015-09-10 01:38:51.223: [ora.orcl.db][3571566336]{1:28142:484} [start] Container:start oracle home /u02/app/oracle/product/11.2.0/dbhome_1
2015-09-10 01:38:51.224: [ora.orcl.db][3571566336]{1:28142:484} [start] InstConnection::connectInt: server not attached
2015-09-10 01:38:52.996: [ora.orcl.db][3571566336]{1:28142:484} [start] ORA-12547: TNS:lost contact
2015-09-10 01:38:53.030: [ora.orcl.db][3571566336]{1:28142:484} [start] InstConnection::connectInt (1) Exception OCIException
2015-09-10 01:38:53.032: [ora.orcl.db][3571566336]{1:28142:484} [start] InstConnection:connect:excp OCIException OCI error 12547
2015-09-10 01:38:53.033: [ora.orcl.db][3571566336]{1:28142:484} [start] InstConnection::connectInt: server not attached
2015-09-10 01:38:53.712: [ora.orcl.db][3571566336]{1:28142:484} [start] ORA-12547: TNS:lost contact
2015-09-10 01:38:53.713: [ora.orcl.db][3571566336]{1:28142:484} [start] InstConnection::connectInt (1) Exception OCIException
2015-09-10 01:38:53.713: [ora.orcl.db][3571566336]{1:28142:484} [start] InstAgent::start: 1 errcode 12547
2015-09-10 01:38:53.713: [ora.orcl.db][3571566336]{1:28142:484} [start] ConnectionPool::resetConnection s_statusOfConnectionMap 00ae9760
2015-09-10 01:38:53.713: [ora.orcl.db][3571566336]{1:28142:484} [start] ConnectionPool::resetConnection sid orcl3 status 2
2015-09-10 01:38:53.713: [ora.orcl.db][3571566336]{1:28142:484} [start] Gimh::check OH /u02/app/oracle/product/11.2.0/dbhome_1 SID orcl3
2015-09-10 01:38:53.754: [ora.orcl.db][3571566336]{1:28142:484} [start] GIMH: GIM-00104: Health check failed to connect to instance.
GIM-00090: OS-dependent operation:open failed with status: 2
GIM-00091: OS failure message: No such file or directory
GIM-00092: OS failure occurred at: sskgmsmr_7
2015-09-10 01:38:53.754: [ora.orcl.db][3571566336]{1:28142:484} [start] (:CLSN00007:)DbAgent::check failed gimh state 0
2015-09-10 01:38:53.763: [ora.orcl.db][3571566336]{1:28142:484} [start] clsnDbAgent:checkCbk clsagfw_res_status ret 5
2015-09-10 01:38:53.763: [ora.orcl.db][3571566336]{1:28142:484} [start] ConnectionPool::stopConnection
2015-09-10 01:38:53.763: [ora.orcl.db][3571566336]{1:28142:484} [start] ConnectionPool::removeConnection connection count 0
2015-09-10 01:38:53.763: [ora.orcl.db][3571566336]{1:28142:484} [start] ConnectionPool::removeConnection freed 0
2015-09-10 01:38:53.763: [ora.orcl.db][3571566336]{1:28142:484} [start] ConnectionPool::stopConnection sid orcl3 status 1
2015-09-10 01:38:53.763: [ora.orcl.db][3571566336]{1:28142:484} [start] InstAgent::check 1 prev clsagfw_res_status 0 current clsagfw_res_status 5
2015-09-10 01:38:53.764: [ora.orcl.db][3571566336]{1:28142:484} [start] InstAgent::start not logged on check state details Abnormal Termination
2015-09-10 01:38:53.764: [ora.orcl.db][3571566336]{1:28142:484} [start] InstAgent::start: ORA-1012 or Lost Contact try cleanOracleIpc and start force
2015-09-10 01:38:53.764: [ USRTHRD][3571566336]{1:28142:484} InstConnection:~InstConnection: this b00070c0
2015-09-10 01:38:53.766: [ora.orcl.db][3571566336]{1:28142:484} [start] InstAgent::start call sysresv
2015-09-10 01:38:53.766: [ora.orcl.db][3571566336]{1:28142:484} [start] Container:start scls_clean_oracle_ipc Container orcl3 dbHome /u02/app/oracle/product/11.2.0/dbhome_1
用如上的报错,到mos上搜索,不过没啥有价值的东西。
于是就改变策略,用sqlplus / as sysdba 登陆看看有啥报错:
[oracle@rac3 oracle]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Thu Sep 10 12:09:13 2015
Copyright (c) 1982, 2013, Oracle. All rights reserved.
ERROR:
ORA-12547: TNS:lost contact
Enter user-name:
ERROR:
ORA-12547: TNS:lost contact
Enter user-name:
ERROR:
ORA-12547: TNS:lost contact
SP2-0157: unable to CONNECT to ORACLE after 3 attempts, exiting SQL*Plus
[oracle@rac3 oracle]$
在mos文章SYSDBA Connections Fail With ORA-12547 Error (文档 ID 782276.1)的提示下,
在 $ORACLE_HOME/rdbms/log下,找到了很多trc文件,其内容截取如下:
----此时你也许又疑问,到bdump下看看?其实此时instance尚未建立,是没有bdump目录的。
[oracle@rac3 log]$ more orcl3_ora_14292.trc
Dump file /u02/app/oracle/product/11.2.0/dbhome_1/rdbms/log/orcl3_ora_14292.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
ORACLE_HOME = /u02/app/oracle/product/11.2.0/dbhome_1
System name: Linux
Node name: rac3
Release: 2.6.32-431.el6.x86_64
Version: #1 SMP Sun Nov 10 22:19:54 EST 2013
Machine: x86_64
Instance name: orcl3
Redo thread mounted by this instance: 0 <none>
Oracle process number: 0
Unix process pid: 14292, image: oracle@rac3
*** 2015-09-10 11:32:38.641
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=3, mask=0x0)
----- Error Stack Dump -----
ORA-00600: internal error code, arguments: [spstp: ORACLE_HOME uid does not match euid], [500], [1200], [], [], [], [], [], [], [], [], []
----- SQL Statement (None) -----
Current SQL information unavailable - no SGA.
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
skdstdst()+41 call kgdsdst() 000000000 ? 000000000 ?
7FFFB8AFF650 ? 7FFFB8AFF728 ?
7FFFB8B041D0 ? 000000002 ?
ksedst1()+103 call skdstdst() 000000000 ? 000000000 ?
7FFFB8AFF650 ? 7FFFB8AFF728 ?
7FFFB8B041D0 ? 000000002 ?
发现了比较关键的报错:
spstp: ORACLE_HOME uid does not match euid], [500], [1200], [], [], [], [], [], [], [], [], []
到mos上搜索到了文章ORA-600 [spstp: ORACLE_HOME uid does not match euid] When Changing Permissions On $ORACLE_HOME/bin/oracle (文档 ID 747456.1)
得到如下的信息:该报错中的500是uid,而1200是euid
于是就去检查该节点上的oracle用户和grid用户的id信息,如下:
[oracle@rac3 oracle]$ id oracle
uid=1200(oracle) gid=1000(oinstall) groups=1000(oinstall),1200(dba),1201(oper),1300(asmdba)
[oracle@rac3 oracle]$ id grid
uid=1100(grid) gid=1000(oinstall) groups=1000(oinstall),1200(dba),1100(asmadmin),1301(asmoper),1300(asmdba)
[oracle@rac3 oracle]$
上面输出中没有500.那500是从哪里来的?继续检查ORACLE_DB_HOME的属主,发现了问题:
[oracle@rac3 ~]$ pwd
/home/oracle
[oracle@rac3 ~]$ cd /u02/app/oracle/product/11.2.0/
[oracle@rac3 11.2.0]$ ls -lrt
total 4
drwxrwxr-x 74 500 oinstall 4096 Sep 10 01:12 dbhome_1
[oracle@rac3 11.2.0]$ cd ..
[oracle@rac3 product]$ ls -lrt
total 4
drwxrwxr-x 3 500 oinstall 4096 Sep 9 21:46 11.2.0
[oracle@rac3 product]$ cd ..
[oracle@rac3 oracle]$ ls -lrt
total 12
drwxrwxr-x 3 500 oinstall 4096 Sep 9 21:36 product --------->此出product的属主是500,问题得到定位
drwxr-xr-x 3 oracle oinstall 4096 Sep 10 01:37 cfgtoollogs
drwxr-xr-x 3 oracle oinstall 4096 Sep 10 11:31 admin
[oracle@rac3 oracle]$ pwd
/u02/app/oracle
[oracle@rac3 oracle]$
改变属主为oracle之后,再添加节点就没问题了。
总结一下:/u02/app/oracle/product的属主之所以会显示500,是因为rac3主机oracle用户一开始的uid是500,而其他两个节点上oracle用户的uid是1200.大家知道,rac节点的uid不一致的话,是不行的。于是就修改rac3上的uid,结果/u02/app/oracle/product的属主没改,就开始加节点。后续的就不说了。。