文档 ID: 844272.1 Troubleshooting when srvctl can't a RAC instance, but sqlplus can start it

主题:Troubleshooting when srvctl can't a RAC instance, but sqlplus can start it
文档 ID:844272.1类型: TROUBLESHOOTING
Modified Date: 30-JUL-2009状态: PUBLISHED

In this Document
Purpose
Last Review Date
Instructions for the Reader
Troubleshooting Details
I. Introduction
II. Troubleshooting
References


Applies to:

Oracle Server - Enterprise Edition - Version: 10.1.0.2 to 11.1.0.7
Information in this document applies to any platform.

Purpose

This note is intended to help DBA troubleshoot instance startup problems on RAC systems when srvctl can't be used to start the instance(s), but sqlplus works. This note is mainly written for RAC on Unix based systems, but can be used to a certain extend for RAC on Windows systems.

Last Review Date

June 17, 2009

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

I. Introduction

srvctl is the client utility to start/stop RAC instances/database. The srvctl command forward the start request to the crsd scheduler daemon to perform the start/stop of the instances/database on the different nodes. The result of the start/stop request is responded back to the client srvctl utility. The instance start job can be decomposed in different parts:

a. Initiate the instance startup

Start dependent resources like ons/listeners/asm in case they are not started yet. ASM is e.g. a required
resource when used by the rdbms instance, so ASM need to be known as started by the clusterware
before an instance can be started via srvctl command. It is not necessary to prestart the dependent resources manually. CRS knows those dependencies and will autostart first those resources. In case dependent resources like the ASM instance can't be started, then no further trial will be done to start the rdbms instance.

Prepare the init.ora to use under $CRS_HOME/racg/tmp/ora...inst.ora
(retrieved from the srvctl config -database -d -a command).

b. Start the racgimon to monitor the instance

Start the racgimon process that will monitor the instance and collect statistics like service metrics.

c. Start the instance

It is done via a spawned sqlplus session. Hence, when a instance can't be started via sqlplus, it will not
be startable via srvctl.

d. racgimon connect towards the newly started instance

The racgimon instance monitor connect to the new instance. On success, racgimon will
send this info toward evmd and ons to tell the whole cluster+users that the instance is started.
The racgimon will further start to collect service metrics and monitor the instance.

e. Stop the instance in case the instance startup failed somewhere

In case of instance startup failure, a trial is performed to stop the instance and racgimon to clear leftover resources. The main logging to check the instance failure reason stands in the $ORACLE_HOME/log//racg directory (the $ORACLE_HOME of the rdbms or asm instance) under the name 'imon_.log'. When the rdbms is in version 10.1, the $ORACLE_HOME/racg/dump directory was used.

II. Troubleshooting

Check 1. Check the srvctl start command is received by the crsd daemon

When the 'srvctl start' command is executed, the crsd logging should show a line that tells it attempts to start the instance, e.g.

2009-03-04 11:42:14.889: [ CRSRES][35737] Attempting to start `ora.ORCL.ORCL1.inst` on member `machine1`

The crsd logging stands in $CRS_HOME/log//crsd/crsd.log in 10gR2 and further and in $CRS_HOME/crs/log in 10gR1. In case such logging appear, skip this section and go to Check2 further. In case such logging don't appear, the following troubleshooting checks should be followed.

a. The srvctl may be malfunctioning

In case the srvctl don't report CRS-* errors like e.g

srvctl start database -d V120
PRKP-1001 : Error starting instance V1201 on node lnx10gr2n1
CRS-0233: Resource or relatives are currently involved with another operation.
PRKP-1001 : Error starting instance V1202 on node lnx10gr2n2
CRS-0233: Resource or relatives are currently involved with another operation

but only report PRKP-* errors, then most likely, the srvctl java code is malfunctioning.

There are two srvctl command (in the ORACLE_HOME and ORA_CRS_HOME). The best is to check whether any of them give the same error and trace the srvctl by setting environment variable SRVM_TRACE to trace the srvctl command and check for errors.

The srvctl version need to be in the same version as the database to be started
(e.g. see Note 279429.1)

b. The srvctl look to function correctly but crsd don't show startup trials

The srvctl show CRS-* errors or hang because crsd don't give him the response to his command requests.

=> Check other resources that are using different ORACLE_HOME,e.g. the CRS_HOME can be started/stopped (e.g. the nodeapps).

In case nodeapps can't be stopped/started, then the rdbms startup problem is more a general resource startup problem. Problem can be that the crsd.bin daemon hang or is malfunctioning. The crsd.log in
$CRS_HOME/log//crsd/crsd.log need to be reviewed.

Also check whether the clusterware is responding, e.g. crsctl check crs should report

crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy

on all nodes. Restarting the crsd.bin (as mentioned in Note 726925.1) can be a troubleshooting step.

=> Check other resources that using the same ORACLE_HOME can be started/stopped (e.g. listeners)

In case nodeapps can be started/stopped, then the crsd daemon looks to work and accept start/stop commands. However, it is possible all resource from one ORACLE_HOME can't be started. Most likely, there is a

1. a racgwrap problem (see Note 740319.1)

2. a permission problem in that ORACLE_HOME

The startup of the instance is done as the user mentioned via the command:

crs_getperm ora...inst

In case the scripts used to start the instance can't write in the $ORACLE_HOME/log//racg directory, then the clusterware will not be able to start the instance and it will reach the status 'UNKNOWN' (i.e. it can't be stopped either) (see Note 741212.1)

Check all files with the database references are owned by the oracle user, e.g.

cd $ORACLE_HOME

find . -name '*ORCL*'

Check 2. Check the status of the instance in the cluster is OFFLINE before starting it

In case the instance resource reach the state 'UNKNOWN' (viewable via crs_stat -t)
then the instance can't be started either without a manual intervention:

crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora....01.inst application ONLINE UNKNOWN lnx10gr2n1
ora....02.inst application OFFLINE OFFLINE
ora.V120.db application OFFLINE OFFLINE

$ srvctl start database -d fsqatr
PRKP-1001 : Error starting instance fsqatr1 on node adc17
CRS-1028: Dependency analysis failed because of:
CRS-0223: Resource 'ora.fsqatr.fsqatr1.inst' has placement error.

srvctl status database -d V120
PRKO-2015 : Error in checking condition of instance on node: lnx10gr2n1
Instance V1202 is not running on node lnx10gr2n2

srvctl stop instance -d V120 -i V1201
or
crs_stop -f ora.fsqatr.fsqatr1.inst

Once the instance is in state OFFLINE, a restart trial via srvctl can be done.

Always check no leftover instance processes are running (via 'ps -ef | grep '). Leftover processes can block the semaphore and shared memory segment used by the failed instance startup and inhibit further instance startup.

Check 3. Check the instance can be started via srvctl when the instance was prestarted via sqlplus

When the instance is started via sqlplus, then the instance should be detected as started by the clusterware.
In case the instance can't be detected as started via the clusterware, i.e. the instance is started via
sqlplus and the 'srvctl start instance -d -i ' don't permit the clusterware to
detect the instance as started, then the problem is most likely that the racgimon process can't connect to
the instance. This problem can inhibit the instance to be started via srvctl when the instance is down.

racgimon is using net8 to connect to the instance, so net8 configuration problems can inhibit it to occur, e.g.
=> sqlnet.inbound_connect_timeout and Note 402437.1

=> permission problems and Note 391453.1

=> sqlnet.ora misconfiguration and Note 387526.1

=> left over debugging info and Note 605540.1

The racgimon logging in the $ORACLE_HOME/log/

Check 4. check the database is correctly configured in the clusterware

a. The database configuration can be seen via 'srvctl config database -d -a ', e.g.

srvctl config database -d V120 -a
lnx10gr2n1 V1201 /home/oracle/oracle/product/10.2.0/db_1
lnx10gr2n2 V1202 /home/oracle/oracle/product/10.2.0/db_1
DB_NAME: null
ORACLE_HOME: /home/oracle/oracle/product/10.2.0/db_1
SPFILE: /ocfs/admin/V120/pfile/spfileV120.ora
DOMAIN: null
DB_ROLE: null
START_OPTIONS: null
POLICY: AUTOMATIC
ENABLE FLAG: DB ENABLED

The START_OPTIONS can be incorrectly set (see Note 311321.1) as well as the
ORACLE_HOME and SPFILE.

The spfile/pfile used by the sqlplus when starting the instance manually stands in $ORACLE_HOME/dbs (see Note 162491.1) and is thus potentially not the same as the spfile defined in the CRS.

The clusterware use the spfile that is copied to the $CRS_HOME/racg/tmp/.ora file
before starting the instance with it. Check whether the instance can be started with that spfile, e.g.

sqlplus /nolog
connect / as sysdba
startup pfile=/oracle/crs/racg/tmp/ora.orcl.orcl1.inst.ora

In case the sqlplus can't start the instance with that clusterware spfile, check for differences
between it and the spfile/pfile used by sqlplus and correct the differences in the file
mentioned in the 'srvctl config database -d -a' command. Use Note 137483.1 to correct it.

The logging $ORACLE_HOME/log//racg/imon_.log contains the error reported by
the sqlplus startup trial (see Note 732683.1 and Note 360575.1)

b. Database resources should be viewable

The 'crs_stat' command should show one *.db resource + multiple *.inst resources
for each defined instances. In case they are not viewable, then the best is to
recreate the database configuration in the clusterware (e.g. see Note 455226.1 )

Check 5. check the system settings of the root user compared to the oracle user

Since the sqlplus used to start the instance is launched by the crsd.bin daemon process,
the OS limitation of the crsd daemon are inherited by that sqlplus session. Since crsd.bin is a process started as root, the OS limitations applicable for the root user are used instead of the ones set for the oracle user.

In case of OS limitations set on root, but not on the oracle user, it is possible the instance
can't be started via srvctl ( see Note 367442.1 and Note 753516.1) but is started with sqlplus and the oracle user.

To change the instance state from UNKNOWN to OFFLINE, it is needed to stop the resource
either via:

References

Note 162491.1 - Startup of an Oracle 9i, 10g, 11g Instance using SPFile or Init.ora Parameter File
Note 279429.1 - PRKR-1007, PRKO-2005 using the 10g srvctl on a 9i database
Note 311321.1 - Srvctl Cannot Start database
Note 360575.1 - CRS-0215: Srvctl Cant Start Instance But Sqlplus Can
Note 367442.1 - 'srvctl' Unable to Start Large SGA Instance ORA-27102
Note 387526.1 - Getting PRKP-1001, CRS-1005, CRS-0223 When Trying to Startup Instance Using Srvctl
Note 391453.1 - SRVCTL does not work when RACGIMON process cannot connect to the DB
Note 402437.1 - SRVCTL: PRKP-1001 : CRS-0215: Could not start resource, racgimon killed by NS
Note 455226.1 - Database / Instances Not Starting With the srvctl Command, Errors PRKP-1001 & CRS-0212
Note 605540.1 - Can not Start Instance, Get ORA-3113 via sqlplus and PRKP-1001 CRS-215 via srvctl
Note 726925.1 - srvctl start instance fails with PRKP-1001; srvctl trace shows error connecting to CRSD
Note 732683.1 - Cannot start instance using srvctl but sqlplus can
Note 740319.1 - CRS-215 Srvctl unable to start ASM, Listener, RDBMS Resources
Note 741212.1 - Cannot start instance using srvctl, no info in imon logs
Note 753516.1 - The difference between using srvctl vs using sqlplus for start/stop one or more database nodes

总结:

1:srvctl 是个客户端程序,发送命令请求给crsd后台进程

2:当发送srvctl命令时,会在crsd日志记录$CRS_HOME/log//crsd/crsd.log

3:进程 racgimon 监控实例的状态,日志:

$ORACLE_HOME/log//racg/imon.log

4:检查 cluster组件的状态是否正常

#crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy

5:启动CLUSTER资源前 确保该资源是OFFLINE,否则可能block the semaphore and shared memory segment

6:srvctl无法启动nodeapps资源时 资源处于unknows状态, 可以尝试通过crs_start ,crs_stop 启动,建议这2个命令在ORACLE SUPPORT下执行

#crs_stop -f nodeapp

#crs_start nodeapp

[@more@]

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/9225895/viewspace-1027275/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/9225895/viewspace-1027275/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值