安装RAC集群时，第二个节点root.sh失败的处理

最新推荐文章于 2023-08-21 08:00:00 发布

Michael_A

最新推荐文章于 2023-08-21 08:00:00 发布

阅读量5.1k

点赞数

分类专栏： oracle

oracle 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

CLSRSC-507: The root script cannot proceed on this node <node-n> (文档 ID 1919825.1)

Purpose

Details

Case 1: root script didn't succeed on first node

Case 2: root script completed on first node but other nodes fail to obtain the status due to ocrdump issue

Case 2.1 ocrdump fails due to error AMDU-00201 and AMDU-00200

Solution:

Case 2.2 ocrdump fails: AMDU-00210 AMDU-00205 AMDU-00201 AMDU-00407 asmlib error asm_close asm_open

Solution:

Case 2.3 ocrdump fails as amdu core dumped

Solution:

Case 2.4 same disk name points to different storage on different node

Solution:

Case 2.5 same storage sub-system are shared by different clusters and same diskgroup name exists in more than one cluster

Case 2.6 root user is seeing the same physical disks multiple times because of different path

Case 2.7 ocrdump fails with AMDU-00210 on Windows environment

Case 3: root script completed on first node but other nodes fail to obtain the status as ocrdump wasn't executed

References

Applies to:

Oracle Database - Enterprise Edition - Version 12.1.0.2 and later
Information in this document applies to any platform.

Purpose

The note lists known issues regarding the following error:

CLSRSC-507: The root script cannot proceed on this node <non-first_node> because either the first-node operations have not completed on node <first_node> or there was an error in obtaining the status of the first-node operations.

Details

Case 1: root script didn't succeed on first node

Grid Infrastructure root script (root.sh or rootupgrade.sh) needs to be completed successfully on node1 or first node before it can be ran on other nodes; first node is the one on which the runInstall/config.sh ran, this is new in 12.1.0.2.

If this is the case, complete root script on node1 before running it on other nodes.

Case 2: root script completed on first node but other nodes fail to obtain the status due to ocrdump issue

In this case, it's confirmed that root script finished on node1:

<NEW_GI_HOME>/cfgtoollogs/crsconfig/rootcrs_<node>_<timestamp>.log

2014-08-22 10:23:10: Invoking "/opt/ogrid/12.1.0.2/bin/cluutil -exec -ocrsetval -key SYSTEM.rootcrs.checkpoints.firstnode -value SUCCESS"
2014-08-22 10:23:10: trace file=/opt/oracle/crsdata/inari/crsconfig/cluutil0.log
2014-08-22 10:23:10: Executing cmd: /opt/ogrid/12.1.0.2/bin/cluutil -exec -ocrsetval -key SYSTEM.rootcrs.checkpoints.firstnode -value SUCCESS
2014-08-22 10:23:10: Succeeded in writing the key pair (SYSTEM.rootcrs.checkpoints.firstnode:SUCCESS) to OCR
2014-08-22 10:23:10: Executing cmd: /opt/ogrid/12.1.0.2/bin/clsecho -p has -f clsrsc -m 325
2014-08-22 10:23:10: Command output:
> CLSRSC-325: Configure Oracle Grid Infrastructure for a Cluster ... succeeded
>End Command output
2014-08-22 10:23:10: CLSRSC-325: Configure Oracle Grid Infrastructure for a Cluster ... succeeded

And root script fails on other nodes as ocrdump failed

<NEW_GI_HOME>/cfgtoollogs/crsconfig/rootcrs_<node>_<timestamp>.log

2014-09-04 13:45:34: ASM_DISKS=ORCL:OCR01,ORCL:OCR02,ORCL:OCR03
....
2014-09-04 13:46:04: Check the existence of global ckpt 'checkpoints.firstnode'
2014-09-04 13:46:04: setting ORAASM_UPGRADE to 1
2014-09-04 13:46:04: Invoking "/product/app/12.1.0.2/grid/bin/cluutil -exec -keyexists -key checkpoints.firstnode"
2014-09-04 13:46:04: trace file=/product/app/grid/crsdata/sipr0-db04/crsconfig/cluutil8.log
2014-09-04 13:46:04: Running as user grid: /product/app/12.1.0.2/grid/bin/cluutil -exec -keyexists -key checkpoints.firstnode
2014-09-04 13:46:04: s_run_as_user2: Running /bin/su grid -c ' echo CLSRSC_START; /product/app/12.1.0.2/grid/bin/cluutil -exec -keyexists -key checkpoints.firstnode '
2014-09-04 13:46:05: Removing file /tmp/fileRiu5NI
2014-09-04 13:46:05: Successfully removed file: /tmp/fileRiu5NI
2014-09-04 13:46:05: pipe exit code: 256
2014-09-04 13:46:05: /bin/su exited with rc=1

2014-09-04 13:46:05: oracle.ops.mgmt.rawdevice.OCRException: PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]

2014-09-04 13:46:05: Cannot get OCR key with CLUUTIL, try using OCRDUMP.
2014-09-04 13:46:05: Check OCR key using ocrdump
2014-09-04 13:46:22: ocrdump output: PROT-302: Failed to initialize ocrdump

2014-09-04 13:46:22: The key pair with keyname: SYSTEM.rootcrs.checkpoints.firstnode does not exist in OCR.
2014-09-04 13:46:22: Checking a remote host sipr0-db03 for reachability...

Case 2.1 ocrdump fails due to error AMDU-00201 and AMDU-00200

<ADR_HOME>/crs/<node>/crs/trace/ocrdump_<pid>.trc

2014-09-04 13:46:14.044274 : OCRASM: proprasmo: ASM instance is down. Proceed to open the file in dirty mode.

CLWAL: clsw_Initialize: Error [32] from procr_init_ext
CLWAL: clsw_Initialize: Error [PROCL-32: Oracle High Availability Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]] from procr_init_ext
2014-09-04 13:46:14.050831 : GPNP: clsgpnpkww_initclswcx: [at clsgpnpkww.c:351] Result: (56) CLSGPNP_OCR_INIT. (:GPNP01201:)Failed to init CLSW-OLR context. CLSW Error (3): CLSW-3: Error in the cluster registry (OCR) layer. [32] [PROCL-32: Oracle High Availability Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]]
2014-09-04 13:46:14.093544 : OCRASM: proprasmo: Error [13] in opening the GPNP profile. Try to get offline profile
2014-09-04 13:46:16.210708 : OCRRAW: kgfo_kge2slos error stack at kgfolclcpi1: AMDU-00200: Unable to read [32768] bytes from Disk N0050 at offset [140737488355328]
AMDU-00201: Disk N0050: '/dev/sdg'
AMDU-00200: Unable to read [32768] bytes from Disk N0049 at offset [140737488355328]
AMDU-00201: Disk N0049: '/dev/sdf'
AMDU-00200: Unable to read [32768] bytes from Disk N0048 at offset [140737488355328]
AMDU-00201: Disk N0048: '/dev/sde'
AMDU-00200: Unable to read [32768] bytes from Disk N0035 at offset [140737488355328]
AMDU-00201: Disk N0035: '/dev/sdaw'
AMDU-00200: Unable to read [32768] bytes from Disk N0024 at offset [140737488355328]
AMDU-00201: Disk N0024: '/dev/sdaq'
....

2014-09-04 13:46:16.212934 : OCRASM: proprasmo: Failed to open file in dirty mode

2014-09-04 13:46:16.212964 : OCRASM: proprasmo: dgname is [OCRVOTE] : discoverystring []
2014-09-04 13:46:16.212990 : OCRASM: proprasmo: Error in open/create file in dg [OCRVOTE]
OCRASM: SLOS : SLOS: cat=8, opn=kgfolclcpi1, dep=200, loc=kgfokge

2014-09-04 13:46:16.213075 : OCRASM: ASM Error Stack :

....
2014-09-04 13:46:22.690905 : OCRASM: proprasmo: kgfoCheckMount returned [7]
2014-09-04 13:46:22.690933 : OCRASM: proprasmo: The ASM instance is down
2014-09-04 13:46:22.692150 : OCRRAW: proprioo: Failed to open [+OCRVOTE/sipr0-dbhv1/OCRFILE/registry.255.857389203]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2014-09-04 13:46:22.692204 : OCRRAW: proprioo: No OCR/OLR devices are usable
2014-09-04 13:46:22.692239 : OCRRAW: proprinit: Could not open raw device
2014-09-04 13:46:22.692561 : default: a_init:7!: Backend init unsuccessful : [26]
2014-09-04 13:46:22.692777 : OCRDUMP: Failed to initailized OCR context. Error [PROC-26: Error while accessing the physical storage
] [26].
2014-09-04 13:46:22.692822 : OCRDUMP: Failed to initialize ocrdump stage 2
2014-09-04 13:46:22.692864 : OCRDUMP: Exiting [status=failed]...

Solution:

The solution is to apply patch 18456643, then re-run root script.

Case 2.2 ocrdump fails: AMDU-00210 AMDU-00205 AMDU-00201 AMDU-00407 asmlib error asm_close asm_open

<ADR_HOME>/crs/<node>/crs/trace/ocrdump_<pid>.trc

OCRASM: proprasmo: ASM instance is down. Proceed to open the file in dirty mode.

2014-09-09 13:52:04.131609 : OCRRAW: kgfo_kge2slos error stack at kgfolclcpi1: AMDU-00210: No disks found in diskgroup CRSGRP
AMDU-00210: No disks found in diskgroup CRSGRP
AMDU-00205: Disk N0033 open failed during deep discovery.
AMDU-00201: Disk N0033: 'ORCL:REDOA'
AMDU-00407: asmlib error!! function = [asm_close], error = [0], mesg = [Invalid argument]
AMDU-00407: asmlib error!! function = [asm_open], error = [0], mesg = [Operation not permitted]
....

2014-09-09 13:52:04.131691 : OCRRAW: kgfoOpenDirty: dg=CRSGRP diskstring= filename=/opt/oracle/crsdata/drcsvr713/output/tmp_amdu_ocr_CRSGRP_09_09_2014_13_52_04

....

2014-09-09 13:52:04.131756 : OCRRAW: Category: 8

2014-09-09 13:52:04.131767 : OCRRAW: DepInfo: 210

....
OCRRAW: proprioo: No OCR/OLR devices are usable
OCRRAW: proprinit: Could not open raw device
default: a_init:7!: Backend init unsuccessful : [26]
OCRDUMP: Failed to initailized OCR context. Error [PROC-26: Error while accessing the physical storage] [26].
OCRDUMP: Failed to initialize ocrdump stage 2
OCRDUMP: Exiting [status=failed]...

Solution:

The cause is that asmlib is used but not properly configured as confirmed by the output of the following commands on all nodes:

/etc/init.d/oracleasm listdisks
/etc/init.d/oracleasm scandisks
/etc/init.d/oracleasm listdisks
/etc/init.d/oracleasm listdisks | xargs /etc/init.d/oracleasm querydisk -d
/etc/init.d/oracleasm status
/usr/sbin/oracleasm configure
ls -l /dev/oracleasm/disks/*
rpm -qa | grep oracleasm
uname -a

It's recommended to use AFD (ASM Filter Driver) instead of ASMLIB, but if ASMLIB must be used, fix the misconfiguration, then re-run root script.

Case 2.3 ocrdump fails as amdu core dumped

<ADR_HOME>/crs/<node>/crs/trace/ocrdump_<pid>.trc

2014-08-27 14:34:33.077433 : OCRRAW: kgfo_kge2slos error stack at kgfolclcpi1: AMDU-00210: No disks found in diskgroup QUORUM
AMDU-00210: No disks found in diskgroup QUORUM
....
2014-08-27 14:34:39.262032 : OCRASM: proprasmo: kgfoCheckMount returned [7]
2014-08-27 14:34:39.262041 : OCRASM: proprasmo: The ASM instance is down
2014-08-27 14:34:39.262521 : OCRRAW: proprioo: Failed to open [+QUORUM/wrac-cl-tor/OCRFILE/registry.255.856261165]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2014-08-27 14:34:39.262540 : OCRRAW: proprioo: No OCR/OLR devices are usable
2014-08-27 14:34:39.262552 : OCRRAW: proprinit: Could not open raw device
2014-08-27 14:34:39.262668 : default: a_init:7!: Backend init unsuccessful : [26]
2014-08-27 14:34:39.262743 : OCRDUMP: Failed to initailized OCR context. Error [PROC-26: Error while accessing the physical storage
] [26].
2014-08-27 14:34:39.262760 : OCRDUMP: Failed to initialize ocrdump stage 2

amdu command core dumps:

$ amdu -diskstring 'ORCL:*'
amdu_2014_09_09_14_35_43/
amdu: ossdebug.c:1136: ossdebug_init_diag: Assertion `0' failed.
Aborted (core dumped)

Solution:

At the time of this writing, the issu s still being worked in bug 19592048, engage Oracle Support for further help.

Case 2.4 same disk name points to different storage on different node

<ADR_HOME>/crs/<node>/crs/trace/ocrdump_<pid>.trc

2014-09-10 13:12:53.429460 : OCRASM: proprasmo: Error [13] in opening the GPNP profile. Try to get offline profile
2014-09-10 13:12:53.435300 : OCRRAW: kgfo_kge2slos error stack at kgfolclcpi1: AMDU-00210: No disks found in diskgroup DATA01
AMDU-00210: No disks found in diskgroup DATA01

amdu command output on node1

Disk Path: /dev/asm-data001
Unique Disk ID:
Disk Label:
Physical Sector Size: 512 bytes
Disk Size: 409600 megabytes
Group Name: DATA01
Disk Name: DATA01_0000
Failure Group Name: DATA01_0000

amdu command output on node2

Disk Path: /dev/asm-data001
Unique Disk ID:
Disk Label:
Physical Sector Size: 512 bytes
Disk Size: 409600 megabytes
** NOT A VALID ASM DISK HEADER. BAD VALUE IN FIELD blksize_kfdhdb **

Solution:

The solution is to engage SysAdmin to fix the disk setup issue.

If using asmlib with multipath devices, verify oracleasm_scanorder and oracleasm_scanexclude option inside /etc/sysconfig/oracleasm set properly in all nodes

Case 2.5 same storage sub-system are shared by different clusters and same diskgroup name exists in more than one cluster

<ADR_HOME>/crs/<node>/crs/trace/ocrdump_<pid>.trc

2015-07-17 16:57:00.532160 : OCRRAW: AMDU-00211: Inconsistent disks in diskgroup OCR

Solution:

The issue was investigated in bug 21469989, the cause is that multiple clusters are having the same diskgroup name and seeing the same shared disks, the workaround is to change diskgroup name for the new cluster.

An example will be that both cluster1 and cluster2 are seeing the same physical disks /dev/mappers/disk1-10, disk1-5 are allocated to cluster1 and disk6-10 are allocated to cluster2, however, both cluster are trying to use the same diskgroup name dgsys.

Ref: BUG 21469989 - CLSRSC-507 ROOT.SH FAILING ON NODE 2 WHEN CHECKING GLOBAL CHECKPOINT

Case 2.6 root user is seeing the same physical disks multiple times because of different path

<ADR_HOME>/crs/<node>/crs/trace/ocrdump_<pid>.trc

2015-07-17 16:57:00.532160 : OCRRAW: AMDU-00211: Inconsistent disks in diskgroup OCR

Solution:

The solution is to ensure disk string is set correctly and root user is only seeing the same physical disk once.

Ref: BUG 21164225 - OCRDUMP FAILS WITH AMDU-211 ONLY ON NORMAL REDUNDANCY

Case 2.7 ocrdump fails with AMDU-00210 on Windows environment

<ADR_HOME>/crs/<node>/crs/trace/ocrdump_<pid>.trc

KGF:kgfo.c@954: kgfo_kge2slos error stack at kgfolclcpi1: AMDU-00210: No disks found in diskgroup CRS
AMDU-00210: No disks found in diskgroup CRS
...
KGF:kgfo.c@1122: kgfoSaveError: ignoring existing error:
ORA-29701: unable to connect to Cluster Synchronization Service
AMDU-00210: No disks found in diskgroup CRS
AMDU-00210: No disks found in diskgroup CRS
...
-------------------------------------------------------------------------------
Trace Bucket Dump End: default trace bucket
OCRRAW: -- trace dump end --

OCRASM: proprasmo: kgfoCheckMount returned [7]
OCRASM: proprasmo: The ASM instance is down
OCRRAW: proprioo: Failed to open [+CRS/XXX/OCRFILE/registry.255.920322319]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
OCRRAW: proprioo: No OCR/OLR devices are usable
OCRRAW: proprinit: Could not open raw device
default: a_init:7!: Backend init unsuccessful : [26]
OCRDUMP: Failed to initailized OCR context. Error [PROC-26: Error while accessing the physical storage
] [26].
OCRDUMP: Failed to initialize ocrdump stage 2
OCRDUMP: Exiting [status=failed]...

AMDU from command line is successful.

% amdu -diskstring '\\.\ORCLDISK*' -dump 'CRS' -nodir

Solution:

This issue is still under investigation on BUG:24495889.
The workaround is to modify $GRID_HOME/crs/install/crsutils.pm as following:

From:

sub isOcrKeyExists
{
...
if (0 != $rc)
{
trace("Cannot get OCR key with CLUUTIL, try using OCRDUMP.");
if (checkOcrKeyWithDump($fullKeyName))
{
return TRUE;
}
else
{
return FALSE; <---
}

To:

sub isOcrKeyExists
{
...
if (0 != $rc)
{
trace("Cannot get OCR key with CLUUTIL, try using OCRDUMP.");
if (checkOcrKeyWithDump($fullKeyName))
{
return TRUE;
}
else
{
return TRUE; <---
}

Case 3: root script completed on first node but other nodes fail to obtain the status as ocrdump wasn't executed

In this case, it's confirmed that root script finished on node1:

<NEW_GI_HOME>/cfgtoollogs/crsconfig/rootcrs_<node>_<timestamp>.log

CLSRSC-325: Configure Oracle Grid Infrastructure for a Cluster ... succeeded

And root script fails on other nodes as ocrdump wasn't executed:

2014-08-28 17:53:55: Check the existence of global ckpt 'checkpoints.firstnode'
2014-08-28 17:53:55: setting ORAASM_UPGRADE to 1
2014-08-28 17:53:55: Invoking "/opt/12.1.0.2/grid/bin/cluutil -exec -keyexists -key checkpoints.firstnode"
2014-08-28 17:53:55: trace file=/opt/oracle/crsdata/racnode2/crsconfig/cluutil3.log
2014-08-28 17:53:55: Running as user oracle: /opt/12.1.0.2/grid/bin/cluutil -exec -keyexists -key checkpoints.firstnode
2014-08-28 17:53:55: s_run_as_user2: Running /bin/su oracle -c ' echo CLSRSC_START; /opt/12.1.0.2/grid/bin/cluutil -exec -keyexists -key checkpoints.firstnode '
2014-08-28 17:53:56: Removing file /tmp/fileZCubj2
2014-08-28 17:53:56: Successfully removed file: /tmp/fileZCubj2
2014-08-28 17:53:56: pipe exit code: 0 ====>>>> cluutil failed with PROC-32 but exit code 0
2014-08-28 17:53:56: /bin/su successfully executed

2014-08-28 17:53:56: oracle.ops.mgmt.rawdevice.OCRException: PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]

2014-08-28 17:53:56: Checking a remote host dblab01 for reachability...
....

2014-08-28 17:53:57: CLSRSC-507: The root script cannot proceed on this node dblab02 because either the first-node operations have not completed on node dblab01 or there was an error in obtaining the status of the first-node operations.

cluutil trace <ORACLE_BASE>/crsdata/racnode2/crsconfig/cluutil3.log confirms it failed:

[main] [ 2014-08-29 17:40:46.750 EDT ] [OCR.<init>:278] ocr Error code = 32
[main] [ 2014-08-29 17:40:46.750 EDT ] [ClusterExecUtil.executeCmd:168] Exception caught: PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]
[main] [ 2014-08-29 17:40:46.750 EDT ] [ClusterUtil.main:236] ClusterUtil.execute rc: 1

The issue was investigated in bug 19570598:

BUG 19570598 - ROOT.SH FAILS ON NODE2 WHILE CHECKING GLOBAL FIRST NODE CHECKPOINT