SUSPEND导致RAC环境的ORA-29702错误（一）

最新推荐文章于 2023-06-13 11:59:04 发布

ckawt40802

最新推荐文章于 2023-06-13 11:59:04 发布

阅读量156

点赞数

文章标签：数据库网络

在测试RAC环境的SUSPEND命令时，出现了这个错误。

测试RAC环境下，SUSPEND对多个实例的影响时，在后台日志中出现了这个错误。

简单描述一下测试环境，3节点的Oracle Rac 10203 for Solaris sparc64，其中实例启动，一个实例关闭，将数据库置于SUSPEND状态，然后通过SRVCTL启动远端关闭的实例，这时SRVCTL命令会被阻塞，登陆远端站点发现数据库还没有被MOUNT，而实例1由于上述的操作被自动SHUTDOWN ABORT，并自动重起。在后台日志中可以发现ORA-29702错误。

下面模拟一下这个现象：

bash-2.03$ srvctl status db -d testrac
Instance testrac1 is running on node racnode1
Instance testrac2 is running on node racnode2
Instance testrac3 is running on node racnode3
bash-2.03$ srvctl stop inst -d testrac -i testrac3
bash-2.03$ srvctl status db -d testrac
Instance testrac1 is running on node racnode1
Instance testrac2 is running on node racnode2
Instance testrac3 is not running on node racnode3
bash-2.03$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on 星期六 2月 21 01:01:17 2009

连接到:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> alter session set nls_date_format = 'yyyy-mm-dd hh24:mi:ss';

会话已更改。

SQL> select instance_name, startup_time from v$instance;

INSTANCE_NAME STARTUP_TIME
---------------- -------------------
testrac1 2009-02-21 00:54:20

SQL> select instance_name, status, database_status from gv$instance;

INSTANCE_NAME    STATUS       DATABASE_STATUS
---------------- ------------ -----------------
testrac1         OPEN         ACTIVE
testrac2         OPEN         ACTIVE

SQL> alter system suspend;

系统已更改。

SQL> select instance_name, status, database_status from gv$instance;

INSTANCE_NAME    STATUS       DATABASE_STATUS
---------------- ------------ -----------------
testrac1         OPEN         SUSPENDED
testrac2         OPEN         SUSPENDED

检查实例2的状态：

SQL> select instance_name from v$instance;

INSTANCE_NAME
----------------
testrac2

SQL> select instance_name, status, database_status from v$instance;

INSTANCE_NAME STATUS DATABASE_STATUS
---------------- ------------ -----------------
testrac2 OPEN SUSPENDED

下面在另一个终端使用srvctl启动实例3：

bash-2.03$ srvctl start inst -d testrac -i testrac3

这个操作处于被阻塞的状态。

现在通过sqlplus来连接实例3：

$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on 星期三 4月 22 16:32:08 2009

连接到:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> select instance_name, status, database_status from v$instance;

INSTANCE_NAME STATUS DATABASE_STATUS
---------------- ------------ -----------------
testrac3 STARTED ACTIVE

SQL> select instance_name, status, database_status from gv$instance;

可以看到数据库的状态并不是OPEN，而是STARTED，说明数据库还没有被MOUNT。而查询GV$INSTANCE信息的操作被阻塞。

又等待一段时间，发现SRVCTL命令结束，但是实例3还处于等待状态中。

在实例1上执行RESUME操作，并检查当前实例状态：

SQL> alter system resume;

系统已更改。

SQL> select instance_name, status, database_status from gv$instance;

INSTANCE_NAME    STATUS       DATABASE_STATUS
---------------- ------------ -----------------
testrac1         OPEN         ACTIVE
testrac3         OPEN         ACTIVE

实例3上被阻塞的查询得到结果，但是查询GV$INSTANCE却只有3的信息：

INSTANCE_NAME STATUS DATABASE_STATUS
---------------- ------------ -----------------
testrac3 OPEN ACTIVE

SQL> select instance_name from v$instance;

INSTANCE_NAME
----------------
testrac3

SQL> select instance_name, status, database_status from gv$instance;

INSTANCE_NAME    STATUS       DATABASE_STATUS
---------------- ------------ -----------------
testrac3         OPEN         ACTIVE
testrac2         OPEN         ACTIVE
testrac1         OPEN         ACTIVE

随后3个节点都启动，且恢复正常的状态。

SQL> select instance_name, startup_time from v$instance;

INSTANCE_NAME STARTUP_TIME
---------------- -------------------
testrac1 2009-02-21 00:54:20

跟踪这次测试的结果实例1并没有重启，而实例2则发生了重启，而第一次发现问题的时候是在实例2上执行alter system resume操作，看来没有执行这个操作的实例会发生重启。

下面是实例2上的操作：

SQL> select instance_name, status, database_status from v$instance;

INSTANCE_NAME STATUS DATABASE_STATUS
---------------- ------------ -----------------
testrac2 OPEN ACTIVE

SQL> select instance_name, status, database_status from v$instance;
select instance_name, status, database_status from v$instance
*
第 1 行出现错误:
ORA-03135: 连接失去联系

可以看到实例2在刚执行完alter system resume后还可以执行查询，但是随后就断开连接。

检查实例2对应的alert文件：

Resuming database after alter system resume command
Sat Feb 21 01:17:44 2009
Error: KGXGN polling error (15)
Sat Feb 21 01:17:44 2009
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_16311.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON: terminating instance due to error 29702
Sat Feb 21 01:17:44 2009
System state dump is made for local instance
System State dumped to trace file /data/oracle/admin/testrac/bdump/testrac2_diag_16307.trc
Sat Feb 21 01:17:46 2009
Shutting down instance (abort)
License high water mark = 6
Sat Feb 21 01:17:46 2009
Trace dumping is performing id=[cdmp_20090221011744]
Sat Feb 21 01:17:48 2009
Instance terminated by LMON, pid = 16311
Sat Feb 21 01:17:51 2009
Instance terminated by USER, pid = 21837
Sat Feb 21 01:17:55 2009
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 ce1 10.0.0.0 configured from OCR for use as a cluster interconnect
Interface type 1 ce0 172.25.0.0 configured from OCR for use as a public interface
Picked latch-free SCN scheme 3
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.3.0.
System parameters with non-default values:
processes                = 300
__shared_pool_size       = 352321536
__large_pool_size        = 16777216
__java_pool_size         = 16777216
__streams_pool_size      = 50331648
streams_pool_size        = 50331648
spfile                   = +DISK/testrac/spfiletestrac.ora
backup_tape_io_slaves    = TRUE
sga_target               = 2147483648
control_files            = +DISK/testrac/controlfile/current.256.618591133, +DISK/testrac/controlfile/current.257.618591137
db_file_name_convert     = /data1/oradata/testrac/g, +DISK/testrac/datafile/g, /data/oradata/testrac/l, +DISK/testrac/datafile/l, /data1/oradata/testrac/n, +DISK/testrac/datafile/n, /data1/oradata/testrac/s, +DISK/testrac/datafile/s, /data1/oradata/testrac/u, +DISK/testrac/datafile/u, /export/home/oracle/oradata/testrac/t, +DISK/testrac/tempfile/t, /export/home/oracle/oradata/testrac/zhejiang01.dbf, +DISK/testrac/datafile/zhejiang01.dbf, /export/home/oracle/oradata/testrac/zhejiang02.dbf, +DISK/testrac/datafile/zhejiang02.dbf, /expor
log_file_name_convert    = /export/home/oracle/oradata/testrac, +DISK/testrac/onlinelog
db_block_size            = 8192
__db_cache_size          = 1694498816
compatible               = 10.2.0.1.0
log_archive_config       =
log_archive_dest_1       = LOCATION=/data1/archivelog VALID_FOR=(ALL_LOGFILES,ALL_ROLES) DB_UNIQUE_NAME=testrac
log_archive_dest_2       =
log_archive_dest_state_2 = ENABLE
standby_archive_dest     = /data1/archivelog
log_archive_local_first = TRUE
log_archive_format       = %t_%s_%r.dbf
fal_client               = TESTRAC
fal_server               = TESTRAC_STANDBY
db_file_multiblock_read_count= 16
cluster_database         = TRUE
cluster_database_instances= 3
db_create_file_dest      = +DISK
db_recovery_file_dest    =
db_recovery_file_dest_size= 5368709120
standby_file_management = AUTO
thread                   = 2
instance_number          = 2
undo_management          = AUTO
undo_tablespace          = UNDOTBS2
remote_login_passwordfile= EXCLUSIVE
db_domain                =
service_names            = TESTRAC, testrac
dispatchers              = (PROTOCOL=TCP) (SERVICE=testracXDB)
local_listener           = (ADDRESS = (PROTOCOL = TCP)(HOST = 172.25.198.225)(PORT = 1521))
remote_listener          = LISTENERS_TESTRAC
job_queue_processes      = 10
cursor_sharing           = EXACT
background_dump_dest     = /data/oracle/admin/testrac/bdump
user_dump_dest           = /data/oracle/admin/testrac/udump
core_dump_dest           = /data/oracle/admin/testrac/cdump
audit_file_dest          = /data/oracle/admin/testrac/adump
session_max_open_files   = 20
db_name                  = testrac
open_cursors             = 5000
pga_aggregate_target     = 536870912
Cluster communication is configured to use the following interface(s) for this instance
10.0.0.2
Sat Feb 21 01:17:58 2009
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
PMON started with pid=2, OS id=22188
DIAG started with pid=3, OS id=22190
PSP0 started with pid=4, OS id=22192
LMON started with pid=5, OS id=22194
Sat Feb 21 01:17:59 2009
WARNING: Failed to set buffer limit on IPC interconnect socket
Oracle requires that the SocketReceive buffer size be tunable upto 1MB
Please make sure the kernel parameterwhich limits SO_RCVBUF value set by
applications is atleast 1MB
LMD0 started with pid=6, OS id=22206
LMS0 started with pid=7, OS id=22227
LMS1 started with pid=8, OS id=22231
MMAN started with pid=9, OS id=22235
DBW0 started with pid=10, OS id=22237
LGWR started with pid=11, OS id=22239
CKPT started with pid=12, OS id=22246
SMON started with pid=13, OS id=22251
RECO started with pid=14, OS id=22253
CJQ0 started with pid=15, OS id=22259
MMON started with pid=16, OS id=22273
Sat Feb 21 01:18:01 2009
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
MMNL started with pid=17, OS id=22280
Sat Feb 21 01:18:01 2009
starting up 1 shared server(s) ...
Sat Feb 21 01:18:01 2009
lmon registered with NM - instance id 2 (internal mem no 1)
Sat Feb 21 01:18:01 2009
Reconfiguration started (old inc 0, new inc 12)
List of nodes:
0 1 2
Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
Communication channels reestablished
* domain 0 valid = 1 according to instance 0
Sat Feb 21 01:18:02 2009
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Sat Feb 21 01:18:02 2009
LMS 0: 0 GCS shadows cancelled, 0 closed
Sat Feb 21 01:18:02 2009
LMS 1: 0 GCS shadows cancelled, 0 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Sat Feb 21 01:18:02 2009
LMS 0: 0 GCS shadows traversed, 0 replayed
Sat Feb 21 01:18:02 2009
LMS 1: 0 GCS shadows traversed, 0 replayed
Sat Feb 21 01:18:02 2009
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
LCK0 started with pid=20, OS id=22315
Sat Feb 21 01:18:03 2009
ALTER DATABASE   MOUNT
Sat Feb 21 01:18:03 2009
Starting background process ASMB
ASMB started with pid=22, OS id=22347
Starting background process RBAL
RBAL started with pid=23, OS id=22351
Sat Feb 21 01:18:07 2009
SUCCESS: diskgroup DISK was mounted
Sat Feb 21 01:18:11 2009
Setting recovery target incarnation to 2
Sat Feb 21 01:18:11 2009
Successful mount of redo thread 2, with mount id 57044521
Sat Feb 21 01:18:11 2009
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Completed: ALTER DATABASE   MOUNT
Sat Feb 21 01:18:13 2009
ALTER DATABASE OPEN
Picked broadcast on commit scheme to generate SCNs
Sat Feb 21 01:18:13 2009
LGWR: STARTING ARCH PROCESSES
ARC0 started with pid=27, OS id=22623
Sat Feb 21 01:18:14 2009
ARC0: Archival started
ARC1: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC1 started with pid=28, OS id=22625
Sat Feb 21 01:18:14 2009
Thread 2 opened at log sequence 842
Current log# 4 seq# 842 mem# 0: +DISK/testrac/onlinelog/group_4.270.633485455
Successful open of redo thread 2
Sat Feb 21 01:18:14 2009
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Sat Feb 21 01:18:14 2009
SMON: enabling cache recovery
Sat Feb 21 01:18:14 2009
ARC0: Becoming the 'no FAL' ARCH
ARC0: Becoming the 'no SRL' ARCH
Sat Feb 21 01:18:14 2009
ARC1: Becoming the heartbeat ARCH
Sat Feb 21 01:18:15 2009
Successfully onlined Undo Tablespace 4.
Sat Feb 21 01:18:15 2009
SMON: enabling tx recovery
Sat Feb 21 01:18:15 2009
Database Characterset is ZHS16GBK
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=31, OS id=22830
Sat Feb 21 01:18:25 2009
Completed: ALTER DATABASE OPEN

从alert文件中可以清晰的看到，Oracle执行resume命令后由于出现的ORA-29702错误对当前实例执行了SHUTDOWN ABORT并重启的操作。根据metalink中的描述，ORA-29702只是一个超时的错误，怀疑是SRVCTL命令发出时间超时，导致CLUSTER无法判断数据库是否处于CLUSTER一致性的状态，因此在恢复操作后，重启了另外的实例。

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/4227/viewspace-591813/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/4227/viewspace-591813/

ckawt40802

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
SUSPEND导致RAC环境的ORA-29702错误（一）

在测试RAC环境的SUSPEND命令时，出现了这个错误。测试RAC环境下，SUSPEND对多个实例的影响时，在后台日志中出现了这个错误。简单描述一下测试环境，3节点的Oracle Rac 10203 for ...
复制链接

扫一扫