原标题:注意:ORACLE 11G ADG RAC 这个情况下并不能高可用
墨墨导读:一套11 G r2(11.2.0.3)版本的2节点RAC adg环境,节点1因为硬件原因异常crash(apply redo 节点), 但同时实例2上的应用也都断开了(原来都是open)。
今天在一套11 G r2(11.2.0.3)版本的2节点RAC adg环境,节点1因为硬件原因异常crash(apply redo 节点), 但同时实例2上的应用也都断开了(原来都是open),adg上是有连接一些只读业务,而且节点2 db alert log未发现明显手动close 实例的日志,并且是自动切换到了mount状态,RAC不是应该高可用吗?为什么死一个节点另外的节点也要跟着受影响?
这里如果检查实例2状态其实是“mount”, 不知道有多少人知道数据库其实是有alter database close命令的,但是在一个实例的生命周期内手动close,也就无法再open,需要重启实例(pdb除外), 并且刚才也说了,实例2 alert没有close迹象,下面附一段
dg instance 2 db alert log
2020-09-03 00:10:39.601000 +08:00
ORA-01555 caused by SQL statement below (SQL ID: 4snkhx5vxrmv2, Query Duration=7340 sec, SCN: 0x0f46.c04d2402):
select....
2020-09-0411: 35: 29.504000+ 08: 00
ArchivedLogentry 105312added forthread2sequence200520ID0x1fcb56a7dest 1:
2020-09-0814: 16: 17.954000+ 08: 00
Reconfiguration started ( oldinc 22, newinc 24)
Listofinstances:
2(myinst: 2)
GlobalResourceDirectoryfrozen
* dead instancedetected - domain0invalid = TRUE
Communication channels reestablished
Masterbroadcasted resourcehashvaluebitmaps
Non- localProcess blocks cleaned out
LMS 0: 23GCS shadows cancelled, 0closed, 0Xw survived
LMS 2: 34GCS shadows cancelled, 0closed, 0Xw survived
LMS 1: 19GCS shadows cancelled, 0closed, 0Xw survived
Setmasternode info
Submitted allremote- enqueuerequests
2020-09-0814: 16: 19.005000+ 08: 00
Dwn-cvts replayed, VALBLKs dubious
Allgrantable enqueues granted
Post SMON tostart1st pass IR
2020-09-0814: 16: 22.014000+ 08: 00
ARC1: Becoming the active heartbeat ARCH
ARC1: Becoming the active heartbeat ARCH
2020-09-0814: 16: 23.328000+ 08: 00
Submitted allGCS remote- cacherequests
Post SMON tostart1st pass IR
Fix write ingcs resources
2020-09-0814: 16: 24.368000+ 08: 00
Reconfiguration complete
Recoverysessionaborted due toinstancecrash
Closethe databasedue toaborted recoverysession
SMON: disabling tx recovery
2020-09-0814: 16: 54.955000+ 08: 00
Stopping background process MMNL
Stopping background process MMON
2020-09-0814: 17: 26.530000+ 08: 00
Background process MMON notdead after30seconds
Killing background process MMON
2020-09-0814: 18: 04.907000+ 08: 00
Startingbackground process MMON
MMON started withpid= 27, OS id= 18743
Startingbackground process MMNL
MMNL started withpid= 1865, OS id= 18745
CLOSE: killing serversessions.
2020-09-0814: 18: 07.003000+ 08: 00
Active process 3858user'grid'program 'oracle@anbob2'
Active process 14847user'grid'program 'oracle@anbob2'
Active process 3435user'grid'program 'oracle@anbob2'
Active process 25029user'grid'program 'oracle@anbob2'
Active process 9789user'grid'program 'oracle@anbob2'
Active process 23815user'grid'program 'oracle@anbob2'
...
Active process 24285user'itmuser'program 'oracle@anbob2 (TNS V1-V3)'
Active process 10045user'grid'program 'oracle@anbob2'
Active process 24229user'grid'program 'oracle@anbob2'
CLOSE: allsessions shutdownsuccessfully.
SMON: disabling tx recovery
SMON: disabling cacherecovery
2020-09-0814: 19: 06.638000+ 08: 00
2020-09-0814: 26: 56.608000+ 08: 00alterdatabaserecovermanagedstandbydatabaseusingcurrentlogfiledisconnectfromsessionAttempt tostartbackground ManagedStandbyRecoveryprocess (tbcsc2)MRP0 started withpid= 73, OS id= 9746MRP0: Background ManagedStandbyRecoveryprocess started (tbcsc2)2020-09-0814: 27: 01.712000+ 08: 00started logmerger processManagedStandbyRecoverystartingRealTimeApply2020-09-0814: 27: 08.171000+ 08: 00ParallelMedia Recoverystarted with32slavesWaiting forallnon- currentORLs tobe archived...Allnon- currentORLs have been archived.RecoveryofOnlineRedoLog: Thread2Group41Seq 201120Reading mem 0Mem # 0: /dev/yyc_oravg04/ryyc_redo41Media RecoveryLog/yyc1_arch/arch_1_288357_920590168.arcCompleted: alterdatabaserecovermanagedstandbydatabaseusingcurrentlogfiledisconnectfromsession2020-09-0814: 27: 29.221000+ 08: 00
-- 因为缺少主机1上的归档日志所以无法应用,我们cancel redo apply,abort了数据库alterdatabaserecovermanagedstandbydatabasecancel2020-09-0814: 31: 57.640000+ 08: 00
WARNING: inbound connectiontimed out(ORA -3136)2020-09-0814: 33: 30.958000+ 08: 00Shutting down instance( abort)
简而言之就是,如果apply redo应用日志的实例进程异常终止后,其它所有OPEN READ ONLY的实例会close, 因为在RAC ADG环境中,如果实例在应用日志过程中中断crash, 会把CACHE FUSION的锁留到残留幸存的实例中,会导致数据查询不一致,因次需要关闭数据库,重新打开来保证buffer cache和datafile 的一致状态。如果配置了DG BROKER 这个操作可以自动完成, 版本大于11.2.0.2,如果没有配置dg broker,手动方式直接open 就可以了,接着手动执行应用日志命令,继续在幸存的节点上应用日志。
附上MOS那段解释
Symptoms
In an Active Data Guard RAC standby, if the redo apply instance crashes, all other instances of that standby that were open Read Only will be closed and returned to the MOUNT state. This disconnects all readers of the Active Data Guard standby.
Cause
In an Active Data Guard RAC standby, if the redo apply instance crashes in the middle of media recovery, it leaves the RAC cache fusion locks on the surviving instances and the data files on disk in an in-flux state. In such a situation, queries on the surviving instances can potentially see inconsistent data. To resolve this in-flux state, the entire standby database is closed. Upon opening the first instance after such a close, the buffer caches and datafiles are made consistent again.
But from 12.1 when the apply instance crashed in the middle of applying changes, one of the remaining open instances will be automatically posted to do “ADG instance recovery”, after the ADG instance recovery.We can see this, ADG instance recovery by checking the alert log, for the messages like “Beginning ADG Instance Recovery” and “Completed ADG Instance Recovery”. If DG broker is enabled then Broker will start the MRP on any of the surviving instances.返回搜狐,查看更多
责任编辑: