log_archive_dest_n参数中reopen、max_failure、net_timeout三个属性的组合效果测试

最新推荐文章于 2023-11-16 11:13:13 发布

cuiqiong6888

最新推荐文章于 2023-11-16 11:13:13 发布

阅读量1.3k

点赞数

文章标签：数据库

log_archive_dest_n参数中有三个属性都是用来指导在目标归档路径故障的情况下，归档过程应如何继续下去，这三个属性是：

reopen：能用于DG或者非DG的环境下，在非DG环境下当出现归档目录满或者因权限问题无法正常写入时，在DG环境下当prmy与stdby间的网络中断时，用来控制隔多长时间(以秒为单位)进行下一次的尝试

max_failure：能用于DG或者非DG的环境下，当归档路径的写入失败次数超过指定次数时，就不会再次尝试继续归档到这个路径下了

net_timeout：仅能用在保护模式为maximum availability的DG环境下，正常情况下prmy产生的redo通过LNS进程传输给Stdby的RFS进程，RFS进程接收到redo后回给prmy的LNS一个acknowledge，LNS进一步通知LGWR commit可以继续；当prmy与stdby间的网络中断时，LGWR如果在net_timeout定义的时间内无法收到LNS的通知，也能让commit继续下去。

在实际使用过程中往往需要对这三个参数进行组合应用才能达到应有的效果，且在DG和非DG环境下略有差别。

>>>>>>>>> 第一部分：非DG环境下reopen+max_failures三个参数的组合效果测试 <<<<<<<<<
///
//第一组测试：仅一个archive destination，设置了reopen=10 max_failure=3
///
###archive dest当前设置
col dest_name format a30
col destination format a50
set linesize 150
select dest_name,destination,status,binding,reopen_secs,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- -----------
LOG_ARCHIVE_DEST_4 /oradata06/arch VALID OPTIONAL 10 3

修改/oradata06/arch目录权限为root.system
ls -ld /oradata06/arch

chown root.system /oradata06/arch

ls -ld /oradata06/arch
drwxr-xr-x 2 root system 96 Sep 10 09:26 /oradata06/arch

###switch logfile
alter system switch logfile;

select dest_name,status,binding,reopen_secs,max_failure,fail_date,failure_count from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME STATUS BINDING REOPEN_SECS MAX_FAILURE FAIL_DATE FAILURE_COUNT
------------------------------ --------- --------- ----------- ----------- ----------------- -------------
LOG_ARCHIVE_DEST_4 ERROR OPTIONAL 10 3 20150910 09:28:01 1

###连续执行三次switch logfile，观察failure_count增长情况
alter system switch logfile; --09:31:36 执行
alter system switch logfile; --09:31:36 执行
alter system switch logfile; --09:32:32 执行

***但是直到09:33:48 failure_count 才从1增长为2，并没有按照reopen=10的设置每隔10秒钟探测一次归档目录是否可用
select dest_name,status,binding,reopen_secs,max_failure,fail_date,failure_count from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME STATUS BINDING REOPEN_SECS MAX_FAILURE FAIL_DATE FAILURE_COUNT
------------------------------ --------- --------- ----------- ----------- ----------------- -------------
LOG_ARCHIVE_DEST_4 ERROR OPTIONAL 10 3 20150910 09:33:48 2

###再次执行switch logfile，但是直到09:38:48 failure_count才从2增长为3
alter system switch logfile; --09:36:44

***但是直到09:33:48 failure_count 才从2增长为3，也远超过了reopen=10的设置
SQL> select dest_name,status,binding,reopen_secs,max_failure,fail_date,failure_count from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';

DEST_NAME STATUS BINDING REOPEN_SECS MAX_FAILURE FAIL_DATE FAILURE_COUNT
------------------------------ --------- --------- ----------- ----------- ----------------- -------------
LOG_ARCHIVE_DEST_4 ERROR OPTIONAL 10 3 20150910 09:38:48 3

从上面的记录的fail_date可以看出每次探测距离上一次失败的间隔是5分钟

###因为LOG_ARCHIVE_DEST_4是唯一个可用的归档目录，所以尽管Failure_count达到了max_failure定义的上限3，ARCH进程依然不断尝试该路径下的归档是否可用
alter system switch logfile; --09:51:50再次执行switch logfile

***alert.log里于09:53:49报出无法归档的错误
Thu Sep 10 09:53:49 2015
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance tstdb1_stdby1 - Archival Error
ORA-16014: log 3 sequence# 305 not archived, no available destinations
ORA-00312: online log 3 thread 1: '/oradata06/teststdby/testaaaaa/redo03a.log'
ORA-00312: online log 3 thread 1: '/oradata06/teststdby/testaaaaa/redo03b.log'

***但v$archive_dest里的failure_count还是停留在3，因为其已经到了max_failure定义的最大值，所以不再更新
select dest_name,status,binding,reopen_secs,max_failure,fail_date,failure_count from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';

DEST_NAME STATUS BINDING REOPEN_SECS MAX_FAILURE FAIL_DATE FAILURE_COUNT
------------------------------ --------- --------- ----------- ----------- ----------------- -------------
LOG_ARCHIVE_DEST_4 ERROR OPTIONAL 10 3 20150910 09:38:48 3

###直到我们重新将failure_count置为0，FAILURE_COUNT才会继续计数
SQL> alter system set log_archive_dest_4='location=/oradata06/arch reopen=10 max_failure=3' scope=both; <---重新设置log_archive_Dest_n就能将failure_count复位成0

System altered.

SQL> select dest_name,status,binding,reopen_secs,max_failure,fail_date,failure_count from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';

DEST_NAME STATUS BINDING REOPEN_SECS MAX_FAILURE FAIL_DATE FAILURE_COUNT
------------------------------ --------- --------- ----------- ----------- ----------------- -------------
LOG_ARCHIVE_DEST_4 VALID OPTIONAL 10 3 0

alter system switch logfile;
alter system switch logfile;
alter system switch logfile;

SQL> select dest_name,status,binding,reopen_secs,max_failure,fail_date,failure_count from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';

DEST_NAME STATUS BINDING REOPEN_SECS MAX_FAILURE FAIL_DATE FAILURE_COUNT
-------------------------------------------------- --------- --------- ----------- ----------- ----------------- -------------
LOG_ARCHIVE_DEST_4 ERROR OPTIONAL 10 3 20150910 10:07:07 3

reopen time并没有向文档上描述的那样精确的控制两次探测archive dest间的时间间隔，对于mandatory的archive dest或者仅有的一个optional的archive dest，达到failure_count后，依然会被ARCH进程探测

///
//第二组测试：有两个archive destination，其中一个是mandatory，一个是optional，两个destination均设置reopen=20 max_failure=5；
//故意制造optional的destination无法写入的场景
///
alter system set log_archive_dest_4='location=/oradata06/arch mandatory reopen=20 max_failure=5' scope=both;
alter system set log_archive_dest_5='location=/oradata06/arch1 optional reopen=20 max_failure=5' scope=both;

alter system switch logfile;

select dest_name,destination,status,binding,reopen_secs,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS FAILURE_COUNT FAIL_DATE MAX_FAILURE
-------------------- ------------------------------ --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4 /oradata06/arch VALID MANDATORY 10 0 3
LOG_ARCHIVE_DEST_5 /oradata06/arch1 VALID OPTIONAL 10 0 3

SQL> show parameter log_archive_min_succeed_dest

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
log_archive_min_succeed_dest integer 1

###修改/oradata06/arch1目录权限为root.system
root@qcp570717a:/oradata06/arch1>chown root.system /oradata06/arch1
root@qcp570717a:/oradata06/arch1>ls -ld /oradata06/arch1
drwxr-xr-x 2 root system 96 Sep 10 10:56 /oradata06/arch1

###连续的执行switch logfile，边执行边观察v$archive_dest视图的变化情况
***session 1：以每秒钟1次的频率执行switch logfile
alter system switch logfile;
。。。。。 <----执行n遍

***session 2：观察v$archive_dest的刷新情况
select dest_name,destination,status,binding,reopen_secs,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4 /oradata06/arch VALID MANDATORY 20 0 5
LOG_ARCHIVE_DEST_5 /oradata06/arch1 ERROR OPTIONAL 20 1 20150910 12:59:09 5

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4 /oradata06/arch VALID MANDATORY 20 0 5
LOG_ARCHIVE_DEST_5 /oradata06/arch1 ERROR OPTIONAL 20 2 20150910 12:59:29 5

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4 /oradata06/arch VALID MANDATORY 20 0 5
LOG_ARCHIVE_DEST_5 /oradata06/arch1 ERROR OPTIONAL 20 3 20150910 12:59:51 5

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4 /oradata06/arch VALID MANDATORY 20 0 5
LOG_ARCHIVE_DEST_5 /oradata06/arch1 ERROR OPTIONAL 20 4 20150910 13:00:08 5

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4 /oradata06/arch VALID MANDATORY 20 0 5
LOG_ARCHIVE_DEST_5 /oradata06/arch1 ERROR OPTIONAL 20 5 20150910 13:00:28 5

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4 /oradata06/arch VALID MANDATORY 20 0 5
LOG_ARCHIVE_DEST_5 /oradata06/arch1 ERROR OPTIONAL 20 5 20150910 13:00:28 5

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4 /oradata06/arch VALID MANDATORY 20 0 5
LOG_ARCHIVE_DEST_5 /oradata06/arch1 ERROR OPTIONAL 20 5 20150910 13:00:28 5

以上输出清楚的看出，对于optional的archive dest的确能按照reopen和max_failure参数的指定值，尽管我们以每秒1次的频率执行"alter system switch logfile"，但ARCH进程基本以每20秒1次的频率探测LOG_ARCHIVE_DEST_5是否可用，直到failure_count达到max_failure所指定的5次，便停止探测。
如果此时LOG_ARCHIVE_DEST_5对应的/oradata06/arch1目录写权限恢复，必须人工重设log_archive_dest_5参数使得ARCH进程重新开始对归档目录的探测: alter system set log_archive_dest_5='location=/oradata06/arch1 optional reopen=20 max_failure=5' scope=both;
当ARCH发现log_archive_dest_5对应的/oradata06/arch1目录能够正常写入了，后面的归档就能放到该目录下了

///
第三组测试：有两个archive destination，其中一个是mandatory，一个是optional，两个destination均设置reopen=20 max_failure=5；
故意制造mandatory的destination无法写入的场景
///
alter system set log_archive_dest_4='location=/oradata06/arch mandatory reopen=20 max_failure=5' scope=both;
alter system set log_archive_dest_5='location=/oradata06/arch1 optional reopen=20 max_failure=5' scope=both;

alter system switch logfile;

ls -ld /oradata06/arch
drwxr-xr-x 2 oracle oinstall 8192 Sep 10 13:43 /oradata06/arch

chown root.system /oradata06/arch

ls -ld /oradata06/arch
drwxr-xr-x 2 root system 8192 Sep 10 13:43 /oradata06/arch

##每10秒钟执行1次switch logfile，共执行10次
alter system switch logfile;
。。。 <---共执行10次

###观察v$archive_dest中关于LOG_ARCHIVE_DEST_4的fail_date按照5分钟1次的频率进行更新，并不受reopen=20的约束
select dest_name,destination,status,binding,reopen_secs,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4 /oradata06/arch ERROR MANDATORY 20 3 20150910 13:50:59 5
LOG_ARCHIVE_DEST_5 /oradata06/arch1 VALID OPTIONAL 20 0 5

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4 /oradata06/arch ERROR MANDATORY 20 3 20150910 13:50:59 5
LOG_ARCHIVE_DEST_5 /oradata06/arch1 VALID OPTIONAL 20 0 5

。。。。省略了重复的输出

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4 /oradata06/arch ERROR MANDATORY 20 4 20150910 13:55:59 5 <---这一次更新距离上一次是5分钟
LOG_ARCHIVE_DEST_5 /oradata06/arch1 VALID OPTIONAL 20 0 5

通过上面三组测试，得出如下结论：
对于mandatory的archive dest，或者仅有的一个optional的archive dest，reopen与max_failure参数并没有起到文档中所描述的作用，而是按照每5分钟一次的间隔永久的尝试进行归档；
reopen与max_failure的作用对于optional的archive dest，完全遵循文档上的解释

>>>>>>>>> 第二部分：DG环境下net_timeout+reopen+max_failures三个参数的组合效果测试 <<<<<<<<<
测试的DG处于maximum availability保护模式下，测试中制造网络中断来模拟prmy与stdby间无法通信的场景
///
//第一组测试：net_timeout + reopen的效果测试(在net_timeout时间内网络没有恢复正常)
///
***prmy设置reopen、net_timeout
SQL> select db_unique_name,database_role,protection_mode,protection_level,open_mode,SWITCHOVER_STATUS from v$database;

DB_UNIQUE_NAME DATABASE_ROLE PROTECTION_MODE PROTECTION_LEVEL OPEN_MODE SWITCHOVER_STATUS
--------------- ---------------- -------------------- -------------------- -------------------- --------------------
tstdb1_stdby1 PRIMARY MAXIMUM AVAILABILITY MAXIMUM AVAILABILITY READ WRITE TO STANDBY

alter system set log_archive_dest_2='service=tstdb1 SYNC affirm reopen=10 net_timeout=40 valid_for=(online_logfiles,primary_roles) db_unique_name=tstdb1' scope=both;

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------- ---------------------------------------- -------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1 USE_DB_RECOVERY_FILE_DEST VALID OPTIONAL 300 0 0 0
LOG_ARCHIVE_DEST_2 tstdb1 VALID OPTIONAL 10 40 0 0

***故意制造prmy与stdby网络中断，中断后立即在prmy上执行update
select sysdate from dual; <---网络中断时间，同时也可以近似认为是下面发起commit的时间
SYSDATE
-----------------
20150910 16:54:56

update scott.t0908_1 set username='FFF' where username='EEE';

SQL> commit; <--- Hang住

Commit complete.

SQL> select sysdate from dual; <---commit 完成时间
SYSDATE
-----------------
20150910 16:55:37

可以看出Hang住的时间大约是41秒，符合net_timeout=40的设置

***这时prmy侧的nss进程尚未重启
ps -ef|grep nss
root 4457152 3801892 0 16:55:25 pts/8 0:00 grep nss
oracle 4326204 1 0 16:46:57 - 0:00 ora_nss2_tstdb1_stdby1

###观察lgwr的.trc文件能清晰的看出网络中断后Redo Tranport Service的应对过程：
*** 2015-09-10 16:55:02.486 <---网络中断后LGWR首次长时间未收到NSS的响应
LGWR found NSS2 alive.. waiting for msg

*** 2015-09-10 16:55:12.507
LGWR found NSS2 alive.. waiting for msg <---10秒后，LGWR仍未收到NSS的响应，这个10秒就是由reopen参数定义的

*** 2015-09-10 16:55:22.528
LGWR found NSS2 alive.. waiting for msg <---又过了10秒，LGWR仍未收到NSS的响应

*** 2015-09-10 16:55:32.000 <---又过了10秒，LGWR仍未收到NSS的响应，这时总的超时时间已经达到了net_timeout所定义的40秒
Current time <09/10/2015 16:55:32> exceeds begin time <09/10/2015 16:54:52> by <40> seconds. Return timedout error
*** 2015-09-10 16:55:32.000 4320 krsh.c
ORA-16198: LGWR received timedout error from KSR
*** 2015-09-10 16:55:32.000 4320 krsh.c
LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
*** 2015-09-10 16:55:32.000 4320 krsh.c
LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned <----放弃reconnect
ERROR: Receive timedout!
*** 2015-09-10 16:55:32.000 4320 krsh.c
Error 16198 for archive log file 1 to 'tstdb1'
LGWR had received a timeout previously. return timeout again
Error 16198 detaching RFS from standby instance at host 'tstdb1' <---中断stdby上的RFS与prmy上的NSS之间的连接
*** 2015-09-10 16:55:32.001 6904 krsu.c
Making upidhs request to NSS2 (ocis 0x11169e420). Begin time is <09/10/2015 16:54:52> and NET_TIMEOUT <40> seconds <---网络中断的精确时间是16:54:52
NetServer pid:4326204

*** 2015-09-10 16:55:36.001
*** 2015-09-10 16:55:36.001 8830 krsu.c
Cleaning up NSS2 [pid 4326204] because it failed to enter 'alive after detach' state
*** 2015-09-10 16:55:36.001 978 krsw.c
NSS2 [pid 4326204] receiving termination signal.. <---NSS进程被kill
.... killed successfully

*** 2015-09-10 16:55:37.001
.. pmon posted for async lns cleanup
*** 2015-09-10 16:55:37.001 7080 krsu.c
upidhs done status 0
krsl_reap_io: ignoring detach error 16198
*** 2015-09-10 16:55:37.001 2932 krsi.c <----LOG_ARCHIVE_DEST_2对应的destination被置为了failed
krsi_dst_fail: dest:2 err:16198 force:0 blast:1

*** 2015-09-10 16:55:37.015
Warning: log write elapsed time 44541ms, size 0KB <---44s才完成写入

*** 2015-09-10 16:55:37.151
*** 2015-09-10 16:55:37.151 2682 krsl.c
No standby database destinations have been configured
as being archived by the LGWR process
This instance will operate at a reduced protection mode until
network connectivity to the standby databases is restored and
all archivelog gaps have been resolved.

###从v$archive_dest可以看出LOG_ARCHIVE_DEST_2被置为fail的时间是16:55:37，与上面lgwr日志反映出来的时间一致
col dest_name format a25
col destination format a40
set linesize 180
select dest_name,destination,status,binding,reopen_secs,net_timeout,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1 USE_DB_RECOVERY_FILE_DEST VALID OPTIONAL 300 0 0 0
LOG_ARCHIVE_DEST_2 tstdb1 ERROR OPTIONAL 10 40 1 20150910 16:55:37 0

***路由恢复
SQL> SQL> select sysdate from dual; <---17:00:57路由恢复

SYSDATE
-----------------
20150910 17:00:57

***路由恢复后等待了近两分钟LOG_ARCHIVE_DEST_2 并没有变为VALID，需要有Transaction执行来触发一下LGWR
SQL> update scott.t0908_1 set username='GGG' where username='FFF';

1 row updated.

SQL> commit;

Commit complete.

***随即LGWR进程日志有了输出：
*** 2015-09-10 17:02:52.224
NSS2 is not running anymore.
Waiting for subscriber count on LGWR-NSS2 channel to go to zero
Subscriber count went to zero - time now is <09/10/2015 17:02:52>
*** 2015-09-10 17:02:52.224 7723 krsu.c
Starting NSS2 ...
Waiting for NSS2 [pid 5112496] to initialize itself <---重启NSS进程

*** 2015-09-10 17:02:55.301
*** 2015-09-10 17:02:55.301 7890 krsu.c
Netserver NSS2 [pid 5112496] for mode SYNC has been initialized
Performing a channel reset to ignore previous responses
Successfully started NSS2 [pid 5112496] for dest tstdb1 mode SYNC ocis=0x11169e420 <----LOG_ARCHIVE_DEST_2对应的destination重新恢复成VALID状态

*** 2015-09-10 17:02:55.511
*** 2015-09-10 17:02:55.511 3561 krsu.c
upiahm connect done status is 0

*** 2015-09-10 17:02:58.681
*** 2015-09-10 17:02:58.681 7119 krsu.c
Making upinblc request to NSS2 (ocis 0x11169e420). Begin time is <09/10/2015 17:02:58> and NET_TIMEOUT is <40> seconds
NetServer pid:5112496

###v$archive_dest里的failure_count、fail_date清零，以下过程可以看出先清fail_date，再清failure_count
select dest_name,destination,status,binding,reopen_secs,net_timeout,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1 USE_DB_RECOVERY_FILE_DEST VALID OPTIONAL 300 0 0 0
LOG_ARCHIVE_DEST_2 tstdb1 ERROR OPTIONAL 10 40 1 20150910 16:55:37 0

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1 USE_DB_RECOVERY_FILE_DEST VALID OPTIONAL 300 0 0 0
LOG_ARCHIVE_DEST_2 tstdb1 VALID OPTIONAL 10 40 1 0

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1 USE_DB_RECOVERY_FILE_DEST VALID OPTIONAL 300 0 0 0
LOG_ARCHIVE_DEST_2 tstdb1 VALID OPTIONAL 10 40 1 0

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1 USE_DB_RECOVERY_FILE_DEST VALID OPTIONAL 300 0 0 0
LOG_ARCHIVE_DEST_2 tstdb1 VALID OPTIONAL 10 40 1 0

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1 USE_DB_RECOVERY_FILE_DEST VALID OPTIONAL 300 0 0 0
LOG_ARCHIVE_DEST_2 tstdb1 VALID OPTIONAL 10 40 1 0

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1 USE_DB_RECOVERY_FILE_DEST VALID OPTIONAL 300 0 0 0
LOG_ARCHIVE_DEST_2 tstdb1 VALID OPTIONAL 10 40 1 0

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1 USE_DB_RECOVERY_FILE_DEST VALID OPTIONAL 300 0 0 0
LOG_ARCHIVE_DEST_2 tstdb1 VALID OPTIONAL 10 40 0 0

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1 USE_DB_RECOVERY_FILE_DEST VALID OPTIONAL 300 0 0 0
LOG_ARCHIVE_DEST_2 tstdb1 VALID OPTIONAL 10 40 0 0

###重新启动后的NSS
oracle@qcp570717a:/home/oracle>ps -ef|grep 5112496
oracle 3998006 2883966 0 17:41:21 pts/7 0:00 grep 5112496
oracle 5112496 1 0 17:02:52 - 0:00 ora_nss2_tstdb1_stdby1 <---重新启动后的NSS

///
//第二组测试：net_timeout + reopen的效果测试(在net_timeout时间内网络恢复正常)
///
***prmy设置reopen、net_timeout的参数同上
SQL> select db_unique_name,database_role,protection_mode,protection_level,open_mode,SWITCHOVER_STATUS from v$database;

DB_UNIQUE_NAME DATABASE_ROLE PROTECTION_MODE PROTECTION_LEVEL OPEN_MODE SWITCHOVER_STATUS
--------------- ---------------- -------------------- -------------------- -------------------- --------------------
tstdb1_stdby1 PRIMARY MAXIMUM AVAILABILITY MAXIMUM AVAILABILITY READ WRITE TO STANDBY

alter system set log_archive_dest_2='service=tstdb1 SYNC affirm reopen=10 net_timeout=40 valid_for=(online_logfiles,primary_roles) db_unique_name=tstdb1' scope=both;

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------- ---------------------------------------- -------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1 USE_DB_RECOVERY_FILE_DEST VALID OPTIONAL 300 0 0 0
LOG_ARCHIVE_DEST_2 tstdb1 VALID OPTIONAL 10 40 0 0

***故意制造prmy与stdby网络中断，中断后立即在prmy上执行update
select sysdate from dual; <---网络中断时间，同时也可以近似认为是下面发起commit的时间

SYSDATE
-----------------
20150913 05:13:49

update scott.t0908_1 set username='HHH' where username='GGG';

SQL> commit; <--- Hang住之后，在net_timeout定义的40秒时间内，在05:14:08恢复prmy、stdby间的网络通信

Commit complete.

SQL> select sysdate from dual; <---commit 完成时间

SYSDATE
-----------------
20150913 05:14:10

可以看出Hang住的时间大约是21秒，中断时长 < net_timeout

###这时lgwr的.trc文件能看出经历了两次10秒钟的判断后，远程log write恢复正常，期间没有发生NSS进程重启等动作：
*** 2015-09-13 05:13:58.368
LGWR found NSS2 alive.. waiting for msg

*** 2015-09-13 05:14:08.389
LGWR found NSS2 alive.. waiting for msg

*** 2015-09-13 05:14:10.667
Warning: log write elapsed time 22304ms, size 0KB

***nss进程还是保持上一次的重启时间
ps -ef|grep nss
oracle 14155968 4981746 0 05:16:17 pts/4 0:00 grep nss
oracle 5964580 1 0 05:10:18 - 0:00 ora_nss2_tstdb1_stdby1

###因为在net_timeout时间内网络恢复，所以LOG_ARCHIVE_DEST_2的状态一直是VALID，与上面lgwr日志反映出来的时间一致
col dest_name format a25
col destination format a40
set linesize 180
select dest_name,destination,status,binding,reopen_secs,net_timeout,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1 USE_DB_RECOVERY_FILE_DEST VALID OPTIONAL 300 0 0 0
LOG_ARCHIVE_DEST_2 tstdb1 VALID OPTIONAL 10 40 0 0

///
//第三组测试：net_timeout + reopen + max_failure的效果测试(未在net_timeout时间内网络恢复正常)
///
在原来的基础上加上了max_failure=3
alter system set log_archive_dest_2='service=tstdb1 SYNC affirm reopen=10 net_timeout=40 max_failure=3 valid_for=(online_logfiles,primary_roles) db_unique_name=tstdb1' scope=both;

col dest_name format a25
col destination format a40
set linesize 180
select dest_name,destination,status,binding,reopen_secs,net_timeout,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------- ---------------------------------------- -------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1 USE_DB_RECOVERY_FILE_DEST VALID OPTIONAL 300 0 0 0
LOG_ARCHIVE_DEST_2 tstdb1 VALID OPTIONAL 10 40 0 3

select sysdate from dual; <---网络中断时间，同时也可以近似认为是下面发起commit的时间
SYSDATE
-----------------
20150913 05:47:29

update scott.t0908_1 set username='III' where username='HHH';

SQL> commit; <--- Hang住

Commit complete.

SQL> select sysdate from dual; <---commit 完成时间
SYSDATE
-----------------
20150913 05:48:12

###观察lgwr的.trc文件能看出网络中断后Redo Tranport Service的应对过程：经历了两次40s的等待，NSS进程经历被kill->重启->kill的过程
LGWR found NSS2 alive.. waiting for msg

*** 2015-09-13 05:47:47.267
LGWR found NSS2 alive.. waiting for msg

*** 2015-09-13 05:47:57.288
LGWR found NSS2 alive.. waiting for msg

*** 2015-09-13 05:48:07.008
Current time <09/13/2015 05:48:07> exceeds begin time <09/13/2015 05:47:27> by <40> seconds. Return timedout error
*** 2015-09-13 05:48:07.008 4320 krsh.c
ORA-16198: LGWR received timedout error from KSR
*** 2015-09-13 05:48:07.008 4320 krsh.c
LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
*** 2015-09-13 05:48:07.008 4320 krsh.c
LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
ERROR: Receive timedout!
*** 2015-09-13 05:48:07.009 4320 krsh.c
Error 16198 for archive log file 6 to 'tstdb1'
LGWR had received a timeout previously. return timeout again
Error 16198 detaching RFS from standby instance at host 'tstdb1'
*** 2015-09-13 05:48:07.009 6904 krsu.c
Making upidhs request to NSS2 (ocis 0x11169e420). Begin time is <09/13/2015 05:47:27> and NET_TIMEOUT <40> seconds <--第一次timeout
NetServer pid:5964580

*** 2015-09-13 05:48:11.009
*** 2015-09-13 05:48:11.009 8830 krsu.c
Cleaning up NSS2 [pid 5964580] because it failed to enter 'alive after detach' state
*** 2015-09-13 05:48:11.009 978 krsw.c
NSS2 [pid 5964580] receiving termination signal..
.... killed successfully

*** 2015-09-13 05:48:12.009
.. pmon posted for async lns cleanup
*** 2015-09-13 05:48:12.010 7080 krsu.c
upidhs done status 0
krsl_reap_io: ignoring detach error 16198
*** 2015-09-13 05:48:12.010 2932 krsi.c
krsi_dst_fail: dest:2 err:16198 force:0 blast:1

*** 2015-09-13 05:48:12.014
Warning: log write elapsed time 44779ms, size 0KB
NSS2 is not running anymore.
Waiting for subscriber count on LGWR-NSS2 channel to go to zero
Subscriber count went to zero - time now is <09/13/2015 05:48:12>
*** 2015-09-13 05:48:12.045 7723 krsu.c
Starting NSS2 ...
Waiting for NSS2 [pid 2883910] to initialize itself <---NSS被重启

*** 2015-09-13 05:48:15.130
*** 2015-09-13 05:48:15.130 7890 krsu.c
Netserver NSS2 [pid 2883910] for mode SYNC has been initialized
Performing a channel reset to ignore previous responses
Successfully started NSS2 [pid 2883910] for dest tstdb1 mode SYNC ocis=0x11169e420

*** 2015-09-13 05:48:25.160
LGWR found NSS2 alive.. waiting for msg

*** 2015-09-13 05:48:35.181
LGWR found NSS2 alive.. waiting for msg

*** 2015-09-13 05:48:45.202
LGWR found NSS2 alive.. waiting for msg

*** 2015-09-13 05:48:52.000
Current time <09/13/2015 05:48:52> exceeds begin time <09/13/2015 05:48:12> by <40> seconds. Return timedout error
*** 2015-09-13 05:48:52.000 4320 krsh.c
ORA-16198: LGWR received timedout error from KSR
ERROR: timedout
*** 2015-09-13 05:48:52.000 6904 krsu.c
Making upidhs request to NSS2 (ocis 0x11169e420). Begin time is <09/13/2015 05:48:12> and NET_TIMEOUT <40> seconds <--第二次timeout
NetServer pid:2883910

*** 2015-09-13 05:48:56.000
*** 2015-09-13 05:48:56.000 8830 krsu.c
Cleaning up NSS2 [pid 2883910] because it failed to enter 'alive after detach' state
*** 2015-09-13 05:48:56.000 978 krsw.c
NSS2 [pid 2883910] receiving termination signal..
.... killed successfully <---NSS被第二次kill

*** 2015-09-13 05:48:57.001
.. pmon posted for async lns cleanup
*** 2015-09-13 05:48:57.001 7080 krsu.c
upidhs done status 0
*** 2015-09-13 05:48:57.001 915 krsu.c
Error 16198 connecting to destination LOG_ARCHIVE_DEST_2 standby host 'tstdb1'
*** 2015-09-13 05:48:57.001 6904 krsu.c
No LNS exists (ocis 0x11169e420) that requires a disconnect
Error 16198 attaching to destination LOG_ARCHIVE_DEST_2 standby host 'tstdb1'
*** 2015-09-13 05:48:57.001 4320 krsh.c
Error 16198 for archive log file 2 to 'tstdb1'
*** 2015-09-13 05:48:57.001 6904 krsu.c
No LNS exists (ocis 0x11169e420) that requires a disconnect
*** 2015-09-13 05:48:57.001 2932 krsi.c
krsi_dst_fail: dest:2 err:16198 force:0 blast:1

###从v$archive_dest可以看出LOG_ARCHIVE_DEST_2被置为fail的时间是05:48:57，与上面lgwr日志反映出来的时间一致，failure_count=1
col dest_name format a25
col destination format a40
set linesize 180
select dest_name,destination,status,binding,reopen_secs,net_timeout,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1 USE_DB_RECOVERY_FILE_DEST VALID OPTIONAL 300 0 0 0
LOG_ARCHIVE_DEST_2 tstdb1 ERROR OPTIONAL 10 40 1 20150913 05:48:57 3

***再执行若干次switch logfile
alter system switch logfile;
alter system switch logfile;
alter system switch logfile;

同时观察LGWR进程新的输出，并没有重连的信息：
*** 2015-09-13 05:57:42.850
*** 2015-09-13 05:57:42.850 2682 krsl.c
No standby database destinations have been configured
as being archived by the LGWR process
This instance will operate at a reduced protection mode until
network connectivity to the standby databases is restored and
all archivelog gaps have been resolved.

*** 2015-09-13 05:59:16.776
*** 2015-09-13 05:59:16.776 2682 krsl.c
No standby database destinations have been configured
as being archived by the LGWR process
This instance will operate at a reduced protection mode until
network connectivity to the standby databases is restored and
all archivelog gaps have been resolved.

*** 2015-09-13 05:59:31.318
*** 2015-09-13 05:59:31.318 2682 krsl.c
No standby database destinations have been configured
as being archived by the LGWR process
This instance will operate at a reduced protection mode until
network connectivity to the standby databases is restored and
all archivelog gaps have been resolved.

***v$archive_dest里的failure_count始终保持1，没有继续往上涨
col dest_name format a25
col destination format a40
set linesize 180
select dest_name,destination,status,binding,reopen_secs,net_timeout,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1 USE_DB_RECOVERY_FILE_DEST VALID OPTIONAL 300 0 0 0
LOG_ARCHIVE_DEST_2 tstdb1 ERROR OPTIONAL 10 40 1 20150913 05:48:57 3

***06:02:40网络恢复，但是LGWR日志却到了06:03:33才开始恢复Redo Transport Service
NSS2 is not running anymore.

*** 2015-09-13 06:03:33.062
Waiting for subscriber count on LGWR-NSS2 channel to go to zero
Subscriber count went to zero - time now is <09/13/2015 06:03:33>
*** 2015-09-13 06:03:33.062 7723 krsu.c
Starting NSS2 ...
Waiting for NSS2 [pid 5374506] to initialize itself

*** 2015-09-13 06:03:36.143
*** 2015-09-13 06:03:36.142 7890 krsu.c
Netserver NSS2 [pid 5374506] for mode SYNC has been initialized
Performing a channel reset to ignore previous responses
Successfully started NSS2 [pid 5374506] for dest tstdb1 mode SYNC ocis=0x11169e420

*** 2015-09-13 06:03:36.323
*** 2015-09-13 06:03:36.323 3561 krsu.c
upiahm connect done status is 0

*** 2015-09-13 06:03:42.508
*** 2015-09-13 06:03:42.508 7119 krsu.c
Making upinblc request to NSS2 (ocis 0x11169e420). Begin time is <09/13/2015 06:03:42> and NET_TIMEOUT is <40> seconds
NetServer pid:5374506

***failure_count重置为0
col dest_name format a25
col destination format a40
set linesize 180
select dest_name,destination,status,binding,reopen_secs,net_timeout,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------- ---------------------------------------- -------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1 USE_DB_RECOVERY_FILE_DEST VALID OPTIONAL 300 0 0 0
LOG_ARCHIVE_DEST_2 tstdb1 VALID OPTIONAL 10 40 0 3

至此发现max_failure=3的设置并没有实际意义，因为failure_count=1之后，只要网络不恢复，NSS进程就不会去连第2、3次，而当网络恢复后，failure_count就置0了，因此failure_count永远不会达到3

仍然是上面的例子我们把max_failure设置为1，效果就不一样了
alter system set log_archive_dest_2='service=tstdb1 SYNC affirm reopen=10 net_timeout=40 max_failure=1 valid_for=(online_logfiles,primary_roles) db_unique_name=tstdb1' scope=both;

DEST_NAME DESTINATION STATUS BINDING REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE MAX_FAILURE
------------------------- ---------------------------------------- -------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1 USE_DB_RECOVERY_FILE_DEST VALID OPTIONAL 300 0 0 0
LOG_ARCHIVE_DEST_2 tstdb1 VALID OPTIONAL 10 40 0 1

这里我们只通过LGWR的日志来解读网络中断->网络恢复->redo transport service恢复的过程
<<< 网络中断后的第一次等待>>>
*** 2015-09-13 06:11:12.040
LGWR found NSS2 alive.. waiting for msg

*** 2015-09-13 06:11:22.061
LGWR found NSS2 alive.. waiting for msg

*** 2015-09-13 06:11:32.082
LGWR found NSS2 alive.. waiting for msg

*** 2015-09-13 06:11:42.002
Current time <09/13/2015 06:11:42> exceeds begin time <09/13/2015 06:11:02> by <40> seconds. Return timedout error
*** 2015-09-13 06:11:42.003 4320 krsh.c
ORA-16198: LGWR received timedout error from KSR
*** 2015-09-13 06:11:42.003 4320 krsh.c
LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
*** 2015-09-13 06:11:42.003 4320 krsh.c
LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
ERROR: Receive timedout!
*** 2015-09-13 06:11:42.003 4320 krsh.c
Error 16198 for archive log file 1 to 'tstdb1'
LGWR had received a timeout previously. return timeout again
Error 16198 detaching RFS from standby instance at host 'tstdb1'
*** 2015-09-13 06:11:42.003 6904 krsu.c
Making upidhs request to NSS2 (ocis 0x11169e420). Begin time is <09/13/2015 06:11:02> and NET_TIMEOUT <40> seconds
NetServer pid:5374506

<<< NSS 第一次被kill >>>
*** 2015-09-13 06:11:46.004
*** 2015-09-13 06:11:46.004 8830 krsu.c
Cleaning up NSS2 [pid 5374506] because it failed to enter 'alive after detach' state
*** 2015-09-13 06:11:46.004 978 krsw.c
NSS2 [pid 5374506] receiving termination signal..
.... killed successfully

*** 2015-09-13 06:11:47.004
.. pmon posted for async lns cleanup
*** 2015-09-13 06:11:47.004 7080 krsu.c
upidhs done status 0
krsl_reap_io: ignoring detach error 16198
*** 2015-09-13 06:11:47.004 2932 krsi.c
krsi_dst_fail: dest:2 err:16198 force:0 blast:1

<<< NSS 重启 >>>
*** 2015-09-13 06:11:47.010
Warning: log write elapsed time 44986ms, size 17KB
NSS2 is not running anymore.
Waiting for subscriber count on LGWR-NSS2 channel to go to zero
Subscriber count went to zero - time now is <09/13/2015 06:11:47>
*** 2015-09-13 06:11:47.040 7723 krsu.c
Starting NSS2 ...
Waiting for NSS2 [pid 2884044] to initialize itself

*** 2015-09-13 06:11:50.118
*** 2015-09-13 06:11:50.118 7890 krsu.c
Netserver NSS2 [pid 2884044] for mode SYNC has been initialized
Performing a channel reset to ignore previous responses
Successfully started NSS2 [pid 2884044] for dest tstdb1 mode SYNC ocis=0x11169e420

<<< 网络中断后的第二次等待开始：LGWR等NSS >>>
*** 2015-09-13 06:12:00.130
LGWR found NSS2 alive.. waiting for msg

*** 2015-09-13 06:12:10.146
LGWR found NSS2 alive.. waiting for msg

*** 2015-09-13 06:12:20.167
LGWR found NSS2 alive.. waiting for msg

*** 2015-09-13 06:12:27.004
Current time <09/13/2015 06:12:27> exceeds begin time <09/13/2015 06:11:47> by <40> seconds. Return timedout error
*** 2015-09-13 06:12:27.005 4320 krsh.c
ORA-16198: LGWR received timedout error from KSR
ERROR: timedout
*** 2015-09-13 06:12:27.005 6904 krsu.c
Making upidhs request to NSS2 (ocis 0x11169e420). Begin time is <09/13/2015 06:11:47> and NET_TIMEOUT <40> seconds
NetServer pid:2884044

<<< NSS 第二次被kill >>>
*** 2015-09-13 06:12:31.005
*** 2015-09-13 06:12:31.005 8830 krsu.c
Cleaning up NSS2 [pid 2884044] because it failed to enter 'alive after detach' state
*** 2015-09-13 06:12:31.005 978 krsw.c
NSS2 [pid 2884044] receiving termination signal..
.... killed successfully

<<< 远端归档目录被置为ERROR >>>
*** 2015-09-13 06:12:32.005
.. pmon posted for async lns cleanup
*** 2015-09-13 06:12:32.005 7080 krsu.c
upidhs done status 0
*** 2015-09-13 06:12:32.005 915 krsu.c
Error 16198 connecting to destination LOG_ARCHIVE_DEST_2 standby host 'tstdb1'
*** 2015-09-13 06:12:32.005 6904 krsu.c
No LNS exists (ocis 0x11169e420) that requires a disconnect
Error 16198 attaching to destination LOG_ARCHIVE_DEST_2 standby host 'tstdb1'
*** 2015-09-13 06:12:32.006 4320 krsh.c
Error 16198 for archive log file 4 to 'tstdb1'
*** 2015-09-13 06:12:32.006 6904 krsu.c
No LNS exists (ocis 0x11169e420) that requires a disconnect
*** 2015-09-13 06:12:32.006 2932 krsi.c
krsi_dst_fail: dest:2 err:16198 force:0 blast:1

<<< 网络恢复后在prmy上switch logfile，但redo Transport一直未恢复>>>
*** 2015-09-13 06:13:01.616
*** 2015-09-13 06:13:01.616 2682 krsl.c
No standby database destinations have been configured
as being archived by the LGWR process
This instance will operate at a reduced protection mode until
network connectivity to the standby databases is restored and
all archivelog gaps have been resolved.

*** 2015-09-13 06:14:12.220
*** 2015-09-13 06:14:12.220 2682 krsl.c
No standby database destinations have been configured
as being archived by the LGWR process
This instance will operate at a reduced protection mode until
network connectivity to the standby databases is restored and
all archivelog gaps have been resolved.

*** 2015-09-13 06:16:35.440
*** 2015-09-13 06:16:35.440 2682 krsl.c
No standby database destinations have been configured
as being archived by the LGWR process
This instance will operate at a reduced protection mode until
network connectivity to the standby databases is restored and
all archivelog gaps have been resolved.

<<< 直到手工重新设置了LOG_ARCHIVE_DEST_2: "alter system set log_archive_dest_2='service=tstdb1 SYNC affirm reopen=10 net_timeout=40 max_failure=3 valid_for=(online_logfiles,primary_roles) db_unique_name=tstdb1' scope=both;">>>
*** 2015-09-13 06:17:37.716
NSS2 is not running anymore.
Waiting for subscriber count on LGWR-NSS2 channel to go to zero
Subscriber count went to zero - time now is <09/13/2015 06:17:37>
*** 2015-09-13 06:17:37.716 7723 krsu.c
Starting NSS2 ...

*** 2015-09-13 06:17:37.769
Waiting for NSS2 [pid 17105102] to initialize itself

*** 2015-09-13 06:17:40.782
*** 2015-09-13 06:17:40.782 7890 krsu.c
Netserver NSS2 [pid 17105102] for mode SYNC has been initialized
Performing a channel reset to ignore previous responses
Successfully started NSS2 [pid 17105102] for dest tstdb1 mode SYNC ocis=0x11169e420
*** 2015-09-13 06:17:41.046 3561 krsu.c
upiahm connect done status is 0

由此看出在DG环境中网络中断引起的本地Redo无法传输到远程的情况下，max_failure参数如果设定为>1并没有实际意义，因为只要网络没有恢复，LGWR就不会去进行第2次的连接尝试，等到网络恢复failure_count又会被复位成0。因此只有在max_failure=1的情况下，当出现一次网络中断后，远程归档路径才会被禁用，即便之后网络恢复正常了，也需要人工的重置log_archive_dest_n来恢复Redo向远端的传输

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/53956/viewspace-1798959/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/53956/viewspace-1798959/

cuiqiong6888

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
log_archive_dest_n参数中reopen、max_failure、net_timeout三个属性的组合效果测试

log_archive_dest_n参数中有三个属性都是用来指导在目标归档路径故障的情况下，归档过程应如何继续下去，这三个属性是：reopen：能用于DG或者非DG的环境下，在非DG环境下当出现归档目录满或者因权限问...
复制链接

扫一扫