1.重启主库,观察守护进程的状态变化
[monitor] 2024-07-27 14:58:50: dwmon_tcp_recv failed, close port, vio:0, mid:1722055233, errno:104, code:-6007
[monitor] 2024-07-27 14:58:50: dwmon tcp port vio(0) close, inst_name:SSPUDB1, ip:192.168.1.20, port:15238, n_fixed:1.
[SSPUDB2] 2024-07-27 14:58:50.970 [INFO] dmwatcher P0000018005 T0000000000000018009 没有收到远程守护进程(SSPUDB1)消息,原状态为(OPEN),距进程为ERROR状态
[SSPUDB2] 2024-07-27 14:58:51.021 [INFO] dmwatcher P0000018005 T0000000000000018009 Instance: 守护进程状态(ERROR) 实例状态(OK) 实例名(SSALID) POCNT(5) FLSN(42270) CLSN(42270) SLSN(42270) SSLSN(42270)
[monitor] 2024-07-27 14:59:00: <RECEIVE TIMEOUT SSPUDB1>
[monitor] 2024-07-27 14:59:00: 接收守护进程(SSPUDB1)消息超时
WTIME WSTATUS INST_OK INAME ISTATUS IMODE RSTAT N_OPEN FLSN CLSN
2024-07-27 14:58:50 ERROR OK SSPUDB1 OPEN PRIMARY VALID 5 42270 42270
[monitor] 2024-07-27 14:59:00: </RECEIVE TIMEOUT SSPUDB1>
[monitor] 2024-07-27 14:59:00: [!!! 实例SSPUDB1[PRIMARY, OPEN, ISTAT_SAME:TRUE]故障,实例SSPUDB2[STANDBY, OPEN, ISTAT_SAME:TRUE]符合自动接管条件 !!!]
[monitor] 2024-07-27 14:59:00: 检测到PRIMARY实例故障,开始对组(GRP1)执行自动接管
[monitor] 2024-07-27 14:59:00: <AUTO TAKEOVER SSPUDB2>
[monitor] 2024-07-27 14:59:00: 通知组(GRP1)当前活动的守护进程设置MID
[monitor] 2024-07-27 14:59:00: Begin to wait site(SSPUDB2) complete...
[monitor] 2024-07-27 14:59:00: [!!! dwmon_tcp_msg_send, get master tcp port failed, send cmd msg(cmd:5, name_sendto:GRP1) to dmwatcher() !!!]
[monitor] 2024-07-27 14:59:01: Wait site(SSPUDB2) finished, code=0!
[monitor] 2024-07-27 14:59:01: 通知组(GRP1)当前活动的守护进程设置MID成功
[monitor] 2024-07-27 14:59:01: 开始使用实例SSPUDB2接管
[monitor] 2024-07-27 14:59:01: 通知守护进程SSPUDB2切换TAKEOVER状态
[monitor] 2024-07-27 14:59:01: 守护进程(SSPUDB2)状态切换 [OPEN-->TAKEOVER]
[monitor] 2024-07-27 14:59:02: 切换守护进程SSPUDB2为TAKEOVER状态成功
[monitor] 2024-07-27 14:59:02: 实例SSPUDB2开始执行SP_SET_GLOBAL_DW_STATUS(0, 7)语句
[monitor] 2024-07-27 14:59:02: 实例SSPUDB2执行SP_SET_GLOBAL_DW_STATUS(0, 7)语句成功
[monitor] 2024-07-27 14:59:02: 实例SSPUDB2开始执行SP_APPLY_KEEP_PKG()语句
[monitor] 2024-07-27 14:59:02: 实例SSPUDB2执行SP_APPLY_KEEP_PKG()语句成功
[monitor] 2024-07-27 14:59:02: 实例SSPUDB2开始执行ALTER DATABASE MOUNT语句
[monitor] 2024-07-27 14:59:02: Begin to wait site(SSPUDB2) complete...
[monitor] 2024-07-27 14:59:02: Wait site(SSPUDB2) finished, code=0!
[monitor] 2024-07-27 14:59:02: 实例SSPUDB2执行ALTER DATABASE MOUNT语句成功
[monitor] 2024-07-27 14:59:02: 实例SSPUDB2开始执行ALTER DATABASE PRIMARY语句
[monitor] 2024-07-27 14:59:02: 实例SSPUDB2执行ALTER DATABASE PRIMARY语句成功
[monitor] 2024-07-27 14:59:02: 通知实例SSPUDB2修改所有归档状态无效
[monitor] 2024-07-27 14:59:02: 修改所有实例归档为无效状态成功
[monitor] 2024-07-27 14:59:02: 实例SSPUDB2开始执行ALTER DATABASE OPEN FORCE语句
[monitor] 2024-07-27 14:59:02: ohis_inst_info_copy_low, inst(SSPUDB2) apply info changed, old info[p_db_magic:1684231299, n_apply_ep:1], new info to set[p_db_magic:1771613664, n_apply_ep:0]!
[monitor] 2024-07-27 14:59:02: 实例SSPUDB2执行ALTER DATABASE OPEN FORCE语句成功
[monitor] 2024-07-27 14:59:02: 实例SSPUDB2开始执行SP_SET_GLOBAL_DW_STATUS(7, 0)语句
[monitor] 2024-07-27 14:59:02: 实例SSPUDB2执行SP_SET_GLOBAL_DW_STATUS(7, 0)语句成功
[monitor] 2024-07-27 14:59:02: 通知守护进程SSPUDB2切换OPEN状态
[monitor] 2024-07-27 14:59:02: 守护进程(SSPUDB2)状态切换 [TAKEOVER-->OPEN]
[monitor] 2024-07-27 14:59:03: 切换守护进程SSPUDB2为OPEN状态成功
[monitor] 2024-07-27 14:59:03: 通知组(GRP1)的守护进程执行清理操作
[monitor] 2024-07-27 14:59:03: Notify instance(SSPUDB2) to clear monitor info and wait complete!
[monitor] 2024-07-27 14:59:03: Begin to wait site(SSPUDB2) complete...
[monitor] 2024-07-27 14:59:03: dwmon_cmd_msg_send_low failed, get tcp_port failed(inst_name:, ip:192.168.1.20, port:15238)!
[monitor] 2024-07-27 14:59:03: [!!! dwmon_tcp_msg_send to master tcp_port failed, code:-6010, (inst_name:, ip:192.168.1.20, port:15238, vio:0) !!!]
[monitor] 2024-07-27 14:59:04: 清理守护进程(SSPUDB2)请求成功
[monitor] 2024-07-27 14:59:04: Wait site(SSPUDB2) finished, code=0!
[monitor] 2024-07-27 14:59:04: 使用实例SSPUDB2接管成功
[monitor] 2024-07-27 14:59:04: </AUTO TAKEOVER SSPUDB2>
[monitor] 2024-07-27 14:59:04: 组(GRP1)使用实例SSPUDB2自动接管成功
[monitor] 2024-07-27 14:59:16: dmmonitor create link to dmwatcher success, mid:1722055233, dmwatcher ip:192.168.1.20, dmwatcher port:15238, vio:3, inst_name:SSPUDB1
[monitor] 2024-07-27 14:59:16: <MON CHECK SSPUDB1>
[monitor] 2024-07-27 14:59:16: 守护进程(SSPUDB1)状态切换 [NONE-->STARTUP]
[monitor] 2024-07-27 14:59:16: </MON CHECK SSPUDB1>
[monitor] 2024-07-27 14:59:16: [!!! 组(GRP1)中存在多个PRIMARY&OPEN实例,不符合自动接管条件 !!!]
[monitor] 2024-07-27 14:59:16: ohis_inst_info_copy_low, inst(SSPUDB1) apply info changed, old info[p_db_magic:1684231299, n_apply_ep:0], new info to set[p_db_magic:0, n_apply_ep:0]!
[SSPUDB2] 2024-07-27 14:59:17.573 [INFO] dmwatcher P0000018005 T0000000000000035510 Instance: 守护进程状态(ERROR) 实例状态(OK) 实例名(SSPUDB1) 模式(PRIMARY) 实例状态(OPEN) 归档状态(INVALID) POCNT(5) FLSN(42270) CLSN(42270) SLSN(42270) SSLSN(42270)
[SSPUDB1] 2024-07-27 14:59:17.587 [INFO] dmwatcher P0000001318 T0000000000000001453 接收到远程守护进程广播消息,实例状态为:
[SSPUDB1] 2024-07-27 14:59:17.587 [INFO] dmwatcher P0000001318 T0000000000000001453 Instance: 守护进程状态(OPEN) 实例状态(OK) 实例名(SSPUDB2) 模式(PRIMARY) 实例状态(OPEN) 归档状态(INVALID) POCNT(6) FLSN(42639) CLSN(42639) SLSN(42639) SSLSN(42639)
[SSPUDB2] 2024-07-27 14:59:25.577 [INFO] dmwatcher P0000018005 T0000000000000035510 远程实例的模式、状态或者归档状态发生变化,原状态是:
[monitor] 2024-07-27 14:59:26: ohis_inst_info_copy_low, inst(SSPUDB1) apply info changed, old info[p_db_magic:0, n_apply_ep:0], new info to set[p_db_magic:1684231299, n_apply_ep:0]!
[SSPUDB1] 2024-07-27 14:59:26.594 [INFO] dmwatcher P0000001318 T0000000000000002081 服务器端(SSPUDB1)公钥发生变化,广播新值给监视器
[SSPUDB2] 2024-07-27 14:59:26.579 [INFO] dmwatcher P0000018005 T0000000000000035510 远程实例的模式、状态或者归档状态发生变化,原状态是:
[monitor] 2024-07-27 14:59:26: ohis_inst_info_copy_low, inst(SSPUDB1) apply info changed, old info[p_db_magic:1684231299, n_apply_ep:0], new info to set[p_db_magic:1684231299, n_apply_ep:1]!
[SSPUDB1] 2024-07-27 14:59:26.750 [INFO] dmwatcher P0000001318 T0000000000000001371 设置GRP1守护进程为STARTUP(SUB:STARTUP)状态
[SSPUDB2] 2024-07-27 14:59:26.732 [INFO] dmwatcher P0000018005 T0000000000000035510 远程实例的模式、状态或者归档状态发生变化,原状态是:
[SSPUDB2] 2024-07-27 14:59:26.833 [INFO] dmwatcher P0000018005 T0000000000000035510 远程实例的模式、状态或者归档状态发生变化,新状态是:
[monitor] 2024-07-27 14:59:26: <MON CHECK SSPUDB1>
[monitor] 2024-07-27 14:59:26: 守护进程(SSPUDB1)状态切换 [STARTUP-->UNIFY EP]
WTIME WSTATUS INST_OK INAME ISTATUS IMODE RSTAT N_OPEN FLSN CLSN
2024-07-27 14:59:26 UNIFY EP OK SSPUDB1 MOUNT STANDBY INVALID 5 42270 42270
[monitor] 2024-07-27 14:59:26: </MON CHECK SSPUDB1>
[SSPUDB1] 2024-07-27 14:59:26.955 [INFO] dmwatcher P0000001318 T0000000000000001371 设置GRP1守护进程为UNIFY EP(SUB:STARTUP)状态
[SSPUDB2] 2024-07-27 14:59:26.934 [INFO] dmwatcher P0000018005 T0000000000000035510 远程实例的模式、状态或者归档状态发生变化,原状态是:
[SSPUDB2] 2024-07-27 14:59:27.035 [INFO] dmwatcher P0000018005 T0000000000000035510 远程实例的模式、状态或者归档状态发生变化,新状态是:
[monitor] 2024-07-27 14:59:27: <MON CHECK SSPUDB1>
[monitor] 2024-07-27 14:59:27: 守护进程(SSPUDB1)状态切换 [UNIFY EP-->STARTUP]
WTIME WSTATUS INST_OK INAME ISTATUS IMODE RSTAT N_OPEN FLSN CLSN
2024-07-27 14:59:27 STARTUP OK SSPUDB1 OPEN STANDBY INVALID 5 42270 42270
[monitor] 2024-07-27 14:59:27: </MON CHECK SSPUDB1>
[SSPUDB1] 2024-07-27 14:59:27.164 [INFO] dmwatcher P0000001318 T0000000000000001371 设置GRP1守护进程为STARTUP(SUB:STARTUP)状态
[SSPUDB2] 2024-07-27 14:59:27.136 [INFO] dmwatcher P0000018005 T0000000000000035510 远程实例的模式、状态或者归档状态发生变化,原状态是:
[SSPUDB2] 2024-07-27 14:59:27.237 [INFO] dmwatcher P0000018005 T0000000000000035510 远程实例的模式、状态或者归档状态发生变化,新状态是:
[monitor] 2024-07-27 14:59:27: <MON CHECK SSPUDB1>
[monitor] 2024-07-27 14:59:27: 守护进程(SSPUDB1)状态切换 [STARTUP-->OPEN]
WTIME WSTATUS INST_OK INAME ISTATUS IMODE RSTAT N_OPEN FLSN CLSN
2024-07-27 14:59:27 OPEN OK SSPUDB1 OPEN STANDBY INVALID 5 42270 42270
[monitor] 2024-07-27 14:59:27: </MON CHECK SSPUDB1>
[SSPUDB1] 2024-07-27 14:59:27.354 [INFO] dmwatcher P0000001318 T0000000000000001371 设置GRP1守护进程为OPEN(SUB:STARTUP)状态
[SSPUDB2] 2024-07-27 14:59:27.338 [INFO] dmwatcher P0000018005 T0000000000000035510 远程实例的模式、状态或者归档状态发生变化,原状态是:
[SSPUDB2] 2024-07-27 14:59:27.439 [INFO] dmwatcher P0000018005 T0000000000000035510 远程实例的模式、状态或者归档状态发生变化,新状态是:
[SSPUDB2] 2024-07-27 14:59:27.540 [INFO] dmwatcher P0000018005 T0000000000000035510 远程实例的模式、状态或者归档状态发生变化,原状态是:
[SSPUDB2] 2024-07-27 14:59:27.641 [INFO] dmwatcher P0000018005 T0000000000000035510 远程实例的模式、状态或者归档状态发生变化,新状态是:
[SSPUDB1] 2024-07-27 14:59:28.785 [INFO] dmwatcher P0000001318 T0000000000000001453 远程实例的模式、状态或者归档状态发生变化,原状态是:
[SSPUDB1] 2024-07-27 14:59:28.785 [INFO] dmwatcher P0000001318 T0000000000000001453 Instance: 守护进程状态(OPEN) 实例状态(OK) 实例名(SSPUDB2) 模式(PRIMARY) 实例状态(OPEN) 归档状态(INVALID) POCNT(6) FLSN(42642) CLSN(42643) SLSN(42643) SSLSN(42643)
[SSPUDB1] 2024-07-27 14:59:28.840 [INFO] dmwatcher P0000001318 T0000000000000001453 远程实例的模式、状态或者归档状态发生变化,新状态是:
[monitor] 2024-07-27 14:59:29: <MON CHECK SSPUDB2>
[monitor] 2024-07-27 14:59:29: 守护进程(SSPUDB2)状态切换 [OPEN-->RECOVERY]
WTIME WSTATUS INST_OK INAME ISTATUS IMODE RSTAT N_OPEN FLSN CLSN
2024-07-27 14:59:29 RECOVERY OK SSPUDB2 OPEN PRIMARY VALID 6 42643 42643
[monitor] 2024-07-27 14:59:29: </MON CHECK SSPUDB2>
[SSPUDB2] 2024-07-27 14:59:29.929 [INFO] dmwatcher P0000018005 T0000000000000018009 检测到实例(SSPUDB1)可恢复,执行恢复流程
[SSPUDB2] 2024-07-27 14:59:29.929 [INFO] dmwatcher P0000018005 T0000000000000018009 开始向实例(SSPUDB1)发送归档日志
[monitor] 2024-07-27 14:59:30: ohis_inst_info_copy_low, inst(SSPUDB1) apply info changed, old info[p_db_magic:1684231299, n_apply_ep:1], new info to set[p_db_magic:1771613664, n_apply_ep:1]!
[SSPUDB2] 2024-07-27 14:59:30.951 [INFO] dmwatcher P0000018005 T0000000000000018009 检测到实例(SSPUDB1)发送归档成功,设置为当前恢复实例
[SSPUDB2] 2024-07-27 14:59:30.951 [INFO] dmwatcher P0000018005 T0000000000000018009 向实例(SSPUDB1)发送归档日志成功,实例(SSPUDB2)转入suspend状态
[SSPUDB2] 2024-07-27 14:59:31.185 [INFO] dmwatcher P0000018005 T0000000000000018009 发送归档完毕,设置实例(SSPUDB1)归档有效
[SSPUDB2] 2024-07-27 14:59:31.941 [INFO] dmwatcher P0000018005 T0000000000000018009 不存在可恢复备库
[monitor] 2024-07-27 14:59:32: <MON CHECK SSPUDB2>
[monitor] 2024-07-27 14:59:32: 守护进程(SSPUDB2)状态切换 [RECOVERY-->OPEN]
WTIME WSTATUS INST_OK INAME ISTATUS IMODE RSTAT N_OPEN FLSN CLSN
2024-07-27 14:59:32 OPEN OK SSPUDB2 OPEN PRIMARY VALID 6 42644 42644
[monitor] 2024-07-27 14:59:32: </MON CHECK SSPUDB2>
[SSPUDB2] 2024-07-27 14:59:32.191 [INFO] dmwatcher P0000018005 T0000000000000018009 设置GRP1守护进程为OPEN(SUB:STARTUP)状态
[SSPUDB1] 2024-07-27 14:59:32.207 [INFO] dmwatcher P0000001318 T0000000000000001453 远程实例的模式、状态或者归档状态发生变化,原状态是:
[SSPUDB1] 2024-07-27 14:59:32.258 [INFO] dmwatcher P0000001318 T0000000000000001453 远程实例的模式、状态或者归档状态发生变化,新状态是:
[monitor] 2024-07-27 14:59:36:
GROUP OGUID MON_CONFIRM MODE MPP_FLAG
GRP1 453331 TRUE AUTO FALSE
<<DATABASE GLOBAL INFO:>>
DW_IP MAL_DW_PORT WTIME WTYPE WCTLSTAT WSTATUS INAME INST_OK N_EP N_OK ISTATUS IMODE DSC_STATUS RTYPE RSTAT
192.168.1.21 25238 2024-07-27 14:59:35 GLOBAL VALID OPEN SSPUDB2 OK 1 1 OPEN PRIMARY DSC_OPEN REALTIME VALID
EP INFO:
INST_IP INST_PORT INST_OK INAME ISTATUS IMODE DSC_SEQNO DSC_CTL_NODE RTYPE RSTAT FSEQ FLSN CSEQ CLSN DW_STAT_FLAG
192.168.1.21 25236 OK SSPUDB2 OPEN PRIMARY 0 0 REALTIME VALID 8884 42644 8885 42645 NONE
<<DATABASE GLOBAL INFO:>>
DW_IP MAL_DW_PORT WTIME WTYPE WCTLSTAT WSTATUS INAME INST_OK N_EP N_OK ISTATUS IMODE DSC_STATUS RTYPE RSTAT
192.168.1.20 15238 2024-07-27 14:59:35 GLOBAL VALID OPEN SSPUDB1 OK 1 1 OPEN STANDBY DSC_OPEN REALTIME VALID
EP INFO:
INST_IP INST_PORT INST_OK INAME ISTATUS IMODE DSC_SEQNO DSC_CTL_NODE RTYPE RSTAT FSEQ FLSN CSEQ CLSN DW_STAT_FLAG
192.168.1.20 15236 OK SSPUDB1 OPEN STANDBY 0 0 REALTIME VALID 8872 42644 8872 42644 NONE
DATABASE(SSPUDB1) APPLY INFO FROM (SSPUDB2), REDOS_PARALLEL_NUM (1):
DSC_SEQNO[0], (RSEQ, SSEQ, KSEQ)[8884, 8884, 8885], (RLSN, SLSN, KLSN)[42644, 42644, 42645], N_TSK[0], TSK_MEM_USE[512]
REDO_LSN_ARR: (42644)
2.日志切换状态总结
监听进程检测到数据库守护进程挂掉,数据库宕机。
通知实例2执行自动接管。
通知组(GRP1)当前活动的守护进程设置MID
开始使用实例SSPUDB2接管
通知守护进程SSPUDB2切换TAKEOVER状态
守护进程(SSPUDB2)状态切换 [OPEN-->TAKEOVER]
实例SSPUDB2开始执行 SP_SET_GLOBAL_DW_STATUS(0, 7)语句
实例SSPUDB2开始执行 SP_APPLY_KEEP_PKG() 语句
实例SSPUDB2开始执行ALTER DATABASE MOUNT语句
实例SSPUDB2开始执行ALTER DATABASE PRIMARY语句
修改所有实例归档为无效状态成功
实例SSPUDB2开始执行ALTER DATABASE OPEN FORCE语句
守护进程(SSPUDB2)状态切换 [TAKEOVER-->OPEN]
通知组(GRP1)的守护进程执行清理操作
SSPUDB1 启动后自动变为 STANDBY;
守护进程(SSPUDB2)状态切换 [OPEN-->RECOVERY]
检测到实例(SSPUDB1)可恢复,执行恢复流程
开始向实例(SSPUDB1)发送归档日志
检测到实例(SSPUDB1)发送归档成功,设置为当前恢复实例
向实例(SSPUDB1)发送归档日志成功,实例(SSPUDB2)转入suspend状态
发送归档完毕,设置实例(SSPUDB1)归档有效
守护进程(SSPUDB2)状态切换 [RECOVERY-->OPEN]
3.数据库状态检查
--SSPUDB2 状态检查
SQL> conn sspudb/sspudb123456@192.168.1.21:25236
服务器[192.168.1.21:25236]:处于主库打开状态
登录使用时间 : 3.506(ms)
SQL>
SQL> select role$,status$ from v$database;
行号 ROLE$ STATUS$
---------- ----------- -----------
1 1 4
已用时间: 1.021(毫秒). 执行号:800.
--SSPUDB1状态检查
SQL> conn sspudb/sspudb123456@192.168.1.20:15236
服务器[192.168.1.20:15236]:处于备库打开状态
登录使用时间 : 4.205(ms)
SQL> select name,status$,role$ from v$database;
行号 NAME STATUS$ ROLE$
---------- ------ ----------- -----------
1 sspudb 4 1
已用时间: 2.483(毫秒). 执行号:900.
4.主库写入数据验证
SQL> create table sspu_tab1 (id int,name varchar(20));
操作已执行
已用时间: 9.315(毫秒). 执行号:901.
SQL> insert into sspu_tab1 values(2,'xsq1'),(1,'xsq2');
影响行数 2
已用时间: 0.863(毫秒). 执行号:902.
SQL> commit;
操作已执行
已用时间: 1.881(毫秒). 执行号:903.
SQL> select * from sspu_tab1;
行号 ID NAME
---------- ----------- ----
1 2 xsq1
2 1 xsq2
已用时间: 0.857(毫秒). 执行号:904.
--从库检查
SQL> conn sspudb/sspudb123456@192.168.1.20:15236
服务器[192.168.1.20:15236]:处于备库打开状态
登录使用时间 : 3.484(ms)
SQL> select * from sspu_tab1;
行号 ID NAME
---------- ----------- ----
1 2 xsq1
2 1 xsq2
已用时间: 4.890(毫秒). 执行号:100.