验证实时归档+高性能模式的场景下,自动切换模式在各场景下的故障机制
-
一、测试环境说明
1、测试环境信息:
主库MAL | 备库MAL | 确认监视器MAL | 归档类型 | 模式 |
10.10.0.45 | 10.10.0.46 | 10.10.0.47 | 实时归档 | 高性能模式 |
2、测试场景:
故障切换 | 确认监视器 | 测试场景 | 验证点 |
自动切换 | 存在且正常 | 持续存在事务运行过程中,备库发生故障 | 1、事务是否中断 2、主库实例状态变化情况 |
自动切换 | 不存在或异常 | 持续存在事务运行过程中,备库发生故障 | 1、事务是否中断 2、主库实例状态变化情况 |
自动切换 | 存在且正常 | 持续存在事务运行过程中,主库发生故障 | 1、事务是否中断 2、备库实例状态变化情况 |
自动切换 | 不事务模拟:存在或异常 | 持续存在事务运行过程中,主库发生故障 | 1、事务是否中断 2、备库实例状态变化情况 |
3、事务模拟:
create table aaa.t1(id int,create_time TIMESTAMP) ;
begin
for i in 1..10000000 loop
insert into aaa.t1 values(i,now());
sleep(1);
commit;
end loop;
end;
-
二、测试过程
2.1监视器正常,备库发生异常
1)主机执行事务;
2)关闭备机
3)查看主机的日志
2024-09-20 16:15:19.806 [INFO] dmwatcher P0000001941 T0000000000000002133 没有收到远程守护进程(DW_B01)消息,原状态为(OPEN),距上次接收消息时间间隔(31)s, 设置远程守护进程为ERROR状态 2024-09-20 16:15:19.806 [INFO] dmwatcher P0000001941 T0000000000000002133 Local instance: 守护进程状态(OPEN) 实例状态(OK) 实例名(DW_A01) 模式(PRIMARY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(12) FLSN(122860646) CLSN(122860657) SLSN(122860657) SSLSN(122860657) 2024-09-20 16:15:19.806 [INFO] dmwatcher P0000001941 T0000000000000002133 Instance: 守护进程状态(ERROR) 实例状态(OK) 实例名(DW_B01) 模式(STANDBY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(12) FLSN(122860616) CLSN(122860616) SLSN(122860616) SSLSN(122860616) 2024-09-20 16:15:19.856 [WARNING] dmwatcher P0000001941 T0000000000000002133 Tcp port from dmwatcher vio(4) close, ip:10.10.0.46, port:5436, mid(-1), is confirm(FALSE), from name:DW_B01 2024-09-20 16:15:19.856 [INFO] dmwatcher P0000001941 T0000000000000002133 dw2_tcp_port_close_low, reset report_err to TRUE, ip:10.10.0.46, port:5436! 2024-09-20 16:15:19.856 [INFO] dmwatcher P0000001941 T0000000000000320569 dw2_group_tcp_recv_thread, receive tcp msg failed, close tcp port, vio:0, mid:-1, from_flag:0, from_name:DW_B01, dw_closed:1, ip:10.10.0.46, port:5436, errno:0, error:Failure occurs in data_recv_inet_once, code(0) len(8128, 0), code:-6007 2024-09-20 16:15:19.856 [INFO] dmwatcher P0000001941 T0000000000000320569 dw2_tcp_port_close_low, reset report_err to TRUE, ip:10.10.0.46, port:5436! 2024-09-20 16:15:19.857 [INFO] dmwatcher P0000001941 T0000000000000320568 dw2_tcp_port_close_low, reset report_err to TRUE, ip:10.10.0.46, port:5436! 2024-09-20 16:15:22.940 [ERROR] dmwatcher P0000001941 T0000000000000002135 Can't connect to DM server on '10.10.0.46' port(5436) errno(113) 2024-09-20 16:15:56.000 [INFO] dmwatcher P0000001941 T0000000000000002133 switch sub_state to sub_stat_start! 2024-09-20 16:15:56.000 [INFO] dmwatcher P0000001941 T0000000000000004877 本地实例的模式、状态或者归档状态发生变化,原状态是: 2024-09-20 16:15:56.000 [INFO] dmwatcher P0000001941 T0000000000000002133 设置GDW1守护进程为STARTUP(SUB:STARTUP)状态 2024-09-20 16:15:56.000 [INFO] dmwatcher P0000001941 T0000000000000004877 Instance: 守护进程状态(STARTUP) 实例状态(OK) 实例名(DW_A01) 模式(PRIMARY) 实例状态(OPEN) 归档状态(VALID) POCNT(12) FLSN(122860646) CLSN(122860657) SLSN(122860657) SSLSN(122860657) 2024-09-20 16:15:56.100 [INFO] dmwatcher P0000001941 T0000000000000004877 本地实例的模式、状态或者归档状态发生变化,新状态是: 2024-09-20 16:15:56.101 [INFO] dmwatcher P0000001941 T0000000000000004877 Instance: 守护进程状态(STARTUP) 实例状态(OK) 实例名(DW_A01) 模式(PRIMARY) 实例状态(SUSPEND) 归档状态(VALID) POCNT(12) FLSN(122860646) CLSN(122860657) SLSN(122860657) SSLSN(122860657) 2024-09-20 16:15:56.151 [INFO] dmwatcher P0000001941 T0000000000000002133 需要通过确认监视器确认是否可以执行故障处理,切换为CONFIRM状态 2024-09-20 16:15:56.151 [INFO] dmwatcher P0000001941 T0000000000000002133 switch sub_state to sub_stat_start! 2024-09-20 16:15:56.151 [INFO] dmwatcher P0000001941 T0000000000000002133 设置GDW1守护进程为CONFIRM(SUB:STARTUP)状态 2024-09-20 16:15:57.000 [INFO] dmwatcher P0000001941 T0000000000000002133 Local dmwatcher is in CONFIRM status, but has no confirm monitor or the monitor confirmed it cannot switch to FAILOVER! 2024-09-20 16:16:26.233 [INFO] dmwatcher P0000001941 T0000000000000002137 dw2_send_port_set from dmmonitor vio(8) set, mid(194439562), from name:dmmonitor, ip:::ffff:10.10.0.45, mon_confirm:FALSE 2024-09-20 16:17:04.015 [INFO] dmwatcher P0000001941 T0000000000000002133 Tcp msg send over (30)s, close tcp port(from_flag: 1, vio: 7) 2024-09-20 16:17:04.016 [WARNING] dmwatcher P0000001941 T0000000000000002133 Tcp port from dmmonitor vio(7) close, ip:::ffff:10.10.0.47, port:53934, mid(83782884), is confirm(TRUE), from name:dmmonitor 2024-09-20 16:17:04.016 [INFO] dmwatcher P0000001941 T0000000000000002133 dw2_tcp_port_close_low, reset report_err to TRUE, ip:, port:0! 2024-09-20 16:17:04.016 [INFO] dmwatcher P0000001941 T0000000000000323068 dw2_group_tcp_recv_thread, receive tcp msg failed, close tcp port, vio:0, mid:-1, from_flag:1, from_name:dmmonitor, dw_closed:1, ip:::ffff:10.10.0.47, port:53934, errno:0, error:Failure occurs in data_recv_inet_once, code(0) len(8128, 0), code:-6007 2024-09-20 16:17:04.016 [INFO] dmwatcher P0000001941 T0000000000000323068 dw2_tcp_port_close_low, reset report_err to TRUE, ip:, port:0! 2024-09-20 16:17:04.016 [INFO] dmwatcher P0000001941 T0000000000000323067 dw2_group_send_tcp_msg failed, from flag:1, from name:dmmonitor, close tcp port 2024-09-20 16:17:04.016 [INFO] dmwatcher P0000001941 T0000000000000323067 dw2_tcp_port_close_low, reset report_err to TRUE, ip:, port:0! 2024-09-20 16:17:04.016 [INFO] dmwatcher P0000001941 T0000000000000323067 dw2_group_info_send, send to port(from flag:1, from name:dmmonitor, ip:::ffff:10.10.0.47) over 5s, code:-6007! 2024-09-20 16:17:04.016 [INFO] dmwatcher P0000001941 T0000000000000323067 dw2_group_tcp_send_thread, code:-6007, from flag:1, from name:dmmonitor, ip:::ffff:10.10.0.47 2024-09-20 16:17:04.016 [INFO] dmwatcher P0000001941 T0000000000000323067 dw2_tcp_port_close_low, reset report_err to TRUE, ip:, port:0! 2024-09-20 16:20:56.879 [INFO] dmwatcher P0000001941 T0000000000000002137 dw2_send_port_set from dmmonitor vio(7) set, mid(83782884), from name:dmmonitor, ip:::ffff:10.10.0.47, mon_confirm:TRUE 2024-09-20 16:20:57.223 [INFO] dmwatcher P0000001941 T0000000000000002133 switch sub_state to pre_set_dw_stat! 2024-09-20 16:20:57.224 [INFO] dmwatcher P0000001941 T0000000000000002133 设置GDW1守护进程为FAILOVER(SUB:STARTUP)状态 2024-09-20 16:20:58.079 [INFO] dmwatcher P0000001941 T0000000000000002133 dw2_notify_set_dw_stat, dseq = 1726029919, from_dw_stat: NONE, to_dw_stat: DW_FAILOVER 2024-09-20 16:20:58.079 [INFO] dmwatcher P0000001941 T0000000000000002133 Send tcp msg to local ep DW_A01, hpc_seqno:0, code:0 2024-09-20 16:20:58.079 [INFO] dmwatcher P0000001941 T0000000000000002133 设置GDW1守护进程子状态为WAIT_SET_DW_STAT状态 2024-09-20 16:20:58.080 [INFO] dmwatcher P0000001941 T0000000000000002133 dw2_group_get_curr_ep_retcode, ep(DW_A01) cmd_ret:cmd=217, dseq=1726029919, code=100 2024-09-20 16:20:58.080 [INFO] dmwatcher P0000001941 T0000000000000002133 dw2_group_get_curr_ep_retcode, ep(DW_A01) cmd_ret:cmd=217, dseq=1726029919, code=0 2024-09-20 16:20:58.080 [INFO] dmwatcher P0000001941 T0000000000000002133 dw2_clear_ep_cmd_info_low, clear ep(DW_A01) cmd info, and reset curr_ep to NULL. 2024-09-20 16:20:58.080 [INFO] dmwatcher P0000001941 T0000000000000002133 notify ep(DW_A01) set dw_stat to DW_FAILOVER success! 2024-09-20 16:20:59.080 [INFO] dmwatcher P0000001941 T0000000000000002133 设置GDW1守护进程子状态为WAIT_TO_SUSPEND状态 2024-09-20 16:20:59.080 [INFO] dmwatcher P0000001941 T0000000000000002133 dw2_notify_chg_arch_status, dseq = 1726029920, rstat = 1 2024-09-20 16:20:59.080 [INFO] dmwatcher P0000001941 T0000000000000002133 Send tcp msg to local ep DW_A01, hpc_seqno:0, code:0 2024-09-20 16:20:59.080 [INFO] dmwatcher P0000001941 T0000000000000002133 设置GDW1守护进程子状态为WAIT_SET_ARCH状态 2024-09-20 16:20:59.081 [INFO] dmwatcher P0000001941 T0000000000000004877 实例(DW_B01)归档状态发生变化:VALID --> INVALID 2024-09-20 16:20:59.081 [INFO] dmwatcher P0000001941 T0000000000000002133 dw2_group_get_curr_ep_retcode, ep(DW_A01) cmd_ret:cmd=214, dseq=1726029920, code=0 2024-09-20 16:20:59.081 [INFO] dmwatcher P0000001941 T0000000000000002133 dw2_clear_ep_cmd_info_low, clear ep(DW_A01) cmd info, and reset curr_ep to NULL. 2024-09-20 16:20:59.081 [INFO] dmwatcher P0000001941 T0000000000000002133 dw2_notify_sql_exec, dseq = 1726029921, sql: ALTER DATABASE OPEN FORCE 2024-09-20 16:20:59.081 [INFO] dmwatcher P0000001941 T0000000000000002133 Send tcp msg to local ep DW_A01, hpc_seqno:0, code:0 2024-09-20 16:20:59.081 [INFO] dmwatcher P0000001941 T0000000000000002133 设置GDW1守护进程子状态为WAIT_TO_OPEN状态 2024-09-20 16:20:59.082 [INFO] dmwatcher P0000001941 T0000000000000002133 dw2_group_get_curr_ep_retcode, ep(DW_A01) cmd_ret:cmd=1, dseq=1726029921, code=100 2024-09-20 16:20:59.082 [INFO] dmwatcher P0000001941 T0000000000000002133 dw2_group_get_curr_ep_retcode, ep(DW_A01) cmd_ret:cmd=1, dseq=1726029921, code=0 2024-09-20 16:20:59.082 [INFO] dmwatcher P0000001941 T0000000000000002133 dw2_clear_ep_cmd_info_low, clear ep(DW_A01) cmd info, and reset curr_ep to NULL. 2024-09-20 16:20:59.082 [INFO] dmwatcher P0000001941 T0000000000000002133 本地守护进程为FAILOVER状态,本地实例为PRIMARY & OPEN,故障处理完成. 2024-09-20 16:20:59.082 [INFO] dmwatcher P0000001941 T0000000000000002133 设置GDW1守护进程子状态为SUB_STATE_CLEAR状态 2024-09-20 16:21:00.082 [INFO] dmwatcher P0000001941 T0000000000000002133 dw2_notify_set_dw_stat, dseq = 1726029922, from_dw_stat: DW_FAILOVER, to_dw_stat: NONE 2024-09-20 16:21:00.082 [INFO] dmwatcher P0000001941 T0000000000000002133 Send tcp msg to local ep DW_A01, hpc_seqno:0, code:0 2024-09-20 16:21:00.082 [INFO] dmwatcher P0000001941 T0000000000000002133 设置GDW1守护进程子状态为WAIT_CLEAR状态 2024-09-20 16:21:00.083 [INFO] dmwatcher P0000001941 T0000000000000002133 dw2_group_get_curr_ep_retcode, ep(DW_A01) cmd_ret:cmd=217, dseq=1726029922, code=0 2024-09-20 16:21:00.083 [INFO] dmwatcher P0000001941 T0000000000000002133 dw2_clear_ep_cmd_info_low, clear ep(DW_A01) cmd info, and reset curr_ep to NULL. 2024-09-20 16:21:00.083 [INFO] dmwatcher P0000001941 T0000000000000002133 notify ep(DW_A01) set dw_stat to NONE success! 2024-09-20 16:21:00.083 [INFO] dmwatcher P0000001941 T0000000000000002133 dw2_clear_ep_cmd_info_low, clear ep(DW_A01) cmd info. 2024-09-20 16:21:00.083 [INFO] dmwatcher P0000001941 T0000000000000002133 Clear all ep g_dw_status finished, Failover finished! 2024-09-20 16:21:00.083 [INFO] dmwatcher P0000001941 T0000000000000002133 switch sub_state to sub_stat_start! 2024-09-20 16:21:00.083 [INFO] dmwatcher P0000001941 T0000000000000002133 设置GDW1守护进程为OPEN(SUB:STARTUP)状态 |
整个过程大致如下:
①备库发生异常,主库检测不到备库报错
②将主库实例状态修改为suspend避免主从不一致
③主库守护进程状态修改为MON CONFIRM
④提交确认监视器处理,主库守护进程修改为FAILOVER
⑤监视器将备库归档状态从valid修改为invalid
⑥恢复主库实例状态为primary&open
⑦恢复主库守护进程open
小结:
运行中的事务会中断,主库先进行suspend后,等待监视器将从库的归档置为invalid后主库实例恢复为primary&open。
2.1.1主库修复入集群现象
重新启动备库的过程中,主库也会产生一次suspend的切换,但是并没有出现事务写入中断的现象。
2.2监视器异常,备库发生异常
1)关闭监视器服务器
2)主机执行事务;
3)关闭备机
实例日志日式归档发送失败,实例suspend:
守护进程日志提示需要确认监视器处理,但没有确认监视器,因此实例处于suspend状态,而事务也停止了。
小结:
运行中的事务会中断,主库suspend后因没有确认监视器干涉处理,长期挂起。
2.2.1恢复场景1:启动监视器,不启动备机
结论:守护进程干涉将备机的归档状态修改为invalid后,主库primary&open
2.2.2恢复场景2:不启动监视器,启动备机
结论:主库守护进程检测到备库状态正常,实例自动恢复为primary&open状态
2.3监视器正常,主库发生异常
- 主机执行事务;
- 下电主库服务器
守护进程通知监视器主节点异常,监视器将主库arch修改为invalid后,切换备库为primary&open状态:
确认监视器详细列出了备机接管的工作机制
2.3.1主库修复加入集群现象
原主节点开机恢复后重新加入集群,实例切换为standby&open状态:
新主库在恢复过程中suspend来发送历史归档至恢复的实例,并设置恢复实例归档有效后,重新切换成Primary&open状态:
事务在这个过程中暂停了写入,新主库从suspend恢复open后继续写入:
2.4监视器异常,主库发生异常
- 主机执行事务;
- 下电确认监视器
- 下电主库服务器
从非确认监视器查看获取不到primary库信息,也获取不到确认监视器,备库状态始终为standby
守护进程日志反馈找不到确认监视器干涉
2.4.1监视器恢复集群现象
在未进行主库修复前,仅启动确认监视器后,因原DW_B01是备库,不满足自动接管条件,集群无法提供业务支持:
2.4.2仅启动主库,不启动确认监视器
在关闭确认监视器的情况下,启动主库后,守护进程之间检测恢复正常,集群可正常提供对外访问:
三、总结:
场景 | 现象 | 结论 |
1 | 监视器正常,备库异常 | 主库自动切换为suspend后,监视器介入将备库归档置为invalid后主库恢复open,集群可正常对外提供服务 |
2 | 监视器异常,备库异常 | 因无监视器干涉,主库自动切换并一直处于suspend状态,集群无法对外提供服务 |
3 | 监视器正常,主库异常 | 备库自动切换为主库,集群可正常对外提供服务 |
4 | 监视器异常,主库异常 | 备库一直处于standby状态,集群无法对外提供服务 |