达梦数据库DW集群自动切换模式下各场景的处理机制

验证实时归档+高性能模式的场景下,自动切换模式在各场景下的故障机制

  • 一、测试环境说明

1、测试环境信息:

主库MAL

备库MAL

确认监视器MAL

归档类型

模式

10.10.0.45

10.10.0.46

10.10.0.47

实时归档

高性能模式

2、测试场景:

故障切换

确认监视器

测试场景

验证点

自动切换

存在且正常

持续存在事务运行过程中,备库发生故障

1、事务是否中断

2、主库实例状态变化情况

自动切换

不存在或异常

持续存在事务运行过程中,备库发生故障

1、事务是否中断

2、主库实例状态变化情况

自动切换

存在且正常

持续存在事务运行过程中,主库发生故障

1、事务是否中断

2、备库实例状态变化情况

自动切换

不事务模拟:存在或异常

持续存在事务运行过程中,主库发生故障

1、事务是否中断

2、备库实例状态变化情况

3、事务模拟:

create table aaa.t1(id int,create_time TIMESTAMP) ;

begin

for i in 1..10000000 loop 

insert into aaa.t1 values(i,now());

sleep(1);

commit;

end loop;

end;

  • 二、测试过程

2.1监视器正常,备库发生异常

1)主机执行事务;

2)关闭备机

3)查看主机的日志

2024-09-20 16:15:19.806 [INFO] dmwatcher P0000001941 T0000000000000002133  没有收到远程守护进程(DW_B01)消息,原状态为(OPEN),距上次接收消息时间间隔(31)s, 设置远程守护进程为ERROR状态

2024-09-20 16:15:19.806 [INFO] dmwatcher P0000001941 T0000000000000002133  Local instance: 守护进程状态(OPEN) 实例状态(OK) 实例名(DW_A01) 模式(PRIMARY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(12) FLSN(122860646) CLSN(122860657) SLSN(122860657) SSLSN(122860657)

2024-09-20 16:15:19.806 [INFO] dmwatcher P0000001941 T0000000000000002133  Instance: 守护进程状态(ERROR) 实例状态(OK) 实例名(DW_B01) 模式(STANDBY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(12) FLSN(122860616) CLSN(122860616) SLSN(122860616) SSLSN(122860616)

2024-09-20 16:15:19.856 [WARNING] dmwatcher P0000001941 T0000000000000002133  Tcp port from dmwatcher vio(4) close, ip:10.10.0.46, port:5436, mid(-1), is confirm(FALSE), from name:DW_B01

2024-09-20 16:15:19.856 [INFO] dmwatcher P0000001941 T0000000000000002133  dw2_tcp_port_close_low, reset report_err to TRUE, ip:10.10.0.46, port:5436!

2024-09-20 16:15:19.856 [INFO] dmwatcher P0000001941 T0000000000000320569  dw2_group_tcp_recv_thread, receive tcp msg failed, close tcp port, vio:0, mid:-1, from_flag:0, from_name:DW_B01, dw_closed:1, ip:10.10.0.46, port:5436, errno:0, error:Failure occurs in data_recv_inet_once, code(0) len(8128, 0), code:-6007

2024-09-20 16:15:19.856 [INFO] dmwatcher P0000001941 T0000000000000320569  dw2_tcp_port_close_low, reset report_err to TRUE, ip:10.10.0.46, port:5436!

2024-09-20 16:15:19.857 [INFO] dmwatcher P0000001941 T0000000000000320568  dw2_tcp_port_close_low, reset report_err to TRUE, ip:10.10.0.46, port:5436!

2024-09-20 16:15:22.940 [ERROR] dmwatcher P0000001941 T0000000000000002135  Can't connect to DM server on '10.10.0.46' port(5436) errno(113)

2024-09-20 16:15:56.000 [INFO] dmwatcher P0000001941 T0000000000000002133  switch sub_state to sub_stat_start!

2024-09-20 16:15:56.000 [INFO] dmwatcher P0000001941 T0000000000000004877  本地实例的模式、状态或者归档状态发生变化,原状态是:

2024-09-20 16:15:56.000 [INFO] dmwatcher P0000001941 T0000000000000002133  设置GDW1守护进程为STARTUP(SUB:STARTUP)状态

2024-09-20 16:15:56.000 [INFO] dmwatcher P0000001941 T0000000000000004877  Instance: 守护进程状态(STARTUP) 实例状态(OK) 实例名(DW_A01) 模式(PRIMARY) 实例状态(OPEN) 归档状态(VALID) POCNT(12) FLSN(122860646) CLSN(122860657) SLSN(122860657) SSLSN(122860657)

2024-09-20 16:15:56.100 [INFO] dmwatcher P0000001941 T0000000000000004877  本地实例的模式、状态或者归档状态发生变化,新状态是:

2024-09-20 16:15:56.101 [INFO] dmwatcher P0000001941 T0000000000000004877  Instance: 守护进程状态(STARTUP) 实例状态(OK) 实例名(DW_A01) 模式(PRIMARY) 实例状态(SUSPEND) 归档状态(VALID) POCNT(12) FLSN(122860646) CLSN(122860657) SLSN(122860657) SSLSN(122860657)

2024-09-20 16:15:56.151 [INFO] dmwatcher P0000001941 T0000000000000002133  需要通过确认监视器确认是否可以执行故障处理,切换为CONFIRM状态

2024-09-20 16:15:56.151 [INFO] dmwatcher P0000001941 T0000000000000002133  switch sub_state to sub_stat_start!

2024-09-20 16:15:56.151 [INFO] dmwatcher P0000001941 T0000000000000002133  设置GDW1守护进程为CONFIRM(SUB:STARTUP)状态

2024-09-20 16:15:57.000 [INFO] dmwatcher P0000001941 T0000000000000002133  Local dmwatcher is in CONFIRM status, but has no confirm monitor or the monitor confirmed it cannot switch to FAILOVER!

2024-09-20 16:16:26.233 [INFO] dmwatcher P0000001941 T0000000000000002137  dw2_send_port_set from dmmonitor vio(8) set, mid(194439562), from name:dmmonitor, ip:::ffff:10.10.0.45, mon_confirm:FALSE

2024-09-20 16:17:04.015 [INFO] dmwatcher P0000001941 T0000000000000002133  Tcp msg send over (30)s, close tcp port(from_flag: 1, vio: 7)

2024-09-20 16:17:04.016 [WARNING] dmwatcher P0000001941 T0000000000000002133  Tcp port from dmmonitor vio(7) close, ip:::ffff:10.10.0.47, port:53934, mid(83782884), is confirm(TRUE), from name:dmmonitor

2024-09-20 16:17:04.016 [INFO] dmwatcher P0000001941 T0000000000000002133  dw2_tcp_port_close_low, reset report_err to TRUE, ip:, port:0!

2024-09-20 16:17:04.016 [INFO] dmwatcher P0000001941 T0000000000000323068  dw2_group_tcp_recv_thread, receive tcp msg failed, close tcp port, vio:0, mid:-1, from_flag:1, from_name:dmmonitor, dw_closed:1, ip:::ffff:10.10.0.47, port:53934, errno:0, error:Failure occurs in data_recv_inet_once, code(0) len(8128, 0), code:-6007

2024-09-20 16:17:04.016 [INFO] dmwatcher P0000001941 T0000000000000323068  dw2_tcp_port_close_low, reset report_err to TRUE, ip:, port:0!

2024-09-20 16:17:04.016 [INFO] dmwatcher P0000001941 T0000000000000323067  dw2_group_send_tcp_msg failed, from flag:1, from name:dmmonitor, close tcp port

2024-09-20 16:17:04.016 [INFO] dmwatcher P0000001941 T0000000000000323067  dw2_tcp_port_close_low, reset report_err to TRUE, ip:, port:0!

2024-09-20 16:17:04.016 [INFO] dmwatcher P0000001941 T0000000000000323067  dw2_group_info_send, send to port(from flag:1, from name:dmmonitor, ip:::ffff:10.10.0.47) over 5s, code:-6007!

2024-09-20 16:17:04.016 [INFO] dmwatcher P0000001941 T0000000000000323067  dw2_group_tcp_send_thread, code:-6007, from flag:1, from name:dmmonitor, ip:::ffff:10.10.0.47

2024-09-20 16:17:04.016 [INFO] dmwatcher P0000001941 T0000000000000323067  dw2_tcp_port_close_low, reset report_err to TRUE, ip:, port:0!

2024-09-20 16:20:56.879 [INFO] dmwatcher P0000001941 T0000000000000002137  dw2_send_port_set from dmmonitor vio(7) set, mid(83782884), from name:dmmonitor, ip:::ffff:10.10.0.47, mon_confirm:TRUE

2024-09-20 16:20:57.223 [INFO] dmwatcher P0000001941 T0000000000000002133  switch sub_state to pre_set_dw_stat!

2024-09-20 16:20:57.224 [INFO] dmwatcher P0000001941 T0000000000000002133  设置GDW1守护进程为FAILOVER(SUB:STARTUP)状态

2024-09-20 16:20:58.079 [INFO] dmwatcher P0000001941 T0000000000000002133  dw2_notify_set_dw_stat, dseq = 1726029919, from_dw_stat: NONE, to_dw_stat: DW_FAILOVER

2024-09-20 16:20:58.079 [INFO] dmwatcher P0000001941 T0000000000000002133  Send tcp msg to local ep DW_A01, hpc_seqno:0, code:0

2024-09-20 16:20:58.079 [INFO] dmwatcher P0000001941 T0000000000000002133  设置GDW1守护进程子状态为WAIT_SET_DW_STAT状态

2024-09-20 16:20:58.080 [INFO] dmwatcher P0000001941 T0000000000000002133  dw2_group_get_curr_ep_retcode, ep(DW_A01) cmd_ret:cmd=217, dseq=1726029919, code=100

2024-09-20 16:20:58.080 [INFO] dmwatcher P0000001941 T0000000000000002133  dw2_group_get_curr_ep_retcode, ep(DW_A01) cmd_ret:cmd=217, dseq=1726029919, code=0

2024-09-20 16:20:58.080 [INFO] dmwatcher P0000001941 T0000000000000002133  dw2_clear_ep_cmd_info_low, clear ep(DW_A01) cmd info, and reset curr_ep to NULL.

2024-09-20 16:20:58.080 [INFO] dmwatcher P0000001941 T0000000000000002133  notify ep(DW_A01) set dw_stat to DW_FAILOVER success!

2024-09-20 16:20:59.080 [INFO] dmwatcher P0000001941 T0000000000000002133  设置GDW1守护进程子状态为WAIT_TO_SUSPEND状态

2024-09-20 16:20:59.080 [INFO] dmwatcher P0000001941 T0000000000000002133  dw2_notify_chg_arch_status, dseq = 1726029920, rstat = 1

2024-09-20 16:20:59.080 [INFO] dmwatcher P0000001941 T0000000000000002133  Send tcp msg to local ep DW_A01, hpc_seqno:0, code:0

2024-09-20 16:20:59.080 [INFO] dmwatcher P0000001941 T0000000000000002133  设置GDW1守护进程子状态为WAIT_SET_ARCH状态

2024-09-20 16:20:59.081 [INFO] dmwatcher P0000001941 T0000000000000004877  实例(DW_B01)归档状态发生变化:VALID --> INVALID

2024-09-20 16:20:59.081 [INFO] dmwatcher P0000001941 T0000000000000002133  dw2_group_get_curr_ep_retcode, ep(DW_A01) cmd_ret:cmd=214, dseq=1726029920, code=0

2024-09-20 16:20:59.081 [INFO] dmwatcher P0000001941 T0000000000000002133  dw2_clear_ep_cmd_info_low, clear ep(DW_A01) cmd info, and reset curr_ep to NULL.

2024-09-20 16:20:59.081 [INFO] dmwatcher P0000001941 T0000000000000002133  dw2_notify_sql_exec, dseq = 1726029921, sql: ALTER DATABASE OPEN FORCE

2024-09-20 16:20:59.081 [INFO] dmwatcher P0000001941 T0000000000000002133  Send tcp msg to local ep DW_A01, hpc_seqno:0, code:0

2024-09-20 16:20:59.081 [INFO] dmwatcher P0000001941 T0000000000000002133  设置GDW1守护进程子状态为WAIT_TO_OPEN状态

2024-09-20 16:20:59.082 [INFO] dmwatcher P0000001941 T0000000000000002133  dw2_group_get_curr_ep_retcode, ep(DW_A01) cmd_ret:cmd=1, dseq=1726029921, code=100

2024-09-20 16:20:59.082 [INFO] dmwatcher P0000001941 T0000000000000002133  dw2_group_get_curr_ep_retcode, ep(DW_A01) cmd_ret:cmd=1, dseq=1726029921, code=0

2024-09-20 16:20:59.082 [INFO] dmwatcher P0000001941 T0000000000000002133  dw2_clear_ep_cmd_info_low, clear ep(DW_A01) cmd info, and reset curr_ep to NULL.

2024-09-20 16:20:59.082 [INFO] dmwatcher P0000001941 T0000000000000002133  本地守护进程为FAILOVER状态,本地实例为PRIMARY & OPEN,故障处理完成.

2024-09-20 16:20:59.082 [INFO] dmwatcher P0000001941 T0000000000000002133  设置GDW1守护进程子状态为SUB_STATE_CLEAR状态

2024-09-20 16:21:00.082 [INFO] dmwatcher P0000001941 T0000000000000002133  dw2_notify_set_dw_stat, dseq = 1726029922, from_dw_stat: DW_FAILOVER, to_dw_stat: NONE

2024-09-20 16:21:00.082 [INFO] dmwatcher P0000001941 T0000000000000002133  Send tcp msg to local ep DW_A01, hpc_seqno:0, code:0

2024-09-20 16:21:00.082 [INFO] dmwatcher P0000001941 T0000000000000002133  设置GDW1守护进程子状态为WAIT_CLEAR状态

2024-09-20 16:21:00.083 [INFO] dmwatcher P0000001941 T0000000000000002133  dw2_group_get_curr_ep_retcode, ep(DW_A01) cmd_ret:cmd=217, dseq=1726029922, code=0

2024-09-20 16:21:00.083 [INFO] dmwatcher P0000001941 T0000000000000002133  dw2_clear_ep_cmd_info_low, clear ep(DW_A01) cmd info, and reset curr_ep to NULL.

2024-09-20 16:21:00.083 [INFO] dmwatcher P0000001941 T0000000000000002133  notify ep(DW_A01) set dw_stat to NONE success!

2024-09-20 16:21:00.083 [INFO] dmwatcher P0000001941 T0000000000000002133  dw2_clear_ep_cmd_info_low, clear ep(DW_A01) cmd info.

2024-09-20 16:21:00.083 [INFO] dmwatcher P0000001941 T0000000000000002133  Clear all ep g_dw_status finished, Failover finished!

2024-09-20 16:21:00.083 [INFO] dmwatcher P0000001941 T0000000000000002133  switch sub_state to sub_stat_start!

2024-09-20 16:21:00.083 [INFO] dmwatcher P0000001941 T0000000000000002133  设置GDW1守护进程为OPEN(SUB:STARTUP)状态

整个过程大致如下:

①备库发生异常,主库检测不到备库报错

②将主库实例状态修改为suspend避免主从不一致

③主库守护进程状态修改为MON CONFIRM

④提交确认监视器处理,主库守护进程修改为FAILOVER

⑤监视器将备库归档状态从valid修改为invalid

⑥恢复主库实例状态为primary&open

⑦恢复主库守护进程open

小结:

运行中的事务会中断,主库先进行suspend后,等待监视器将从库的归档置为invalid后主库实例恢复为primary&open。

2.1.1主库修复入集群现象

重新启动备库的过程中,主库也会产生一次suspend的切换,但是并没有出现事务写入中断的现象。

2.2监视器异常,备库发生异常

1)关闭监视器服务器

2)主机执行事务;

3)关闭备机

实例日志日式归档发送失败,实例suspend:

守护进程日志提示需要确认监视器处理,但没有确认监视器,因此实例处于suspend状态,而事务也停止了。

小结:

运行中的事务会中断,主库suspend后因没有确认监视器干涉处理,长期挂起。

2.2.1恢复场景1:启动监视器,不启动备机

结论:守护进程干涉将备机的归档状态修改为invalid后,主库primary&open

2.2.2恢复场景2:不启动监视器,启动备机

结论:主库守护进程检测到备库状态正常,实例自动恢复为primary&open状态

2.3监视器正常,主库发生异常

  1. 主机执行事务;
  2. 下电主库服务器

守护进程通知监视器主节点异常,监视器将主库arch修改为invalid后,切换备库为primary&open状态:

确认监视器详细列出了备机接管的工作机制

2.3.1主库修复加入集群现象

原主节点开机恢复后重新加入集群,实例切换为standby&open状态:

新主库在恢复过程中suspend来发送历史归档至恢复的实例,并设置恢复实例归档有效后,重新切换成Primary&open状态:

事务在这个过程中暂停了写入,新主库从suspend恢复open后继续写入:

2.4监视器异常,主库发生异常

  1. 主机执行事务;
  2. 下电确认监视器
  3. 下电主库服务器

从非确认监视器查看获取不到primary库信息,也获取不到确认监视器,备库状态始终为standby

守护进程日志反馈找不到确认监视器干涉

2.4.1监视器恢复集群现象

在未进行主库修复前,仅启动确认监视器后,因原DW_B01是备库,不满足自动接管条件,集群无法提供业务支持:

2.4.2仅启动主库,不启动确认监视器

在关闭确认监视器的情况下,启动主库后,守护进程之间检测恢复正常,集群可正常提供对外访问:

三、总结:

场景现象结论
1监视器正常,备库异常主库自动切换为suspend后,监视器介入将备库归档置为invalid后主库恢复open,集群可正常对外提供服务
2监视器异常,备库异常因无监视器干涉,主库自动切换并一直处于suspend状态,集群无法对外提供服务
3监视器正常,主库异常备库自动切换为主库,集群可正常对外提供服务
4监视器异常,主库异常备库一直处于standby状态,集群无法对外提供服务





达梦数据库 - 新一代大型通用关系型数据库 | 达梦在线服务平台

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值