达梦8 通过日志解释数据守护系统的关停顺序

关闭守护系统时,必须按照一定的顺序来关闭守护进程和数据库实例。特别是自动切换模式,如果退出守护进程或主备库的顺序不正确,可能会引起主备切换,甚至造成守护进程组分裂。

官方推荐通过在监视器执行stop group命令关闭守护系统:

有时监视器里操作被认为不太方便,比如要login输入复杂的SYSDBA口令,而通过disql可以通过本地认证免密登录数据库。下面官方也推荐了disql操作方式:

如上可见,为防止发生自动接管,首先需要先关停确认监视器。然后再关闭守护进程。关停守护进程没有严格的顺序。之所以先关闭备库的watcher,是考虑到操作方便。因为确认监视器往往部署在备机。之所以最后关闭主库的watcher,同样是操作方便。因为随后停数据库是有严格顺序的,先停主,再停备。

为什么要先停主库,能不能先停备库?先停备库会导致主库由于归档发送失败导致短暂进入suspend状态,备库再次重启后需要进入一个恢复阶段。大多数情况系统会完美地处理上述问题,但不能排除极少数情况下发生异常。

下面通过日志观察备库关闭再启动,系统做了什么。

#关停备库的守护进程
[dmdba@node2 bin]$ ./DmWatcherServiceTEST2 stop
Stopping DmWatcherServiceTEST2:                            [ OK ]

#关停备库。之所以先关守护进程,是防止备库自动被守护进程重新拉起
[dmdba@node2 bin]$ ./DmServiceTEST2 stop
Stopping DmServiceTEST2:                                   [ OK ]

#查看备库归档日志的LSN号
[dmdba@node2 bin]$ ./dmrachk arch_path=/home/dmdba/dmarch
rachk V8
the database pmnt_magic: 1859674653
the database db_magic: 1291794688
/*******************************************************************/
archive file /home/dmdba/dmarch/STANDBY_ARCHIVE_0x25335316_EP0_2024-06-23_01-43-11.log itemize.
version        : 0x7007
status         : INACTIVE
enc_flag       : FALSE
n_rpkgs        : 22
pmnt_magic     : 1859674653
db_magic       : 1291794688
src_db_magic   : 624120598
arch_lsn       : 1724574716
arch_seq       : 1430505
clsn           : 1724574737
next_seq       : 1430526
g_next_seq     : 1430526
file len       : 15360
file free      : 15360
dsc seqno      : 0
dsc node       : 1
create time    : 2024-06-23 01:43:11.914690 time_zone:+08:00
close time     : 2024-06-23 01:43:12.061566 time_zone:+08:00
llog first time     : 1900-01-01 00:00:00.000000 time_zone:+08:00
llog last time      : 1900-01-01 00:00:00.000000 time_zone:+08:00
/*******************************************************************/
/*******************************************************************/
archive file /home/dmdba/dmarch/STANDBY_ARCHIVE_0x25335316_EP0_2024-06-23_01-43-12.log itemize.
version        : 0x7007
status         : INACTIVE
enc_flag       : FALSE
n_rpkgs        : 32
pmnt_magic     : 1859674653
db_magic       : 1291794688
src_db_magic   : 624120598
arch_lsn       : 1724574738
arch_seq       : 1430527
clsn           : 1724574775
next_seq       : 1430558
g_next_seq     : 1430558
file len       : 21504
file free      : 21504
dsc seqno      : 0
dsc node       : 1
create time    : 2024-06-23 01:43:12.562146 time_zone:+08:00
close time     : 2024-06-23 01:44:49.847657 time_zone:+08:00
llog first time     : 1900-01-01 00:00:00.000000 time_zone:+08:00
llog last time      : 1900-01-01 00:00:00.000000 time_zone:+08:00
/*******************************************************************/

The SUMMARY(seqno[0]):
total files: 2
okey files: 2
fail file: 0
repeat file: 0

the rachk tool running cost 0.869 ms, code:0

#启动守护进程,备库自动拉起
[dmdba@node2 bin]$ ./DmWatcherServiceTEST2 start
Starting DmWatcherServiceTEST2:                            [ OK ]

下面查看主库日志

日志内容比较多,节选几个关键部分

向备库发送实时归档(lsn:1724574777)失败,节点切换为SUSPEND状态,守护进程状态修改为DW_FAILOVER状态。

2024-06-23 01:44:47.931 [ERROR] database P0000001104 T0000000000000001518  [mal recv for arch] mal receive from site(TEST2) success, but site(TEST2) apply failed, begin lsn:1724574777, end lsn:1724574777, code:-518
2024-06-23 01:44:47.931 [ERROR] database P0000001104 T0000000000000001518  send realtime archive to instance[TEST2] failed, code = -518, begin_lsn = 1724574777, end_lsn = 1724574777!
2024-06-23 01:44:47.931 [INFO] database P0000001104 T0000000000000001518  Send archive log to remote instance failed, switch all ep to SUSPEND status success!
2024-06-23 01:44:49.934 [INFO] database P0000001104 T0000000000000001554  utsk_cmd_add, cmd info: cmd=217, dseq=1719099861, name_in=, begin_lsn=-1!
2024-06-23 01:44:49.934 [INFO] database P0000001104 T0000000000000001554  utsk_set_global_dw_stat, begin, msg_dseq:1719099861
2024-06-23 01:44:49.934 [INFO] database P0000001104 T0000000000000001554  set g_dw_stat from NONE to DW_FAILOVER success, g_dw_recover_stop is 0
2024-06-23 01:44:49.934 [INFO] database P0000001104 T0000000000000001554  utsk_set_global_dw_stat, finished, msg_dseq:1719099861, set code:0

归档状态修改为INVALID状态,保证数据库open后不再传输实时归档。

数据库恢复为open,业务数据可以继续写入RLOG

守护进程状态由FAILOVER恢复为NONE

2024-06-23 01:44:50.936 [INFO] database P0000001104 T0000000000000001558  Change TEST2 arch status from VALID to INVALID
2024-06-23 01:44:50.936 [INFO] database P0000001104 T0000000000000001554  utsk_cmd_add, received sql exec cmd:1, dseq:1719099863, sql:ALTER DATABASE OPEN FORCE
2024-06-23 01:44:50.936 [INFO] database P0000001104 T0000000000000001554  utsk_cmd_add, cmd info: cmd=1, dseq=1719099863, name_in=, begin_lsn=-1!
2024-06-23 01:44:50.936 [INFO] database P0000001104 T0000000000000001554  pha_altdb_open_for_dw alter database open start...
2024-06-23 01:44:50.936 [INFO] database P0000001104 T0000000000000001554  pha_altdb_open_for_dw, altdb set changing end!
2024-06-23 01:44:50.936 [INFO] database P0000001104 T0000000000000001554  altdb open rlog_flush_notify_ex start!
2024-06-23 01:44:50.936 [INFO] database P0000001104 T0000000000000001554  altdb open rlog_flush_notify_ex end!
2024-06-23 01:44:50.936 [INFO] database P0000001104 T0000000000000001554  pha_altdb_open_for_dw alter database open success!
2024-06-23 01:44:50.936 [INFO] database P0000001104 T0000000000000001559  total 0 active crash trx, pseg_crash_trx_rollbacksys_only(0) begin ...
2024-06-23 01:44:50.937 [INFO] database P0000001104 T0000000000000001559  pseg_crash_trx_rollback end, total 0 active crash trx, include 0 empty_trxs, 0 empty_pages which only need to delete mgr recs.
2024-06-23 01:44:50.937 [INFO] database P0000001104 T0000000000000001559  pseg_crash_trx_rollback end
2024-06-23 01:44:50.937 [INFO] database P0000001104 T0000000000000001559  hpc_clear_active_trx adjust n_crash_active_trx from 0 to 0.
2024-06-23 01:44:51.938 [INFO] database P0000001104 T0000000000000001554  utsk_cmd_add, cmd info: cmd=217, dseq=1719099864, name_in=, begin_lsn=-1!
2024-06-23 01:44:51.938 [INFO] database P0000001104 T0000000000000001554  utsk_set_global_dw_stat, begin, msg_dseq:1719099864
2024-06-23 01:44:51.938 [INFO] database P0000001104 T0000000000000001554  set g_dw_stat from DW_FAILOVER to NONE success, g_dw_recover_stop is 0
2024-06-23 01:44:51.938 [INFO] database P0000001104 T0000000000000001554  utsk_set_global_dw_stat, finished, msg_dseq:1719099864, set code:0

备库再次启动后,守护进程状态设置为DW_RECOVERY,准备异步传输归档

...
#连接备库失败
2024-06-23 01:47:46.577 [WARNING] database P0000001104 T0000000000000001507  MAL receive site(0) lost connect to site(1), ctl_handle(0), data_handle(0), dsc_handle(0)
2024-06-23 01:47:46.577 [WARNING] database P0000001104 T0000000000000001507  site(0) ctl_link mal_site_letter_recv from site(1) failed, socket handle = 0, mal sys status is 0, try to get mal_port again
2024-06-23 01:47:46.577 [WARNING] database P0000001104 T0000000000000001506  mal_site_tsk_check site(0) connect lost to site(1), socket handle = 0, mal sys status = 0, try get port again
2024-06-23 01:47:46.578 [INFO] database P0000001104 T0000000000000001506  send  CMD_MAL_LINK_CHECK(350): (mal_id:21784098, stmt_id:71964, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:0, dest_site:1, build_time:-1)
2024-06-23 01:47:46.578 [INFO] database P0000001104 T0000000000000001506  send  CMD_MAL_LINK_CHECK(350): (mal_id:21784098, stmt_id:71966, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:0, dest_site:1, build_time:4467)
#重新建立到备库的连接
2024-06-23 01:47:52.515 [INFO] database P0000001104 T0000000000000001514  mal_site_ctl_link_create startup from mal_site(0) to mal_site(1)!
2024-06-23 01:47:52.515 [INFO] database P0000001104 T0000000000000001514  mal_site_magic_gen site_magic[39959], src_site:0, dst_site:1
2024-06-23 01:47:52.515 [INFO] database P0000001104 T0000000000000001514  site[0] mal_site_ctl_port_set to site[1, IP: 192.168.100.101, port_num: 5800], socket handle = 52, site_magic = 39959
2024-06-23 01:47:52.515 [INFO] database P0000001104 T0000000000000001507  mal_site_port_get site_magic:39959, src_site:0, dst_site:1
2024-06-23 01:47:52.516 [INFO] database P0000001104 T0000000000000001506  mal_site_port_get site_magic:39959, src_site:0, dst_site:1
2024-06-23 01:47:52.523 [INFO] database P0000001104 T0000000000000001513  site[0] mal_site_data_port_set from site[1, IP: 192.168.100.101, port_num: 5800], socket handle = 53, site_magic = 39959
2024-06-23 01:47:52.523 [INFO] database P0000001104 T0000000000000001505  mal_site_port_get site_magic:39959, src_site:0, dst_site:1
2024-06-23 01:47:52.523 [INFO] database P0000001104 T0000000000000001508  mal_site_port_get site_magic:39959, src_site:0, dst_site:1
2024-06-23 01:47:53.293 [INFO] database P0000001104 T0000000000000001554  utsk_cmd_add, cmd info: cmd=217, dseq=1719099865, name_in=, begin_lsn=-1!
2024-06-23 01:47:53.293 [INFO] database P0000001104 T0000000000000001554  utsk_set_global_dw_stat, begin, msg_dseq:1719099865
#守护进程状态设置为DW_RECOVERY
2024-06-23 01:47:53.293 [INFO] database P0000001104 T0000000000000001554  set g_dw_stat from NONE to DW_RECOVERY success, g_dw_recover_stop is 0
#准备异步发送归档
2024-06-23 01:47:53.293 [INFO] database P0000001104 T0000000000000001554  rarch_set_async_send_info, reset total_rpkg and total_time_used.
2024-06-23 01:47:53.293 [INFO] database P0000001104 T0000000000000001554  utsk_set_global_dw_stat, reset all async send info success.
2024-06-23 01:47:53.293 [INFO] database P0000001104 T0000000000000001554  utsk_set_global_dw_stat, finished, msg_dseq:1719099865, set code:0
2024-06-23 01:47:54.347 [INFO] database P0000001104 T0000000000000001554  utsk_cmd_add, cmd info: cmd=210, dseq=1719099866, name_in=TEST2, begin_lsn=0!
2024-06-23 01:47:54.347 [INFO] database P0000001104 T0000000000000001557  utsk_cmd_exec, cmd:210, sys_status:OPEN, dseq:1719099866

首先获取备库上次应用成功的最后一个日志包信息,与前面dmarchk分析相同。然后开始传日志。

#备库(mal_site(1))1724574775最末重演日志序列号1430558,lsn1724574775
#与dmrachk分析备库最后一个归档的next_seq和clsn相同
2024-06-23 01:47:54.356 [INFO] database P0000001104 T0000000000000001557  rarch_sync_process from mal_site(1): n_apply_ep=1, apply_seq_arr[0]=1430558, apply_lsn_arr[0]=1724574775
2024-06-23 01:47:54.357 [INFO] database P0000001104 T0000000000000001557  rarch_ep_init begin, dsc_seqno:0, begin_pkg_seqno:1430558, rfil:/u01/dm/dmarch/test/ARCHIVE_LOCAL1_0x25335316_EP0_2024-06-23_00-43-33.log!
2024-06-23 01:47:54.373 [INFO] database P0000001104 T0000000000000001557  rarch_ep_init, ep->prev_term_id change from 0 to 0.
2024-06-23 01:47:54.373 [INFO] database P0000001104 T0000000000000001557  rarch_ep_init, code:0, dsc_seqno:0, begin_pkg_seqno:1430558, load ep pkg info[db_magic:0x25335316, dsc_seqno:0, pkg_seqno:1430558, min_lsn:1724574775, max_lsn:1724574775], prev_term_id:0!
2024-06-23 01:47:54.373 [INFO] database P0000001104 T0000000000000001557  rarch_sync_process, begin to send arch, db_magic:0x25335316.                      2024-06-23 01:47:54.376 [INFO] database P0000001104 T0000000000000001557  Local arch send type to TEST2 change from 0 to 1, clear recent arch send info!
2024-06-23 01:47:54.533 [INFO] database P0000001104 T0000000000000001557  rarch_sync_process, ep_seqno[0]'s rlog_pkg send over!                             2024-06-23 01:47:54.533 [INFO] database P0000001104 T0000000000000001557  rarch_sync_process, all rlog_pkg send over, rarch_apply_wait!
2024-06-23 01:47:54.547 [INFO] database P0000001104 T0000000000000001557  rarch_sync_process, all rlog_pkg send over, rarch_apply_wait success, site_seq:1, reach_until_time:0!
2024-06-23 01:47:54.547 [INFO] database P0000001104 T0000000000000001557  rarch_local_arch_send to TEST2 finished, code=0, dseq=1719099866, max_arch_lsn=1724574841!
...

为保证可以追齐,传输一段日志后将主库设置为SUSPEND,禁止业务数据继续RLOG写入。

2024-06-23 01:47:54.927 [INFO] database P0000001104 T0000000000000001566  rarch_local_arch_send to TEST2 finished, code=0, dseq=1719099871, max_arch_lsn=1724574842!
2024-06-23 01:47:54.927 [INFO] database P0000001104 T0000000000000001554  utsk_cmd_add, received sql exec cmd:1, dseq:1719099872, sql:ALTER DATABASE SUSPEND
2024-06-23 01:47:54.927 [INFO] database P0000001104 T0000000000000001554  utsk_cmd_add, cmd info: cmd=1, dseq=1719099872, name_in=, begin_lsn=-1!
2024-06-23 01:47:54.927 [INFO] database P0000001104 T0000000000000001554  utsk_dw_sql, before enqueue, len:0, tsks:0
2024-06-23 01:47:54.927 [INFO] database P0000001104 T0000000000000001554  utsk_dw_sql, after enqueue, len:1, tsks:1
2024-06-23 01:47:54.927 [INFO] database P0000001104 T0000000000000001554  utsk_dw_sql:ALTER DATABASE SUSPEND put to task sys, sys_status:OPEN
2024-06-23 01:47:54.927 [INFO] database P0000001104 T0000000000000001568  utsk_cmd_exec, cmd:1, sys_status:OPEN, dseq:1719099872
2024-06-23 01:47:55.028 [INFO] database P0000001104 T0000000000000001568  alter database suspend success!
2024-06-23 01:47:55.028 [INFO] database P0000001104 T0000000000000001568  utsk_process_dw_sql_exec, prepare sql:ALTER DATABASE SUSPEND, code:0, dseq:1719099872

继续传输直至归档追齐

2024-06-23 01:47:55.029 [INFO] database P0000001104 T0000000000000001554  utsk_cmd_add, cmd info: cmd=210, dseq=1719099873, name_in=TEST2, begin_lsn=0!
2024-06-23 01:47:55.029 [INFO] database P0000001104 T0000000000000001569  utsk_cmd_exec, cmd:210, sys_status:SUSPEND, dseq:1719099873
2024-06-23 01:47:55.036 [INFO] database P0000001104 T0000000000000001569  rarch_sync_process from mal_site(1): n_apply_ep=1, apply_seq_arr[0]=1430623, apply_lsn_arr[0]=1724574842
2024-06-23 01:47:55.037 [INFO] database P0000001104 T0000000000000001569  rarch_ep_init begin, dsc_seqno:0, begin_pkg_seqno:1430623, rfil:/u01/dm/dmarch/test/ARCHIVE_LOCAL1_0x25335316_EP0_2024-06-23_00-43-33.log!
2024-06-23 01:47:55.054 [INFO] database P0000001104 T0000000000000001569  rarch_ep_init, ep->prev_term_id change from 0 to 0.
2024-06-23 01:47:55.054 [INFO] database P0000001104 T0000000000000001569  rarch_ep_init, code:0, dsc_seqno:0, begin_pkg_seqno:1430623, load ep pkg info[db_magic:0x25335316, dsc_seqno:0, pkg_seqno:1430623, min_lsn:1724574842, max_lsn:1724574842], prev_term_id:0!
2024-06-23 01:47:55.054 [INFO] database P0000001104 T0000000000000001569  rarch_sync_process, begin to send arch, db_magic:0x25335316.
2024-06-23 01:47:55.056 [INFO] database P0000001104 T0000000000000001569  rarch_sync_process, ep_seqno[0]'s rlog_pkg send over!
2024-06-23 01:47:55.056 [INFO] database P0000001104 T0000000000000001569  rarch_sync_process, all rlog_pkg send over, rarch_apply_wait!
2024-06-23 01:47:55.060 [INFO] database P0000001104 T0000000000000001569  rarch_sync_process, all rlog_pkg send over, rarch_apply_wait success, site_seq:1, reach_until_time:0!
2024-06-23 01:47:55.060 [INFO] database P0000001104 T0000000000000001569  rarch_local_arch_send to TEST2 finished, code=0, dseq=1719099873, max_arch_lsn=1724574842!
...

日志追齐后将ARCHIVE状态恢复为VALID,保证open后保持到备库实时归档

2024-06-23 01:47:55.061 [INFO] database P0000001104 T0000000000000001554  utsk_cmd_add, cmd info: cmd=100, dseq=1719099874, name_in=, begin_lsn=-1!
2024-06-23 01:47:55.061 [INFO] database P0000001104 T0000000000000001567  utsk_cmd_exec, cmd:100, sys_status:SUSPEND, dseq:1719099874
2024-06-23 01:47:55.061 [INFO] database P0000001104 T0000000000000001567  Change TEST2 arch status from INVALID to VALID
2024-06-23 01:47:55.064 [INFO] database P0000001104 T0000000000000001567  Notify dest ep(1) set dw_stat_flag from DW_RECOVERY to NONE success.
2024-06-23 01:47:55.064 [INFO] database P0000001104 T0000000000000001567  utsk_process_mdw_chg_march, dseq:1719099874, code:0, op_flag:0, n_name:1, arch_status:VALID!
2024-06-23 01:47:55.064 [INFO] database P0000001104 T0000000000000001567  utsk_cmd_exec, utsk_process_mdw_chg_march success, sys_status:SUSPEND!
2024-06-23 01:47:55.064 [INFO] database P0000001104 T0000000000000001554  utsk_cmd_add, received sql exec cmd:1, dseq:1719099875, sql:ALTER DATABASE OPEN FORCE

数据库open完毕,守护进程状态由DW_RECOVERY切换为NONE

2024-06-23 01:47:55.064 [INFO] database P0000001104 T0000000000000001554  pha_altdb_open_for_dw alter database open start...                                2024-06-23 01:47:55.065 [INFO] database P0000001104 T0000000000000001554  pha_altdb_open_for_dw, altdb set changing end!                                    2024-06-23 01:47:55.065 [INFO] database P0000001104 T0000000000000001554  altdb open rlog_flush_notify_ex start!                                            2024-06-23 01:47:55.065 [INFO] database P0000001104 T0000000000000001554  altdb open rlog_flush_notify_ex end!                                              2024-06-23 01:47:55.065 [INFO] database P0000001104 T0000000000000001554  pha_altdb_open_for_dw alter database open success!                                2024-06-23 01:47:55.065 [INFO] database P0000001104 T0000000000000001570  total 0 active crash trx, pseg_crash_trx_rollbacksys_only(0) begin ...            2024-06-23 01:47:55.065 [INFO] database P0000001104 T0000000000000001570  pseg_crash_trx_rollback end, total 0 active crash trx, include 0 empty_trxs, 0 empty_pages which only need to delete mgr recs.                                                                                                                2024-06-23 01:47:55.065 [INFO] database P0000001104 T0000000000000001570  pseg_crash_trx_rollback end                                                       2024-06-23 01:47:55.065 [INFO] database P0000001104 T0000000000000001570  hpc_clear_active_trx adjust n_crash_active_trx from 0 to 0.                       2024-06-23 01:47:56.067 [INFO] database P0000001104 T0000000000000001554  utsk_cmd_add, cmd info: cmd=217, dseq=1719099876, name_in=, begin_lsn=-1!         2024-06-23 01:47:56.067 [INFO] database P0000001104 T0000000000000001554  utsk_set_global_dw_stat, begin, msg_dseq:1719099876                               2024-06-23 01:47:56.067 [INFO] database P0000001104 T0000000000000001554  set g_dw_stat from DW_RECOVERY to NONE success, g_dw_recover_stop is 0            2024-06-23 01:47:56.070 [INFO] database P0000001104 T0000000000000001554  Notify dest ep(TEST2) set dw_stat_flag from DW_RECOVERY to NONE success.          2024-06-23 01:47:56.070 [INFO] database P0000001104 T0000000000000001554  utsk_set_global_dw_stat, finished, msg_dseq:1719099876, set code:0                2024-06-23 01:47:57.588 [INFO] database P0000001104 T0000000000000001518  Local arch send type to TEST2 change from 1 to 0, clear recent arch send info!

总结一下先停备库时系统将做如下处理:

1、主库向备库发送实时归档失败

2、主库状态变为SUSPEND,备库归档状态变INVALID

这时查询v$arch_status视图ARCH_STATUS字段等于INVALID。 

3、主库open,继续对外提供服务

备库再次启动后

1、守护系统状态设置为DW_RECOVERY

2、通知主库同步历史归档

这时主库查询v$arch_send_info视图FOR_RECOVERY字段等于Y。

查询v$recover_status视图有记录。

3、通知主库切换为SUSPEND状态

4、同步剩余历史归档

同步完毕时查询v$recover_status视图RECOVER_PERCENT字段等于100%

5、通知主库设置备库归档为VALID状态

这时查询v$arch_status视图ARCH_STATUS字段等于VALID。

6、通知主库切换为 Open 状态

7、守护进程状态由DW_RECOVERY切换为NONE

这时查询v$recover_status视图无记录,查询v$arch_send_info视图FOR_RECOVERY字段等于N。

  • 4
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值