Kingbase数据库V8R6版本高可用集群切机测试

一.测试环境:
1.架构:X86
2.操作系统:centos7
3.数据库版本:V008R006C005B0054
4.服务器物理ip:192.168.81.181、192.168.81.182
5.集群漂移vip:192.168.81.185
二.集群配置文件主要参数值:
1.conninfo='host=192.168.81.182 user=system dbname=esrep port=54322 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'
2.reconnect_attempts=10
3.reconnect_interval=6
4.failover='automatic'
5.recovery='automatic'
6.trusted_servers='192.168.81.2'
7.virtual_ip='192.168.81.185/24'

三.测试过程:
1.查看当前集群状态:
[k86@192-168-81-182 etc]$ repmgr cluster show
ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                 
----+---------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------------------------------------------         ------------------------------------------------
1  | node182 | standby |   running | node181  | default  | 100      | 4        | host=192.168.81.182 user=system dbname=esrep port=54322 connect_timeout=10 keepalives=1              keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2  | node181 | primary |  running |          | default  | 100      | 4        | host=192.168.81.181 user=system dbname=esrep port=54322 connect_timeout=10 keepalives=1                    keepalives_idle=10 keepalives_interval=1 keepalives_count=3

2.手动停止当前集群的主库:
   [k86@192-168-81-181 ~]$ sys_ctl -D /home/k86/cluster/cluster/fcj/kingbase/data -m fast stop
3.查看该节点(192.168.81.181)集群日志(hamgr.log):
638 [2022-11-15 18:03:25] [WARNING] unable to ping "host=192.168.81.181 user=system dbname=esrep port=54322 connect_timeout=10 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3"
639 [2022-11-15 18:03:25] [DETAIL] PQping() returned "PQPING_REJECT"
640 [2022-11-15 18:03:25] [WARNING] connection to node "node181" (ID: 1) lost
641 [2022-11-15 18:03:25] [DETAIL]
642 FATAL:  terminating connection due to administrator command
643 server closed the connection unexpectedly
644         This probably means the server terminated abnormally
645         before or while processing the request.
646
647 [2022-11-15 18:03:25] [INFO] attempting to reconnect to node "node181" (ID: 1)
648 [2022-11-15 18:03:25] [ERROR] connection to database failed
649 [2022-11-15 18:03:25] [DETAIL]
650 FATAL:  the database system is shutting down
651
652 [2022-11-15 18:03:25] [DETAIL] attempted to connect using:
653   user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3     fallback_application_name=repmgr options=-csearch_path=
654 [2022-11-15 18:03:25] [WARNING] reconnection to node "node181" (ID: 1) failed
655 [2022-11-15 18:03:25] [WARNING] unable to connect to local node
656 [2022-11-15 18:03:25] [INFO] sleeping 6 seconds until next reconnection attempt
657 [2022-11-15 18:03:31] [INFO] checking state of node 1, 1 of 10 attempts
658 [2022-11-15 18:03:31] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
659 [2022-11-15 18:03:31] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
660 [2022-11-15 18:03:31] [INFO] sleeping 6 seconds until next reconnection attempt
661 [2022-11-15 18:03:37] [INFO] checking state of node 1, 2 of 10 attempts
662 [2022-11-15 18:03:37] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
663 [2022-11-15 18:03:37] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
664 [2022-11-15 18:03:37] [INFO] sleeping 6 seconds until next reconnection attempt
665 [2022-11-15 18:03:43] [INFO] checking state of node 1, 3 of 10 attempts
666 [2022-11-15 18:03:43] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
667 [2022-11-15 18:03:43] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
668 [2022-11-15 18:03:43] [INFO] sleeping 6 seconds until next reconnection attempt
669 [2022-11-15 18:03:49] [INFO] checking state of node 1, 4 of 10 attempts
670 [2022-11-15 18:03:49] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
671 [2022-11-15 18:03:49] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
672 [2022-11-15 18:03:49] [INFO] sleeping 6 seconds until next reconnection attempt
673 [2022-11-15 18:03:55] [INFO] checking state of node 1, 5 of 10 attempts
674 [2022-11-15 18:03:55] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
675 [2022-11-15 18:03:55] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
676 [2022-11-15 18:03:55] [INFO] sleeping 6 seconds until next reconnection attempt
677 [2022-11-15 18:04:01] [INFO] checking state of node 1, 6 of 10 attempts
678 [2022-11-15 18:04:01] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
679 [2022-11-15 18:04:01] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
680 [2022-11-15 18:04:01] [INFO] sleeping 6 seconds until next reconnection attempt
681 [2022-11-15 18:04:07] [INFO] checking state of node 1, 7 of 10 attempts
682 [2022-11-15 18:04:07] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
683 [2022-11-15 18:04:07] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
684 [2022-11-15 18:04:07] [INFO] sleeping 6 seconds until next reconnection attempt
685 [2022-11-15 18:04:13] [INFO] checking state of node 1, 8 of 10 attempts
686 [2022-11-15 18:04:13] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
687 [2022-11-15 18:04:13] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
688 [2022-11-15 18:04:13] [INFO] sleeping 6 seconds until next reconnection attempt
689 [2022-11-15 18:04:19] [INFO] checking state of node 1, 9 of 10 attempts
690 [2022-11-15 18:04:19] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
691 [2022-11-15 18:04:19] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
692 [2022-11-15 18:04:19] [INFO] sleeping 6 seconds until next reconnection attempt
693 [2022-11-15 18:04:25] [INFO] checking state of node 1, 10 of 10 attempts
694 [2022-11-15 18:04:25] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
695 [2022-11-15 18:04:25] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
696 [2022-11-15 18:04:25] [WARNING] unable to reconnect to node 1 after 10 attempts
697 [2022-11-15 18:04:25] [NOTICE] unable to connect to local node, falling back to degraded monitoring
698 [2022-11-15 18:04:25] [WARNING] unable to ping "host=192.168.81.181 user=system dbname=esrep port=54322 connect_timeout=10 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3"
699 [2022-11-15 18:04:25] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
700 [2022-11-15 18:04:25] [ERROR] unable to determine if server is in recovery
701 [2022-11-15 18:04:25] [DETAIL] query text is:
702 SELECT pg_catalog.pg_is_in_recovery()
703 [2022-11-15 18:04:25] [WARNING] unable to determine node recovery status
704 [2022-11-15 18:04:27] [WARNING] unable to ping "host=192.168.81.181 user=system dbname=esrep port=54322 connect_timeout=10 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3"
705 [2022-11-15 18:04:27] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
706 [2022-11-15 18:04:27] [WARNING] connection to node "node181" (ID: 1) lost
707 [2022-11-15 18:04:27] [DETAIL]
708 connection pointer is NULL
709
710 [2022-11-15 18:04:27] [INFO] attempting to reconnect to node "node181" (ID: 1)
711 [2022-11-15 18:04:27] [ERROR] connection to database failed
712 [2022-11-15 18:04:27] [DETAIL]
713 could not connect to server: Connection refused
714         Is the server running on host "192.168.81.181" and accepting
715         TCP/IP connections on port 54322?
716
717 [2022-11-15 18:04:27] [DETAIL] attempted to connect using:
718   user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3     fallback_application_name=repmgr options=-csearch_path=
719 [2022-11-15 18:04:27] [WARNING] reconnection to node "node181" (ID: 1) failed
720 [2022-11-15 18:04:27] [WARNING] unable to ping "host=192.168.81.181 user=system dbname=esrep port=54322 connect_timeout=10 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3"
721 [2022-11-15 18:04:27] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
722 [2022-11-15 18:04:27] [ERROR] unable to determine if server is in recovery
723 [2022-11-15 18:04:27] [DETAIL] query text is:
724 SELECT pg_catalog.pg_is_in_recovery()
725 [2022-11-15 18:04:27] [WARNING] unable to determine node recovery status
726 [2022-11-15 18:04:29] [WARNING] unable to ping "host=192.168.81.181 user=system dbname=esrep port=54322 connect_timeout=10 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3"
727 [2022-11-15 18:04:29] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
728 [2022-11-15 18:04:29] [WARNING] connection to node "node181" (ID: 1) lost
729 [2022-11-15 18:04:29] [DETAIL]
730 connection pointer is NULL
731
732 [2022-11-15 18:04:29] [INFO] attempting to reconnect to node "node181" (ID: 1)
733 [2022-11-15 18:04:29] [ERROR] connection to database failed
734 [2022-11-15 18:04:29] [DETAIL]
735 could not connect to server: Connection refused
736         Is the server running on host "192.168.81.181" and accepting
737         TCP/IP connections on port 54322?
738
739 [2022-11-15 18:04:29] [DETAIL] attempted to connect using:
740   user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3     fallback_application_name=repmgr options=-csearch_path=
741 [2022-11-15 18:04:29] [WARNING] reconnection to node "node181" (ID: 1) failed
742 [2022-11-15 18:04:29] [WARNING] unable to ping "host=192.168.81.181 user=system dbname=esrep port=54322 connect_timeout=10 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3"
743 [2022-11-15 18:04:29] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
744 [2022-11-15 18:04:29] [ERROR] unable to determine if server is in recovery
745 [2022-11-15 18:04:29] [DETAIL] query text is:
746 SELECT pg_catalog.pg_is_in_recovery()
747 [2022-11-15 18:04:29] [WARNING] unable to determine node recovery status
748 [2022-11-15 18:04:35] [INFO] attempting to reconnect to node "node181" (ID: 1)
749 [2022-11-15 18:04:45] [INFO] attempting to reconnect to node "node181" (ID: 1)
750 [2022-11-15 18:04:55] [INFO] attempting to reconnect to node "node181" (ID: 1)
751 [2022-11-15 18:05:05] [INFO] attempting to reconnect to node "node181" (ID: 1)
752 [2022-11-15 18:05:15] [INFO] attempting to reconnect to node "node181" (ID: 1)
753 [2022-11-15 18:05:15] [INFO] reconnected to node "node181" (ID: 1)
754 INFO:  set_repmgrd_pid(): provided pidfile is /home/k86/cluster/cluster/fcj/kingbase/etc/hamgrd.pid
755 [2022-11-15 18:05:15] [NOTICE] reconnected to node after 50 seconds, node is now a standby, switching to standby monitoring
756 [2022-11-15 18:05:15] [INFO] searching for primary node
757 [2022-11-15 18:05:15] [INFO] checking if node 2 is primary
758 [2022-11-15 18:05:15] [INFO] current primary node is 2
759 [2022-11-15 18:05:15] [NOTICE] former primary has been restored as standby after 50 seconds, updating node record and resuming monitoring
760 [2022-11-15 18:05:16] [NOTICE] try to change wal catched_up state to 1
761 [2022-11-15 18:05:16] [INFO] primary flush lsn is 0/D001A00, local flush lsn is 0/D001A00
762 [2022-11-15 18:05:18] [NOTICE] try to change streaming_sync state to TRUE
763 [2022-11-15 18:10:16] [INFO] node "node181" (ID: 1) monitoring upstream node "node182" (ID: 2) in normal state
764 [2022-11-15 18:15:17] [INFO] node "node181" (ID: 1) monitoring upstream node "node182" (ID: 2) in normal state
765 [2022-11-15 18:20:19] [INFO] node "node181" (ID: 1) monitoring upstream node "node182" (ID: 2) in normal state
766 [2022-11-15 18:25:21] [INFO] node "node181" (ID: 1) monitoring upstream node "node182" (ID: 2) in normal state
767 [2022-11-15 18:30:22] [INFO] node "node181" (ID: 1) monitoring upstream node "node182" (ID: 2) in normal state
768 [2022-11-15 18:35:24] [INFO] node "node181" (ID: 1) monitoring upstream node "node182" (ID: 2) in normal state
769 [2022-11-15 18:40:25] [INFO] node "node181" (ID: 1) monitoring upstream node "node182" (ID: 2) in normal state

从第657行开始主库的repmgrd进程通过conninfo的配置信息去连接本机数据库,进行10次尝试(该尝试连接次数由reconnect_attempts参数值决定)且每次尝试间隔6s(该尝试连接次数由reconnect_interval参数值决定)后由PQping() 返回 "PQPING_NO_RESPONSE",表示通过conninfo给出的连接信息无法连接到目标服务端。随后。通过conninfo配置连接数据库去执行SELECT pg_catalog.pg_is_in_recovery()也无法判定该节点是不是处于恢复状态。
4.查看备节点(192.168.81.182)集群日志(hamgr.log):
447 [2022-11-15 18:03:26] [WARNING] unable to ping "host=192.168.81.181 user=system dbname=esrep port=54322 connect_timeout=10 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3"
448 [2022-11-15 18:03:26] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
449 [2022-11-15 18:03:26] [WARNING] unable to connect to upstream node "node181" (ID: 1)
450 [2022-11-15 18:03:26] [INFO] checking state of node "node181" (ID: 1), 1 of 10 attempts
451 [2022-11-15 18:03:26] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
452 [2022-11-15 18:03:26] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
453 [2022-11-15 18:03:26] [INFO] sleeping up to 6 seconds until next reconnection attempt
454 [2022-11-15 18:03:32] [INFO] checking state of node "node181" (ID: 1), 2 of 10 attempts
455 [2022-11-15 18:03:32] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
456 [2022-11-15 18:03:32] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
457 [2022-11-15 18:03:32] [INFO] sleeping up to 6 seconds until next reconnection attempt
458 [2022-11-15 18:03:38] [INFO] checking state of node "node181" (ID: 1), 3 of 10 attempts
459 [2022-11-15 18:03:38] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
460 [2022-11-15 18:03:38] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
461 [2022-11-15 18:03:38] [INFO] sleeping up to 6 seconds until next reconnection attempt
462 [2022-11-15 18:03:44] [INFO] checking state of node "node181" (ID: 1), 4 of 10 attempts
463 [2022-11-15 18:03:44] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
464 [2022-11-15 18:03:44] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
465 [2022-11-15 18:03:44] [INFO] sleeping up to 6 seconds until next reconnection attempt
466 [2022-11-15 18:03:50] [INFO] checking state of node "node181" (ID: 1), 5 of 10 attempts
467 [2022-11-15 18:03:50] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
468 [2022-11-15 18:03:50] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
469 [2022-11-15 18:03:50] [INFO] sleeping up to 6 seconds until next reconnection attempt
470 [2022-11-15 18:03:56] [INFO] checking state of node "node181" (ID: 1), 6 of 10 attempts
471 [2022-11-15 18:03:56] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
472 [2022-11-15 18:03:56] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
473 [2022-11-15 18:03:56] [INFO] sleeping up to 6 seconds until next reconnection attempt
474 [2022-11-15 18:04:02] [INFO] checking state of node "node181" (ID: 1), 7 of 10 attempts
475 [2022-11-15 18:04:02] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
476 [2022-11-15 18:04:02] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
477 [2022-11-15 18:04:02] [INFO] sleeping up to 6 seconds until next reconnection attempt
478 [2022-11-15 18:04:08] [INFO] checking state of node "node181" (ID: 1), 8 of 10 attempts
479 [2022-11-15 18:04:08] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
480 [2022-11-15 18:04:08] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
481 [2022-11-15 18:04:08] [INFO] sleeping up to 6 seconds until next reconnection attempt
482 [2022-11-15 18:04:14] [INFO] checking state of node "node181" (ID: 1), 9 of 10 attempts
483 [2022-11-15 18:04:14] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
484 [2022-11-15 18:04:14] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
485 [2022-11-15 18:04:14] [INFO] sleeping up to 6 seconds until next reconnection attempt
486 [2022-11-15 18:04:20] [INFO] checking state of node "node181" (ID: 1), 10 of 10 attempts
487 [2022-11-15 18:04:20] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.81.181 port=54322 keepalives=1 keepalives_idl    e=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
488 [2022-11-15 18:04:20] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
489 [2022-11-15 18:04:20] [WARNING] unable to reconnect to node "node181" (ID: 1) after 10 attempts
490 [2022-11-15 18:04:20] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
491 [2022-11-15 18:04:20] [WARNING] wal receiver not running
492 [2022-11-15 18:04:20] [NOTICE] WAL receiver disconnected on all sibling nodes
493 [2022-11-15 18:04:20] [INFO] WAL receiver disconnected on all 0 sibling nodes
494 [2022-11-15 18:04:20] [INFO] 0 active sibling nodes registered
495 [2022-11-15 18:04:20] [INFO] 2 total nodes registered
496 [2022-11-15 18:04:20] [INFO] primary node  "node181" (ID: 1) and this node have the same location ("default")
497 [2022-11-15 18:04:20] [INFO] no other sibling nodes - we win by default
498 [2022-11-15 18:04:20] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms
499 [2022-11-15 18:04:20] [NOTICE] this node is the only available candidate and will now promote itself
500 [2022-11-15 18:04:20] [INFO] try to ping the trusted_servers "192.168.81.2" before execute promote_command
501 [2022-11-15 18:04:22] [NOTICE] PING 192.168.81.2 (192.168.81.2) 56(84) bytes of data.
502
503 --- 192.168.81.2 ping statistics ---
504 2 packets transmitted, 2 received, 0% packet loss, time 1001ms
505 rtt min/avg/max/mdev = 0.309/0.399/0.490/0.092 ms
506
507 [2022-11-15 18:04:22] [NOTICE] successfully ping one or more of the trusted_servers "192.168.81.2"
508 [2022-11-15 18:04:23] [NOTICE] try to stop old primary db (host: "192.168.81.181")
509 [2022-11-15 18:04:24] [NOTICE] PING 192.168.81.185 (192.168.81.185) 56(84) bytes of data.
510
511 --- 192.168.81.185 ping statistics ---
512 2 packets transmitted, 2 received, 0% packet loss, time 1002ms
513 rtt min/avg/max/mdev = 0.817/0.825/0.834/0.029 ms
514
515 [2022-11-15 18:04:24] [WARNING] the virtual ip is already on other host, try to release it on old primary node (host: "192.168.81.181")
516 [2022-11-15 18:04:25] [INFO] ES connection to host "192.168.81.181" succeeded, ready to release vip on it
517 [2022-11-15 18:04:25] [NOTICE] old primary node (host: "192.168.81.181") release the virtual ip 192.168.81.185/24 success
518 [2022-11-15 18:04:25] [NOTICE] will acquire the virtual ip again
519 [2022-11-15 18:04:26] [NOTICE] PING 192.168.81.185 (192.168.81.185) 56(84) bytes of data.
520
521 --- 192.168.81.185 ping statistics ---
522 2 packets transmitted, 0 received, +1 errors, 100% packet loss, time 1000ms
523
524
525 [2022-11-15 18:04:26] [WARNING] ping host"192.168.81.185" failed
526 [2022-11-15 18:04:26] [DETAIL] average RTT value is not greater than zero
527 [2022-11-15 18:04:26] [INFO] loadvip result: 1, arping result: 1
528 [2022-11-15 18:04:26] [NOTICE] new primary node (ID: 2) acquire the virtual ip 192.168.81.185/24 success
529 [2022-11-15 18:04:26] [INFO] promote_command is:
530   "/home/k86/cluster/cluster/fcj/kingbase/bin/repmgr  standby promote -f /home/k86/cluster/cluster/fcj/kingbase/etc/repmgr.conf"
531 NOTICE: promoting standby to primary
532 DETAIL: promoting server "node182" (ID: 2) using sys_promote()
533 NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
534 INFO: SET synchronous TO "async" on primary host
535 NOTICE: STANDBY PROMOTE successful
536 DETAIL: server "node182" (ID: 2) was successfully promoted to primary
537 [2022-11-15 18:04:27] [INFO] sleeping 6 seconds until next reconnection attempt
538 [2022-11-15 18:04:33] [INFO] checking state of node 2, 1 of 10 attempts
539 [2022-11-15 18:04:33] [NOTICE] node 2 has recovered, reconnecting
540 [2022-11-15 18:04:33] [INFO] connection to node 2 succeeded
541 [2022-11-15 18:04:33] [INFO] original connection is still available
542 [2022-11-15 18:04:33] [INFO] switching to primary monitoring mode
543 [2022-11-15 18:04:33] [NOTICE] monitoring cluster primary "node182" (ID: 2)
544 [2022-11-15 18:04:33] [INFO] create a thread 0x7fc307114700 to check the cluster status
545 [2022-11-15 18:04:33] [INFO] child node: 1; attached: no
546 [2022-11-15 18:04:33] [INFO] check node status again, try 1 / 10 times
547 [2022-11-15 18:04:34] [INFO] node (ID: 1): no server running
548 [2022-11-15 18:04:34] [INFO] [thread 0x7fc307114700] the cluster has no other running primary node, exit
549 [2022-11-15 18:04:35] [INFO] child node: 1; attached: no
550 [2022-11-15 18:04:35] [INFO] check node status again, try 2 / 10 times
551 [2022-11-15 18:04:37] [INFO] child node: 1; attached: no
552 [2022-11-15 18:04:37] [INFO] check node status again, try 3 / 10 times
553 [2022-11-15 18:04:39] [INFO] child node: 1; attached: no
554 [2022-11-15 18:04:39] [INFO] check node status again, try 4 / 10 times
555 [2022-11-15 18:04:42] [INFO] child node: 1; attached: no
556 [2022-11-15 18:04:42] [INFO] check node status again, try 5 / 10 times
557 [2022-11-15 18:04:44] [INFO] child node: 1; attached: no
558 [2022-11-15 18:04:44] [INFO] check node status again, try 6 / 10 times
559 [2022-11-15 18:04:46] [INFO] child node: 1; attached: no
560 [2022-11-15 18:04:46] [INFO] check node status again, try 7 / 10 times
561 [2022-11-15 18:04:48] [INFO] child node: 1; attached: no
562 [2022-11-15 18:04:48] [INFO] check node status again, try 8 / 10 times
563 [2022-11-15 18:04:50] [INFO] child node: 1; attached: no
564 [2022-11-15 18:04:50] [INFO] check node status again, try 9 / 10 times
565 [2022-11-15 18:04:52] [INFO] child node: 1; attached: no
566 [2022-11-15 18:04:52] [INFO] check node status again, try 10 / 10 times
567 [2022-11-15 18:04:54] [INFO] child node: 1; attached: no
568 [2022-11-15 18:04:54] [INFO] found node down, recovery will be triggered after recovery delay time 20s
569 [2022-11-15 18:04:56] [INFO] child node: 1; attached: no
570 [2022-11-15 18:04:58] [INFO] child node: 1; attached: no
571 [2022-11-15 18:05:00] [INFO] child node: 1; attached: no
572 [2022-11-15 18:05:02] [INFO] child node: 1; attached: no
573 [2022-11-15 18:05:04] [INFO] child node: 1; attached: no
574 [2022-11-15 18:05:06] [INFO] child node: 1; attached: no
575 [2022-11-15 18:05:08] [INFO] child node: 1; attached: no
576 [2022-11-15 18:05:10] [INFO] child node: 1; attached: no
577 [2022-11-15 18:05:12] [INFO] child node: 1; attached: no
578 [2022-11-15 18:05:14] [INFO] child node: 1; attached: no
579 [2022-11-15 18:05:14] [INFO] recovery delay time reached. can do recovery now.
580 [2022-11-15 18:05:14] [INFO] [thread pid:31795] do_nodes_recovery thread begin. The pthread_t tid is 0x7fc307114700
581 [2022-11-15 18:05:14] [NOTICE] [thread pid:31795] node (ID: 1; host: "192.168.81.181") is not attached, ready to auto-recovery
582 WARNING: following issues were detected
583   - unable to connect to node "node181" (ID: 1)
584
585 HINT: execute with --verbose option to see connection error messages
586 [2022-11-15 18:05:14] [NOTICE] [thread pid:31795] Now, the primary host ip: 192.168.81.182
587 [2022-11-15 18:05:14] [INFO] [thread pid:31795] ES connection to host "192.168.81.181" succeeded, ready to do auto-recovery
588 [2022-11-15 18:05:15] [INFO] unlink file /tmp/.s.KINGBASE.54322.lock
589 [2022-11-15 18:05:15] [NOTICE] executing repmgr command "/home/k86/cluster/cluster/fcj/kingbase/bin/repmgr --dbname="host=192.168.81.182 dbname=esrep     user=system port=54322" node rejoin --force-rewind --upstream-node-id 2"
590 NOTICE: rejoin target is node "node182" (ID: 2)
591 NOTICE: executing sys_rewind
592 DETAIL: sys_rewind command is "/home/k86/cluster/cluster/fcj/kingbase/bin/sys_rewind -D '/home/k86/cluster/cluster/fcj/kingbase/data' --source-server=    'host=192.168.81.182 user=system dbname=esrep port=54322 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'"
593 sys_rewind: servers diverged at WAL location 0/D0000A0 on timeline 3
594 sys_rewind: no rewind required
595 sys_rewind: status check: restore all the backup temp file, Done!
596 NOTICE: 0 files copied to /home/k86/cluster/cluster/fcj/kingbase/data
597 INFO: creating replication slot as user "system"
598 NOTICE: setting node 1's upstream to node 2
599 WARNING: unable to ping "host=192.168.81.181 user=system dbname=esrep port=54322 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interva    l=1 keepalives_count=3"
600 DETAIL: PQping() returned "PQPING_NO_RESPONSE"
601 NOTICE: begin to start server at 2022-11-15 18:05:15.287856
602 NOTICE: starting server using "/home/k86/cluster/cluster/fcj/kingbase/bin/sys_ctl  -w -D '/home/k86/cluster/cluster/fcj/kingbase/data' start"
603 NOTICE: start server finish at 2022-11-15 18:05:15.400976
604 NOTICE: NODE REJOIN successful
605 DETAIL: node 1 is now attached to node 2
606 [2022-11-15 18:05:15] [NOTICE] kbha: node (ID: 1) rejoin success.
607
608 [2022-11-15 18:05:15] [NOTICE] [thread pid:31795] node "node181" (ID: 1) auto-recovery success
609 [2022-11-15 18:05:15] [INFO] [thread pid:31795] Is standby node "node181" (ID: 1) ready for connection?
610 [2022-11-15 18:05:15] [INFO] [thread pid:31795] the standby node "node181" (ID: 1) connected ... OK
611 [2022-11-15 18:05:15] [INFO] [thread pid:31795] do_nodes_recovery thread ends. The pthread_t tid is 0x7fc307114700
612 [2022-11-15 18:05:16] [NOTICE] new standby "node181" (ID: 1) has connected
613 [2022-11-15 18:05:16] [INFO] SET synchronous TO "quorum" on primary host
614 [2022-11-15 18:05:16] [INFO] thread tid:0x7fc307114700 is not running
615 [2022-11-15 18:05:16] [INFO] the recovery thread was exited, reset tid
616 [2022-11-15 18:05:16] [NOTICE] Some nodes reconnect, all standby nodes are OK now
[2022-11-15 18:09:35] [INFO] monitoring primary node "node182" (ID: 2) in normal state
[2022-11-15 18:14:36] [INFO] monitoring primary node "node182" (ID: 2) in normal state
[2022-11-15 18:19:38] [INFO] monitoring primary node "node182" (ID: 2) in normal state
[2022-11-15 18:24:40] [INFO] monitoring primary node "node182" (ID: 2) in normal state
[2022-11-15 18:29:41] [INFO] monitoring primary node "node182" (ID: 2) in normal state
[2022-11-15 18:34:43] [INFO] monitoring primary node "node182" (ID: 2) in normal state
[2022-11-15 18:39:44] [INFO] monitoring primary node "node182" (ID: 2) in normal state
[2022-11-15 18:44:46] [INFO] monitoring primary node "node182" (ID: 2) in normal state
[2022-11-15 18:49:48] [INFO] monitoring primary node "node182" (ID: 2) in normal state
[2022-11-15 18:54:49] [INFO] monitoring primary node "node182" (ID: 2) in normal state
[2022-11-15 18:59:51] [INFO] monitoring primary node "node182" (ID: 2) in normal state
根据原备库的集群日志信息可以看出:与原主库几乎相同的时间(第450行),备库的repmgrd进程也通过原主库的conninfo配置信息去连接源端的主库,进行10次尝试(该尝试连接次数由reconnect_attempts参数值决定)且每次尝试间隔6s(该尝试连接次数由reconnect_interval参数值决定)后也由PQping() 返回 "PQPING_NO_RESPONSE"。随后,检查集群中有几个点节点(第495行),确定该备库节点是唯一的候选备库后进行备库提升为主库的操作(第499行)。将备库提升为新主库之前会先尝试ping网关的操作(第500行),成功ping通网关后再去ping漂移vip(第509行),此时,如果能够ping通说明vip还在原主库,需将其del后再add到本地网卡(第511行-第528行)。获取到vip后,在当前备库执行repmgr standby promote的操作,实际使用的是sys_promote函数操作提升备库。等待长达60s时间后(即promote_check_timeout的参数值),备库提升为主库成功(第533行)。接下来重要的一步是repmgrd进程将集群同步方式设定为async(即异步)模式(第534行)。promote操作成功后,间隔reconnect_interval时间后reconnect新主库确定该节点状态,还是10次尝试每次间隔6s,如果能成功连接到节点,则尝试连接操作终止。此时,创建一个线程(thread 0x7fc307114700)去检查集群状态,发现node:1节点是未连接状态(child node: 1; attached: no),尝试十次每次间隔2s后仍未发现node:1节点仍未连接,则认为此节点状态为down,将在20s延时后触发恢复操作。达到20s的恢复延时,又创建出线程(pid:31795)去执行do_nodes_recovery操作,线程的tid为0x7fc307114700,此线程检查到node (ID: 1; host: "192.168.81.181")处于not attached状态,所以进行auto-recovery操作。恢复操作首先会删除/tmp/.s.KINGBASE.54322.lock文件,然后执行repmgr  node rejoin操作,此操作中制定了--force-rewind参数,故会执行sys_rewind命令。上述操作完成后,repmgrd进程会创建新的流复制复制槽(第597行),然后通过node:1的conninfo配置去连接rejoin后的节点,当PQping() 返回 "PQPING_NO_RESPONSE"后,执行sys_ctl start命令去启动node:1节点的数据库。同时在原主库的hamgr.log日志中也能看到相关日志(原主库hamgr.log日志的第759行)。node "node181" (ID: 1) auto-recovery success,同时确定该节点可以连接,结束do_nodes_recovery thread的线程(tid为0x7fc307114700),repmgrd进程将同步状态设置为quorum,此时原主库的hamgr.log将流复制同步模式改变为sync模式(第762行)。
5.查看集群状态:
[k86@192-168-81-182 etc]$ repmgr cluster show
ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                 
----+---------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------
1  | node181 | standby |   running | node182  | default  | 100      | 4        | host=192.168.81.181 user=system dbname=esrep port=54322 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2  | node182 | primary | * running |          | default  | 100      | 4        | host=192.168.81.182 user=system dbname=esrep port=54322 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
至此,集群切换日志分析完成。
附:安装集群的时候recovery参数默认值为standby,这样在手动停止主库,备库完成提升成为新主库后,旧主库是无法以备库身份自动加入集群的(即无repmgr node rejoin操作),需手动执行才可。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值