pg主备切换

韶博雅

已于 2023-10-17 09:19:32 修改

阅读量322

点赞数

文章标签：数据库

于 2023-10-15 20:18:13 首次发布

本文链接：https://blog.csdn.net/weixin_39735909/article/details/133845824

版权

1. switchover正常主从切换

1.1 正常切换

官网：https://repmgr.org/docs/current/repmgr-standby-switchover.html
repmgr -f /postgresql/pg13/repmgr.conf cluster show
repmgr -f /postgresql/pg13/repmgr.conf standby switchover --siblings-follow --dry-run --force-rewind
repmgr -f /postgresql/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind
repmgr -f /postgresql/pg13/repmgr.conf cluster show
注意：切换操作只能在standby 库上执行

-- 可以debug打印详细的切换过程
repmgr -f /postgresql/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind --log-level DEBUG --verbose


案例1：
[pgsql@pg1 ~]$ repmgr -f /postgresql/pg13/repmgr.conf cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                            
----+---------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------------------------
 1  | pg1     | primary | * running |          | default  | 100      | 1        | host=172.72.10.2 user=repmgr password=repmgr dbname=repmgr connect_timeout=2 
 2  | pg2     | standby |   running | pg1      | default  | 100      | 1        | host=172.72.10.3  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 3  | pg3     | standby |   running | pg1      | default  | 100      | 1        | host=172.72.10.5  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 6  | wintess | witness | * running | pg1      | default  | 0        | n/a      | host=172.72.10.7  user=repmgr password=repmgr dbname=repmgr connect_timeout=2



[pgsql@pg2 ~]$ repmgr -f /postgresql/pg13/repmgr.conf standby switchover --siblings-follow --dry-run --force-rewind
NOTICE: checking switchover on node "pg2" (ID: 2) in --dry-run mode
INFO: prerequisites for using pg_rewind are met
INFO: SSH connection to host "172.72.10.2" succeeded
INFO: able to execute "repmgr" on remote host "172.72.10.2"
INFO: all sibling nodes are reachable via SSH
INFO: 3 walsenders required, 10 available
INFO: demotion candidate is able to make replication connection to promotion candidate
INFO: 0 pending archive files
INFO: replication lag on this standby is 0 seconds
NOTICE: local node "pg2" (ID: 2) would be promoted to primary; current primary "pg1" (ID: 1) would be demoted to standby
INFO: following shutdown command would be run on node "pg1":
  "/postgresql/pg13/bin/pg_ctl  -D '/postgresql/pgdata' -W -m fast stop"
INFO: parameter "shutdown_check_timeout" is set to 60 seconds
INFO: prerequisites for executing STANDBY SWITCHOVER are met



[pgsql@pg2 ~]$ repmgr -f /postgresql/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind
NOTICE: executing switchover on node "pg2" (ID: 2)
NOTICE: local node "pg2" (ID: 2) will be promoted to primary; current primary "pg1" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "pg1" (ID: 1)
NOTICE: issuing CHECKPOINT on node "pg1" (ID: 1) 
DETAIL: executing server command "/postgresql/pg13/bin/pg_ctl  -D '/postgresql/pgdata' -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/F000028
NOTICE: promoting standby to primary
DETAIL: promoting server "pg2" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "pg2" (ID: 2) was successfully promoted to primary
NOTICE: issuing CHECKPOINT on node "pg2" (ID: 2) 
ERROR: unable to execute CHECKPOINT
INFO: local node 1 can attach to rejoin target node 2
DETAIL: local node's recovery point: 0/F000028; rejoin target node's fork point: 0/F0000A0
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=172.72.10.2 user=repmgr password=repmgr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/postgresql/pg13/bin/pg_ctl  -w -D '/postgresql/pgdata' start"
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
NOTICE: node  "pg2" (ID: 2) promoted to primary, node "pg1" (ID: 1) demoted to standby
NOTICE: executing STANDBY FOLLOW on 2 of 2 siblings
INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
NOTICE: switchover was successful
DETAIL: node "pg2" is now primary and node "pg1" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully



[pgsql@pg1 ~]$ repmgr -f /postgresql/pg13/repmgr.conf cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                            
----+---------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------------------------
 1  | pg1     | standby |   running | pg2      | default  | 100      | 1        | host=172.72.10.2 user=repmgr password=repmgr dbname=repmgr connect_timeout=2 
 2  | pg2     | primary | * running |          | default  | 100      | 2        | host=172.72.10.3  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 3  | pg3     | standby |   running | pg2      | default  | 100      | 2        | host=172.72.10.5  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 6  | wintess | witness | * running | pg2      | default  | 0        | n/a      | host=172.72.10.7  user=repmgr password=repmgr dbname=repmgr connect_timeout=2



repmgr=# \x 
Expanded display is on.
repmgr=# select * from nodes;
-[ RECORD 1 ]----+------------------------------------------------------------------------------
node_id          | 2
upstream_node_id | 
active           | t
node_name        | pg2
type             | primary
location         | default
priority         | 100
conninfo         | host=172.72.10.3  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
repluser         | repmgr
slot_name        | 
config_file      | /postgresql/pg13/repmgr.conf
-[ RECORD 2 ]----+------------------------------------------------------------------------------
node_id          | 1
upstream_node_id | 2
active           | t
node_name        | pg1
type             | standby
location         | default
priority         | 100
conninfo         | host=172.72.10.2 user=repmgr password=repmgr dbname=repmgr connect_timeout=2
repluser         | repmgr
slot_name        | 
config_file      | /postgresql/pg13/repmgr.conf
-[ RECORD 3 ]----+------------------------------------------------------------------------------
node_id          | 3
upstream_node_id | 2
active           | t
node_name        | pg3
type             | standby
location         | default
priority         | 100
conninfo         | host=172.72.10.5  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
repluser         | repmgr
slot_name        | 
config_file      | /postgresql/pg13/repmgr.conf
-[ RECORD 4 ]----+------------------------------------------------------------------------------
node_id          | 6
upstream_node_id | 2
active           | t
node_name        | wintess
type             | witness
location         | default
priority         | 0
conninfo         | host=172.72.10.7  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
repluser         | repmgr
slot_name        | 
config_file      | /postgresql/pg13/repmgr.conf

1.2 切换故障

以下所有操作，均在备库执行
PS:若切换失败，则需要重新克隆从库
1)  启动主库
pg_ctl start

2) 重新克隆从库
repmgr -h 172.72.10.2 -U repmgr -d repmgr -f /postgresql/pg13/repmgr.conf standby clone --force 
pg_ctl  start

3) 重新注册
repmgr -f /postgresql/pg13/repmgr.conf standby register --force
repmgr -f  /postgresql/pg13/repmgr.conf  cluster show


4)  把新备库切换为主库
repmgr -f  /postgresql/pg13/repmgr.conf cluster show
repmgr -f  /postgresql/pg13/repmgr.conf standby switchover --siblings-follow --dry-run --force-rewind
-- 可以debug打印详细的切换过程
repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind --log-level DEBUG --verbose
repmgr -f  /postgresql/pg13/repmgr.conf cluster show

2 failover异常手工主从切换

2.1 主库出现故障，然后直接将从库提升为主库

1）主库宕机
pg_ctl -m fast stop

2）查询集群状态
[pgsql@pg1 ~]$ repmgr -f /postgresql/pg13/repmgr.conf cluster show
 ID | Name    | Role    | Status        | Upstream | Location | Priority | Timeline | Connection string                                                            
----+---------+---------+---------------+----------+----------+----------+----------+-------------------------------------------------------------------------------
 1  | pg1     | standby |   running     | ? pg2    | default  | 100      | 2        | host=172.72.10.2 user=repmgr password=repmgr dbname=repmgr connect_timeout=2 
 2  | pg2     | primary | ? unreachable | ?        | default  | 100      |          | host=172.72.10.3  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 3  | pg3     | standby |   running     | ? pg2    | default  | 100      | 2        | host=172.72.10.5  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 6  | wintess | witness | * running     | ? pg2    | default  | 0        | n/a      | host=172.72.10.7  user=repmgr password=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "pg1" (ID: 1)'s upstream node "pg2" (ID: 2)
  - unable to determine if node "pg1" (ID: 1) is attached to its upstream node "pg2" (ID: 2)
  - unable to connect to node "pg2" (ID: 2)
  - node "pg2" (ID: 2) is registered as an active primary but is unreachable
  - unable to connect to node "pg3" (ID: 3)'s upstream node "pg2" (ID: 2)
  - unable to determine if node "pg3" (ID: 3) is attached to its upstream node "pg2" (ID: 2)
  - unable to connect to node "wintess" (ID: 6)'s upstream node "pg2" (ID: 2)

HINT: execute with --verbose option to see connection error messages


3） pg3 提升为主库（pg3执行）
repmgr  -f  /postgresql/pg13/repmgr.conf --siblings-follow standby promote


4）查看集群状态
[pgsql@pg1 ~]$ repmgr -f /postgresql/pg13/repmgr.conf cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                            
----+---------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------------------------
 1  | pg1     | standby |   running | pg3      | default  | 100      | 2        | host=172.72.10.2 user=repmgr password=repmgr dbname=repmgr connect_timeout=2 
 2  | pg2     | primary | - failed  | ?        | default  | 100      |          | host=172.72.10.3  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 3  | pg3     | primary | * running |          | default  | 100      | 3        | host=172.72.10.5  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 6  | wintess | witness | * running | pg3      | default  | 0        | n/a      | host=172.72.10.7  user=repmgr password=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "pg2" (ID: 2)

HINT: execute with --verbose option to see connection error messages
PS:pg2损坏，pg3升为新主


5） 因pg2损坏，现需修复，作为从库加入集群
pg_ctl stop
repmgr -h 172.72.10.5 -U repmgr -d repmgr -f /postgresql/pg13/repmgr.conf standby clone --force

repmgr -f /postgresql/pg13/repmgr.conf node rejoin -d 'host=172.72.10.5  user=repmgr dbname=repmgr connect_timeout=2' --force-rewind --dry-run --verbose
repmgr -f /postgresql/pg13/repmgr.conf node rejoin -d 'host=172.72.10.5  user=repmgr dbname=repmgr connect_timeout=2' --force-rewind --verbose

pg_ctl status
pg_ctl: server is running (PID: 4356)
/postgresql/pg13/bin/postgres "-D" "/postgresql/pgdata"

6） 查看集群状态
[pgsql@pg1 ~]$ repmgr -f /postgresql/pg13/repmgr.conf cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                            
----+---------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------------------------
 1  | pg1     | primary | * running |          | default  | 100      | 4        | host=172.72.10.2 user=repmgr password=repmgr dbname=repmgr connect_timeout=2 
 2  | pg2     | standby |   running | pg1      | default  | 100      | 4        | host=172.72.10.3  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 3  | pg3     | standby |   running | pg1      | default  | 100      | 4        | host=172.72.10.5  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 6  | wintess | witness | * running | pg1      | default  | 0        | n/a      | host=172.72.10.7  user=repmgr password=repmgr dbname=repmgr connect_timeout=2

2.2 配置自动failover

1、在所有节点都配置
echo "shared_preload_libraries='repmgr' " >> /postgresql/pgdata/postgresql.conf
pg_ctl stop
pg_ctl start



2、在所有节点都配置文件` /postgresql/pg13/repmgr.conf`，增加内容包括：
cat >>  /postgresql/pg13/repmgr.conf << "EOF"

monitoring_history=yes
monitor_interval_secs=5
failover=automatic
reconnect_attempts=6
reconnect_interval=5
promote_command='repmgr standby promote -f  /postgresql/pg13/repmgr.conf --log-to-file'
follow_command='repmgr standby follow -f  /postgresql/pg13/repmgr.conf --log-to-file --upstream-node-id=%n'
log_level=INFO
log_status_interval=10
log_file=' /postgresql/pg13/repmgr.log'
EOF



cat >> /etc/logrotate.conf <<"EOF"
/postgresql/pg13/repmgr.log {
missingok
compress
rotate 30
daily
dateext
create 0600 pgsql pgsql
}
EOF


3、在所有节点都启动repmgrd进程

-- 启动
repmgrd -f /postgresql/pg13/repmgr.conf --pid-file /tmp/repmgrd.pid --daemonize

-- 建议加到开机自动启动:/etc/rc.local
echo "repmgrd -f /postgresql/pg13/repmgr.conf --pid-file /tmp/repmgrd.pid --daemonize" >> /etc/rc.local
chmod +x /etc/rc.d/rc.local

-- 停止
kill -9 `cat /tmp/repmgrd.pid`

3 .自动切换测试

1 查看集群状态
[pgsql@pg1 ~]$ repmgrd -f /postgresql/pg13/repmgr.conf --pid-file /tmp/repmgrd.pid --daemonize
[2023-10-15 23:12:58] [NOTICE] redirecting logging output to "/postgresql/pg13/repmgr.log"

[pgsql@pg1 ~]$ repmgr -f /postgresql/pg13/repmgr.conf cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                            
----+---------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------------------------
 1  | pg1     | primary | * running |          | default  | 100      | 4        | host=172.72.10.2 user=repmgr password=repmgr dbname=repmgr connect_timeout=2 
 2  | pg2     | standby |   running | pg1      | default  | 100      | 4        | host=172.72.10.3  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 3  | pg3     | standby |   running | pg1      | default  | 100      | 4        | host=172.72.10.5  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 6  | wintess | witness | * running | pg1      | default  | 0        | n/a      | host=172.72.10.7  user=repmgr password=repmgr dbname=repmgr connect_timeout=2



2  停止主库
pg_ctl stop

3  查看集群状态
[pgsql@pg3 ~]$ repmgrd -f /postgresql/pg13/repmgr.conf --pid-file /tmp/repmgrd.pid --daemonize
[2023-10-15 23:12:58] [NOTICE] redirecting logging output to "/postgresql/pg13/repmgr.log"

[pgsql@pg3 ~]$ repmgr -f /postgresql/pg13/repmgr.conf cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                            
----+---------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------------------------
 1  | pg1     | primary | - failed  | ?        | default  | 100      |          | host=172.72.10.2 user=repmgr password=repmgr dbname=repmgr connect_timeout=2 
 2  | pg2     | primary | * running |          | default  | 100      | 5        | host=172.72.10.3  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 3  | pg3     | standby |   running | pg2      | default  | 100      | 4        | host=172.72.10.5  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 6  | wintess | witness | * running | pg2      | default  | 0        | n/a      | host=172.72.10.7  user=repmgr password=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "pg1" (ID: 1)

HINT: execute with --verbose option to see connection error messages


4 修复原主库
repmgr -f /postgresql/pg13/repmgr.conf node  rejoin -d 'host=172.72.10.3 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind --verbose