pg高可用之repmgr(二)

   承接上篇文章的PostgrSQL + repmgr的搭建部署,本篇主要讲述PostgreSQL + repmgr高可用集群集群是如何实现switchover,以下是具体场景及其操作步骤。

一、手动提升standby为primary

该命令常见的使用场景为,若没有配置自动failover,primary节点宕机后,手动将指定standby节点并提升为primary主服务器节点。

1、查看正常情况下集群状态

[postgres@172-16-104-55 ~]$ repmgr -f /pg_data/pgsql11/repmgr.conf cluster show
 ID | Name          | Role    | Status    | Upstream      | Location   | Priority | Timeline | Connection string                                             
----+---------------+---------+-----------+---------------+------------+----------+----------+----------------------------------------------------------------
 1  | 172-16-104-7  | primary | * running |               | location1  | 100      | 13       | host=172-16-104-7 user=repmgr dbname=repmgr connect_timeout=2 
 2  | 172-16-104-55 | standby |   running | 172-16-104-7  | location1  | 99       | 13       | host=172-16-104-55 user=repmgr dbname=repmgr connect_timeout=2
 3  | 172-16-104-56 | standby |   running | 172-16-104-7  | location1  | 100      | 12       | host=172-16-104-56 user=repmgr dbname=repmgr connect_timeout=2

2、手动关闭primary节点,模拟primary故障宕机

[postgres@172-16-104-7 ~]$ sudo systemctl  stop postgres11

3、手动在需要提升为primary节点的standby节点提升为primary节点

1) 查看当前集群信息,可以看到旧primary节点“172-16-104-7 ”已经处于不可达的状态

[postgres@172-16-104-55 ~]$ repmgr -f /pg_data/pgsql11/repmgr.conf cluster show
 ID | Name          | Role    | Status        | Upstream       | Location   | Priority | Timeline | Connection string                                             
----+---------------+---------+---------------+----------------+------------+----------+----------+----------------------------------------------------------------
 1  | 172-16-104-7  | primary | ? unreachable | ?              | location1  | 100      |          | host=172-16-104-7 user=repmgr dbname=repmgr connect_timeout=2 
 2  | 172-16-104-55 | standby |   running     | ? 172-16-104-7 | location1  | 99       | 13       | host=172-16-104-55 user=repmgr dbname=repmgr connect_timeout=2
 3  | 172-16-104-56 | standby |   running     | ? 172-16-104-7 | location1  | 100      | 12       | host=172-16-104-56 user=repmgr dbname=repmgr connect_timeout=2
 4  | 172-16-104-57 | witness | * running     | 172-16-104-56  | location1  | 0        | n/a      | host=172.16.104.57 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "172-16-104-7" (ID: 1)
  - node "172-16-104-7" (ID: 1) is registered as an active primary but is unreachable
  - unable to connect to node "172-16-104-55" (ID: 2)'s upstream node "172-16-104-7" (ID: 1)
  - unable to determine if node "172-16-104-55" (ID: 2) is attached to its upstream node "172-16-104-7" (ID: 1)
  - unable to connect to node "172-16-104-56" (ID: 3)'s upstream node "172-16-104-7" (ID: 1)
  - unable to determine if node "172-16-104-56" (ID: 3) is attached to its upstream node "172-16-104-7" (ID: 1)

HINT: execute with --verbose option to see connection error messages

2)手动提升“172-16-104-56”为primary节点

[postgres@172-16-104-56 extension]$ repmgr -f /pg_data/pgsql11/repmgr.conf  standby promote
WARNING: 1 sibling nodes found, but option "--siblings-follow" not specified
DETAIL: these nodes will remain attached to the current primary:
  172-16-104-55 (node ID: 2)
NOTICE: promoting standby to primary
DETAIL: promoting server "172-16-104-56" (ID: 3) using "/usr/local/pgsql11/bin/pg_ctl  -w -D '/pg_data/pgsql11/data' promote"
waiting for server to promote.... done
server promoted
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "172-16-104-56" (ID: 3) was successfully promoted to primary

3)查看当前集群状态,可以看到172-16-104-56已经被提升为新的primary

[postgres@172-16-104-55 ~]$ repmgr -f /pg_data/pgsql11/repmgr.conf cluster show
 ID | Name          | Role    | Status               | Upstream       | Location   | Priority | Timeline | Connection string                                             
----+---------------+---------+----------------------+----------------+------------+----------+----------+----------------------------------------------------------------
 1  | 172-16-104-7  | primary | ? unreachable        | ?              | location1  | 100      |          | host=172-16-104-7 user=repmgr dbname=repmgr connect_timeout=2 
 2  | 172-16-104-55 | standby |   running            | ? 172-16-104-7 | location1  | 99       | 13       | host=172-16-104-55 user=repmgr dbname=repmgr connect_timeout=2
 3  | 172-16-104-56 | standby | ! running as primary |                | location1  | 100      | 14       | host=172-16-104-56 user=repmgr dbname=repmgr connect_timeout=2

二、集群其他standby重新指向新primary

1、集群中其他standby节点默认还是连接在已经宕机的旧primary节点,需要我们手动将这些standby节点指向新的primary节点

[postgres@172-16-104-55 ~]$ repmgr -f /pg_data/pgsql11/repmgr.conf  standby follow
NOTICE: attempting to find and follow current primary
INFO: local node 2 can attach to follow target node 3
DETAIL: local node's recovery point: 0/35000098; follow target node's fork point: 0/35000098
INFO: creating replication slot as user "repmgr"
NOTICE: setting node 2's upstream to node 3
NOTICE: restarting server using "sudo systemctl restart postgres11"
WARNING: node "172-16-104-55" attached in state "catchup"
NOTICE: STANDBY FOLLOW successful
DETAIL: standby attached to upstream node "172-16-104-56" (ID: 3)

2、查看集群状态信息,可以发现,目前standby节点“172-16-104-55 ”已经正常开始从“172-16-104-57”开始进行流复制同步

[postgres@172-16-104-55 ~]$ repmgr -f /pg_data/pgsql11/repmgr.conf  cluster show
 ID | Name          | Role    | Status    | Upstream       | Location  | Priority | Timeline | Connection string                                             
----+---------------+---------+-----------+----------------+-----------+----------+----------+----------------------------------------------------------------
 1  | 172-16-104-7  | primary | - failed  | ?              | location1 | 100      |          | host=172-16-104-7 user=repmgr dbname=repmgr connect_timeout=2 
 2  | 172-16-104-55 | standby |   running | 172-16-104-56  | location1 | 99       | 17       | host=172-16-104-55 user=repmgr dbname=repmgr connect_timeout=2
 3  | 172-16-104-56 | primary | * running |                | location1 | 100      | 18       | host=172-16-104-56 user=repmgr dbname=repmgr connect_timeout=2
 4  | 172-16-104-57 | witness | * running | ? 172-16-104-7 | location1 | 0        | n/a      | host=172.16.104.57 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "172-16-104-7" (ID: 1)
  - unable to connect to node "172-16-104-57" (ID: 4)'s upstream node "172-16-104-7" (ID: 1)
  

三、将宕机primary降级为standby节点

当primary节点宕机后,后续业务稳定,我们是需要将宕机的primary主服务器节点重新加入集群并作为standby备用服务器节点进行服务

1、清理宕机节点数据目录

[postgres@172-16-104-7 ~]$ rm -rf /pg_data/pgsql11/data/*

2、重新通过repmgr克隆备服务器standby节点

[postgres@172-16-104-7 ~]$ repmgr -h 172-16-104-56 -U repmgr -d repmgr -f /pg_data/pgsql11/repmgr.conf standby clone
NOTICE: destination directory "/pg_data/pgsql11/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=172-16-104-56 user=repmgr dbname=repmgr
DETAIL: current installation size is 87 MB
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: checking and correcting permissions on existing directory "/pg_data/pgsql11/data"
INFO: creating replication slot as user "repmgr"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
  /usr/local/pgsql11/bin/pg_basebackup -l "repmgr base backup"  -D /pg_data/pgsql11/data -h 172-16-104-56 -p 5432 -U repmgr -X stream -S repmgr_slot_1 
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: sudo systemctl start postgres11
HINT: after starting the server, you need to re-register this standby with "repmgr standby register --force" to update the existing node record

3、启动数据库

[postgres@172-16-104-7 ~]$ sudo systemctl start postgres11

4、将宕机恢复后的旧primary节点以standby的身份重新加入集群

[postgres@172-16-104-7 ~]$ repmgr -f /pg_data/pgsql11/repmgr.conf standby register --force
INFO: connecting to local node "172-16-104-7" (ID: 1)
INFO: connecting to primary database
INFO: standby registration complete
NOTICE: standby node "172-16-104-7" (ID: 1) successfully registered

四、手动switchover

1、查看switchover前的集群状态信息

[postgres@172-16-104-55 ~]$ repmgr -f /pg_data/pgsql11/repmgr.conf cluster show
 ID | Name          | Role    | Status    | Upstream      | Location  | Priority | Timeline | Connection string                                             
----+---------------+---------+-----------+---------------+-----------+----------+----------+----------------------------------------------------------------
 1  | 172-16-104-7  | standby |   running | 172-16-104-56 | location1 | 100      | 14       | host=172-16-104-7 user=repmgr dbname=repmgr connect_timeout=2 
 2  | 172-16-104-55 | standby |   running | 172-16-104-56 | location1 | 99       | 14       | host=172-16-104-55 user=repmgr dbname=repmgr connect_timeout=2
 3  | 172-16-104-56 | primary | * running |               | location1 | 100      | 14       | host=172-16-104-56 user=repmgr dbname=repmgr connect_timeout=2

2、手动在需要提升为primary节点服务器上执行switchover命令

[postgres@172-16-104-7 ~]$ repmgr -f /pg_data/pgsql11/repmgr.conf standby switchover
NOTICE: executing switchover on node "172-16-104-7" (ID: 1)
WARNING: 2 sibling nodes found, but option "--siblings-follow" not specified
DETAIL: these nodes will remain attached to the current primary:
  172-16-104-55 (node ID: 2)
  172-16-104-57 (node ID: 4, witness server)
NOTICE: local node "172-16-104-7" (ID: 1) will be promoted to primary; current primary "172-16-104-56" (ID: 3) will be demoted to standby
NOTICE: stopping current primary node "172-16-104-56" (ID: 3)
NOTICE: issuing CHECKPOINT on node "172-16-104-56" (ID: 3) 
DETAIL: executing server command "sudo systemctl stop postgres11"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/3A000028
NOTICE: promoting standby to primary
DETAIL: promoting server "172-16-104-7" (ID: 1) using "/usr/local/pgsql11/bin/pg_ctl  -w -D '/pg_data/pgsql11/data' promote"
waiting for server to promote.... done
server promoted
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "172-16-104-7" (ID: 1) was successfully promoted to primary
INFO: local node 3 can attach to rejoin target node 1
DETAIL: local node's recovery point: 0/3A000028; rejoin target node's fork point: 0/3A000098
INFO: creating replication slot as user "repmgr"
NOTICE: setting node 3's upstream to node 1
WARNING: unable to ping "host=172-16-104-56 user=repmgr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "sudo systemctl start postgres11"
WARNING: node "172-16-104-56" attached in state "startup"
NOTICE: replication slot "repmgr_slot_1" deleted on node 3
WARNING: 1 inactive replication slots detected
DETAIL: inactive replication slots:
  - repmgr_slot_2 (physical)
HINT: these replication slots may need to be removed manually
NOTICE: NODE REJOIN successful
DETAIL: node 3 is now attached to node 1
NOTICE: node  "172-16-104-7" (ID: 1) promoted to primary, node "172-16-104-56" (ID: 3) demoted to standby
NOTICE: switchover was successful
DETAIL: node "172-16-104-7" is now primary and node "172-16-104-56" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully

3、手动将其他standby节点指向新的primary节点

[postgres@172-16-104-55 ~]$ repmgr -f /pg_data/pgsql11/repmgr.conf  standby follow
NOTICE: attempting to find and follow current primary
INFO: timelines are same, this server is not ahead
DETAIL: local node lsn is 0/3A01E548, follow target lsn is 0/3A01E548
INFO: creating replication slot as user "repmgr"
NOTICE: setting node 2's upstream to node 1
NOTICE: restarting server using "sudo systemctl restart postgres11"
NOTICE: replication slot "repmgr_slot_2" deleted on node 3
NOTICE: STANDBY FOLLOW successful
DETAIL: standby attached to upstream node "172-16-104-7" (ID: 1)

4、查看switchover变更后的集群信息

[postgres@172-16-104-55 ~]$ repmgr -f /pg_data/pgsql11/repmgr.conf cluster show
 ID | Name          | Role    | Status    | Upstream      | Location  | Priority | Timeline | Connection string                                             
----+---------------+---------+-----------+---------------+-----------+----------+----------+----------------------------------------------------------------
 1  | 172-16-104-7  | primary | * running |               | location1 | 100      | 15       | host=172-16-104-7 user=repmgr dbname=repmgr connect_timeout=2 
 2  | 172-16-104-55 | standby |   running | 172-16-104-7  | default   | 99       | 15       | host=172-16-104-55 user=repmgr dbname=repmgr connect_timeout=2
 3  | 172-16-104-56 | standby |   running | 172-16-104-7  | default   | 100      | 14       | host=172-16-104-56 user=repmgr dbname=repmgr connect_timeout=2
  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
Patroni是一种开源的工具,用于管理PostgreSQL集群的可用性。它是一个容器化的解决方案,可以实现自动化的集群管理和故障转移。以下是使用Patroni实现PG数据库可用的步骤: 1. 安装Patroni 可以使用pip命令安装Patroni: ``` pip install patroni ``` 2. 配置Patroni Patroni的配置文件是YAML格式的,可以根据需要进行修改。以下是一个简单的示例: ``` scope: postgres namespace: /db/ name: pg-cluster restapi: listen: 0.0.0.0:8008 connect_address: $NODE1_IP:8008 etcd: host: $ETCD_IP:2379 bootstrap: dcs: ttl: 30 loop_wait: 10 retry_timeout: 10 maximum_lag_on_failover: 1048576 postgresql: use_pg_rewind: true parameters: max_wal_senders: 10 wal_keep_segments: 10 pg_hba: - host replication replicator 0.0.0.0/0 md5 - host all all 0.0.0.0/0 md5 synchronous_mode: off synchronous_commit: off archive_mode: off archive_command: false recovery_conf: restore_command: cp /var/lib/postgresql/backup/%f %p recovery_target_timeline: latest pgpass: /tmp/pgpass pgpassfile_mode: 600 bin_dir: /usr/lib/postgresql/9.6/bin pg_ctl: /usr/lib/postgresql/9.6/bin/pg_ctl use_slots: true create_replica_methods: - basebackup - pg_rewind ``` 在这个示例中,我们使用etcd作为DCS(分布式协调服务)来管理集群状态。我们还配置了一些PostgreSQL参数,如max_wal_senders和wal_keep_segments。这些参数都可以根据需要进行修改。 3. 启动Patroni 可以使用以下命令启动Patroni: ``` patroni postgres.yml ``` 这将启动一个PostgreSQL集群,并将其注册到etcd中。您可以使用以下命令检查集群状态: ``` curl http://$NODE1_IP:8008/patroni ``` 这将返回一个JSON格式的响应,其中包含有关集群状态的信息。 4. 测试故障转移 为了测试故障转移,您可以杀死主节点上的PostgreSQL进程。Patroni将检测到主节点已经下线,并自动将一个从节点提升为新的主节点。 您可以使用以下命令检查新主节点的状态: ``` curl http://$NODE2_IP:8008/patroni ``` 这将返回有关新主节点的信息。 总的来说,使用Patroni实现PostgreSQL集群的可用性相对简单。它可以自动管理故障转移,并提供一些其他有用的功能,如DCS和可插拔的备份存储后端。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值