PCS 管理 PostgreSQL 高可用集群-CSDN博客

本文链接：https://blog.csdn.net/weixin_45035103/article/details/147395671

以下是使用 Pacemaker/Corosync (PCS) 管理 PostgreSQL 高可用集群的详细实现方案。该方案基于流复制（Streaming Replication）和自动故障转移，确保数据库服务的高可用性。

一、环境准备

节点规划
| 节点名称 | IP 地址 | 角色 |
|-------------|--------------|--------------------|
| node1 | 192.168.1.101| 主节点 + 集群节点 |
| node2 | 192.168.1.102| 备节点 + 集群节点 |
| virtual_ip | 192.168.1.100| 虚拟 IP（VIP） |
前提条件
• 所有节点时间同步（使用 NTP）。

• 主机名解析（/etc/hosts 或 DNS）。

• 关闭 SELinux 和防火墙（或放行相关端口）。

• 已安装相同版本的 PostgreSQL。

二、安装与配置 Pacemaker/Corosync

安装 PCS 软件包

# 在所有节点执行
yum install -y pacemaker corosync pcs psmisc
systemctl enable pcsd --now

配置集群认证

# 设置集群管理用户密码（所有节点）
echo "hacluster" | passwd hacluster --stdin

# 节点间认证（在任意节点执行）
pcs cluster auth node1 node2 -u hacluster -p hacluster

创建集群

# 在 node1 上初始化集群
pcs cluster setup pg_cluster node1 node2 --force

# 启动集群
pcs cluster start --all
pcs cluster enable --all

禁用无关服务

pcs property set stonith-enabled=false      # 禁用 STONITH（测试环境）
pcs property set no-quorum-policy=ignore    # 忽略仲裁丢失

三、配置 PostgreSQL 流复制

主节点配置（node1）

启用流复制：
修改 postgresql.conf：

wal_level = replica
max_wal_senders = 5
listen_addresses = '*'

配置复制用户：
在 pg_hba.conf 中添加：

host replication rep_user 192.168.1.0/24 md5

重启 PostgreSQL：
```
systemctl restart postgresql
```
备节点配置（node2）

初始化备库：

pg_basebackup -h node1 -U rep_user -D /var/lib/pgsql/data -P -R

启动 PostgreSQL：
```
systemctl start postgresql
```

四、配置 Pacemaker 资源

定义资源代理

# 安装 PostgreSQL 资源代理
yum install -y resource-agents

创建集群资源

# 添加虚拟 IP 资源
pcs resource create pg_vip ocf:heartbeat:IPaddr2 ip=192.168.1.100 cidr_netmask=24 op monitor interval=30s

# 添加 PostgreSQL 资源
pcs resource create pgsql ocf:heartbeat:pgsql \
  pgctl="/usr/bin/pg_ctl" \
  psql="/usr/bin/psql" \
  pgdata="/var/lib/pgsql/data" \
  rep_mode="sync" \
  node_list="node1 node2" \
  primary_conninfo_opt="password=rep_password" \
  op start timeout=60s \
  op stop timeout=60s \
  op promote timeout=30s \
  op demote timeout=120s \
  op monitor interval=15s timeout=10s role=Master \
  op monitor interval=16s timeout=10s role=Slave

# 设置资源组（确保 VIP 和 PostgreSQL 在同一节点）
pcs constraint colocation add pg_vip with pgsql INFINITY
pcs constraint order pgsql then pg_vip

验证资源配置

pcs status          # 查看集群状态
pcs resource show   # 显示资源详情

五、故障转移测试

模拟主节点故障

# 在 node1 上停止 PostgreSQL
systemctl stop postgresql

# 观察 Pacemaker 自动切换至 node2
pcs status | grep "Current DC"

手动切换主备

pcs resource move pgsql node2  # 将主库切换至 node2
pcs resource clear pgsql       # 清理操作状态

六、日常管理命令

命令	说明
`pcs cluster status`	查看集群状态
`pcs resource disable pgsql`	临时禁用资源
`pcs node standby node1`	将节点设为备用
`pcs cluster stop --all`	停止整个集群
`crm_mon -Afr`	实时监控集群

七、常见问题解决

脑裂（Split-Brain）
• 现象：节点间通信中断，各自认为自己是主节点。

• 解决：

pcs cluster stop --all          # 停止所有节点
pcs cluster start --all         # 重新启动

资源无法启动
• 现象：pcs status 显示资源 FAILED。

• 解决：

pcs resource cleanup pgsql      # 清理资源状态
journalctl -xe                  # 查看详细日志

VIP 无法切换
• 现象：虚拟 IP 未迁移到新主节点。

• 解决：

arping -c 3 -U -I eth0 192.168.1.100  # 强制刷新 ARP
pcs resource restart pg_vip           # 重启 VIP 资源

八、性能优化建议

调整复制模式：

pcs resource update pgsql rep_mode=async  # 异步复制（更高性能）

监控同步延迟：

SELECT pg_current_wal_lsn(), replay_lsn FROM pg_stat_replication;

启用 WAL 压缩：
在 postgresql.conf 中设置 wal_compression = on。

通过此方案，可以实现基于 Pacemaker/Corosync 的 PostgreSQL 高可用集群，确保数据库服务在节点故障时自动恢复。根据实际业务需求调整资源参数和监控策略。（腾讯元宝AI生成结果）