项目场景:
配置centos虚拟机基于postgresql15的repmgr主从自动故障切换,出现的问题及解决方案。
环境说明
hostname | ip | 状态 |
---|---|---|
node1 | 192.168.1.141 | 主 |
node2 | 192.168.1.142 | 从 |
1.虚拟机都操作:
安装postgres15
# 获取所需依赖包
wget https://download-ib01.fedoraproject.org/pub/epel/7/x86_64/Packages/l/libzstd-1.5.2-1.el7.x86_64.rpm
wget https://download-ib01.fedoraproject.org/pub/epel/7/x86_64/Packages/l/llvm5.0-devel-5.0.1-7.el7.x86_64.rpm
wget https://download-ib01.fedoraproject.org/pub/epel/7/x86_64/Packages/l/llvm5.0-5.0.1-7.el7.x86_64.rpm
wget https://download-ib01.fedoraproject.org/pub/epel/7/x86_64/Packages/l/llvm5.0-libs-5.0.1-7.el7.x86_64.rpm
yum install -y ./libzstd-1.5.2-1.el7.x86_64.rpm
yum install -y centos-release-scl-rh llvm5*
yum install -y epel-release
# 安装版本库的RPM
sudo yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm
# 安装 PostgreSQL
sudo yum install -y postgresql15-server postgresql15-devel
# 初始化DB
sudo /usr/pgsql-15/bin/postgresql-15-setup initdb
# 开机启动|启动|重启|状态|停止 命令
sudo systemctl enable postgresql-15
sudo systemctl start postgresql-15
sudo systemctl restart postgresql-15
sudo systemctl status postgresql-15
sudo systemctl stop postgresql-15
安装完成后在 /var/lib/pgsql/15 下生成文件夹
安装repmgr
根据pg版本安装对应版本
curl https://dl.2ndquadrant.com/default/release/get/15/rpm | bash
yum install -y repmgr15
修改postgres用户密码
最好两个服务器密码一样,别忘了。
echo "自己输密码"|passwd --stdin postgres
创建流复制用户
// 创建流复制用户replicator
create user replica replication login connection limit 5 password 'replica';
// 创建repmgr数据库
create database repmgr;
// 创建repmgr用户
create user repmgr with password 'repmgr' superuser login;
alter database repmgr owner to repmgr;
修改pg_hba.conf配置文件
# TYPE DATABASE USER ADDRESS METHOD
# "local" is for Unix domain socket connections only
local all all peer
# IPv4 local connections:
host all all 127.0.0.1/32 scram-sha-256
host all all 192.168.1.0/24 md5
# IPv6 local connections:
host all all ::1/128 scram-sha-256
# Allow replication connections from localhost, by a user with the
# replication privilege.
local replication all peer
host replication all 127.0.0.1/32 scram-sha-256
host replication all ::1/128 scram-sha-256
host all all 0.0.0.0/0 scram-sha-256
host replication replica 192.168.1.142/32 trust
host replication replica 192.168.1.141/32 trust
# all
host replication repmgr 192.168.1.141/32 trust
host replication repmgr 192.168.1.142/32 trust
修改postgresql.conf配置文件
# basic
listen_addresses = '*' # 监听所有ip
port = 5432 # 端口
max_connections = 1000 # 最大连接数
superuser_reserved_connections = 10 # 给超级用户预留的连接数
shared_buffers = 1GB # 共享内存,一般设置为内存的1/4
work_mem = 16MB # 设置在写入临时磁盘文件之前查询操作(例如排序或哈希表)可使用的最大内存容量
maintenance_work_mem = 256MB # 在维护性操作(例如VACUUM、CREATE INDEX和ALTER TABLE ADD FOREIGN KEY)中使用的 最大的内存量
timezone = 'Asia/Shanghai' # 系统时区
hot_standby = on # 打开热备
shared_preload_libraries = 'repmgr'
# optimizer
default_statistics_target = 500 # 默认100,ANALYZE在pg_statistic中存储的信息量,增大该值,会增加ANALYZE的时间,但会让解释计划更精准
# wal
max_wal_size = 1GB # 建议与shared_buffers保持一致
min_wal_size = 80MB # 建议max_wal_size/12.5
wal_log_hints = on # 控制WAL日志记录的方式,建议打开
wal_level = replica # wal日志写入级别,要使用流复制,必须使用replica或更高级别
wal_sender_timeout = 60s # 设置WAL发送者在发送WAL数据时等待主服务器响应的超时时间
# archive
archive_mode = on #
archive_command = 'copy "%p" /var/lib/pgsql/15/data/pg_archive/"%f"' # 归档存储位置,自行修改
# log 近7天轮询
log_destination = 'csvlog' # 日志格式
logging_collector = on # 日志收集器
log_directory = 'pg_log' # 日志目录 $PGDATA/pg_log
log_filename = 'postgresql.%a' # 7天日志轮询
log_file_mode = 0600 # 日志文件的权限
log_rotation_size = 0 # 日志的最大尺寸,设置为零时将禁用基于大小创建新的日志文件
log_truncate_on_rotation = on # 这个参数将导致PostgreSQL截断(覆盖而不是追加)任何已有的同名日志文件
log_min_duration_statement = 0 # 如果语句运行至少指定的时间量,将导致记录每一个这种完成的语句的持续时间
log_duration = on # 每一个完成的语句的持续时间被记录
log_lock_waits = on # 控制当一个会话为获得一个锁等到超过deadlock_timeout时,是否要产生一个日志消息
log_statement = 'mod' # 控制哪些 SQL 语句被记录。有效值是 none (off)、ddl、mod和 all(所有语句)。ddl记录所有数据定义语句,例如CREATE、ALTER和 DROP语句。mod记录所有ddl语句,外加数据修改语句例如INSERT, UPDATE、DELETE、TRUNCATE, 和COPY FROM
log_timezone = 'Asia/Shanghai' # 设置在服务器日志中写入的时间戳的时区
# sql
statement_timeout = 300000 # 语句执行超时时间 5分钟
idle_in_transaction_session_timeout = 300000 # 事务空闲超时时间 5分钟
idle_session_timeout = 1800000 # 会话空闲超时时间 30分钟
lock_timeout = 60000 # 等锁超时时间 1分钟
启动pg15
systemctl start postgresql-15
2.配置服务器ssh远程免密登录
// 每个虚拟机都生成秘钥
ssh-keygen -t rsa
// 都发到主服务器上
ssh-copy-id -i id_rsa.pub -p22 postgres@node1
// 发送给从秘钥
scp authorized_keys node2:$PWD
2.1设置postgres用户权限
如果根据网上ssh远程连接配置了秘钥,并且给与了该用户文件夹700权限,并且authorized_keys 600权限之后还是出现了需要密码验证。
2.2请查看SElinux是否关闭
临时关闭:
运行命令 setenforce 0,设置为permissive模式,不禁止违反安全策略的行为,但会记录在日志中。运行命令 setenforce 1,设置为enforcing模式,禁止违反安全策略的行为。
永久关闭:编辑 /etc/selinux/config 文件,将 SELINUX=enforcing 改为 SELINUX=disabled,保存并重启机器。运行命令 getenforce,验证SELinux状态为 Disabled。
2.3 配置完成后务必测试
使用下面的代码测试,务必保证postgres能免密登录其他虚拟机的postgres,如提示需要密码,请查看上方2.1和2.2是否操作
// node1切换postgres用户
su postgres
//ssh 其他虚拟机的postgres用户
ssh node2
3.repmgr配置:
3.1主虚拟机操作
vi /etc/repmgr/15/repmgr.conf
修改repmgr.conf配置文件
ssh_options='-q -o ConnectTimeout=10'
node_id=1
node_name='192.168.1.141'
conninfo='user=repmgr connect_timeout=2 dbname=repmgr host=192.168.1.141 fallback_application_name=repmgr options=-csearch_path='
data_directory='/var/lib/pgsql/15/data'
pg_bindir='/usr/pgsql-15/bin/'
failover='automatic'
promote_command='/usr/pgsql-15/bin/repmgr standby promote -f /etc/repmgr/15/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-15/bin/repmgr standby follow -f /etc/repmgr/15/repmgr.conf --log-to-file --upstream-node-id=%n'
log_file='/var/lib/pgsql/15/repmgrd.log'
primary_visibility_consensus = true
standby_disconnect_on_failover = true
repmgrd_service_start_command ='sudo /usr/bin/systemctl start repmgr15.service'
repmgrd_service_stop_command ='sudo /usr/bin/systemctl stop repmgr15.service'
passfile='/home/postgres/.pgpass'
# 高可用参数设置
location='location1' # 多数据中心时标记服务器的位置,在故障转移期间用于检查当前主节点的可见性
priority=100 # 节点优先级,选主时可能使用到(lsn > priority > node_id),0代表该节点不会被提升为主节点
monitoring_history=yes # 是否将监控数据写入“monitoring_history”表
reconnect_interval=5 # 故障转移之前,尝试重新连接的间隔(以秒为单位)
reconnect_attempts=3 # 故障转移之前,尝试重新连接的次数
monitor_interval_secs=2
use_replication_slots=true
connection_check_type=ping # ping: repmgr 使用PQPing() 方法测试连接
//注册主节点
su - postgres -c "/usr/pgsql-15/bin/repmgr -f /etc/repmgr/15/repmgr.conf primary register"
//查看状态
su - postgres -c "/usr/pgsql-15/bin/repmgr -f /etc/repmgr/15/repmgr.conf cluster show"
//启动standby节点
systemctl start postgresql-15
//启动repmgr15
systemctl start repmgr15
3.2从虚拟机操作
vi /etc/repmgr/15/repmgr.conf
ssh_options='-q -o ConnectTimeout=10'
node_id=2
node_name='192.168.1.142'
conninfo='user=repmgr connect_timeout=2 dbname=repmgr host=192.168.1.142 fallback_application_name=repmgr options=-csearch_path='
data_directory='/var/lib/pgsql/15/data'
pg_bindir='/usr/pgsql-15/bin/'
failover='automatic'
promote_command='/usr/pgsql-15/bin/repmgr standby promote -f /etc/repmgr/15/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-15/bin/repmgr standby follow -f /etc/repmgr/15/repmgr.conf --log-to-file --upstream-node-id=%n'
log_file='/var/lib/pgsql/15/repmgrd.log'
primary_visibility_consensus = true
standby_disconnect_on_failover = true
repmgrd_service_start_command ='sudo /usr/bin/systemctl start repmgr15.service'
repmgrd_service_stop_command ='sudo /usr/bin/systemctl stop repmgr15.service'
passfile='/home/postgres/.pgpass'
# 高可用参数设置
location='location1' # 多数据中心时标记服务器的位置,在故障转移期间用于检查当前主节点的可见性
priority=100 # 节点优先级,选主时可能使用到(lsn > priority > node_id),0代表该节点不会被提升为主节点
monitoring_history=yes # 是否将监控数据写入“monitoring_history”表
reconnect_interval=5 # 故障转移之前,尝试重新连接的间隔(以秒为单位)
reconnect_attempts=3 # 故障转移之前,尝试重新连接的次数
monitor_interval_secs=2
use_replication_slots=true
connection_check_type=ping # ping: repmgr 使用PQPing() 方法测试连接
//注册从节点
su - postgres -c "/usr/pgsql-15/bin/repmgr -h 192.168.1.142 -U repmgr -d repmgr -f /etc/repmgr/15/repmgr.conf standby clone --dry-run"
//从主节点验证是否能复制数据
su - postgres -c "/usr/pgsql-15/bin/repmgr -h 192.168.1.141 -U repmgr -d repmgr -f /etc/repmgr/15/repmgr.conf standby clone --dry-run"
//开始复制主data数据为从节点,复制完会生成从节点的standby.signal文件
su - postgres -c "/usr/pgsql-15/bin/repmgr -h 192.168.1.141 -U repmgr -d repmgr -f /etc/repmgr/15/repmgr.conf standby clone"
//启动standby节点
systemctl start postgresql-15
//启动repmgr15
systemctl start repmgr15
//查看状态
su - postgres -c "/usr/pgsql-15/bin/repmgr -f /etc/repmgr/15/repmgr.conf cluster show"
// 成功状态
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------------+---------+-----------+---------------+-----------+----------+----------+------------------------------------------------------------------------------------------------------------------------
1 | 192.168.1.141 | primary | * running | | location1 | 100 | 12 | user=repmgr connect_timeout=2 dbname=repmgr host=192.168.1.141 fallback_application_name=repmgr options=-csearch_path=
2 | 192.168.1.142 | standby | running | 192.168.1.141 | location1 | 100 | 12 | user=repmgr connect_timeout=2 dbname=repmgr host=192.168.1.142 fallback_application_name=repmgr options=-csearch_path=
4.主从正常运行主从切换:
正常运行时主从切换,进入到从服务器
从服务器
su - postgres -c "/usr/pgsql-15/bin/repmgr -f /etc/repmgr/15/repmgr.conf standby switchover -U repmgr --verbose"
//切换后
[root@node2 data]# su - postgres -c "/usr/pgsql-15/bin/repmgr -f /etc/repmgr/15/repmgr.conf cluster show"
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------------+---------+-----------+---------------+-----------+----------+----------+------------------------------------------------------------------------------------------------------------------------
1 | 192.168.1.141 | standby | running | 192.168.1.142 | location1 | 100 | 12 | user=repmgr connect_timeout=2 dbname=repmgr host=192.168.1.141 fallback_application_name=repmgr options=-csearch_path=
2 | 192.168.1.142 | primary | * running | | location1 | 100 | 13 | user=repmgr connect_timeout=2 dbname=repmgr host=192.168.1.142 fallback_application_name=repmgr options=-csearch_path=
5.测试主故障自动将从设置为主:
进入主服务器
关停postgresql-15
systemctl stop postgresql-15
主服务器显示
[root@node1 data]# su - postgres -c "/usr/pgsql-15/bin/repmgr -f /etc/repmgr/15/repmgr.conf cluster show"
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------------+---------+----------------------+----------+-----------+----------+----------+------------------------------------------------------------------------------------------------------------------------
1 | 192.168.1.141 | primary | * running | | location1 | 100 | 14 | user=repmgr connect_timeout=2 dbname=repmgr host=192.168.1.141 fallback_application_name=repmgr options=-csearch_path=
2 | 192.168.1.142 | standby | ! running as primary | | location1 | 100 | 15 | user=repmgr connect_timeout=2 dbname=repmgr host=192.168.1.142 fallback_application_name=repmgr options=-csearch_path=
从服务器显示
[root@node2 data]# su - postgres -c "/usr/pgsql-15/bin/repmgr -f /etc/repmgr/15/repmgr.conf cluster show"
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------------+---------+-----------+----------+-----------+----------+----------+------------------------------------------------------------------------------------------------------------------------
1 | 192.168.1.141 | primary | ! running | | location1 | 100 | 14 | user=repmgr connect_timeout=2 dbname=repmgr host=192.168.1.141 fallback_application_name=repmgr options=-csearch_path=
2 | 192.168.1.142 | primary | * running | | location1 | 100 | 15 | user=repmgr connect_timeout=2 dbname=repmgr host=192.168.1.142 fallback_application_name=repmgr options=-csearch_path=
WARNING: following issues were detected
- node "192.168.1.141" (ID: 1) is running but the repmgr node record is inactive
从服务器142 ip已变更为了primary(主)了
进入主服务器
将141注册为从服务加入到repmgr
// 测试是否成功
su - postgres -c "repmgr node rejoin -d 'host=192.168.1.142 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose --dry-run"
su - postgres -c "repmgr node rejoin -d 'host=192.168.1.142 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose"
//最后的提示信息为成功
INFO: node "192.168.1.141" (ID: 1) is pingable
INFO: node "192.168.1.141" (ID: 1) has attached to its upstream node
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
//查看状态,成功
[root@node1 data]# su - postgres -c "/usr/pgsql-15/bin/repmgr -f /etc/repmgr/15/repmgr.conf cluster show"
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------------+---------+-----------+---------------+-----------+----------+----------+------------------------------------------------------------------------------------------------------------------------
1 | 192.168.1.141 | standby | running | 192.168.1.142 | location1 | 100 | 14 | user=repmgr connect_timeout=2 dbname=repmgr host=192.168.1.141 fallback_application_name=repmgr options=-csearch_path=
2 | 192.168.1.142 | primary | * running | | location1 | 100 | 15 | user=repmgr connect_timeout=2 dbname=repmgr host=192.168.1.142 fallback_application_name=repmgr options=-csearch_path=