Linux|centos7-postgresql数据库|yum安装数据库和配置repmgr高可用集群以及repmgr的日常管理工作

一、

前言

postgresql 的yum部署其实还是有点东西的,本文就做一个小小的记录,高可用方面repmgr插件还是非常不错的,但如何部署以及部署后如何使用也是一个难点,因此,也在本文里做一个记录

环境介绍:

第一台服务器:操作系统是centos7,内核版本是3.10,IP地址是192.168.123.17  VMware虚拟机,内存4G,cpu4核,计划在此服务器上安装postgresql12.4版本以及repmgr高可用插件。该服务器是主节点

第二台服务器:操作系统是centos7,内核版本是3.10,IP地址是192.168.123.20  VMware虚拟机,内存4G,cpu4核,计划在此服务器上安装postgresql12.4版本以及repmgr高可用插件。该服务器是从节点

二、

详细的yum安装postgreslq数据库

1,配置yum源

yum安装postgresql比较简单,源配置官方源就可用了,命令如下(2台服务器都执行):

yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm

为了防止某些库安装的时候提示缺少依赖,增加阿里云的库和阿里云的epel库

wget -O /etc/yum.repos.d/epel.repo https://mirrors.aliyun.com/repo/epel-7.repo
wget -O /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo

 增加不常用的rpm包依赖仓库:

[centos-sclo-sclo]
name=CentOS-7 - SCLo sclo
baseurl=https://mirrors.tuna.tsinghua.edu.cn/centos/7/sclo/$basearch/sclo/
gpgcheck=0
enabled=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-SCLo
 
[centos-sclo-rh]
name=CentOS-7 - SCLo rh
baseurl=https://mirrors.tuna.tsinghua.edu.cn/centos/7/sclo/$basearch/rh/
gpgcheck=0
enabled=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-SCLo

2,

安装postgresql的组件和postgresql的server服务(2台服务器都执行)

组件比较多,主要是lib库(postgresql运行的依赖库),插件扩展库(postgresql的插件库),开发库(一些postgresql的特殊组件需要使用的库),连接库(开发微服务使用的库)

后面的几个看自己的需求安装不,比如postgresql12-plperl

版本号根据自己需求来,我这里是12.4, 不想安这个低版本的,修改12.4就可以了,比如全部的12.4修改为12.8

yum install postgresql12-server-12.4 postgresql12-libs-12.4 postgresql12-devel-12.4 postgresql12-contrib-12.4 postgresql12-tcl postgresql12-odbc postgresql12-12.4  postgresql12-plperl-12.4 postgresql12-pltcl-12.4 postgresql12-plpython-12.4  postgresql12-plpython3-12.4 postgresql12-llvmjit-12.4 postgresql12-devel-12.4  -y

3,

postgresql的环境变量定义和一些特殊处理(两台服务器都执行)

在/etc/profile 文件末尾添加如下变量,计划postgresql安装在/data/pgsql/data目录下

PGDATA=/data/pgsql/data
export PGDATA
PGHOME=/data/pgsql
export PGHOME
PATH=$PATH:$PGHOME/bin:$PGDATA:/usr/pgsql-12/bin/
export PATH  PGHOME PGDATA

写入完成后,激活变量:

source /etc/profile

切换用户到普通用户postgres,将该用户的环境变量文件全部注释掉,结果如下:

-bash-4.2$ whoami
postgres
-bash-4.2$ cat ~/.bash_profile 
#[ -f /etc/profile ] && source /etc/profile
#PGDATA=/var/lib/pgsql/12/data
#export PGDATA
# If you want to customize your settings,
# Use the file below. This is not overridden
# by the RPMS.
#[ -f /var/lib/pgsql/.pgsql_profile ] && source /var/lib/pgsql/.pgsql_profile

注释的原因是我们不希望使用默认的/var/lib/12目录存放数据库的数据文件,注销普通用户再次su - postgres 重新激活变量后,在普通用户下,可以看到postgresql数据库相关变量已经固定了:

[root@centos10 media]# su - postgres
Last login: Sat Mar 30 05:19:48 CST 2024 on pts/0
-bash-4.2$ echo $PGDATA
/data/pgsql/data

4,创建相关目录并赋予普通用户的权限,准备初始化工作(root用户执行,两台服务器都执行):

mkdir -p /data/pgsql/
chown -Rf postgres. /data/pgsql

5,执行数据库初始化(仍然是切换到普通用户postgres):

由于前面的准备工作比较充分,环境变量已经完全配置好了,因此,执行initdb就可以完成初始化了,/data/pgsql/data/目录以及其下的文件将在初始化的时候创建好

初始化日志如下:

[root@centos7 media]# su - postgres
Last login: Sat Mar 30 05:25:23 CST 2024 on pts/0
-bash-4.2$ initdb
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory /data/pgsql/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Asia/Shanghai
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

initdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

    pg_ctl -D /data/pgsql/data -l logfile start

6,

在root用户下管理postgresql的启停

由于是yum安装的,因此,启停脚本是已经写好的,可以直接用,但我们数据库的数据文件存放路径修改了,因此,需要先修改一下启停脚本:

/usr/lib/systemd/system/postgresql-12.service

Environment=PGDATA=/var/lib/pgsql/12/data/
修改为
Environment=PGDATA=/data/pgsql/data/

这个启停脚本名字太长了,修改一下名字 就叫pg12吧:

mv /usr/lib/systemd/system/postgresql-12.service /usr/lib/systemd/system/pg12.service

启动postgresql数据库并查看进行确认是否启动成功:

systemctl enable pg12 && systemctl start pg12

查看postgreslq的进程:

[root@centos7 media]# ps aux |grep postgre
postgres 16161  0.2  0.4 397292 17424 ?        Ss   05:51   0:00 /usr/pgsql-12/bin/postmaster -D /data/pgsql/data/
postgres 16164  0.0  0.0 251676  2008 ?        Ss   05:51   0:00 postgres: logger   
postgres 16166  0.0  0.0 397292  2060 ?        Ss   05:51   0:00 postgres: checkpointer   
postgres 16167  0.0  0.0 397292  2288 ?        Ss   05:51   0:00 postgres: background writer   
postgres 16168  0.0  0.0 397292  2064 ?        Ss   05:51   0:00 postgres: walwriter   
postgres 16169  0.0  0.0 397844  3212 ?        Ss   05:51   0:00 postgres: autovacuum launcher   
postgres 16170  0.0  0.0 251672  2120 ?        Ss   05:51   0:00 postgres: stats collector   
postgres 16171  0.0  0.0 397844  2808 ?        Ss   05:51   0:00 postgres: logical replication launcher   
root     16173  0.0  0.0 112712   964 pts/0    S+   05:51   0:00 grep --color=auto postgre

7,

远程连接和postgres用户密码设置

修改主配置文件 /data/pgsql/data/postgresql.conf 

listen_addresses = '*'          
port = 15433                            
max_connections = 1000   
wal_level = logical

这里是设置数据库连接端口是15433 

修改安全连接文件pg_hba.conf:

末尾添加如下两行,这两行的意思是复制用户replication可以随意连接,但其它用户远程连接的时候需要密码,本地登录的时候不校验密码

host    replication     all             0.0.0.0/0                 trust
host    all             all             0.0.0.0/0           md5

8,

本地登录数据库设置密码

su - postgres -c "psql -p15433"

alter user postgres with password '123456';

此时,数据库算是完全安装完毕,连接端口15433,连接密码123456 后面就不在重复这些信息了

随意安装一个插件,看看lib库什么的是否正常:

[root@centos10 ~]# su - postgres -c "psql -p15433"
psql (12.4)
Type "help" for help.

postgres=# create extension 
adminpack           btree_gin           dblink              file_fdw            hstore_plperlu      insert_username     jsonb_plperl        jsonb_plpythonu     ltree_plpython3u    pg_buffercache      pgrowlocks          pg_visibility       plpython3u          postgres_fdw        tablefunc           unaccent
amcheck             btree_gist          dict_int            fuzzystrmatch       hstore_plpython2u   intagg              jsonb_plperlu       lo                  ltree_plpythonu     pgcrypto            pg_stat_statements  plperl              plpythonu           refint              tcn                 "uuid-ossp"
autoinc             citext              dict_xsyn           hstore              hstore_plpython3u   intarray            jsonb_plpython2u    ltree               moddatetime         pg_freespacemap     pgstattuple         plperlu             pltcl               seg                 tsm_system_rows     xml2
bloom               cube                earthdistance       hstore_plperl       hstore_plpythonu    isn                 jsonb_plpython3u    ltree_plpython2u    pageinspect         pg_prewarm          pg_trgm             plpython2u          pltclu              sslinfo             tsm_system_time     
postgres=# create extension pg_trgm ;
CREATE EXTENSION

二,

repmgr的基本安装部署和配置

repmgr的介绍:

Repmgr是2ndQuadrant(第二象限公司)开发的一款复制的开源工具套件,用于管理PostgreSQL服务器集群中的复制和故障转移。

最初,它主要是为了简化流副本的管理,后来发展成为一个完整的故障转移管理套件。它通过设置备用服务器,监视复制以及执行管理任务(如故障转移或手动切换操作)的工具,增强了PostgreSQL内置的热备份功能。

其中各组件功能如下:

No.1 Repmgrd 守护进程

它主动监视复制集群中的服务器并执行以下任务:

1)监控和记录集群复制性能;

2)通过检测主服务器故障并提升最合适的备用服务器来执行故障转移;

3)将有关群集中事件的通知提供给用户定义的脚本,该脚本可以执行诸如通过电子邮件发送警报等任务;

4)repmgrd 根据本地数据库角色不同,其功能也不同:

  • 主库:repmgrd仅监控本地数据库,负责自动恢复、同异步切换;

  • 备库:repmgrd监控本地数据库和主数据库,负责自动切换、复制槽删除。

No.2 Repmgr命令管理

用于执行管理任务的命令行工具,主要有以下方面作用:

1)设置备用服务器;

2)将备用服务器升级为主服务器;

3)切换主服务器和备用服务器;

4)显示复制群集中的服务器状态。

No.3 用户和元数据 为了有效地管理复制集群,repmgr提供专用数据库存储和管理有关repmgr集群服务的相关信息。

此模式在 部署repmgr服务时,由repmgr扩展自动创建,该扩展在初始化repmgr -administered集群(repmgr主寄存器) 的第一步中安装,包含以下对象:

  • 表:

    repmgr.events:记录感兴趣的事件;

    repmgr.nodes:复制群集中每个服务器的连接和状态信息 ;

    repmgr.monitoring_history:repmgrd写入的历史备用监视信息。

  • 视图

    repmgr.show_nodes:基于表repmgr.nodes,另外显示服务器上游节点的名称。

    repmgr.replication_status:启用repmgrd的监视时,显示每个备用数据库的当前监视状态。

    repmgr元数据模式可以存储在现有的数据库或在自己的专用数据库。

请注意,repmgr元数据模式不能驻留在不属于repmgr管理的复制集群的数据库服务器上。数据库用户必须可供repmgr访问此数据库并执行必要的更改。此用户不需要是超级用户,但是某些操作(如初始安装repmgr扩展)将需要超级用户连接(可以使用命令 行选项--superuser在需要时指定 )。

1、

计划repmgr安装方式为yum,安装命令如下:

yum install repmgr_12-5.4.1       repmgr_12-devel-5.4.1  repmgr_12-llvmjit-5.4.1  -y

2、

主数据库装载repmgr插件

修改主数据库17的主配置文件,data/pgsql/data/postgresql.conf,修改完毕后重启服务器

主要是定义数据库的端口,以及归档命令

listen_addresses = '*'
port = 15433
max_connections = 1000	
shared_buffers = 128MB	
dynamic_shared_memory_type = posix
wal_level = logical	
max_wal_size = 1GB
min_wal_size = 80MB
archive_mode = on	
archive_command = 'test ! -f /data/pgsql/arclog/%f && cp %p /data/pgsql/arclog/%f'		# 
log_destination = 'stderr'
logging_collector = on
log_directory = 'log'
log_filename = 'postgresql-%a.log'	
log_truncate_on_rotation = on		
log_rotation_age = 1d			
log_rotation_size = 0			
log_line_prefix = '%m [%p] '		
log_timezone = 'Asia/Shanghai'
datestyle = 'iso, mdy'
timezone = 'Asia/Shanghai'
lc_messages = 'en_US.UTF-8'							
lc_monetary = 'en_US.UTF-8'			
lc_numeric = 'en_US.UTF-8'			
lc_time = 'en_US.UTF-8'	
default_text_search_config = 'pg_catalog.english'
shared_preload_libraries = 'repmgr'
wal_log_hints = on					

DETAIL: pg_rewind requires "wal_log_hints" to be enabled 

根据以上配置,创建目录/data/pgsql/arclog并赋予postgres这个用户权限,最后重启服务器

mkdir -p /data/pgsql/arclog
chown -Rf postgres. /data/pgsql/arclog/
systemctl restart pg12
ntpdate ntp.aliyun.com

3、

主节点17repmgr的配置文件 /etc/repmgr/12/repmgr.conf 增加如下内容:

node_id=1
node_name='pg1'
conninfo='host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=123456'
data_directory='/data/pgsql/data'
replication_user='repmgr'
replication_type='physical'
#location='pg1'                                  
use_replication_slots=true
witness_sync_interval=15
log_level='DEBUG'
log_facility='STDERR'
log_file='/var/log/repmgr/pg2.log'
pg_bindir='/usr/pgsql-12/bin/'
ssh_options='-p 22 -q -o ConnectTimeout=10'
promote_check_timeout=60
promote_check_interval=1
node_rejoin_timeout=60
failover='automatic'
#priority=100                                                   
connection_check_type=ping
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /etc/repmgr/12/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
service_start_command = 'sudo /usr/bin/systemctl start pg12'
service_stop_command = 'sudo /usr/bin/systemctl stop pg12'
service_restart_command = 'sudo /usr/bin/systemctl restart pg12'
service_reload_command = 'sudo /usr/bin/systemctl reload pg12'
service_promote_command = '/usr/pgsql-12/bin/pg_ctl promote -D /data/pgsql/data'
repmgrd_service_start_command = '/usr/pgsql-12/bin/repmgrd -f /etc/repmgr/12/repmgr.conf -p /run/repmgr/repmgrd-12.pid -d --verbose'
repmgrd_service_stop_command = '/usr/bin/kill `cat /run/repmgr/repmgrd-12.pid`'

注意:ssh_options='-p 22 -q -o ConnectTimeout=10'  如果sshd服务不是默认的22,根据实际情况填写,-p必须写在前面,在后面不会生效

service_start_command 这个命令是由repmgrd这个守护进程调用的,但执行命令的是普通用户postgres,因此,需要增加sudo并配置好sudo,其它命令stop什么的类似

sudo配置在文章末尾

4、

根据以上配置文件,数据库添加repmgr用户

[root@centos7 ~]# su - postgres -c 'psql -p15433'
psql (12.4)
Type "help" for help.

postgres=# create user repmgr with password '123456';
CREATE ROLE
postgres=# create database repmgr;
CREATE DATABASE




ALTER USER repmgr WITH REPLICATION;

5、

切换普通用户,利用repmgr注册主节点

[root@centos7 ~]# su - postgres
Last login: Sat Mar 30 23:08:48 CST 2024 on pts/0
-bash-4.2$ repmgr primary register
INFO: connecting to primary database...
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ERROR: connection to database failed
DETAIL: 
fe_sendauth: no password supplied

DETAIL: attempted to connect using:
  user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path=

失败了,原因是没有传递密码变量,执行变量命令,并重新注册,注册的时候临时给repmgr用户超级权限,注册成功

export PGPASSWORD="123456"

-bash-4.2$ repmgr primary register -S postgres
WARNING: following problems with command line parameters detected:
  --superuser ignored when executing PRIMARY REGISTER
INFO: connecting to primary database...
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
NOTICE: attempting to install extension "repmgr"
DEBUG: established superuser connection as "postgres"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (ID: 1) registered

可以看到,扩展插件repmgr启用了,具体路径为repmgr数据库下的名为repmgr的schema下

 命令行也可以佐证确实是安装了插件repmgr:

repmgr=# set search_path to repmgr ;
SET
repmgr=# \dx
                    List of installed extensions
  Name   | Version |   Schema   |            Description             
---------+---------+------------+------------------------------------
 plpgsql | 1.0     | pg_catalog | PL/pgSQL procedural language
 repmgr  | 5.4     | repmgr     | Replication manager for PostgreSQL
(2 rows)

如果觉得注册的不对,可以重新注册,重新注册命令如下:

-bash-4.2$ repmgr primary unregister -S postgres --force
WARNING: following problems with command line parameters detected:
  --superuser ignored when executing PRIMARY UNREGISTER
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
INFO: node "pg1" (ID: 1) was successfully unregistered

看到 successfully后,再次执行注册就可以,查看repmgr集群的状态:

-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
 ID | Name | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                       
----+------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------
 1  | pg1  | primary | * running |          | pg1      | 100      | 1        | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr

6、

从节点20的repmgr的配置文件

node_id=2
node_name='pg2'
conninfo='host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456'
data_directory='/data/pgsql/data'
replication_user='repmgr'
replication_type='physical'
#location='pg2'
use_replication_slots=true
witness_sync_interval=15
log_level='DEBUG'
log_facility='STDERR'
log_file='/var/log/repmgr/pg2.log'
pg_bindir='/usr/pgsql-12/bin/'
ssh_options='-q -o ConnectTimeout=10'
promote_check_timeout=60
promote_check_interval=1
node_rejoin_timeout=60
failover='automatic'
#priority=120
connection_check_type=ping
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /etc/repmgr/12/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
service_start_command = 'sudo /usr/bin/systemctl start pg12'
service_stop_command = 'sudo /usr/bin/systemctl stop pg12'
service_restart_command = 'sudo /usr/bin/systemctl restart pg12'
service_reload_command = 'sudo /usr/bin/systemctl reload pg12'
service_promote_command = '/usr/pgsql-12/bin/pg_ctl promote -D /data/pgsql/data'
repmgrd_service_start_command = '/usr/pgsql-12/bin/repmgrd -f /etc/repmgr/12/repmgr.conf -p /run/repmgr/repmgrd-12.pid -d --verbose'
repmgrd_service_stop_command = '/usr/bin/kill `cat /run/repmgr/repmgrd-12.pid`'

service_start_command 这个命令是由repmgrd这个守护进程调用的,但执行命令的是普通用户postgres,因此,需要增加sudo并配置好sudo,其它命令stop什么的类似

sudo配置在文章末尾

7、

从节点利用流复制搭建主从集群,通过repmgr命令来执行(postgres这个普通用户执行)

repmgr standby clone -h 192.168.123.17 -p15433 -d repmgr --force  -U repmgr -c

HINT: this may take some time; consider using the -c/--fast-checkpoint option 

日志如下:

NOTICE: destination directory "/data/pgsql/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=192.168.123.17 port=15433 user=repmgr dbname=repmgr
DETAIL: current installation size is 32 MB
DEBUG: 1 node records returned by source node
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: upstream_node_id determined as 1
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
WARNING: directory "/data/pgsql/data" exists but is not empty
NOTICE: -F/--force provided - deleting existing data directory "/data/pgsql/data"
DEBUG: replication slot "repmgr_slot_2" exists but is inactive; reusing
NOTICE: starting backup (using pg_basebackup)...
INFO: executing:
  /usr/pgsql-12/bin/pg_basebackup -l "repmgr base backup"  -D /data/pgsql/data -h 192.168.123.17 -p 15433 -U repmgr -c fast -X stream -S repmgr_slot_2 
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
could not identify current directory: No such file or directory
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: /usr/bin/systemctl start pg12
HINT: after starting the server, you need to register this standby with "repmgr standby register"

命令执行完成后,就可以启动从节点了:

systemctl start pg12

查看进程可以看到从节点正常复制了:

[root@centos10 ~]# ps aux |grep postgre
root      1637  0.0  0.0 191784  2336 pts/0    S    Mar30   0:00 su - postgres
postgres  1638  0.0  0.0 115444  2064 pts/0    S+   Mar30   0:00 -bash
postgres  1744  0.3  1.1 450504 46412 ?        Ss   00:05   0:00 /usr/pgsql-12/bin/postmaster -D /data/pgsql/data/
postgres  1745  0.0  0.0 260664  2084 ?        Ss   00:05   0:00 postgres: logger   
postgres  1746  0.0  0.0 450708  3100 ?        Ss   00:05   0:00 postgres: startup   recovering 000000010000000000000013
postgres  1747  0.0  0.0 450504  2112 ?        Ss   00:05   0:00 postgres: checkpointer   
postgres  1748  0.0  0.0 450504  2120 ?        Ss   00:05   0:00 postgres: background writer   
postgres  1749  0.0  0.0 262784  2008 ?        Ss   00:05   0:00 postgres: stats collector   
postgres  1750  0.4  0.1 457520  4700 ?        Ss   00:05   0:00 postgres: walreceiver   streaming 0/130007C8
root      1752  0.0  0.0 112712   964 pts/1    S+   00:05   0:00 grep --color=auto postgre

主节点正常发送wal日志了:

[root@centos7 ~]# ps aux |grep postgres
postgres  2205  0.0  1.1 450504 46464 ?        Ss   00:01   0:00 /usr/pgsql-12/bin/postmaster -D /data/pgsql/data/
postgres  2207  0.0  0.0 260664  2084 ?        Ss   00:01   0:00 postgres: logger   
postgres  2209  0.0  0.1 450908  4180 ?        Ss   00:01   0:00 postgres: checkpointer   
postgres  2210  0.0  0.0 450764  3916 ?        Ss   00:01   0:00 postgres: background writer   
postgres  2211  0.0  0.1 450504  6296 ?        Ss   00:01   0:00 postgres: walwriter   
postgres  2212  0.0  0.0 451784  3352 ?        Ss   00:01   0:00 postgres: autovacuum launcher   
postgres  2213  0.0  0.0 262784  2196 ?        Ss   00:01   0:00 postgres: archiver   last was 000000010000000000000012.00000028.backup
postgres  2214  0.0  0.0 262940  2352 ?        Ss   00:01   0:00 postgres: stats collector   
postgres  2215  0.0  0.0 451624  2856 ?        Ss   00:01   0:00 postgres: logical replication launcher   
root      2216  0.0  0.0 191784  2336 pts/0    S    00:01   0:00 su - postgres
postgres  2217  0.0  0.0 115444  2004 pts/0    S+   00:01   0:00 -bash
postgres  2260  0.0  0.0 451812  3848 ?        Ss   00:05   0:00 postgres: walsender repmgr 192.168.123.20(38408) streaming 0/130008B0
postgres  2265  0.0  0.1 452464  7044 ?        Ss   00:07   0:00 postgres: postgres postgres 192.168.123.1(56048) idle
root      2294  0.0  0.0 112712   960 pts/1    S+   00:10   0:00 grep --color=auto postgres

8、

注册从节点

-bash-4.2$ repmgr standby register
INFO: connecting to local node "pg2" (ID: 2)
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
INFO: connecting to primary database
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID: 1)
INFO: standby registration complete
NOTICE: standby node "pg2" (ID: 2) successfully registered

9、

查看集群状态

-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
 ID | Name | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                           
----+------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------
 1  | pg1  | primary | * running |          | default  | 100      | 1        | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=123456
 2  | pg2  | standby |   running | pg1      | default  | 100      | 1        | host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456



三、

自动故障转移和测试

如果需要启用自动故障转移,那么,我们需要开启repmgrd服务,每个节点都开启

开启比较简单,一条命令就可以了

systemctl start repmgr-12

服务启动完毕后,可以查看进程是否正确:

[root@centos7 ~]# ps aux |grep repm
postgres  5915  0.2  0.2 453488  9716 ?        Ss   09:28   0:00 postgres: repmgr repmgr 192.168.123.17(35544) idle
postgres  5917  0.0  0.0  84276  2160 ?        S    09:28   0:00 /usr/pgsql-12/bin/repmgrd -f /etc/repmgr/12/repmgr.conf -p /run/repmgr/repmgrd-12.pid -d --verbose
root      5925  0.0  0.0 112712   960 pts/3    S+   09:29   0:00 grep --color=auto repm

注意,这里问题来了,该服务是依赖于postgresql数据库的进程的,数据库必须是在线的才可以启停repmgrd服务

那么,很多人会疑惑,这个服务有什么用?其实很简单,就是故障转移用的,这些是关键配置,在发生宕机的情况下,这些命令会自动的启停数据库,例如,主节点宕机,那么,从节点的该服务将会在规定时间内检查主节点,如果一直找不到主,将会把自己提升为主,先执行这个命令:

promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file'

然后执行这个命令:

service_promote_command = '/usr/pgsql-12/bin/pg_ctl promote -D /data/pgsql/data'

特别注意:这些命令全部需要绝对路径,否则会报找不到命令,配置文件修改后,重启repmgrd服务就可以生效

promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /etc/repmgr/12/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
service_start_command = '/usr/bin/systemctl start pg12'
service_stop_command = '/usr/bin/systemctl stop pg12'
service_restart_command = '/usr/bin/systemctl restart pg12'
service_reload_command = '/usr/bin/systemctl reload pg12'
service_promote_command = '/usr/pgsql-12/bin/pg_ctl promote -D /data/pgsql/data'
repmgrd_service_start_command = '/usr/pgsql-12/bin/repmgrd -f /etc/repmgr/12/repmgr.conf -p /run/repmgr/repmgrd-12.pid -d --verbose'
repmgrd_service_stop_command = '/usr/bin/kill `cat /run/repmgr/repmgrd-12.pid`'

下面是将主节点数据库服务关闭后,从节点的repmgrd服务打印的日志/var/log/repmgr/pg2.log

[2024-03-31 09:13:29] [INFO] 2 total nodes registered
[2024-03-31 09:13:29] [INFO] primary node  "pg1" (ID: 1) and this node have the same location ("default")
[2024-03-31 09:13:29] [INFO] no other sibling nodes - we win by default
[2024-03-31 09:13:29] [DEBUG] election result: WON
[2024-03-31 09:13:29] [NOTICE] this node is the only available candidate and will now promote itself
[2024-03-31 09:13:29] [DEBUG] get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached   FROM repmgr.nodes n  WHERE n.node_id = 1
[2024-03-31 09:13:29] [INFO] promote_command is:
  "/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file"
[2024-03-31 09:13:29] [NOTICE] redirecting logging output to "/var/log/repmgr/pg2.log"

[2024-03-31 09:13:29] [DEBUG] connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
[2024-03-31 09:13:29] [DEBUG] connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
[2024-03-31 09:13:29] [DEBUG] connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
[2024-03-31 09:13:29] [NOTICE] promoting standby to primary
[2024-03-31 09:13:29] [DETAIL] promoting server "pg2" (ID: 2) using "/usr/pgsql-12/bin/pg_ctl promote -D /data/pgsql/data"
[2024-03-31 09:13:29] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2024-03-31 09:13:29] [DEBUG] setting node 2 as primary and marking existing primary as failed
[2024-03-31 09:13:29] [NOTICE] STANDBY PROMOTE successful
[2024-03-31 09:13:29] [DETAIL] server "pg2" (ID: 2) was successfully promoted to primary
[2024-03-31 09:13:29] [DEBUG] result of promote_command: 0
[2024-03-31 09:13:29] [INFO] checking state of node 2, 1 of 6 attempts

再次启动原主节点的postgresql数据库,在主节点查看集群状态,可以发现脑裂了:

-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
 ID | Name | Role    | Status               | Upstream | Location | Priority | Timeline | Connection string                                                           
----+------+---------+----------------------+----------+----------+----------+----------+------------------------------------------------------------------------------
 1  | pg1  | primary | * running            |          | default  | 100      | 1        | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=123456
 2  | pg2  | standby | ! running as primary |          | default  | 100      | 2        | host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456

WARNING: following issues were detected
  - node "pg2" (ID: 2) is registered as standby but running as primary

在从节点,可以看到显示的不一样,不过仍然是脑裂:

-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
 ID | Name | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                           
----+------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------
 1  | pg1  | primary | - failed  | ?        | default  | 100      |          | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=123456
 2  | pg2  | primary | * running |          | default  | 100      | 2        | host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456

WARNING: following issues were detected
  - unable to connect to node "pg1" (ID: 1)

解决方案:

首先,在从节点强制踢掉自己,数据库服务不能关闭,命令如下:

repmgr primary unregister --force

然后关闭从节点的数据库服务,重新从现主17服务器拉取:

repmgr standby clone -h 192.168.123.17 -p15433 -d repmgr --force  -U repmgr -c

同步拉取完毕后,在重新强制注册为standby节点,就恢复正常了:

-bash-4.2$ repmgr standby register --force
INFO: connecting to local node "pg2" (ID: 2)
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
INFO: connecting to primary database
DEBUG: connecting to: "user=repmgr password=1234562 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
INFO: standby registration complete
NOTICE: standby node "pg2" (ID: 2) successfully registered

再次查看集群状态,发现恢复正常了:

-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
 ID | Name | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                           
----+------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------
 1  | pg1  | primary | * running |          | default  | 100      | 1        | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=123456
 2  | pg2  | standby |   running | pg1      | default  | 100      | 1        | host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456

总结:

脑裂的处理步骤是,首先,将异常的节点也就是 running as primary的节点踢掉,这个节点在哪,就在哪个节点踢掉,踢节点也就是unregister,踢完后standby clone,重新克隆的时候需要停止异常节点数据库服务,也就是原从节点的数据库服务,完事后在重新注册并再次启动原从节点的数据库就可以了;而脑裂产生的原因是原主节点数据库挂掉后,又重新启动数据库服务了,重新启动数据库后,原主认为自己是主节点,而原从节点已经提升为主节点,它也认为自己是主节点,造成集群内有两个主节点的信息了,那么,此时集群的主从复制就是混乱状态

如果脑裂很快发现按上述步骤没什么问题,但如果时间很长才发现,那么,需要根据pg_controlldata命令确定哪个是主,哪个是从,Latest checkpoint's oldestXID高的为主,反之为从,在从节点按上述操作即可

-bash-4.2$ pg_controldata 
pg_control version number:            1201
Catalog version number:               201909212
Database system identifier:           7351900556498275811
Database cluster state:               in production
pg_control last modified:             Sun 31 Mar 2024 11:41:05 AM CST
Latest checkpoint location:           0/2F001528
Latest checkpoint's REDO location:    0/2F0014F0
Latest checkpoint's REDO WAL file:    00000001000000000000002F
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0:3382
Latest checkpoint's NextOID:          16454
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        479
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  3382
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint:            Sun 31 Mar 2024 11:41:05 AM CST
Fake LSN counter for unlogged rels:   0/3E8
Minimum recovery ending location:     0/0
Min recovery ending loc's timeline:   0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    logical
wal_log_hints setting:                on
max_connections setting:              1000
max_worker_processes setting:         8
max_wal_senders setting:              10
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       off
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0
Mock authentication nonce:            e76af00554364729517eece5fa0fcf6f7f0662da3a154cabb730c2ab24edd267

重新加入节点:

repmgr node rejoin -h 192.168.123.20 -p 15433 -Upostgres -d repmgr --force-rewind

四,

主备手动切换

repmgr standby switchover --force -S postgres

这个命令比较方便,在从节点执行,不过需要配置免密,并且需要standby从库有上级,比如下面的,Upstream这一栏里,表明pg1的上级是pg2,pg2是主节点,pg1是从节点

其次,普通用户postgres需要能够sudo执行  systemctl命令

sudo的配置:

切换到root用户,使用visudo命令,打开sudo的编辑界面,添加如下一行:

postgres        ALL=(ALL)       NOPASSWD: /usr/bin/systemctl

-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
 ID | Name | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                           
----+------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------
 1  | pg1  | standby |   running | pg2      | default  | 100      | 2        | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=123456
 2  | pg2  | primary | * running |          | default  | 100      | 3        | host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456

那么,想快速的把这两个节点调换一个位置就是上面的switchover命令了,非常的方便,比如,现在我又想把pg2切换到从,pg1切换到主,那么,应该是在pg1也就是从节点上执行switchover命令

在切换期间,原主节点的重启什么的都是自动完成的,下面的日志详细说明了主从切换的过程,可以看到是先把从节点提升为主,然后在把原主降级为从,期间两个节点互相远程登录执行了systemctl命令以启停数据库服务

-bash-4.2$ repmgr standby switchover
DEBUG: connecting to: "user=repmgr password=shiguang32 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
NOTICE: executing switchover on node "pg1" (ID: 1)
DEBUG: connecting to: "user=repmgr password=shiguang32 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 192.168.123.20 /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf --version >/dev/null 2>&1 && echo "1" || echo "0"
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 192.168.123.20 /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf --version 2>/dev/null
DEBUG: "repmgr" version on "192.168.123.20" is 50401
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 192.168.123.20 test -f /etc/repmgr/12/repmgr.conf && echo 1 || echo 0
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 192.168.123.20 /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf node check --data-directory-config --optformat -LINFO 2>/dev/null
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 192.168.123.20 /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf node check --replication-config-owner --optformat -LINFO 2>/dev/null
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 192.168.123.20 /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf node check --remote-node-id=1 --replication-connection
DEBUG: connecting to: "user=repmgr password=shiguang32 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 192.168.123.20 /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf node check --terse -LERROR --archive-ready --optformat
DEBUG: lag is 0 
DEBUG: minimum of 1 free slots (0 for siblings) required; 10 available
NOTICE: attempting to pause repmgrd on 2 nodes
DEBUG: connecting to: "user=repmgr password=shiguang32 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=shiguang32 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
NOTICE: local node "pg1" (ID: 1) will be promoted to primary; current primary "pg2" (ID: 2) will be demoted to standby
NOTICE: stopping current primary node "pg2" (ID: 2)
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 192.168.123.20 /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf node service --action=stop --checkpoint
DEBUG: connecting to: "user=repmgr password=shiguang32 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
NOTICE: issuing CHECKPOINT on node "pg2" (ID: 2) 
DETAIL: executing server command "sudo /usr/bin/systemctl stop pg12"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_NO_RESPONSE
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 192.168.123.20 /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf node status --is-shutdown-cleanly
NOTICE: current primary has been cleanly shut down at location 0/C2000028
DEBUG: local node last receive LSN is 0/C20000A0, primary shutdown checkpoint LSN is 0/C2000028
NOTICE: promoting standby to primary
DETAIL: promoting server "pg1" (ID: 1) using "/usr/pgsql-12/bin/pg_ctl promote -D /data/pgsql/data"
waiting for server to promote.... done
server promoted
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
DEBUG: setting node 1 as primary and marking existing primary as failed
NOTICE: STANDBY PROMOTE successful
DETAIL: server "pg1" (ID: 1) was successfully promoted to primary
DEBUG: executing:
  /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf  --no-wait -d \'user=repmgr password=shiguang32 dbname=repmgr host=192.168.123.17 port=15433\' node rejoin > /tmp/node-rejoin.1712402006.log 2>&1 && echo "1" || echo "0"
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 192.168.123.20 /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf  --no-wait -d \'user=repmgr password=shiguang32 dbname=repmgr host=192.168.123.17 port=15433\' node rejoin > /tmp/node-rejoin.1712402006.log 2>&1 && echo "1" || echo "0"
NOTICE: node "pg1" (ID: 1) promoted to primary, node "pg2" (ID: 2) demoted to standby
DEBUG: connecting to: "user=repmgr password=shiguang32 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
NOTICE: switchover was successful
DETAIL: node "pg1" is now primary and node "pg2" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully

在强调一次,需要配置免密和systemctl命令加入sudo才可以使用此功能,全部都是自动化完成的

再看集群状态就十分正常了:

-bash-4.2$ repmgr service status
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
 ID | Name | Role    | Status    | Upstream | repmgrd | PID  | Paused? | Upstream last seen
----+------+---------+-----------+----------+---------+------+---------+--------------------
 1  | pg1  | primary | * running |          | running | 7631 | no      | n/a                
 2  | pg2  | standby |   running | pg1      | running | 5511 | no      | 0 second(s) ago    

追更:

需要说明的是,repmgr的主从复制关系由repmgr这个程序来管理,包括重建主从关系,那么,在使用pgbackrest这个灾备工具的时候,如果只恢复主,从节点需要重新拉取,也就是repmgr standby clone -h 192.168.123.17 -p15433 -d repmgr --force  -U repmgr -c 这个命令,重新拉取前最好是unregister

按照上面提到的步骤,重新拉取并重新注册后,发现了一个问题,在主节点查看集群状态 ,从节点状态是异常的,带了一个问号 ,状态如下所示:

-bash-4.2$ repmgr cluster info
The following command line errors were encountered:
  unknown repmgr action 'cluster info'
Try "repmgr --help" or "repmgr cluster --help" for more information.
-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr password=123456dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
 ID | Name | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                           
----+------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------
 1  | pg1  | primary | * running |          | default  | 100      | 7        | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=123456
 2  | pg2  | standby |   running | ? pg1    | default  | 100      | 7        | host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456

此时即使使用repmgr rejoin什么命令的也没有用,当然了standby follow也是无效的,都是报错,报错如下(从节点执行的):


DETAIL:  SELECT pid, state    FROM pg_catalog.pg_stat_replication   WHERE application_name = 'pg2'
DEBUG: sleeping 1 second waiting for node  "pg2" (ID: 2) to connect to new primary; 60 of max 60 attempts
ERROR: NODE REJOIN failed
DETAIL: no active record for local node "pg2" found in node "pg1"'s "pg_stat_replication" table

最终解决方案为查看  postgresql.auto.conf 文件,发现从节点的这个文件里多了一些东西:

# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
restore_command = 'pgbackrest --stanza=main archive-get %f "%p"'
recovery_target_time = '2024-04-15 08:51:37+08'
primary_conninfo = 'user=repmgr password=shiguang32 host=192.168.123.17 port=15433 application_name=pg2'
primary_slot_name = 'repmgr_slot_2'

很明显,这个配置和主配置文件postgresql.conf 有冲突,因此,保留主从复制的配置,其它注释掉,如下,在重启服务器,立刻就恢复了(从节点的,别忘记了):

# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
#restore_command = 'pgbackrest --stanza=main archive-get %f "%p"'
#recovery_target_time = '2024-04-15 08:51:37+08'
primary_conninfo = 'user=repmgr password=shiguang32 host=192.168.123.17 port=15433 application_name=pg2'
primary_slot_name = 'repmgr_slot_2'

再次查看repmgr集群状态就全部正常了:

-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr password=123456dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
 ID | Name | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                           
----+------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------
 1  | pg1  | primary | * running |          | default  | 100      | 7        | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=123456
 2  | pg2  | standby |   running | pg1      | default  | 100      | 7        | host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456

并且pgbackrest也是正常的了,这里就不测试了,完结撒花!!!~~~~~~

  • 11
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

晚风_END

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值