环境:
Os:Centos 7
DB:13.8
主库:192.168.1.134
从库:192.168.1.135
参考网站:https://www.cnblogs.com/hxlasky/p/16810443.html
################主从部署######################
- 主库创建流复制的用户
postgres=# CREATE ROLE replica login replication encrypted password 'replica';
- 主库修改pg_hba.conf文件,允许备库IP通过复制用户访问数据库
切换root用户 su root
vi /opt/pg13/data/pg_hba.conf
# replication privilege.
local replication all trust
host replication all 127.0.0.1/32 trust
host replication all ::1/128 trust
#此配置置于ipv4
host replication replica 192.168.1.0/24 md5 ## 新增的,我这里整个网段开放
或是具体指定ip
# replication privilege.
local replication all trust
host replication all 127.0.0.1/32 trust
host replication all ::1/128 trust
host replication replica 192.168.1.135/32 md5 ## 具体指定ip
需要重新reload,否则报错连接不了
[postgres@host134 ~]$ pg_ctl -D /opt/pg13/data reload
3.停掉从库
su - postgres
pg_ctl -D /opt/pg13/data -l /opt/pg13/log/postgres.log stop
- 从库准备data目录
从库安装完成后,不初始化,若已经初始化,删除其data目录
若之前安装的pg有data目录的话需要将其删除掉,并创建一个空的相同的目录
su - postgres
[postgres@host135 ~]$ cd /opt/pg13
[postgres@host135 pg13]$ mv data bakdata
[postgres@host135 pg13]$ mkdir data
创建归档目录,保持与主库一致
[postgres@host135 pg13]$mkdir -p /opt/pg13/archivelog
注意权限要正确,不对的话需要进行修改,root用户下修改权限
[root@host135 ~]# chown -R postgres:postgres /opt/pg13
[root@host135 ~]# chmod 0700 /opt/pg13/data
5.备库上执行对于主库的基础备份
[postgres@host135 pg13]$pg_basebackup -h 192.168.1.134 -p 5432 -U replica --password -X stream -Fp --progress -D /opt/pg13/data -R
注意,备份选项上带有-R选项.
[postgres@host135 pg13]$ pg_basebackup -h 192.168.1.134 -p 5432 -U replica --password -X stream -Fp --progress -D /opt/pg13/data -R
Password:
pg_basebackup: error: FATAL: no pg_hba.conf entry for replication connection from host "192.168.1.135", user "replica", SSL off
原因1:
是主库修改了pg_hba.conf,没有reload,执行如下reload即可
pg_ctl -D /opt/pg13/data reload
原因2:
如果操作失败尝试:防火窗是否链拦截,把虚拟机中的防火墙清一下
sudo iptables -F
打开主节点5432端口
firewall-cmd --permanent --zone=public --add-port=5432/tcp
firewall-cmd --state
firewall-cmd --reload
[postgres@host135 pg13]$ pg_basebackup -h 192.168.1.134 -p 5432 -U replica --password -X stream -Fp --progress -D /opt/pg13/data -R
Password:
32247/32247 kB (100%), 1/1 tablespace
执行了pg_basebackup命令,从库会把主库的 postgresql.conf,pg_hba.conf文件也拷贝过来了的
现在这两个文件的内容主从库是一致的.
若是在归档模式下的话,需要从库创建同样的归档目录
6.备库就可以执行pg_ctl start启动了
这时,就可以看到备库服务器上自动生成了standby.signal文件,同时,也看到在$PGDATA路径下,数据库自动帮我们配置了关于流复制的主库的信息:
[postgres@host135 data]$ ls -1
backup_label
backup_manifestbase
current_logfilesglobal
log
pg_commit_ts
pg_dynshmem
pg_hba.conf
pg_ident.conf
pg_logical
pg_multixact
pg_notify
pg_replslot
pg_serial
pg_snapshots
pg_stat
pg_stat_tmp
pg_subtrans
pg_tblspc
pg_twophase
PG_VERSION
pg_wal
pg_xact
postgresql.auto.conf
postgresql.confstandby.signal
也看到在$PGDATA路径下,数据库会复制主库的pg_hba.conf,postgresql.conf这两个文件到从库,这个时候主从库配置文件保持了一致,若需要修改的,也可以修改,比如端口号.
同时postgresql.auto.conf,数据库自动帮我们配置了关于流复制的主库的信息
[postgres@host135 data]$ more postgresql.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
primary_conninfo = 'user=replica password=replica channel_binding=disable host=192.168.1.134 port=5432 sslmode=disable sslcompression=0 ssl_min_protocol_version=TLSv1.2 gssencmode=disable krbsrvname=postgres target_session_attrs=any'
当然了,如果我们没有使用-R来备份主库的话.我们完全可以在备库上手工创建standby.signal文件,然后手工编辑postgresql.conf(不是postgresql.auto.conf文件),并在其内容中配置主库的信息.
7.启动从库
pg_ctl -D /opt/pg13/data -l /opt/pg13/log/postgres.log start
报错:
2022-10-19 10:16:25 CST [32043]: [1-1] user=,db=,app=,client=LOG: redirecting log output to logging collector process
2022-10-19 10:16:25 CST [32043]: [2-1] user=,db=,app=,client=HINT: Future log output will appear in directory "/opt/pg13/log".
2022-10-19 10:57:31 CST [3551]: [1-1] user=,db=,app=,client=FATAL: data directory "/opt/pg13/data" has invalid permissions
2022-10-19 10:57:31 CST [3551]: [2-1] user=,db=,app=,client=DETAIL: Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
解决办法:
root用户下修改权限
[root@host135 ~]# chown -R postgres:postgres /opt/pg13
[root@host135 ~]# chmod 0700 /opt/pg13/data
- 主库查看数据库复制信息
进入数据库:psql -h localhost -U postgres -p 5432
postgres=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | reply_time
------+----------+---------+------------------+----------------+-----------------+-------------+-------------------------------+--------------+-----------+-----------+-----------+-----------+------------+-----------+-----------+------------+---------------+------------+-------------------------------
2197 | 16403 | replica | walreceiver | 192.168.88.130 | | 34058 | 2023-06-09 19:23:29.105932+08 | | streaming | 0/7000060 | 0/7000060 | 0/7000060 | 0/7000060 | | | | 0 | async | 2023-06-09 19:24:59.403341+08
(1 row)
9.进程查看
从库进程
[postgres@host135 data]$ ps -ef|grep postgres
postgres 3815 1 0 10:59 ? 00:00:00 /opt/pg13/bin/postgres -D /opt/pg13/data
postgres 3816 3815 0 10:59 ? 00:00:00 postgres: logger
postgres 3817 3815 0 10:59 ? 00:00:00 postgres: startup recovering 00000001000000000000001B
postgres 3818 3815 0 10:59 ? 00:00:00 postgres: checkpointer
postgres 3819 3815 0 10:59 ? 00:00:00 postgres: background writer
postgres 3820 3815 0 10:59 ? 00:00:00 postgres: stats collector
postgres 3821 3815 0 10:59 ? 00:00:00 postgres: walreceiver streaming 0/1B000148
postgres 3864 26618 0 11:00 pts/1 00:00:00 ps -ef
postgres 3865 26618 0 11:00 pts/1 00:00:00 grep --color=auto postgres
root 26617 25114 0 09:26 pts/1 00:00:00 su - postgres
postgres 26618 26617 0 09:26 pts/1 00:00:00 -bash
主库进程
[postgres@host134 data]$ ps -ef|grep postgres
postgres 11073 1 0 Oct18 ? 00:00:00 /opt/pg13/bin/postgres -D /opt/pg13/data
postgres 11074 11073 0 Oct18 ? 00:00:00 postgres: logger
postgres 11077 11073 0 Oct18 ? 00:00:00 postgres: checkpointer
postgres 11078 11073 0 Oct18 ? 00:00:00 postgres: background writer
postgres 11079 11073 0 Oct18 ? 00:00:00 postgres: walwriter
postgres 11080 11073 0 Oct18 ? 00:00:00 postgres: autovacuum launcher
postgres 11081 11073 0 Oct18 ? 00:00:00 postgres: archiver last was 00000001000000000000001A.00000028.backup
postgres 11082 11073 0 Oct18 ? 00:00:01 postgres: stats collector
postgres 11083 11073 0 Oct18 ? 00:00:00 postgres: logical replication launcher
postgres 11294 11073 0 Oct18 ? 00:00:00 postgres: postgres postgres 192.168.1.134(40882) idle
postgres 21407 11073 0 10:59 ? 00:00:00 postgres: walsender replica 192.168.1.135(50736) streaming 0/1B000148
主库
[postgres@host134 20221021]$ pg_controldata /opt/pg13/data/| grep 'Database cluster state'
Database cluster state: in production
备库
[postgres@host135 bin]$ pg_controldata /opt/pg13/data/| grep 'Database cluster state'
Database cluster state: in archive recovery
10.数据验证
登录从库
[postgres@host135 data]$ psql -h 192.168.1.135 -U postgres
Password for user postgres:
psql (13.8)
Type "help" for help.
postgres=# \c db_test;
You are now connected to database "db_test" as user "postgres".
db_test=# select * from tb_test;
id | name | createtime | modifytime ----+-------+----------------------------+----------------------------
1 | name1 | 2022-10-18 11:32:33.649901 | 2022-10-18 11:32:33.649901
2 | name2 | 2022-10-18 11:32:33.665863 | 2022-10-18 11:32:33.665863
3 | name3 | 2022-10-18 11:32:33.691182 | 2022-10-18 11:32:33.691182
4 | name4 | 2022-10-18 11:32:33.771843 | 2022-10-18 11:32:33.771843
5 | name5 | 2022-10-18 11:32:34.496502 | 2022-10-18 11:32:34.496502
(5 rows)
主库写入:
[postgres@host134 data]$ psql -h 192.168.1.134 -U postgres
Password for user postgres:
psql (13.8)
Type "help" for help.
postgres=# \c db_test;
You are now connected to database "db_test" as user "postgres".
db_test=# select * from tb_test;
id | name | createtime | modifytime ----+-------+----------------------------+----------------------------
1 | name1 | 2022-10-18 11:32:33.649901 | 2022-10-18 11:32:33.649901
2 | name2 | 2022-10-18 11:32:33.665863 | 2022-10-18 11:32:33.665863
3 | name3 | 2022-10-18 11:32:33.691182 | 2022-10-18 11:32:33.691182
4 | name4 | 2022-10-18 11:32:33.771843 | 2022-10-18 11:32:33.771843
5 | name5 | 2022-10-18 11:32:34.496502 | 2022-10-18 11:32:34.496502
(5 rows)
db_test=# insert into tb_test(name) values('name6');
INSERT 0 1
从库查询:
[postgres@host135 data]$ psql -h 192.168.1.135 -U postgres
Password for user postgres:
psql (13.8)
Type "help" for help.
postgres=# \c db_test;
You are now connected to database "db_test" as user "postgres".
db_test=# select * from tb_test;
id | name | createtime | modifytime ----+-------+----------------------------+----------------------------
1 | name1 | 2022-10-18 11:32:33.649901 | 2022-10-18 11:32:33.649901
2 | name2 | 2022-10-18 11:32:33.665863 | 2022-10-18 11:32:33.665863
3 | name3 | 2022-10-18 11:32:33.691182 | 2022-10-18 11:32:33.691182
4 | name4 | 2022-10-18 11:32:33.771843 | 2022-10-18 11:32:33.771843
5 | name5 | 2022-10-18 11:32:34.496502 | 2022-10-18 11:32:34.496502
6 | name6 | 2022-10-19 11:04:56.543939 | 2022-10-19 11:04:56.543939
(6 rows)
尝试从库写入数据
db_test=# insert into tb_test(name) values('name7');
ERROR: cannot execute INSERT in a read-only transaction
从库尝试归档
db_test=# select pg_switch_wal();
ERROR: recovery is in progress
HINT: WAL control functions cannot be executed during recovery.
#####################主从切换####################
1.主库停止,模拟故障
192.168.1.134上执行
##查看状态
[postgres@host134 data]$ pg_ctl -D /opt/pg13/data status
pg_ctl: server is running (PID: 24009)
/opt/pg13/bin/postgres "-D" "/opt/pg13/data"
[postgres@host134 data]$ pg_controldata /opt/pg13/data/| grep 'Database cluster state'
Database cluster state: in production
##停止数据库
[postgres@host134 data]$ pg_ctl -D /opt/pg13/data -l /opt/pg13/log/postgres.log stop -m fast
waiting for server to shut down.... done
server stopped
2.备库提升为新主库,对外提供服务
在备库192.168.1.135上执行
[postgres@host135 data]$ pg_ctl promote -D /opt/pg13/data
waiting for server to promote.... done
server promoted
重要1:启动备库为新主库的命令是pg_ctl promote。
提升备库为主库之后,可以看到,后台进程中不再有startup recovering,以及walreceiver streaming进程了.
同时,多了postgres: walwriter 写进程.
重要2:$PGDATA/standby.signal文件自动消失了. 这是告诉PostgreSQL,我现在不再是备库了,我的身份是主库了.
3.新主库删除primary_conninfo条目
192.168.1.135上操作
这里将之前主从同步的信息删除掉,postgresql.auto.conf文件中的 primary_conninfo
[postgres@host135 data]$ psql -h 192.168.1.135 -U postgres -p 5432
Password for user postgres:
psql (13.8)
Type "help" for help.
postgres=# show primary_conninfo;
primary_conninfo ------------------------------------------------------------
user=replica password=replica host=192.168.1.135 port=5432
(1 row)
postgres=# alter system set primary_conninfo='';
ALTER SYSTEM
或者
alter system set primary_conninfo=default; ##postgresql.auto.conf会删除条目,若postgresql.conf中定义了该参数,重启后会读取该文件的参数
重新 reload
[postgres@host135 data]$ pg_ctl -D /opt/pg13/data reload
[postgres@host135 data]$ psql -h 192.168.1.135 -U postgres -p 5432
postgres=# show primary_conninfo;
primary_conninfo ------------------
(1 row)
4.在新主库写入数据
192.168.1.135上执行
[postgres@host135 data]$ psql -h 192.168.1.135 -U hxl -d db_test -p 5432
insert into tb_test(name) values('name9');
insert into tb_test(name) values('name10');
insert into tb_test(name) values('name11');
insert into tb_test(name) values('name12');
insert into tb_test(name) values('name13');
insert into tb_test(name) values('name14');
insert into tb_test(name) values('name15');
insert into tb_test(name) values('name16');
insert into tb_test(name) values('name17');
insert into tb_test(name) values('name18');
insert into tb_test(name) values('name19');
insert into tb_test(name) values('name20');
db_test=> select * from tb_test;
id | name | createtime | modifytime ----+--------+----------------------------+----------------------------
1 | name1 | 2022-10-18 11:32:33.649901 | 2022-10-18 11:32:33.649901
2 | name2 | 2022-10-18 11:32:33.665863 | 2022-10-18 11:32:33.665863
3 | name3 | 2022-10-18 11:32:33.691182 | 2022-10-18 11:32:33.691182
4 | name4 | 2022-10-18 11:32:33.771843 | 2022-10-18 11:32:33.771843
5 | name5 | 2022-10-18 11:32:34.496502 | 2022-10-18 11:32:34.496502
6 | name6 | 2022-10-19 11:04:56.543939 | 2022-10-19 11:04:56.543939
7 | name7 | 2022-10-19 11:25:52.236651 | 2022-10-19 11:25:52.236651
8 | name8 | 2022-10-20 09:21:51.977815 | 2022-10-20 09:21:51.977815
41 | name9 | 2022-10-20 14:22:26.326255 | 2022-10-20 14:22:26.326255
42 | name10 | 2022-10-20 14:22:26.34316 | 2022-10-20 14:22:26.34316
43 | name11 | 2022-10-20 14:22:26.359988 | 2022-10-20 14:22:26.359988
44 | name12 | 2022-10-20 14:22:26.433694 | 2022-10-20 14:22:26.433694
45 | name13 | 2022-10-20 14:22:26.451945 | 2022-10-20 14:22:26.451945
46 | name14 | 2022-10-20 14:22:26.469966 | 2022-10-20 14:22:26.469966
47 | name15 | 2022-10-20 14:22:26.482091 | 2022-10-20 14:22:26.482091
48 | name16 | 2022-10-20 14:22:26.498319 | 2022-10-20 14:22:26.498319
49 | name17 | 2022-10-20 14:22:26.524554 | 2022-10-20 14:22:26.524554
50 | name18 | 2022-10-20 14:22:26.555449 | 2022-10-20 14:22:26.555449
51 | name19 | 2022-10-20 14:22:26.591774 | 2022-10-20 14:22:26.591774
52 | name20 | 2022-10-20 14:22:27.587955 | 2022-10-20 14:22:27.587955
5.新主库修改pg_hba.conf文件
192.168.1.135上操作
修改新主库(原备库192.168.1.135)的$PGDATA/pg_hba.conf文件,在其中添加允许新备库(原主库192.168.1.134)可以通过replica用户访问数据库的条目信息。
vi /opt/pg13/data/pg_hba.conf
host replication all 192.168.1.134/32 md5
若之前就是以网段的方式开通的话,可以不需要修改,如下:
host replication replica 192.168.1.0/24 md5
修改了pg_hba.conf文件不需要重新启动,重新加载即可
[postgres@host135 data]$ pg_ctl -D /opt/pg13/data reload
server signaled
6.原主库新建$PGDATA/standby.signal文件
192.168.1.134上操作
[postgres@host134 data]$ cd /opt/pg13/data
[postgres@host134 data]$ touch standby.signal
[postgres@host134 data]$ pwd
/opt/pg13/data
[postgres@host134 data]$ ll standby.signal
-rw-rw-r-- 1 postgres postgres 0 Oct 20 14:27 standby.signal
注意:这一步骤非常非常重要,如果不配置该文件的话,那么原来的主库一旦重新启动话,就将成为了1个新的独立主库,脱离了主从数据库环境
- 原主库修改$PGDATA/postgresql.conf文件,添加复制条目
192.168.1.134上操作
[postgres@host134 data]$ vi postgresql.conf
添加如下项:
primary_conninfo='user=replica password=replica host=192.168.1.135 port=5432'
primary_conninfo='user=replica password=1q2!Q@ host=192.168.88.130 port=5432'
- 启动原主库,变为新备库
192.168.1.134上操作
[postgres@host134 data]$pg_ctl -D /opt/pg13/data -l /opt/pg13/log/postgres.log start
[postgres@host134 data]$ ps -ef|grep postgres
postgres 6975 1 2 15:34 ? 00:00:00 /opt/pg13/bin/postgres -D /opt/pg13/data
postgres 6976 6975 0 15:34 ? 00:00:00 postgres: logger
postgres 6977 6975 0 15:34 ? 00:00:00 postgres: startup recovering 000000010000000000000007
postgres 6979 6975 0 15:34 ? 00:00:00 postgres: checkpointer
postgres 6980 6975 0 15:34 ? 00:00:00 postgres: background writer
postgres 6981 6975 0 15:34 ? 00:00:00 postgres: stats collector
postgres 6982 6975 0 15:34 ? 00:00:00 postgres: walreceiver idle
发现这里进程是:walreceiver idle,说明没有原来主库无法加入作为备库加入集群,看错误日志:
[postgres@host134 log]$ pwd/opt/pg13/log
[postgres@host134 log]$ tail -2f postgresql-2022-10-21.log2022-10-21 15:36:39 CST [6982]: [25-1] user=,db=,app=,client=LOG: primary server contains no more WAL on requested timeline 12022-10-21 15:36:39 CST [6977]: [28-1] user=,db=,app=,client=LOG: new timeline 2 forked off current database system timeline 1 before current recovery point 0/70000A0
解决办法:
[postgres@host134 pg13]$ pg_ctl -D /opt/pg13/data -l /opt/pg13/log/postgres.log stop -m fast
waiting for server to shut down.... done
server stopped
[postgres@host134 pg13]$ pg_rewind -D /opt/pg13/data --source-server='host=192.168.1.135 port=5432 user=postgres dbname=postgres password=postgres'
pg_rewind: servers diverged at WAL location 0/7000000 on timeline 1
pg_rewind: error: could not open file "/opt/pg13/data/pg_wal/000000010000000000000006": No such file or directory
pg_rewind: fatal: could not find previous WAL record at 0/6000410
这里提示wal日志不存在000000010000000000000006,将不存在的归档文件拷贝到wal目录,若还是提示wal日志文件不存在需要继续拷贝到wal目录
[postgres@host134 20221021]$ pwd/opt/pg13/archivelog/20221021
[postgres@host135 20221021]$ cp 000000010000000000000006 /opt/pg13/data/pg_wal/
[postgres@host134 20221021]$ pg_rewind -D /opt/pg13/data --source-server='host=192.168.1.135 port=5432 user=postgres dbname=postgres password=postgres'
pg_rewind: servers diverged at WAL location 0/7000000 on timeline 1
pg_rewind: rewinding from last common checkpoint at 0/5000060 on timeline 1
pg_rewind: Done!
使用了 pg_rewind 后,系统会把主库的postgresql.auto.conf和postgresql.conf文件都拷贝过来了,这个时候需要重新修改postgresql.conf文件中的primary_conninfo,其他的参数看情况修改
9.原主库修改$PGDATA/postgresql.conf文件
192.168.1.134上操作
pg_rewind后添加,若没有pg_remind操作,上面的步骤7已结添加了条目,该步骤可以省略
[postgres@host134 data]$ vi postgresql.conf
添加如下项:
primary_conninfo='user=replica password=replica host=192.168.1.135 port=5432'
10.重新生成standby.signal文件
pg_rewind后没有了该文件standby.signal,需要重新生成
[postgres@host134 data]$ cd /opt/pg13/data
[postgres@host134 data]$ touch standby.signal
11.重启动新备库
[postgres@host134 data]$ pg_ctl -D /opt/pg13/data -l /opt/pg13/log/postgres.log start
12.数据验证
新从库
psql -h 192.168.1.134 -U hxl -d db_test -p 5432
新主库
psql -h 192.168.1.135 -U hxl -d db_test -p 5432