postgresql pg_rewind 类似oracle的flashback+基于scn的恢复

最新推荐文章于 2024-06-27 16:12:56 发布

jnrjian

最新推荐文章于 2024-06-27 16:12:56 发布

阅读量529

点赞数 1

文章标签： oracle 数据库

本文链接：https://blog.csdn.net/jnrjian/article/details/129880125

版权

PostgreSQL 在操作的过程中，如果利用物理复制的过程中，另一台从库，或者主库由于某些原因，不再与主库同步，或者主库crash 起不来了，怎么办，如果在利用现在的主库或备库，弄出一个 twins 。

其实PG 早就想到这个问题了，PG有一个独特的命令 pg_rewind 可以帮助你，再造一个你。

我们看看pg_rewind能帮我们什么

pg_rewind 的工作原理有点类似rsync，它可以无缝的读取源目录与目的目录之间不同的数据块，而重复的数据块将不再被读取。这样的方式其实对于上面的问题是一个好的解决方案，因为如果主从复制，任何一方坏了，使用PG_REWIND 可以快速将你认为的数据完全的一方的数据同步到另一方，而不用做全量复制，这样最大的好处就是节省了时间。

当然如果大概率知道checksum的（包括MYSQL的binlog checksum ）大多可以想到，怎么知道这两边的数据是否一致，必须的校验块，postgresql 如果要使用pg_rewind 功能需要你做以下的一些设置

1 full_page_writes = on

2 wal_log_hints = on

3 hot_standby = on

4 如果你在初始化数据库集群（postgresql 单机也叫数据库集群，别和真正的集群的含义搞混）做了data checksum 也是可以的。

其主要的工作原理，在目的集群中对比源，与目的端之间的不同点，就是什么时候两个服务器之间的数据开始不同步的。通过知道这些不同点开始进行

1 使用文件系统的方式进行拷贝

2 使用libpq 建立连接的方式将数据进行拷贝

在拷贝数据文件的以外还需要拷贝事务提交的文件，pg_xact 以及配置文件等等。生成backup label 文件，并且指定开始要恢复的 wal 日志点，并应用恢复点以后的日志，并且还要刷新 pg_control 文件（在设置了检查点并刷新日志之后，检查点的位置将保存在文件pg_control中）,最后执行initdb -S 将数据刷入到磁盘后，关闭。

问题1 ，PG_REWIND 怎么识别两台PG 是曾经为primary 和 standby的管理

其实就是通过 database system identifier 来鉴别，同样的主从的 database system identifier 的编码是一致的。同时也要看version 与 catalog version number 是否一致。

而关于bakcup label 其中包含了check point ，而后续的复制也是要依靠这个check point 点目的集群就可以不断应用源集群从CHECKPOINT LOCATION 之后的WAL 日志。

当然其中的原理不光如此，下面就开始做一个实验看看pg_rewind的强大的功能。

首先下面有两台PG ， 192.168.198.120 主库 192.168.198.176 从库

通过pg_basebackup 进行数据同步后，在 192.168.198.120 上在进行相关的一些建库，曾表，插入数据的事情，看看PG_REWIND 是否可以进行相关的数据同步

pg_basebackup 命令就不在讲了，默认大家都会了，不会的可以百度，或者看我之前的关于这方面的东西。

1 下面的两个服务器的数据已经是一致的，通过pg_basebackup 进行的复制

2 对176 进行promote 操作，模拟主库失效，从库接替主库，此时主库和从库之间不再有任何关系（需要修改176 的 postgresql 的监听地址，这点是基础就不再提了）

4 我们在176上进行创建一个pg_rewind库的操作，此时两个库已经数据不一致了

最后在120 上执行

pg_rewind --target-pgdata=/pgdata/data --source-server='host=192.168.198.176 port=5432 user=repl dbname=postgres password=repl' -P --debug

将数据追齐后，修改120 的配置文件，模拟失效的主库和已经提升为主库的“从库” 数据已经一致了

到这里本来就完事了，但实际上有些评论说pg_rewind 是可以做数据同步的，我是比较感兴趣的，到底 pg_rewind 可不可以做数据同步。

首先我们要确认几点，在达到共识的基础上才能继续下面的工作

1 即使是数据同步，也必须是在之前两个节点之间的关系是主从关系，或者至少是有关系，如果两个节点之间没有任何关系，则这个工作是没有办法做的。

2 数据同步是否需要 promote 操作作为前提，或者直接去修改 recovery.conf 文件

测试1 ：

进行数据同步，然后将从库关闭，将recover.conf 变为 recover.done，然后重启动从库，在关闭，变更主库的数据，使用pg_rewind进行数据同步

结果失败，通过失败可以证明，如果主从失败后，想直接通过提升当前从库的方法，在通过 pg_rewind 进行数据同步的想法可以凉快去了

测试2 同步长时间的主库已经和原来的从库（从库已经提升为主库）的数据是否可行，这里的长时间其实也是看数据量，下面的情况就是报找不到pg_wal 文件，这边可以尝试从原来的从库上拷贝缺少的pg_wal 或者开启 archive 等方式保证你的pg_wal 是充足的。

图中源主库报没有00000004000000000000003F 文件，找到原来的从库，现在的主库将这个文件拷贝到原来的主库，则pg_rewind 正常工作。

所以相关的pg_wal 文件也要留存好，最好是有archive 来做数据恢复的后盾。

总结，某些帖子中假想的想通过pg_rewind 来进行数据库之间的复制的想法是不可以的，同时pg_wal 中的日志数据应该进行archive 以防止出现找不到pg_wal 的问题。

pg_rewind

一、pg_rewind介绍

pg_rewind功能：
流复制主备数据库间数据目录同步工具。
pg_rewind优点：
仅复制产生变化的数据块和一些文件：新数据文件、配置文件、WAL segments。
基本原理：
pg_rewind检查源和目标集群的时间线历史以确定它们的分歧点，并在目标集群的pg_wal目录中找到WAL，一直到达分歧点。分歧点可以在目标时间线、源时间线或它们的共同祖先上找到。
在典型的故障转移场景中，目标集群在产生分歧后很快被关闭，这不是问题，但如果目标集群在生生分歧后运行很长时间，它的旧WAL文件可能不再存在。在这种情况下，您可以手动将它们从WAL归档目录复制到pg_wal目录，或者使用 -c 选项运行pg_rewind以自动从WAL归档中目录中检索它们。
**pg_rewind 的使用不限于故障转移，**例如，可以提升备用服务器主库，运行一些写入事务，然后重新回滚再次成为备用服务器。
运行pg_rewind后，需要完成WAL重放以使数据目录处于一致状态。当目标服务器再次启动时，它将进入归档恢复，并从分歧点之前的最后一个检查点重放源服务器中生成的所有 WAL。如果pg_rewind运行时源服务器中的某些WAL不再可用，因此无法被 pg_rewind 会话复制，则必须在目标服务器启动时使其可用。这可以通过在目标数据目录中创建 recovery.signal 文件并在 postgresql.conf 中配置合适的 restore_command 来完成。
pg_rewind 要求目标服务器在 postgresql.conf 中启用 wal_log_hints 选项或在使用 initdb 初始化集群时启用数据校验和。默认情况下，这些当前均未启用。full_page_writes 也必须设置为 on，但默认情况下启用。
使用pg_rewind命令前提条件：
postgresql.conf
wal_log_hints = on # default:off 或者initdb初始化集群时启用数据校验
full_page_writes = on # default:on

注意点：

如果pg_rewind在处理过程中失败，那么目标的数据目录很可能不在可以恢复的状态。在这种情况下，建议进行新的全新备份。
由于pg_rewind完全从源复制配置文件，可能需要在重新启动目标服务器之前更正用于恢复的配置，特别是如果目标被重新引入作为源的备用服务器。如果在倒带操作完成后重新启动服务器但未配置恢复，则目标可能会再次与主服务器分离。
如果pg_rewind找到无法直接写入的文件，它将立即失败。例如，当源服务器和目标服务器对只读 SSL 密钥和证书使用相同的文件映射时，就会发生这种情况。如果目标服务器上存在此类文件，建议在运行 pg_rewind 之前将其删除。执行倒带后，其中一些文件可能已从源复制，在这种情况下，可能需要删除复制的数据并恢复回倒前使用的链接集。

二、pg_rewind参数

-D directory --target-pgdata=directory
此选项指定与源同步的目标数据目录。在运行 pg_rewind 之前必须彻底关闭目标服务器

–source-pgdata=directory
指定要与目标同步的源服务器数据目录的文件系统路径。此选项要求完全关闭源服务器。

–source-server=connstr
指定一个 libpq 连接字符串以连接到源 PostgreSQL 服务器以与目标同步。连接必须是普通（非复制）连接，角色有足够的权限来执行源服务器上pg_rewind使用的功能或超级用户角色。此选项要求源服务器正在运行并接受连接。

-n --dry-run
除了实际修改目标目录之外，做任何事情。

-N --no-sync
默认情况下，pg_rewind将等待所有文件安全写入磁盘。此选项会导致 pg_rewind 无需等待就返回，这更快，但意味着随后的操作系统崩溃可能会使同步数据目录损坏。通常，此选项对测试很有用，但不应用于生产安装。

-P --progress
启用进度报告。启用此选项将在从源集群复制数据时提供大致进度报告。

-c --restore-target-wal
如果这些文件在 pg_wal 目录中不再可用，请使用目标集群配置中定义的 restore_command 从 WAL 归档中检索 WAL 文件。

–debug
打印对开发人员调试 pg_rewind 最有用的详细调试输出。

–no-ensure-shutdown
pg_rewind 要求目标服务器在倒带之前彻底关闭。默认情况下，如果目标服务器没有完全关闭，pg_rewind 会以单用户模式启动目标服务器，先完成崩溃恢复，然后停止它。
通过传递这个选项，如果服务器没有完全关闭，pg_rewind 会跳过这个并立即出错。在这种情况下，用户应该自己处理这种情况。

三、执行pg_rewind所需权限

当使用在线集群作为源执行 pg_rewind 时，可以使用源集群上具有足够权限来执行 pg_rewind 的角色来代替超级用户。下面是如何创建这样一个角色，这里命名为 rewind_user：

CREATE USER rewind_user LOGIN;
GRANT EXECUTE ON function pg_catalog.pg_ls_dir(text, boolean, boolean) TO rewind_user;
GRANT EXECUTE ON function pg_catalog.pg_stat_file(text, boolean) TO rewind_user;
GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text) TO rewind_user;
GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, boolean) TO rewind_user;

当在完成主备切换的老主库上执行 pg_rewind 时，需要蝗前在新主库上执行 CHECKPOINT 使其控制文件反映最新的时间线信息，pg_rewind 使用它来检查目标集群是否可以使用指定的源簇重绕。

四、pg_rewind如何工作

基本思路是将所文件系统级别的更改从源集群复制到目标集群。

以源集簇的历史时间线从目标集簇分叉出来的点之前的最后一个检查点为起点，扫描目标集簇的 WAL 日志。对于每一个 WAL 记录，读取每一个被动过的数据块。这会得到在目标集簇中从源集簇被分支出去以后所有被更改过的数据块列表。
使用直接的文件系统访问（–source-pgdata）或者 SQL （–source-server），把所有那些更改过的块从源集簇拷贝到目标集簇。
把所有其他诸如pg_xact和配置文件（除了关系文件之外所有的东西）从源集簇拷贝到目标集簇。与基础备份类似，在从源集簇拷贝的数据中，目录pg_dynshmem/、pg_notify/、pg_replslot/、pg_serial/、pg_snapshots/、pg_stat_tmp/以及pg_subtrans/的内容会被忽略。任何以pgsql_tmp开始的文件或目录都会被忽略，backup_label、tablespace_map、pg_internal.init、postmaster.opts以及postmaster.pid也是这样。
从源集簇应用 WAL，从失效处创建的检查点开始（严格来说，pg_rewind并不应用 WAL，它只是创建一个备份标签文件，该文件让PostgreSQL从那个检查点开始向前重放所有 WAL）。

五、测试

主库	192.168.150.161	pg01
备库	192.168.150.163	pg03

创建主库：
initdb -D /pgdata/12/data -U postgres
alter user postgres with password ‘123456’;

pg_hba.conf添加：
host replication all 192.168.150.163/32 trust
host replication all 192.168.150.161/32 trust

postgresql.conf
wal_level = replica
synchronous_commit = on
full_page_writes = on
wal_log_hints = on
min_wal_size = 800MB
archive_mode = on
archive_command = ‘test ! -f /pgdata/12/archive/%f && cp %p /pgdata/12/archive/%f’
max_wal_senders = 10
hot_standby = on
primary_conninfo = ‘user=postgres passfile=’’/home/postgres/.pgpass’’ host=192.168.150.163 port=1921 sslmode=disable sslcompression=0 gssencmode=disable krbsrvname=postgres target_session_attrs=any’

启动主库：
pg_ctl -D /pgdata/12/data start

备库：
pg_basebackup -h 192.168.150.161 -p 1921 -U postgres -F p -P -X stream -R -D /pgdata/12/data -l backup20211027_02

-F —format 指定备份输出格式 p plain 表示把主库中各个数据文件、配置文件、目录结构都完全一样的地写到备份目录中； t tar 表示把备库文件打包到一个tar文件中
-P --progress 输出备份进度条
-X --xlog-method= 取值为f fetch 时与参数-x 含义一样。取s stream时表示备份开始后，启动另一个流复制进程接收WAL日志，此方式需要主库max_wal_senders参数大于或等于2。
-R --write-recovery-conf 是否生成recovery.conf文件 12版本中是standby.signal文件
-l 表示备份标识，类似select pg_start_backup(‘lable’)
-U 表示连接用户

postgresql.auto.conf自动生成下列信息，不需手功配置
primary_conninfo = ‘user=postgres passfile=’’/home/postgres/.pgpass’’ host=192.168.150.161 port=1921 sslmode=disable sslcompression=0 gssencmode=disable krbsrvname=postgres target_session_attrs=any’

启动备库：
pg_ctl -D /pgdata/12/data/start

postgres=# select * from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid              | 482735
usesysid         | 10
usename          | postgres
application_name | walreceiver
client_addr      | 192.168.150.161
client_hostname  | 
client_port      | 42286
backend_start    | 2021-10-27 21:33:59.122261+08
backend_xmin     | 
state            | streaming
sent_lsn         | 0/10000448
write_lsn        | 0/10000448
flush_lsn        | 0/10000448
replay_lsn       | 0/10000448
write_lag        | 
flush_lag        | 
replay_lag       | 
sync_priority    | 0
sync_state       | async
reply_time       | 2021-10-27 21:52:41.704105+08

主备切换：
备库执行切换命令pg_ctl promote

[postgres@pg03 data]$ pg_ctl promote -D /pgdata/12/data
waiting for server to promote....2021-10-29 09:38:32.722 CST [779689] LOG:  received promote request
2021-10-29 09:38:32.722 CST [779693] FATAL:  terminating walreceiver process due to administrator command
2021-10-29 09:38:32.722 CST [779689] LOG:  invalid record length at 0/10021400: wanted 24, got 0
2021-10-29 09:38:32.722 CST [779689] LOG:  redo done at 0/100213C8
2021-10-29 09:38:32.722 CST [779689] LOG:  last completed transaction was at log time 2021-10-29 09:34:36.921164+08
2021-10-29 09:38:32.723 CST [779689] LOG:  selected new timeline ID: 6
2021-10-29 09:38:32.761 CST [779689] LOG:  archive recovery complete
2021-10-29 09:38:32.765 CST [779688] LOG:  database system is ready to accept connections
 done
server promoted

备库已成为主库：

[postgres@pg03 data]$ psql -p 1921
psql (12.8)
Type "help" for help.

postgres=# select pg_is_in_recovery();
 pg_is_in_recovery 
-------------------
 f
(1 row)

原主库执行pg_rewind命令：
问题：

[postgres@pg01 data]$ pg_rewind -D /pgdata/12/data --source-server='host=192.168.150.163  port=1921 user=postgres password=123456' -P
pg_rewind: connected to server
pg_rewind: fatal: target server must be shut down cleanly

出现以上问题，需要先将老主为关闭：

[postgres@pg01 data]$ pg_ctl -D /pgdata/12/data stop
waiting for server to shut down....2021-10-29 09:42:30.879 CST [99487] LOG:  received fast shutdown request
2021-10-29 09:42:30.881 CST [99487] LOG:  aborting any active transactions
2021-10-29 09:42:30.881 CST [99487] LOG:  background worker "logical replication launcher" (PID 103645) exited with exit code 1
2021-10-29 09:42:30.882 CST [99489] LOG:  shutting down
2021-10-29 09:42:31.000 CST [99487] LOG:  database system is shut down
 done
server stopped

再次执行：pg_rewind命令：

[postgres@pg01 data]$ pg_rewind -D /pgdata/12/data --source-server='host=192.168.150.163  port=1921 user=postgres password=123456' -P
pg_rewind: connected to server
pg_rewind: servers diverged at WAL location 0/10021400 on timeline 5
pg_rewind: rewinding from last common checkpoint at 0/10000650 on timeline 5
pg_rewind: reading source file list
pg_rewind: reading target file list
pg_rewind: reading WAL in target
pg_rewind: need to copy 132 MB (total source directory size is 151 MB)
135219/135219 kB (100%) copied
pg_rewind: creating backup label and updating control file
pg_rewind: syncing target data directory
pg_rewind: Done!

原主库生成表示备库文件：standby.signal
修改配置文件postgresql.auto.conf中primary_conninfo参数，指向新的主库

primary_conninfo = 'user=postgres passfile=''/home/postgres/.pgpass'' host=192.168.150.163 port=1921 sslmode=disable sslcompression=0 gssencmode=disable krbsrvname=postg
res target_session_attrs=any'

启动备库：

[postgres@pg01 data]$ pg_ctl -D /pgdata/12/data start
waiting for server to start....2021-10-29 09:45:46.563 CST [105924] LOG:  starting PostgreSQL 12.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36), 64-bit
2021-10-29 09:45:46.563 CST [105924] LOG:  listening on IPv4 address "0.0.0.0", port 1921
2021-10-29 09:45:46.563 CST [105924] LOG:  listening on IPv6 address "::", port 1921
2021-10-29 09:45:46.564 CST [105924] LOG:  listening on Unix socket "/tmp/.s.PGSQL.1921"
2021-10-29 09:45:46.573 CST [105925] LOG:  database system was interrupted while in recovery at log time 2021-10-29 09:38:32 CST
2021-10-29 09:45:46.573 CST [105925] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2021-10-29 09:45:46.579 CST [105925] LOG:  entering standby mode
2021-10-29 09:45:46.580 CST [105925] LOG:  redo starts at 0/10000618
2021-10-29 09:45:46.582 CST [105925] LOG:  consistent recovery state reached at 0/1003EC08
2021-10-29 09:45:46.582 CST [105925] LOG:  invalid record length at 0/1003EC08: wanted 24, got 0
2021-10-29 09:45:46.582 CST [105924] LOG:  database system is ready to accept read only connections
2021-10-29 09:45:46.587 CST [105929] LOG:  started streaming WAL from primary at 0/10000000 on timeline 6
 done
server started

验证是否为备库：

[postgres@pg01 data]$ psql -p 1921
psql (12.8)
Type "help" for help.

postgres@[local]:1921=#select pg_is_in_recovery();
 pg_is_in_recovery 
-------------------
 t
(1 row)

postgres 105924      1  0 09:45 ?        00:00:00 /opt/pg12/bin/postgres -D /pgdata/12/data
postgres 105925 105924  0 09:45 ?        00:00:00 postgres: startup   recovering 000000060000000000000010
postgres 105926 105924  0 09:45 ?        00:00:00 postgres: checkpointer   
postgres 105927 105924  0 09:45 ?        00:00:00 postgres: background writer   
postgres 105928 105924  0 09:45 ?        00:00:00 postgres: stats collector   
postgres 105929 105924  0 09:45 ?        00:00:00 postgres: walreceiver   streaming 0/1003F998

验证数据库是否传输，主库插入数据，查看是否同步到备库：
主库：

postgres=# select * from test;
 a | b 
---+---
 2 | 2
 3 | 3
 4 | 4
(3 rows)

postgres=# **insert into test values (5,5);**
INSERT 0 1
postgres=# select * from test;
 a | b 
---+---
 2 | 2
 3 | 3
 4 | 4
 5 | 5
(4 rows)

postgres=#

备库：

postgres@[local]:1921=#select * from test;
 a | b 
---+---
 2 | 2
 3 | 3
 4 | 4
** 5 | 5**
(4 rows)

postgres@[local]:1921=#

到此主库切换命令pg_rewind解释完成。

repmgr 出现切换后，原主库停止服务。手动启动原主库会导致数据库集群状态异常，变为双主。

解决方案
原主库（192.168.80.221）查询集群状态，发现备库1（192.168.80.222）已经成为新的主库，但是原主库（192.168.80.221）仍然以主库的模式启动，这时候集群就出现了双主的问题。检查后发现，备库2（192.168.80.236）已经连接到了新主库（192.168.80.222），此时可将原主库（192.168.80.221）重新加入集群作为备库运行。

WARNING: following issues were detected
- node "192.168.80.222" (ID: 2) is registered as standby but running as primary
[highgo@localhost HighGo5.6.5-cluster]$ pg_ctl stop -m f
等待服务器进程关闭 ..... 完成
服务器进程已经关闭
1

测试是否可以将本节点重新加入集群：

[highgo@localhost HighGo5.6.5-cluster]$ repmgr node rejoin -d 'host=192.168.80.222 dbname=highgo user=highgo' --force-rewind --config-files=postgresql.conf --verbose --dry-run
INFO: looking for configuration file in "/opt/HighGo5.6.5-cluster"
INFO: configuration file found at: "/opt/HighGo5.6.5-cluster/conf/hg_repmgr.conf"
DEBUG: set_config():
SET synchronous_commit TO 'local'
DEBUG: get_primary_node_id():
SELECT node_id FROM repmgr.nodes WHERE type = 'primary' AND active IS TRUE
DEBUG: get_node_record():
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name FROM repmgr.nodes n WHERE n.node_id = 2
DEBUG: connecting to: "user=highgo connect_timeout=2 dbname=highgo host=192.168.80.222 port=5866 fallback_application_name=repmgr"
DEBUG: set_config():
SET synchronous_commit TO 'local'
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: replication connection to the rejoin target node was successful
INFO: local and rejoin target system identifiers match
DETAIL: system identifier is 6823598896860049498
DEBUG: local timeline: 25; rejoin target timeline: 26
DEBUG: get_timeline_history():
TIMELINE_HISTORY 26
DEBUG: local tli: 25; local_xlogpos: 2/DE000028; follow_target_history->tli: 25; follow_target_history->end: 2/D2013F28
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 26 forked off current database system timeline 25 before current recovery point 2/DE000028
DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings WHERE name = 'full_page_writes' AND setting = 'off'
DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings WHERE name = 'wal_log_hints' AND setting = 'on'
INFO: prerequisites for using pg_rewind are met
DEBUG: using archive directory "/tmp/repmgr-config-archive-192.168.80.221"
INFO: temporary archive directory "/tmp/repmgr-config-archive-192.168.80.221" created
INFO: file "postgresql.conf" would be copied to "/tmp/repmgr-config-archive-192.168.80.221/postgresql.conf"
INFO: 1 files would have been copied to "/tmp/repmgr-config-archive-192.168.80.221"
INFO: temporary archive directory "/tmp/repmgr-config-archive-192.168.80.221" deleted
INFO: pg_rewind would now be executed
DETAIL: pg_rewind command is:
/opt/HighGo5.6.5-cluster/bin/pg_rewind -D '/opt/HighGo5.6.5-cluster/data' --source-server='host=192.168.80.222 user=highgo dbname=highgo port=5866 connect_timeout=2'
INFO: prerequisites for executing NODE REJOIN are met
1

正式将节点重新加入集群：

[highgo@localhost HighGo5.6.5-cluster]$ repmgr node rejoin -d 'host=192.168.80.222 dbname=highgo user=highgo' --force-rewind --config-files=postgresql.conf --verbose
INFO: looking for configuration file in "/opt/HighGo5.6.5-cluster"
INFO: configuration file found at: "/opt/HighGo5.6.5-cluster/conf/hg_repmgr.conf"
DEBUG: set_config():
SET synchronous_commit TO 'local'
DEBUG: get_primary_node_id():
SELECT node_id FROM repmgr.nodes WHERE type = 'primary' AND active IS TRUE
DEBUG: get_node_record():
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name FROM repmgr.nodes n WHERE n.node_id = 2
DEBUG: connecting to: "user=highgo connect_timeout=2 dbname=highgo host=192.168.80.222 port=5866 fallback_application_name=repmgr"
DEBUG: set_config():
SET synchronous_commit TO 'local'
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: local timeline: 25; rejoin target timeline: 26
DEBUG: get_timeline_history():
TIMELINE_HISTORY 26
DEBUG: local tli: 25; local_xlogpos: 2/DE000028; follow_target_history->tli: 25; follow_target_history->end: 2/D2013F28
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 26 forked off current database system timeline 25 before current recovery point 2/DE000028
DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings WHERE name = 'full_page_writes' AND setting = 'off'
DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings WHERE name = 'wal_log_hints' AND setting = 'on'
INFO: prerequisites for using pg_rewind are met
DEBUG: using archive directory "/tmp/repmgr-config-archive-192.168.80.221"
DEBUG: copying "postgresql.conf" to "/tmp/repmgr-config-archive-192.168.80.221/postgresql.conf"
INFO: 1 files copied to "/tmp/repmgr-config-archive-192.168.80.221"
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/opt/HighGo5.6.5-cluster/bin/pg_rewind -D '/opt/HighGo5.6.5-cluster/data' --source-server='host=192.168.80.222 user=highgo dbname=highgo port=5866 connect_timeout=2'"
DEBUG: executing:
/opt/HighGo5.6.5-cluster/bin/pg_rewind -D '/opt/HighGo5.6.5-cluster/data' --source-server='host=192.168.80.222 user=highgo dbname=highgo port=5866 connect_timeout=2'
DEBUG: result of command was 0 (13)
DEBUG: local_command(): output returned was:
servers diverged at WAL location 2/D2013F28 on timeline 25

DEBUG: using archive directory "/tmp/repmgr-config-archive-192.168.80.221"
DEBUG: copying "/tmp/repmgr-config-archive-192.168.80.221/postgresql.conf" to "/opt/HighGo5.6.5-cluster/data/postgresql.conf"
NOTICE: 1 files copied to /opt/HighGo5.6.5-cluster/data
INFO: directory "/tmp/repmgr-config-archive-192.168.80.221" deleted
INFO: deleting "recovery.done"
DEBUG: get_node_record():
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name FROM repmgr.nodes n WHERE n.node_id = 1
NOTICE: setting node 1's upstream to node 2
DEBUG: create_recovery_file(): creating "/opt/HighGo5.6.5-cluster/data/recovery.conf"...
DEBUG: recovery file is:
standby_mode = 'on'
primary_conninfo = 'user=highgo connect_timeout=2 host=192.168.80.222 port=5866 application_name=192.168.80.221'
recovery_target_timeline = 'latest'

DEBUG: is_server_available(): ping status for "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2" is PQPING_NO_RESPONSE
WARNING: unable to ping "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/opt/HighGo5.6.5-cluster/bin/pg_ctl -w -D '/opt/HighGo5.6.5-cluster/data' start"
DEBUG: executing:
/opt/HighGo5.6.5-cluster/bin/pg_ctl -w -D '/opt/HighGo5.6.5-cluster/data' start
DEBUG: result of command was 0 (13)
DEBUG: local_command(): output returned was:
等待服务器进程启动 ....2020-06-25 09:41:54.523 CST [5308] WARNING: 01000: gdb version should large than 7.10

DEBUG: update_node_record_status():
UPDATE repmgr.nodes SET type = 'standby', upstream_node_id = 2, active = TRUE WHERE node_id = 1
DEBUG: is_server_available(): ping status for "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2" is PQPING_REJECT
WARNING: unable to ping "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2"
DETAIL: PQping() returned "PQPING_REJECT"
INFO: waiting for node 1 to respond to pings; 1 of max 60 attempts
DEBUG: is_server_available(): ping status for "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2" is PQPING_REJECT
WARNING: unable to ping "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2"
DETAIL: PQping() returned "PQPING_REJECT"
DEBUG: sleeping 1 second waiting for node 1 to respond to pings; 2 of max 60 attempts
DEBUG: is_server_available(): ping status for "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2" is PQPING_REJECT
WARNING: unable to ping "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2"
DETAIL: PQping() returned "PQPING_REJECT"
DEBUG: sleeping 1 second waiting for node 1 to respond to pings; 3 of max 60 attempts
DEBUG: is_server_available(): ping status for "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2" is PQPING_OK
INFO: demoted primary is pingable
INFO: node 1 has attached to its upstream node
DEBUG: _create_event(): event is "node_rejoin" for node 1
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: _create_event():
INSERT INTO repmgr.events ( node_id, event, successful, details ) VALUES ($1, $2, $3, $4) RETURNING event_timestamp
DEBUG: _create_event(): Event timestamp is "2020-06-30 10:15:47.522104+08"
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2

重新查询集群状态，集群状态正常。

jnrjian

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
postgresql pg_rewind 类似oracle的flashback+基于scn的恢复

流复制主备数据库间数据目录同步工具。仅复制产生变化的数据块和一些文件：新数据文件、配置文件、WAL segments。pg_rewind检查源和目标集群的时间线历史以确定它们的分歧点，并在目标集群的pg_wal目录中找到WAL，一直到达分歧点。分歧点可以在目标时间线、源时间线或它们的共同祖先上找到。在典型的故障转移场景中，目标集群在产生分歧后很快被关闭，这不是问题，但如果目标集群在生生分歧后运行很长时间，它的旧WAL文件可能不再存在。
复制链接

扫一扫