os: ubuntu 16.04
db: postgresql 16.04
情况是这样的,测试环境有基于 streaming 的 master/slave 的两台机器,由于 postgresql 本身未提供自动切换功能,所以选择高可用软件 patroni.
在 master 节点安装 patroni 时,发现日志有报错.
版本
# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.2 LTS
Release: 16.04
Codename: xenial
$ psql -c "select version();"
version
----------------------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.6.8 on x86_64-pc-linux-gnu (Ubuntu 9.6.8-1.pgdg16.04+1), compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit
(1 row)
master 节点
2019-05-09 22:56:18.912 CST,"repl","",48094,"192.168.0.88:30199",5cd43f92.bbde,3,"streaming 671/6F000000",2019-05-09 22:56:18 CST,3/0,0,ERROR,58P01,"requested WAL segment 00000001000006710000006F has already been removed",,,,,,,,,"walreceiver"
从报错信息来看,是 slave 节点请求 wal 数据时,master 发现 请求的 wal 文件已经不存在了.这是一个典型的错误,一般情况是需要使用到归档的wal文件或者重搭slave.
但是这个有点不太一样.
-rw------- 1 postgres postgres 16777216 May 9 22:56 00000001000006710000006F.partial
-rw------- 1 postgres postgres 44 May 9 22:56 00000002.history
-rw------- 1 postgres postgres 16777216 May 9 23:26 00000002000006710000006F
为什么会生成 timeline 2 了?是因为 patroni 启动时会生成 recovery.conf,使数据库进入 recovery mode,然后再 promote 为master,所以 patroni 一定要在 master 节点先安装.
自己在本地做了个实验,产生如下文件
$ ls -l
-rw------- 1 postgres postgres 16777216 May 13 10:59 0000001B00000000000000C1.partial
-rw------- 1 postgres postgres 1177 May 13 10:59 0000001C.history
-rw------- 1 postgres postgres 16777216 May 13 11:00 0000001C00000000000000C1
$ cp 0000001B00000000000000C1.partial /tmp/0000001B00000000000000C1
$ pg_xlogdump /tmp/0000001B00000000000000C1
rmgr: XLOG len (rec/tot): 106/ 106, tx: 0, lsn: 0/C1000028, prev 0/C0010068, desc: CHECKPOINT_SHUTDOWN redo 0/C1000028; tli 27; prev tli 27; fpw true; xid 0:633; oid 40969; multi 1; offset 0; oldest xid 579 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
pg_xlogdump: FATAL: error in WAL record at 0/C1000028: invalid record length at 0/C1000098: wanted 24, got 0
$ cat 0000001C.history
27 0/C1000098 no recovery target specified
$ pg_xlogdump 0000001C00000000000000C1
rmgr: XLOG len (rec/tot): 106/ 106, tx: 0, lsn: 0/C1000028, prev 0/C0010068, desc: CHECKPOINT_SHUTDOWN redo 0/C1000028; tli 27; prev tli 27; fpw true; xid 0:633; oid 40969; multi 1; offset 0; oldest xid 579 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
rmgr: XLOG len (rec/tot): 42/ 42, tx: 0, lsn: 0/C1000098, prev 0/C1000028, desc: END_OF_RECOVERY tli 28; prev tli 27; time 2019-05-13 10:59:07.851805 CST
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/C10000C8, prev 0/C1000098, desc: RUNNING_XACTS nextXid 633 latestCompletedXid 632 oldestRunningXid 633
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/C1000100, prev 0/C10000C8, desc: RUNNING_XACTS nextXid 633 latestCompletedXid 632 oldestRunningXid 633
rmgr: XLOG len (rec/tot): 106/ 106, tx: 0, lsn: 0/C1000138, prev 0/C1000100, desc: CHECKPOINT_ONLINE redo 0/C10000C8; tli 28; prev tli 28; fpw true; xid 0:633; oid 40969; multi 1; offset 0; oldest xid 579 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 633; online
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/C10001A8, prev 0/C1000138, desc: RUNNING_XACTS nextXid 633 latestCompletedXid 632 oldestRunningXid 633
rmgr: XLOG len (rec/tot): 51/ 575, tx: 0, lsn: 0/C10001E0, prev 0/C10001A8, desc: FPI_FOR_HINT , blkref #0: rel 1663/12439/1259 blk 0 FPW
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/C1000420, prev 0/C10001E0, desc: RUNNING_XACTS nextXid 633 latestCompletedXid 632 oldestRunningXid 633
pg_xlogdump: FATAL: error in WAL record at 0/C1000420: invalid record length at 0/C1000458: wanted 24, got 0
注意 0/C1000098 为 END_OF_RECOVERY ,是不会丢失数据的.
参考:
https://www.postgresql.org/message-id/E1ZAr88-0008Jy-Om@gemulon.postgresql.org