pg_resetwal
在实际生产环境中使用pg_resetwal时,必须备份现有环境,否则你没有犯错误的机会!!
pg_resetwal 的使用:
当wal日志丢失一部分,或者控制文件被破坏,导致数据库无法启动
通过重置wal或者重建控制文件,让数据库恢复到最后一次的一致性状态(检查点)
pg_resetwal可以清除预写日志 (WAL)并可选地重置存储在pg_control文件中的一些其他控制信息。 如果WAL或控制文件已损坏,可能就需要此功能恢复数据。当服务器由于此类损坏而无法启动时,只能将其用作最后的手段。
运行此命令后,应该可以启动服务器,但请记住,由于部分提交的事务,数据库可能包含不一致的数据。 您应该立即转储数据、运行 initdb 并恢复。 还原后,检查不一致情况并根据需要进行修复。
此实用程序只能由安装服务器的用户运行,因为它需要对数据目录的读/写访问权限。 出于安全原因,您必须在命令行中指定数据目录。 pg_resetwal不使用环境变量PGDATA。
如果 pg_resetwal 抱怨它无法确定 pg_control 的有效数据,您可以通过指定 -f(强制)选项强制它继续进行。 在这种情况下,合理的值将替代缺失的数据。 大多数字段都可以匹配,但下一个 OID、下一个事务 ID 和epoch、下一个多事务 ID 和偏移量以及 WAL 起始位置字段可能需要人工协助。 可以使用下面讨论的选项设置这些字段。 如果您无法确定所有这些字段的正确值,仍然可以使用 -f,但必须比平时更加怀疑地对待恢复的数据库:必须立即转储和恢复。 转储之前不要在数据库中执行任何数据修改操作,因为任何此类操作都可能使损坏变得更糟。
测试环境:centos7 + hgdb企业版V6(pg12)
https://www.postgresql.org/docs/12/app-pgresetwal.html
一 创建测试环境
create table t4(id int);
insert into t4 values(1);
insert into t4 values(2);
checkpoint;
insert into t4 values(3);
二 模拟wal丢失,使用pg_resetwal恢复
2.1 使wal损坏
模拟数据库异常:kill -9 数据库进程:
[highgo@hgdb121 ~]$ ps -ef |grep postg
highgo 1744 1 0 Jun05 ? 00:00:00 /opt/HighGo6.0.2-cluster/bin/postgres
...
[root@hgdb114 ~]# kill -9 1744
[root@hgdb114 ~]# ps -ef |grep postg
highgo 11022 10932 0 10:51 pts/4 00:00:00 grep --color=auto postg
删除wal文件
[highgo@hgdb121 ~]$ cd /data/highgo/data/pg_wal/
[highgo@hgdb121 pg_wal]$ ls
000000010000000000000001 000000010000000000000003 000000010000000000000005 000000010000000000000007 000000010000000000000009 00000001000000000000000B
000000010000000000000002 000000010000000000000004 000000010000000000000006 000000010000000000000008 00000001000000000000000A archive_status
[highgo@hgdb121 pg_wal]$ rm -rf *
尝试启动数据库
[highgo@hgdb121 ~]$ pg_ctl start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2023-06-06 10:51:29.488 CST [11027] LOG: starting HighGo Database 6 Release 6.0.2-64-bit Production
2023-06-06 10:51:29.489 CST [11027] LOG: listening on IPv4 address "0.0.0.0", port 5866
2023-06-06 10:51:29.489 CST [11027] LOG: listening on IPv6 address "::", port 5866
2023-06-06 10:51:29.491 CST [11027] LOG: listening on Unix socket "/tmp/.s.PGSQL.5866"
2023-06-06 10:51:29.773 CST [11027] LOG: This is a trial edition, validate until 2024-05-18 09:14:26.
2023-06-06 10:51:29.774 CST [11027] LOG: redirecting log output to logging collector process
2023-06-06 10:51:29.774 CST [11027] HINT: Future log output will appear in directory "hgdb_log".
........ stopped waiting
pg_ctl: could not start server
如上表现说明数据库无法启动,查看WAL发现:尝试寻找WAL,但是没找到
2023-06-06 10:51:29.774 CST,,,11027,,647e9f31.2b13,2,,2023-06-06 10:51:29 CST,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,,""
2023-06-06 10:51:29.778 CST,,,11029,,647e9f31.2b15,1,,2023-06-06 10:51:29 CST,,0,LOG,00000,"database system was interrupted; last known up at 2023-06-06 10:50:42 CST",,,,,,,,,""
2023-06-06 10:51:29.778 CST,,,11029,,647e9f31.2b15,2,,2023-06-06 10:51:29 CST,,0,LOG,00000,"creating missing WAL directory ""pg_wal/archive_status""",,,,,,,,,""
2023-06-06 10:51:29.794 CST,,,11029,,647e9f31.2b15,3,,2023-06-06 10:51:29 CST,,0,LOG,00000,"invalid primary checkpoint record",,,,,,,,,""
2023-06-06 10:51:29.794 CST,,,11029,,647e9f31.2b15,4,,2023-06-06 10:51:29 CST,,0,PANIC,XX000,"could not locate a valid checkpoint record",,,,,,,,,""
2023-06-06 10:51:37.836 CST,,,11027,,647e9f31.2b13,3,,2023-06-06 10:51:29 CST,,0,LOG,00000,"startup process (PID 11029) was terminated by signal 6: Aborted",,,,,,,,,""
2023-06-06 10:51:37.836 CST,,,11027,,647e9f31.2b13,4,,2023-06-06 10:51:29 CST,,0,LOG,00000,"aborting startup due to startup process failure",,,,,,,,,""
2023-06-06 10:51:37.857 CST,,,11027,,647e9f31.2b13,5,,2023-06-06 10:51:29 CST,,0,LOG,00000,"database system is shut down",,,,,,,,,""
2.2 解决问题:重置WAL
提示数据库并未干净关闭,询问是否强制打开数据库:
[highgo@hgdb121 ~]$ pg_resetwal $PGDATA
The database server was not shut down cleanly.
Resetting the write-ahead log might cause data to be lost.
If you want to proceed anyway, use -f to force reset.
使用-f参数强制打开数据库
[highgo@hgdb121 ~]$ pg_resetwal -f $PGDATA
Write-ahead log reset
启动数据库完成
pg_ctl start
此时可以看到正常启动了.
三 控制文件损坏,使用pg_resetwal修复
3.1 模拟控制文件损坏
[highgo@hgdb121 pg_wal]$ ps -ef |grep postg
highgo 11372 1 0 11:18 ? 00:00:00 /opt/HighGo6.0.2-cluster/bin/postgres
.......
[highgo@hgdb121 pg_wal]$ kill -9 11372
清空控制文件
[highgo@hgdb121 global]$ cd $PGDATA
[highgo@hgdb121 global]$ echo '' > pg_control
尝试重启数据库
[highgo@hgdb121 data]$ pg_ctl start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2023-06-06 11:28:46.718 CST [11500] PANIC: could not read file "global/pg_control": read 1 of 312
stopped waiting
pg_ctl: could not start server
Examine the log output.
3.2 重建控制文件
pg_resetwal重建控制文件,需要参数如下:
-l, --next-wal-file=WALFILE set minimum starting location for new WAL
-m, --multixact-ids=MXID,MXID set next and oldest multitransaction ID
-O, --multixact-offset=OFFSET set next multitransaction offset
-x, --next-transaction-id=XID set next transaction ID
3.2.1 下一个wal日志的名称
-l, --next-wal-file=WALFILE set minimum starting location for new WAL
那么先找到当前数据库中最大的wal文件
[highgo@hgdb121 data]$ cd pg_wal/
[highgo@hgdb121 pg_wal]$ ll
total 32768
-rw------- 1 highgo highgo 16777216 Jun 6 11:17 00000001000000000000000B
-rw------- 1 highgo highgo 16777216 Jun 6 11:18 00000001000000000000000C
drwx------ 2 highgo highgo 43 Jun 6 11:17 archive_status
当前数据库中是00000001000000000000000C
那么下一个wal段是00000001000000000000000D,新的wal日志从这开始.
3.2.2 手动设置下一个和最旧的多事务ID
手动设置下一个和最旧的多事务ID。
-m, --multixact-ids=MXID,MXID set next and oldest multitransaction ID
mxid1:下一个多事务ID的安全值可以通过在
P
G
D
A
T
A
/
p
g
m
u
l
t
i
x
a
c
t
/
o
f
f
s
e
t
s
目录中查找数值最大的文件名,
+
1
,然后乘以
65536
(
0
x
10000
)来确定。
m
x
i
d
2
:相反,最旧的多事务
I
D
的安全值可以通过
PGDATA/pg_multixact/offsets 目录中查找数值最大的文件名,+1,然后乘以65536(0x10000)来确定。mxid2:相反,最旧的多事务ID的安全值可以通过
PGDATA/pgmultixact/offsets目录中查找数值最大的文件名,+1,然后乘以65536(0x10000)来确定。mxid2:相反,最旧的多事务ID的安全值可以通过PGDATA/pg_multixact/offsets目录中数字最小的文件名,乘以65536来确定。
由于文件名是十六进制的,因此最简单的方法是以十六进制指定选项值并附加四个零。
例如:
ls $PGDATA/pg_multixact/offsets
0100
0102
那么
-m 0x01030000 0x01000000
最大值+1后跟4个0,最小值后跟4个0
注意:假如当前值为0000,那么mxid2取mxid1的值
ls $PGDATA/pg_multixact/offsets
0000
那么:
-m 0x00010000 0x00010000
3.2.3 手动设置下一个多事务处理偏移量
-O, --multixact-offset=OFFSET set next multitransaction offset
安全值可以通过在 $PGDATA/pg_multixact/members 目录中查找数值最大的文件名,+1,然后乘以52352(0xCC80)来确定。文件名为十六进制。没有像附加零的其他选项那样的简单方法。
假如当期前值为0000, 那么使用
-O 0xCC80
3.2.4 手动设置下一个事务ID
-x, --next-transaction-id=XID set next transaction ID
安全值可以通过在$PGDATA/pg_xact目录中查找数值最大的文件名,+1,然后乘以1048576(0x100000)来确定。请注意,文件名是十六进制的。,因此最简单的方法是以十六进制指定选项值并附加五个零。
当前值为0000,则使用-x 0x100000
[highgo@hgdb121 offsets]$ ll $PGDATA/pg_xact
total 8
-rw------- 1 highgo highgo 8192 Jun 6 11:17 0000
3.2.5 执行重建控制文件
结合上面的1/2/3/4个参数,汇总:
pg_resetwal -l 00000001000000000000000D -m 0x10000,0x10000 -O 0xCC80 -x 0x100000 $PGDATA
[highgo@hgdb121 pg_wal]$ pg_resetwal -l 00000001000000000000000D -m 0x10000,0x10000 -O 0xCC80 -x 0x100000 $PGDATA
pg_resetwal: error: lock file "postmaster.pid" exists
pg_resetwal: Is a server running? If not, delete the lock file and try again.
此处提示postmaster.pid仍然存在,需要删除
[highgo@hgdb121 pg_wal]$ rm $PGDATA/postmaster.pid
删除后可以正常执行
[highgo@hgdb121 pg_wal]$ pg_resetwal -l 00000001000000000000000D -m 0x10000,0x10000 -O 0xCC80 -x 0x100000 $PGDATA
pg_resetwal: warning: pg_control exists but is broken or wrong version; ignoring it
Guessed pg_control values:
pg_control version number: 1201
Catalog version number: 201909212
Database system identifier: 7241447295356355133
Latest checkpoint's TimeLineID: 1
Latest checkpoint's full_page_writes: off
Latest checkpoint's NextXID: 0:3
Latest checkpoint's NextOID: 12000
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 3
Latest checkpoint's oldestXID's DB: 0
Latest checkpoint's oldestActiveXID: 0
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 0
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
Values to be changed:
First log segment after reset: 00000001000000000000000D
NextMultiXactId: 65536
OldestMultiXid: 65536
OldestMulti's DB: 0
NextMultiOffset: 52352
NextXID: 1048576
OldestXID: 2296015872
OldestXID's DB: 0
If these values seem acceptable, use -f to force reset.
这些值看起来可以接受,使用-f强制reset
真正重建控制文件:
[highgo@hgdb121 pg_wal]$ pg_resetwal -l 00000001000000000000000D -m 0x10000,0x10000 -O 0xCC80 -x 0x100000 $PGDATA -f
pg_resetwal: warning: pg_control exists but is broken or wrong version; ignoring it
Write-ahead log reset
测试数据中 checkpoint 检查点后的数据,由于控制文件重置而丢失。
注意:使用pg_resrtwal后,可能会造成数据不一致,必须要将数据逻辑导出后,再导入到新建库中,这个库元数据可能也遭到破坏,坚决不能使用了.