案例1:pg_wal下有wal_lsn文件
案例1适用于以下场景:
- pg_wal下有该wal_lsn文件而归档目录下无该wal_lsn文件
- pg_wal和归档目录下同时都有该wal_lsn文件
问题描述
昨晚Repmgr+PG14主备主库因wal日志撑爆磁盘,删除主库过期wal文件重做备库后上午进行主备状态巡查,主库向备库发送wal文件正常,但是查主库状态时发现显示有1条归档失败的记录。
postgres: archiver failed on 000000010000006F00000086
- 主库:
walsender repmgr 172.28.32.23(36122) streaming 72/1BAC3A10" walsender正常
archiver failed on 000000010000006F00000086" 归档失败
- 备库:
walreceiver streaming 77/9EB6A198" “” “” " walreceiver正常
--查主库数据库状态
[root@pgmaster ~]# systemctl status postgres
● postgres.service - PostgreSQL database server
Loaded: loaded (/usr/lib/systemd/system/postgres.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2023-10-12 22:04:08 CST; 13h ago
Process: 3710968 ExecStart=/server/data/pgdb/pgsql/bin/pg_ctl start -D $PGDATA (code=exited, status=0/SUCCESS)
Main PID: 3710970 (postgres)
Tasks: 53 (limit: 201967)
Memory: 19.0G
CGroup: /system.slice/postgres.service
├─ 3710970 /server/data/pgdb/pgsql/bin/postgres -D /server/data/pgdb/data
├─ 3710971 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710992 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710993 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710994 "postgres: walwriter " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710995 "postgres: archiver failed on 000000010000006F00000086" "" "" "" "" "" "" "" "" ""
├─ 3710996 "postgres: logical replication launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3711001 "postgres: top_portal top_portal 172.28.32.18(41438) idle" "" "" "" "" "" ""
├─ 3711003 "postgres: tj_sjjh dataexchange 172.28.32.28(35406) idle" "" "" "" "" "" "" ""
├─ 3711009 "postgres: repmgr repmgr 172.28.32.22(64096) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3711468 "postgres: top_portal top_portal 172.28.32.18(41720) idle" "" "" "" "" "" ""
├─ 3713807 "postgres: top_portal top_portal 172.28.32.20(44492) idle" "" "" "" "" "" ""
├─ 3723017 "postgres: walsender repmgr 172.28.32.23(36122) streaming 72/1BAC3A10" #wal 发送正常
--查备库状态
[root@pgslave ~]# systemctl status postgres
● postgres.service - PostgreSQL database server
Loaded: loaded (/usr/lib/systemd/system/postgres.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2023-10-13 00:12:19 CST; 12h ago
Process: 1931221 ExecStart=/server/data/pgdb/pgsql/bin/pg_ctl start -D $PGDATA (code=exited, status=0/SUCCESS)
Main PID: 1931223 (postgres)
Tasks: 7 (limit: 201967)
Memory: 23.2G
CGroup: /system.slice/postgres.service
├─ 1931223 /server/data/pgdb/pgsql/bin/postgres -D /server/data/pgdb/data
├─ 1931224 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 1931225 "postgres: startup recovering 00000001000000770000009E" "" "" "" "" "" "" "" "" ""
├─ 1931226 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 1931227 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 1931230 "postgres: walreceiver streaming 77/9EB6A198" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" #wal接收
└─ 1931430 "postgres: repmgr repmgr 172.28.32.23(22956) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
Oct 13 00:12:17 pgslave systemd[1]: Starting PostgreSQL database server...
Oct 13 00:12:17 pgslave pg_ctl[1931221]: waiting for server to start....
Oct 13 00:12:17 pgslave pg_ctl[1931223]: 2023-10-13 00:12:17.497 CST [1931223] LOG: redirecting log output to logging collector process
Oct 13 00:12:17 pgslave pg_ctl[1931223]: 2023-10-13 00:12:17.497 CST [1931223] HINT: Future log output will appear in directory "log".
Oct 13 00:12:19 pgslave pg_ctl[1931221]: . done
Oct 13 00:12:19 pgslave pg_ctl[1931221]: server started
Oct 13 00:12:19 pgslave systemd[1]: Started PostgreSQL database server.
问题分析
1.查看数据库日志
2.查看归档配置参数
参数配置正确,归档目录权限也正确
postgres=# show archive_command;
archive_command
-----------------------------------------------------------
/usr/bin/lz4 -q -z %p /server/data/pgdb/pg_archive/%f.lz4
(1 row)
postgres=# show archive_mode;
archive_mode
--------------
on
(1 row)
--查看归档目录的权限
[postgres@pgmaster ~]$ ls -ld /server/data/pgdb/pg_archive
drwxr-x--- 2 postgres postgres 4214784 Oct 13 13:14 /server/data/pgdb/pg_archive
3.手动切日志
手工归档成功,但是未解决,查看状态依然时卡住归档失败的那条wal记录那里
--手工归档
top_portal=# select pg_switch_wal();
pg_switch_wal
---------------
72/51C4CFD8
(1 row)
--查主库数据库状态
[root@pgmaster ~]# systemctl status postgres
● postgres.service - PostgreSQL database server
Loaded: loaded (/usr/lib/systemd/system/postgres.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2023-10-12 22:04:08 CST; 13h ago
Process: 3710968 ExecStart=/server/data/pgdb/pgsql/bin/pg_ctl start -D $PGDATA (code=exited, status=0/SUCCESS)
Main PID: 3710970 (postgres)
Tasks: 53 (limit: 201967)
Memory: 19.0G
CGroup: /system.slice/postgres.service
├─ 3710970 /server/data/pgdb/pgsql/bin/postgres -D /server/data/pgdb/data
├─ 3710971 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710992 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710993 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710994 "postgres: walwriter " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710995 "postgres: archiver failed on 000000010000006F00000086" "" "" "" "" "" "" "" "" ""
├─ 3710996 "postgres: logical replication launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3711001 "postgres: top_portal top_portal 172.28.32.18(41438) idle" "" "" "" "" "" ""
├─ 3711003 "postgres: tj_sjjh dataexchange 172.28.32.28(35406) idle" "" "" "" "" "" "" ""
├─ 3711009 "postgres: repmgr repmgr 172.28.32.22(64096) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3711468 "postgres: top_portal top_portal 172.28.32.18(41720) idle" "" "" "" "" "" ""
├─ 3713807 "postgres: top_portal top_portal 172.28.32.20(44492) idle" "" "" "" "" "" ""
├─ 3723017 "postgres: walsender repmgr 172.28.32.23(36122) streaming 72/1BAC3A10" #wal 发送正常
--查当前wal_lsn
top_portal=# select pg_current_wal_lsn();
pg_current_wal_lsn
--------------------
72/52638F10
(1 row)
--查当前wal_lsn对应的wal文件
top_portal=# select pg_walfile_name(pg_current_wal_lsn());
pg_walfile_name
--------------------------
000000010000007200000052
(1 row)
--查当前最新检查点,最新检查点之前的wal文件均可以删除
[postgres@pgmaster ~]$ pg_controldata $PGDATA
pg_control version number: 1300
Catalog version number: 202107181
Database system identifier: 7268852449124462799
Database cluster state: in production
pg_control last modified: Fri 13 Oct 2023 10:07:35 AM CST
Latest checkpoint location: 71/CDD2FF28
Latest checkpoint's REDO location: 71/CDD28F18
Latest checkpoint's REDO WAL file: 0000000100000071000000CD
--查报错中的wal文件
[postgres@pgmaster pg_wal]$ ls -l 000000010000006F00000086
-rw------- 1 postgres postgres 16777216 Oct 12 21:12 000000010000006F00000086
[postgres@pgmaster pg_wal]$ find /server/data/pgdb/pg_archive -name 000000010000006F00000086*
ls: cannot access '000000010000006F00000086': No such file or directory
[postgres@pgmaster pg_wal]$ find /server -name 000000010000006F00000086*
-rw------- 1 postgres postgres 16777216 Oct 12 21:12 000000010000006F00000086
4.检查$PGDATA/pg_wal/archive_status/目录下文件
[postgres@pgmaster ~]$ cd /server/data/pgdb/data/pg_wal/archive_status/
[postgres@pgmaster archive_status]$ ls -l *.ready
ls: cannot access '*.ready': No such file or directory
说明不存在需要归档但没归档的文件
该目录下,ready说明是需要归档但是没归档的,done是归档完成了的
解决办法
1.将归档失败的wal文件备份到/home/postgres目录下(生产环境如果磁盘空间允许切记不要rm删除,mv备份到目标位置)
2.手工归档select pg_switch_wal();
3.再次查看主备库状态
--1.将归档失败的wal文件备份到/home/postgres目录下
[postgres@pgmaster pg_wal]$ mv 000000010000006F00000086 /home/postgres/000000010000006F00000086
[postgres@pgmaster pg_wal]$ ls -l /home/postgres/000000010000006F00000086
-rw------- 1 postgres postgres 16777216 Oct 12 21:12 /home/postgres/000000010000006F00000086
--2.手工归档
postgres=# select pg_switch_wal();
pg_switch_wal
---------------
73/7EF502E0
(1 row)
--3.再次查看主库状态显示正常
[root@pgmaster data]# systemctl status postgres
● postgres.service - PostgreSQL database server
Loaded: loaded (/usr/lib/systemd/system/postgres.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2023-10-12 22:04:08 CST; 13h ago
Process: 3710968 ExecStart=/server/data/pgdb/pgsql/bin/pg_ctl start -D $PGDATA (code=exited, status=0/SUCCESS)
Main PID: 3710970 (postgres)
Tasks: 50 (limit: 201967)
Memory: 26.6G
CGroup: /system.slice/postgres.service
├─ 3710970 /server/data/pgdb/pgsql/bin/postgres -D /server/data/pgdb/data
├─ 3710971 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710992 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710993 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710994 "postgres: walwriter " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710995 "postgres: archiver archiving 000000010000007100000035" "" "" "" "" "" "" "" "" ""
├─ 3710996 "postgres: logical replication launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3711001 "postgres: top_portal top_portal 172.28.32.18(41438) idle" "" "" "" "" "" ""
├─ 3711003 "postgres: tj_sjjh dataexchange 172.28.32.28(35406) idle" "" "" "" "" "" "" ""
├─ 3711009 "postgres: repmgr repmgr 172.28.32.22(64096) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3711468 "postgres: top_portal top_portal 172.28.32.18(41720) idle" "" "" "" "" "" ""
├─ 3713807 "postgres: top_portal top_portal 172.28.32.20(44492) idle" "" "" "" "" "" ""
├─ 3723017 "postgres: walsender repmgr 172.28.32.23(36122) streaming 73/7F000BD0"
补充
若$PGDATA/pg_wal/archive_status/目录下存在大量的*.ready文件
可能的原因分析:如果数据库是突然断电,那么可能arvchive命令没有完全完成,归档目录会存在不完整的文件名称,重启数据库后,会出现归档失败的情况,这个时候,需要去归档目录删除相关归档失败文件,那么归档就会重新归档。
需要注意的是,archive_command 设定的归档命令是否成功执行,如果未成功,它会周期性的重试,在此期间已有的WAL日志将不会被覆盖重用,新的WAL日志信息会不断占用 pg_wal 的磁盘空间,知道pg_wal所在磁盘沾满后数据库关闭。由于参数 wal_level 与 archive_mode 需要重启数据库,可以在安装之初启动数据库之前,开启这两个参数,然后将 archive_command 的值设置为永远为真的值,例如:/bin/true。当需要开启归档时,只需要修改 archive_command,reload即可。省去重启数据库的步骤。
案例2:pg_wal和归档目录下同时都没该wal_lsn文件
案例2适用于以下场景:
- pg_wal和归档目录下同时都没该wal_lsn文件
问题描述
开发让释放测试环境pg10数据库的归档空间,清理前检查数据库运行状态发现归档失败,提示archiver process failed on 000000010000000000000001,分析发下pg_wal和归档目录下同时都没该wal_lsn文件,查看多个日志最终发现从2022-12-31开始就已经归档失败了,沟通得知该库一直没人维护。
--查看数据库运行状态时发现归档失败
[root@localhost log]# ps -ef | grep postgres
postgres 1099 1 0 11月14 ? 00:00:05 /usr/pgsql-11/bin/postmaster -D /dsg3/postgres/pg115_data
postgres 1103 1 0 11月14 ? 00:00:15 /usr/pgsql-10/bin/postmaster -D /dsg3/postgres/pg10_data/
postgres 1532 1099 0 11月14 ? 00:00:00 postgres: logger
postgres 1595 1103 0 11月14 ? 00:00:16 postgres: logger process
postgres 1674 1099 0 11月14 ? 00:00:00 postgres: checkpointer
postgres 1675 1099 0 11月14 ? 00:00:18 postgres: background writer
postgres 1676 1099 0 11月14 ? 00:00:18 postgres: walwriter
postgres 1677 1099 0 11月14 ? 00:00:12 postgres: autovacuum launcher
postgres 1678 1099 0 11月14 ? 00:00:39 postgres: archiver
postgres 1679 1099 0 11月14 ? 00:00:14 postgres: stats collector
postgres 1680 1099 0 11月14 ? 00:00:01 postgres: logical replication launcher
postgres 1682 1103 0 11月14 ? 00:00:00 postgres: checkpointer process
postgres 1683 1103 0 11月14 ? 00:00:19 postgres: writer process
postgres 1684 1103 0 11月14 ? 00:00:18 postgres: wal writer process
postgres 1685 1103 0 11月14 ? 00:00:13 postgres: autovacuum launcher process
postgres 1686 1103 0 11月14 ? 00:05:19 postgres: archiver process failed on 000000010000000000000001
postgres 1687 1103 0 11月14 ? 00:00:28 postgres: stats collector process
postgres 1688 1103 0 11月14 ? 00:00:01 postgres: bgworker: logical replication launcher
root 8779 8736 0 15:01 pts/0 00:00:00 su - postgres
postgres 8780 8779 0 15:01 pts/0 00:00:00 -bash
root 10057 8888 0 15:17 pts/1 00:00:00 grep --color=auto postgres
postgres 16957 1103 0 11月21 ? 00:00:00 postgres: topicis topicis 192.168.5.211(58552) idle
postgres 16958 1103 0 11月21 ? 00:00:00 postgres: topicis topicis 192.168.5.211(58555) idle
postgres 16959 1103 0 11月21 ? 00:00:00 postgres: topicis topicis 192.168.5.211(58556) idle
postgres 16960 1103 0 11月21 ? 00:00:00 postgres: topicis topicis 192.168.5.211(58558) idle
问题分析
--检查归档参数配置
-bash-4.2$ /usr/pgsql-10/bin/psql -p 54310
psql (10.22)
输入 "help" 来获取帮助信息.
postgres=# show archive_mode;
archive_mode
--------------
on
(1 行记录)
postgres=# show archive_command;
archive_command
----------------------------------------------
cp %p /dsg3/postgres/pg10_data/pg_archive/%f
(1 行记录)
postgres=# \q
--检查归档目录权限
-bash-4.2$ ls -ld /dsg3/postgres/pg10_data/pg_archive
drwxr-xr-x 2 postgres postgres 4096 12月 1 13:59 /dsg3/postgres/pg10_data/pg_archive
--查看多个日志最终发现从2022-12-31开始就已经归档失败了
-bash-4.2$ tail -200f postgresql-2022-12-31_000000.log
cp: 无法获取"pg_wal/000000010000000000000001" 的文件状态(stat): 没有那个文件或目录
2022-12-31 23:58:29.002 CST [1706] 日志: 归档命令执行失败,退出代码为 1
2022-12-31 23:58:29.002 CST [1706] 详细信息: 执行失败的归档命令是: cp pg_wal/000000010000000000000001 /dsg3/postgres/pg10_data/pg_archive/000000010000000000000001
cp: 无法获取"pg_wal/000000010000000000000001" 的文件状态(stat): 没有那个文件或目录
2022-12-31 23:58:30.012 CST [1706] 日志: 归档命令执行失败,退出代码为 1
2022-12-31 23:58:30.012 CST [1706] 详细信息: 执行失败的归档命令是: cp pg_wal/000000010000000000000001 /dsg3/postgres/pg10_data/pg_archive/000000010000000000000001
2022-12-31 23:58:30.012 CST [1706] 警告: archiving write-ahead log file "000000010000000000000001" failed too many times, will try again later
2022-12-31 23:59:00.016 CST [23391] 错误: 字段 "sysdate" 不存在 第 147 个字符处
2022-12-31 23:59:00.016 CST [23391] 语句: select code,sum(1) as sum,sum(investorcount) as invsum from LOG_SYNCNAMEINFO where logtype = '成功' and date_trunc('day',logTime)= date_trunc('day',sysdate - interval '1 day') group by code order by code
2022-12-31 23:59:00.027 CST [23391] 错误: 字段 "sysdate" 不存在 第 123 个字符处
2022-12-31 23:59:00.027 CST [23391] 语句: select * from(select * from log_entopenplatformpush t where t.logtype in('失败','异常') and (t.nexttime is null or t.nexttime<sysdate) and cast(t.excount as numeric)<cast(t.maxcount as numeric) order by t.nexttime) foo limit $1 offset 0
[root@localhost log]# tail -200f postgresql-2023-12-01_000000.log
...
cp: 无法获取"pg_wal/000000010000000000000002" 的文件状态(stat): 没有那个文件或目录
2023-12-01 16:14:50.314 CST [1686] 日志: 归档命令执行失败,退出代码为 1
2023-12-01 16:14:50.314 CST [1686] 详细信息: 执行失败的归档命令是: cp pg_wal/000000010000000000000002 /dsg3/postgres/pg10_data/pg_archive/000000010000000000000002
2023-12-01 16:14:50.314 CST [1686] 警告: archiving write-ahead log file "000000010000000000000002" failed too many times, will try again later
cp: 无法获取"pg_wal/000000010000000000000002" 的文件状态(stat): 没有那个文件或目录
2023-12-01 16:15:50.387 CST [1686] 日志: 归档命令执行失败,退出代码为 1
2023-12-01 16:15:50.387 CST [1686] 详细信息: 执行失败的归档命令是: cp pg_wal/000000010000000000000002 /dsg3/postgres/pg10_data/pg_archive/000000010000000000000002
cp: 无法获取"pg_wal/000000010000000000000002" 的文件状态(stat): 没有那个文件或目录
2023-12-01 16:15:51.397 CST [1686] 日志: 归档命令执行失败,退出代码为 1
2023-12-01 16:15:51.397 CST [1686] 详细信息: 执行失败的归档命令是: cp pg_wal/000000010000000000000002 /dsg3/postgres/pg10_data/pg_archive/000000010000000000000002
...
--查找000000010000000000000001文件,发现pg_wal目录和整个服务器上都没有该wal文件
[root@localhost dsg2]# ls -l /dsg3/postgres/pg10_data/pg_wal/000000010000000000000001
[root@localhost dsg2]# find /dsg2 -name 000000010000000000000001
[root@localhost dsg2]# find /dsg3 -name 000000010000000000000001
[root@localhost dsg2]# find / -name 000000010000000000000001
不晓得是被删除还是其他什么原因,反正是没有了000000010000000000000001文件
--查当前最新检查点,最新检查点之前的wal文件均可以删除
-bash-4.2$ /usr/pgsql-10/bin/pg_controldata -D /dsg3/postgres/pg10_data/
pg_control 版本: 1002
Catalog 版本: 201707211
数据库系统标识符: 7145756055167210409
数据库簇状态: 在运行中
pg_control 最后修改: 2023年12月01日 星期五 14时06分38秒
最新检查点位置: 5C/CC000098
优先检查点位置: 5C/CB000098
最新检查点的 REDO 位置: 5C/CC000060
最新检查点的重做日志文件: 000000010000005C000000CC
最新检查点的 TimeLineID: 1
最新检查点的PrevTimeLineID: 1
最新检查点的full_page_writes: 开启
最新检查点的NextXID: 0:18857199
最新检查点的 NextOID: 5631206
最新检查点的NextMultiXactId: 1
最新检查点的NextMultiOffsetD: 0
最新检查点的oldestXID: 548
最新检查点的oldestXID所在的数据库:1
最新检查点的oldestActiveXID: 18857199
最新检查点的oldestMultiXid: 1
最新检查点的oldestMulti所在的数据库:1
最新检查点的oldestCommitTsXid:0
最新检查点的newestCommitTsXid:0
最新检查点的时间: 2023年12月01日 星期五 14时06分38秒
不带日志的关系: 0/1使用虚假的LSN计数器
最小恢复结束位置: 0/0
最小恢复结束位置时间表: 0
开始进行备份的点位置: 0/0
备份的最终位置: 0/0
需要终止备份的记录: 否
wal_level设置: logical
wal_log_hints设置: 关闭
max_connections设置: 1000
max_worker_processes设置: 8
max_prepared_xacts设置: 0
max_locks_per_xact设置: 64
track_commit_timestamp设置: 关闭
最大数据校准: 8
数据库块大小: 8192
大关系的每段块数: 131072
WAL的块大小: 8192
每一个 WAL 段字节数: 16777216
标识符的最大长度: 64
在索引中可允许使用最大的列数: 32
TOAST区块的最大长度: 1996
大对象区块的大小: 2048
日期/时间 类型存储: 64位整数
正在传递Flloat4类型的参数: 由值
正在传递Flloat8类型的参数: 由值
数据页校验和版本: 0
Mock authentication nonce: 7983f98bfb21a629b6495115d880af674404270a694d663e4e31603c1cb19c41
--查当前wal_lsn
postgres=# select pg_current_wal_lsn();
pg_current_wal_lsn
--------------------
5C/CC000098
(1 row)
--查当前wal_lsn对应的wal文件
postgres=# select pg_walfile_name(pg_current_wal_lsn());
pg_walfile_name
--------------------------
000000010000005C000000CC
(1 row)
--检查$PGDATA/pg_wal/archive_status/目录下文件
[postgres@pgmaster ~]$ cd /server/data/pgdb/data/pg_wal/archive_status/
[postgres@pgmaster archive_status]$ ls -l *.ready
存在大量的.ready结尾的文件,ready说明是需要归档但是没归档的,done是归档完成了的
尝试解决办法
1.关闭归档开启归档(未解决)
关闭归档–>重启库–>开启归档–>重启库,依然报如下错误:
--关闭归档,更改postgresql.conf,注释掉以下参数
-bash-4.2$ vi /dsg3/postgres/pg10_data/postgresql.conf
#archive_mode = on
#archive_command = 'cp %p /dsg3/postgres/pg10_data/pg_archive/%f'
--重启库
/usr/pgsql-10/bin/pg_ctl stop -D /dsg3/postgres/pg10_data/
/usr/pgsql-10/bin/pg_ctl start -D /dsg3/postgres/pg10_data/
--开启归档,更改postgresql.conf,解除以下参数的注释
-bash-4.2$ vi /dsg3/postgres/pg10_data/postgresql.conf
archive_mode = on
archive_command = 'cp %p /dsg3/postgres/pg10_data/pg_archive/%f'
--重启库
/usr/pgsql-10/bin/pg_ctl stop -D /dsg3/postgres/pg10_data/
/usr/pgsql-10/bin/pg_ctl start -D /dsg3/postgres/pg10_data/
--查看数据库运行状态时发现归档失败
[root@localhost log]# ps -ef | grep postgres
postgres 1099 1 0 11月14 ? 00:00:05 /usr/pgsql-11/bin/postmaster -D /dsg3/postgres/pg115_data
postgres 1103 1 0 11月14 ? 00:00:15 /usr/pgsql-10/bin/postmaster -D /dsg3/postgres/pg10_data/
postgres 1532 1099 0 11月14 ? 00:00:00 postgres: logger
postgres 1595 1103 0 11月14 ? 00:00:16 postgres: logger process
postgres 1674 1099 0 11月14 ? 00:00:00 postgres: checkpointer
postgres 1675 1099 0 11月14 ? 00:00:18 postgres: background writer
postgres 1676 1099 0 11月14 ? 00:00:18 postgres: walwriter
postgres 1677 1099 0 11月14 ? 00:00:12 postgres: autovacuum launcher
postgres 1678 1099 0 11月14 ? 00:00:39 postgres: archiver
postgres 1679 1099 0 11月14 ? 00:00:14 postgres: stats collector
postgres 1680 1099 0 11月14 ? 00:00:01 postgres: logical replication launcher
postgres 1682 1103 0 11月14 ? 00:00:00 postgres: checkpointer process
postgres 1683 1103 0 11月14 ? 00:00:19 postgres: writer process
postgres 1684 1103 0 11月14 ? 00:00:18 postgres: wal writer process
postgres 1685 1103 0 11月14 ? 00:00:13 postgres: autovacuum launcher process
postgres 1686 1103 0 11月14 ? 00:05:19 postgres: archiver process failed on 000000010000000000000001
postgres 1687 1103 0 11月14 ? 00:00:28 postgres: stats collector process
postgres 1688 1103 0 11月14 ? 00:00:01 postgres: bgworker: logical replication launcher
root 8779 8736 0 15:01 pts/0 00:00:00 su - postgres
2.pg_archivecleanup清理过期wal文件(未解决)
--查看pg_wal下面得文件
-bash-4.2$ ls -l /dsg3/postgres/pg10_data/pg_wal/
总用量 394736
-rw------- 1 postgres postgres 16777216 4月 12 2023 000000010000005C000000BC
-rw------- 1 postgres postgres 16777216 4月 12 2023 000000010000005C000000BD
-rw------- 1 postgres postgres 16777216 4月 12 2023 000000010000005C000000BE
-rw------- 1 postgres postgres 16777216 4月 12 2023 000000010000005C000000BF
-rw------- 1 postgres postgres 16777216 4月 12 2023 000000010000005C000000C0
-rw------- 1 postgres postgres 16777216 4月 12 2023 000000010000005C000000C1
-rw------- 1 postgres postgres 16777216 4月 12 2023 000000010000005C000000C2
-rw------- 1 postgres postgres 16777216 4月 12 2023 000000010000005C000000C3
-rw------- 1 postgres postgres 16777216 4月 12 2023 000000010000005C000000C4
-rw------- 1 postgres postgres 16777216 4月 12 2023 000000010000005C000000C5
-rw------- 1 postgres postgres 16777216 4月 12 2023 000000010000005C000000C6
-rw------- 1 postgres postgres 16777216 4月 12 2023 000000010000005C000000C7
-rw------- 1 postgres postgres 16777216 4月 12 2023 000000010000005C000000C8
-rw------- 1 postgres postgres 16777216 4月 12 2023 000000010000005C000000C9
-rw------- 1 postgres postgres 16777216 12月 1 14:00 000000010000005C000000CA
-rw------- 1 postgres postgres 16777216 12月 1 14:04 000000010000005C000000CB
-rw------- 1 postgres postgres 16777216 12月 1 16:17 000000010000005C000000CC
-rw------- 1 postgres postgres 16777216 12月 1 16:26 000000010000005C000000CD
-rw------- 1 postgres postgres 16777216 12月 1 16:39 000000010000005C000000CE
--查当前wal_lsn
postgres=# select pg_current_wal_lsn();
pg_current_wal_lsn
--------------------
5C/CC000098
(1 row)
--查当前wal_lsn对应的wal文件
postgres=# select pg_walfile_name(pg_current_wal_lsn());
pg_walfile_name
--------------------------
000000010000005C000000CC
(1 row)
--清除检查点之前的wal文件
# 000000010000005C000000CC 之前的pg_wal文件可以删除 (pg10以前的叫做pg_xlog)
[postgres@Server ~]$ pg_archivecleanup -d $PGDATA/pg_wal 000000010000005C000000C2
pg_archivecleanup: keep WAL file "/server/data/pgdb/data/pg_wal/000000010000005C000000C2" and later
pg_archivecleanup: removing file "/server/data/pgdb/data/pg_wal/000000010000005C000000C1"
虽然是测试环境还是保留了部分wal文件,未从当前wal_lsn000000010000005C000000CC清除,而是选择清除
000000010000005C000000C2之前的文件
--手动切日志
-bash-4.2$ /usr/pgsql-10/bin/psql -p 54310
psql (10.22)
输入 "help" 来获取帮助信息.
postgres=# select pg_switch_wal();
pg_switch_wal
---------------
5C/D10000E8
(1 行记录)
--查看数据库运行状态时发现归档失败
[root@localhost log]# ps -ef | grep postgres
postgres 1099 1 0 11月14 ? 00:00:05 /usr/pgsql-11/bin/postmaster -D /dsg3/postgres/pg115_data
postgres 1103 1 0 11月14 ? 00:00:15 /usr/pgsql-10/bin/postmaster -D /dsg3/postgres/pg10_data/
postgres 1532 1099 0 11月14 ? 00:00:00 postgres: logger
postgres 1595 1103 0 11月14 ? 00:00:16 postgres: logger process
postgres 1674 1099 0 11月14 ? 00:00:00 postgres: checkpointer
postgres 1675 1099 0 11月14 ? 00:00:18 postgres: background writer
postgres 1676 1099 0 11月14 ? 00:00:18 postgres: walwriter
postgres 1677 1099 0 11月14 ? 00:00:12 postgres: autovacuum launcher
postgres 1678 1099 0 11月14 ? 00:00:39 postgres: archiver
postgres 1679 1099 0 11月14 ? 00:00:14 postgres: stats collector
postgres 1680 1099 0 11月14 ? 00:00:01 postgres: logical replication launcher
postgres 1682 1103 0 11月14 ? 00:00:00 postgres: checkpointer process
postgres 1683 1103 0 11月14 ? 00:00:19 postgres: writer process
postgres 1684 1103 0 11月14 ? 00:00:18 postgres: wal writer process
postgres 1685 1103 0 11月14 ? 00:00:13 postgres: autovacuum launcher process
postgres 1686 1103 0 11月14 ? 00:05:19 postgres: archiver process failed on 000000010000000000000001
postgres 1687 1103 0 11月14 ? 00:00:28 postgres: stats collector process
postgres 1688 1103 0 11月14 ? 00:00:01 postgres: bgworker: logical replication launcher
root 8779 8736 0 15:01 pts/0 00:00:00 su - postgres
3.$PG_DATA/pg_wal下创建空文件(未解决)
--关闭数据库
/usr/pgsql-10/bin/pg_ctl stop -D /dsg3/postgres/pg10_data/
--创建和报错同名的wal_lsn文件
cd /dsg3/postgres/pg10_data/pg_wal
touch 000000010000000000000001
--启动数据库
/usr/pgsql-10/bin/pg_ctl start -D /dsg3/postgres/pg10_data/
--查看数据库运行状态时发现归档失败
[root@localhost log]# ps -ef | grep postgres
postgres 1099 1 0 11月14 ? 00:00:05 /usr/pgsql-11/bin/postmaster -D /dsg3/postgres/pg115_data
postgres 1103 1 0 11月14 ? 00:00:15 /usr/pgsql-10/bin/postmaster -D /dsg3/postgres/pg10_data/
postgres 1532 1099 0 11月14 ? 00:00:00 postgres: logger
postgres 1595 1103 0 11月14 ? 00:00:16 postgres: logger process
postgres 1674 1099 0 11月14 ? 00:00:00 postgres: checkpointer
postgres 1675 1099 0 11月14 ? 00:00:18 postgres: background writer
postgres 1676 1099 0 11月14 ? 00:00:18 postgres: walwriter
postgres 1677 1099 0 11月14 ? 00:00:12 postgres: autovacuum launcher
postgres 1678 1099 0 11月14 ? 00:00:39 postgres: archiver
postgres 1679 1099 0 11月14 ? 00:00:14 postgres: stats collector
postgres 1680 1099 0 11月14 ? 00:00:01 postgres: logical replication launcher
postgres 1682 1103 0 11月14 ? 00:00:00 postgres: checkpointer process
postgres 1683 1103 0 11月14 ? 00:00:19 postgres: writer process
postgres 1684 1103 0 11月14 ? 00:00:18 postgres: wal writer process
postgres 1685 1103 0 11月14 ? 00:00:13 postgres: autovacuum launcher process
postgres 1686 1103 0 11月14 ? 00:05:19 postgres: archiver process failed on 000000010000000000000001
postgres 1687 1103 0 11月14 ? 00:00:28 postgres: stats collector process
postgres 1688 1103 0 11月14 ? 00:00:01 postgres: bgworker: logical replication launcher
root 8779 8736 0 15:01 pts/0 00:00:00 su - postgres
最终解决办法
--关闭数据库
/usr/pgsql-10/bin/pg_ctl stop -D /dsg3/postgres/pg10_data/
--备份data目录(如果磁盘空间允许务必备份以防万一)
cd /dsg3/postgres/
cp -r pg10_data pg10_data_bak_20231201
--更改postgresql.conf中以下归档参数
-bash-4.2$ vi /dsg3/postgres/pg10_data/postgresql.conf
#archive_mode = on
archive_command = 'ls -l /dsg3/postgres/pg10_data/pg_archive/'
--重启库
/usr/pgsql-10/bin/pg_ctl stop -D /dsg3/postgres/pg10_data/
/usr/pgsql-10/bin/pg_ctl start -D /dsg3/postgres/pg10_data/
--查看数据库状态,
-bash-4.2$ ps -ef | grep postgres
postgres 1099 1 0 11月14 ? 00:00:06 /usr/pgsql-11/bin/postmaster -D /dsg3/postgres/pg115_data
postgres 1532 1099 0 11月14 ? 00:00:00 postgres: logger
postgres 1674 1099 0 11月14 ? 00:00:00 postgres: checkpointer
postgres 1675 1099 0 11月14 ? 00:00:18 postgres: background writer
postgres 1676 1099 0 11月14 ? 00:00:18 postgres: walwriter
postgres 1677 1099 0 11月14 ? 00:00:12 postgres: autovacuum launcher
postgres 1678 1099 0 11月14 ? 00:00:39 postgres: archiver
postgres 1679 1099 0 11月14 ? 00:00:14 postgres: stats collector
postgres 1680 1099 0 11月14 ? 00:00:01 postgres: logical replication launcher
root 12967 12922 0 15:56 pts/0 00:00:00 su - postgres
postgres 12968 12967 0 15:56 pts/0 00:00:00 -bash
root 13392 13350 0 16:00 pts/1 00:00:00 su - postgres
postgres 13393 13392 0 16:00 pts/1 00:00:00 -bash
root 15935 15815 0 16:34 pts/2 00:00:00 su - postgres
postgres 15936 15935 0 16:34 pts/2 00:00:00 -bash
postgres 17190 1 3 16:49 pts/0 00:00:00 /usr/pgsql-10/bin/postgres -D /dsg3/postgres/pg10_data
postgres 17191 17190 0 16:49 ? 00:00:00 postgres: logger process
postgres 17193 17190 0 16:49 ? 00:00:00 postgres: checkpointer process
postgres 17194 17190 0 16:49 ? 00:00:00 postgres: writer process
postgres 17195 17190 0 16:49 ? 00:00:00 postgres: wal writer process
postgres 17196 17190 0 16:49 ? 00:00:00 postgres: autovacuum launcher process
postgres 17197 17190 71 16:49 ? 00:00:04 postgres: archiver process last was 000000010000000100000074
postgres 17198 17190 0 16:49 ? 00:00:00 postgres: stats collector process
postgres 17199 17190 0 16:49 ? 00:00:00 postgres: bgworker: logical replication launcher
postgres 17584 12968 0 16:49 pts/0 00:00:00 ps -ef
postgres 17585 12968 0 16:49 pts/0 00:00:00 grep --color=auto postgres
多次执行ps -ef | grep postgres会发现
archiver process last was 000000010000000100000074这个地方会不断地变化,是正常现象,不要慌
等不变为止
--检查$PGDATA/pg_wal/archive_status/目录下文件
[postgres@pgmaster ~]$ cd /server/data/pgdb/data/pg_wal/archive_status/
[postgres@pgmaster archive_status]$ ls -l *.ready
[postgres@pgmaster archive_status]$ ls -l *.done
原来的.ready结尾的文件都变成了.done结尾的文件
补充:.ready结尾的文件说明是需要归档但是没归档的,done是归档完成了的
--开启归档,更改postgresql.conf,修改以下归档参数
-bash-4.2$ vi /dsg3/postgres/pg10_data/postgresql.conf
archive_mode = on
archive_command = 'cp %p /dsg3/postgres/pg10_data/pg_archive/%f'
--重启库
/usr/pgsql-10/bin/pg_ctl stop -D /dsg3/postgres/pg10_data/
/usr/pgsql-10/bin/pg_ctl start -D /dsg3/postgres/pg10_data/
--查看数据库状态
-bash-4.2$ ps -ef | grep postgres
postgres 1099 1 0 11月14 ? 00:00:06 /usr/pgsql-11/bin/postmaster -D /dsg3/postgres/pg115_data
postgres 1532 1099 0 11月14 ? 00:00:00 postgres: logger
postgres 1674 1099 0 11月14 ? 00:00:00 postgres: checkpointer
postgres 1675 1099 0 11月14 ? 00:00:18 postgres: background writer
postgres 1676 1099 0 11月14 ? 00:00:18 postgres: walwriter
postgres 1677 1099 0 11月14 ? 00:00:14 postgres: autovacuum launcher
postgres 1678 1099 0 11月14 ? 00:00:39 postgres: archiver
postgres 1679 1099 0 11月14 ? 00:00:15 postgres: stats collector
postgres 1680 1099 0 11月14 ? 00:00:01 postgres: logical replication launcher
root 9783 16354 0 17:00 pts/3 00:00:00 su - postgres
postgres 9784 9783 0 17:00 pts/3 00:00:00 -bash
root 10888 10844 0 17:14 pts/4 00:00:00 su - postgres
postgres 10889 10888 0 17:14 pts/4 00:00:00 -bash
root 12967 12922 0 15:56 pts/0 00:00:00 su - postgres
postgres 12968 12967 0 15:56 pts/0 00:00:00 -bash
root 13392 13350 0 16:00 pts/1 00:00:00 su - postgres
postgres 13393 13392 0 16:00 pts/1 00:00:00 -bash
postgres 15098 1 0 18:16 pts/4 00:00:00 /usr/pgsql-10/bin/postgres -D /dsg3/postgres/pg10_data
postgres 15099 15098 0 18:16 ? 00:00:00 postgres: logger process
postgres 15101 15098 0 18:16 ? 00:00:00 postgres: checkpointer process
postgres 15102 15098 0 18:16 ? 00:00:00 postgres: writer process
postgres 15103 15098 0 18:16 ? 00:00:00 postgres: wal writer process
postgres 15104 15098 0 18:16 ? 00:00:00 postgres: autovacuum launcher process
postgres 15105 15098 0 18:16 ? 00:00:00 postgres: archiver process last was 000000010000005C000000D1
postgres 15106 15098 0 18:16 ? 00:00:00 postgres: stats collector process
postgres 15107 15098 0 18:16 ? 00:00:00 postgres: bgworker: logical replication launcher
postgres 15182 10889 0 18:17 pts/4 00:00:00 ps -ef
postgres 15183 10889 0 18:17 pts/4 00:00:00 grep --color=auto postgres
root 15935 15815 0 16:34 pts/2 00:00:00 su - postgres
postgres 15936 15935 0 16:34 pts/2 00:00:00 -bash
问题最终解决,虽说是测试库,但是也吓得不轻,157G的数据。不管测试还是生产环境还是得慎重,毕竟数据无法重现。