1 背景介绍
checkpoint是postgres数据库中的一个后台进程,其功能是将需要持久化的脏数据进行刷盘,包括:CLOG/SUBTRANS相关的SLRU缓冲池、共享内存中的脏buffer以及二阶段提交事务信息,更新控制文件和移除陈旧的WAL日志;在系统crash后会读取最近的检查点日志记录进行redo, 从而使数据库恢复至crash前的一致性状态,保证了数据的完整性<见官网>。
2 关键数据结构
2.1 CheckPoint
该结构记录Checkpoint日志的相关信息:redo位点、TimeLineID、nextXid和nextOid等
/*
* Body of CheckPoint XLOG records. This is declared here because we keep
* a copy of the latest one in pg_control for possible disaster recovery.
* Changing this struct requires a PG_CONTROL_VERSION bump.
*/
typedef struct CheckPoint
{
XLogRecPtr redo; /* next RecPtr available when we began to // 重做起始点
* create CheckPoint (i.e. REDO start point) */
TimeLineID ThisTimeLineID; /* current TLI */
TimeLineID PrevTimeLineID; /* previous TLI, if this record begins a new
* timeline (equals ThisTimeLineID otherwise) */
bool fullPageWrites; /* current full_page_writes */
FullTransactionId nextXid; /* next free transaction ID */
Oid nextOid; /* next free OID */
MultiXactId nextMulti; /* next free MultiXactId */
MultiXactOffset nextMultiOffset; /* next free MultiXact offset */
TransactionId oldestXid; /* cluster-wide minimum datfrozenxid */
Oid oldestXidDB; /* database with minimum datfrozenxid */
MultiXactId oldestMulti; /* cluster-wide minimum datminmxid */
Oid oldestMultiDB; /* database with minimum datminmxid */
pg_time_t time; /* time stamp of checkpoint */
TransactionId oldestCommitTsXid; /* oldest Xid with valid commit
* timestamp */
TransactionId newestCommitTsXid; /* newest Xid with valid commit
* timestamp */
/*
* Oldest XID still running. This is only needed to initialize hot standby
* mode from an online checkpoint, so we only bother calculating this for
* online checkpoints and only when wal_level is replica. Otherwise it's
* set to InvalidTransactionId.
*/
TransactionId oldestActiveXid;
} CheckPoint;
2.2 CheckpointerShmemStruct
该结构记录 checkpointer 进程和其他后台进程之间的通讯信息,位于共享内存中
typedef struct
{
pid_t checkpointer_pid; /* PID (0 if not started) */
slock_t ckpt_lck; /* protects all the ckpt_* fields */
int ckpt_started; /* advances when checkpoint starts */ // 计数器
int ckpt_done; /* advances when checkpoint done */
int ckpt_failed; /* advances when checkpoint fails */
int ckpt_flags; /* checkpoint flags, as defined in xlog.h */
ConditionVariable start_cv; /* signaled when ckpt_started advances */
ConditionVariable done_cv; /* signaled when ckpt_done advances */
uint32 num_backend_writes; /* counts user backend buffer writes */
uint32 num_backend_fsync; /* counts user backend fsync calls */
int num_requests; /* current # of requests */
int max_requests; /* allocated array size */
CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
} CheckpointerShmemStruct;
static CheckpointerShmemStruct *CheckpointerShmem;
ckpt_flags 标识位信息
/* These directly affect the behavior of CreateCheckPoint and subsidiaries */
#define CHECKPOINT_IS_SHUTDOWN 0x0001 /* Checkpoint is for shutdown */
#define CHECKPOINT_END_OF_RECOVERY 0x0002 /* Like shutdown checkpoint, but
* issued at end of WAL recovery */
#define CHECKPOINT_IMMEDIATE 0x0004 /* Do it without delays */
#define CHECKPOINT_FORCE 0x0008 /* Force even if no activity */
#define CHECKPOINT_FLUSH_ALL 0x0010 /* Flush all pages, including those
* belonging to unlogged tables */
/* These are important to RequestCheckpoint */
#define CHECKPOINT_WAIT 0x0020 /* Wait for completion */
#define CHECKPOINT_REQUESTED 0x0040 /* Checkpoint request has been made */
/* These indicate the cause of a checkpoint request */
#define CHECKPOINT_CAUSE_XLOG 0x0080 /* XLOG consumption */
#define CHECKPOINT_CAUSE_TIME 0x0100 /* Elapsed time */
3 checkpoint的主流程
阅读源代码[代码版本为 pg14], 可总结出CheckpointerMain的执行流程如下:
CheckpointerMain
-> CreateCheckPoint
-> CheckPointGuts (checkpoint刷脏的函数,包括所有脏页)
-> CheckPointBuffers (刷shared buffer的脏页)
-> BufferSync (写出缓存区中的所有脏页)
-> SyncOneBuffer (刷一页)
-> FlushBuffer (实际刷脏的函数)
-> XLogFlush(刷wal记录)
-> ProcessSyncRequests(执行fsync操作,确保BufferSync 的数据真正落盘)
入口函数CheckpointMain内容很多,我们一起抽丝剥茧,看看其实现原理:
void
CheckpointerMain(void)
{
sigjmp_buf local_sigjmp_buf;
MemoryContext checkpointer_context;
CheckpointerShmem->checkpointer_pid = MyProcPid;
/*
* Properly accept or ignore signals the postmaster might send us
*
* Note: we deliberately ignore SIGTERM, because during a standard Unix
* system shutdown cycle, init will SIGTERM all processes at once. We
* want to wait for the backends to exit, whereupon the postmaster will
* tell us it's okay to shut down (via SIGUSR2).
*/
// 注册信号处理函数:重载配置文件SIGHUP、checkpoint请求SIGINT、SIGPIPE和shutdown请求SIGUSR2等
pqsignal(SIGHUP, SignalHandlerForConfigReload);
pqsignal(SIGINT, ReqCheckpointHandler); /* request checkpoint */
pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
/* SIGQUIT handler was already set up by InitPostmasterChild */
pqsignal(SIGALRM, SIG_IGN);
pqsignal(SIGPIPE, SIG_IGN);
pqsignal(SIGUSR1, procsignal_sigusr1_handler);
pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
/*
* Reset some signals that are accepted by postmaster but not here
*/
pqsignal(SIGCHLD, SIG_DFL);
/*
* Initialize so that first time-driven event happens at the correct time.
*/
last_checkpoint_time = last_xlog_switch_time = (pg_time_t) time(NULL);
/*
* Create a memory context that we will do all our work in. We do this so
* that we can reset the context during error recovery and thereby avoid
* possible memory leaks. Formerly this code just ran in
* TopMemoryContext, but resetting that would be a really bad idea.
*/
// 创建内存上下文用于checkpoint操作
checkpointer_context = AllocSetContextCreate(TopMemoryContext,
"Checkpointer",
ALLOCSET_DEFAULT_SIZES);
MemoryContextSwitchTo(checkpointer_context);
/*
* If an exception is encountered, processing resumes here.
*
* You might wonder why this isn't coded as an infinite loop around a
* PG_TRY construct. The reason is that this is the bottom of the
* exception stack, and so with PG_TRY there would be no exception handler
* in force at all during the CATCH part. By leaving the outermost setjmp
* always active, we have at least some chance of recovering from an error
* during error recovery. (If we get into an infinite loop thereby, it
* will soon be stopped by overflow of elog.c's internal state stack.)
*
* Note that we use sigsetjmp(..., 1), so that the prevailing signal mask
* (to wit, BlockSig) will be restored when longjmp'ing to here. Thus,
* signals other than SIGQUIT will be blocked until we complete error
* recovery. It might seem that this policy makes the HOLD_INTERRUPTS()
* call redundant, but it is not since InterruptPending might be set
* already.
*/
// 设置异常跳跃栈进行容错处理
if (sigsetjmp(local_sigjmp_buf, 1) != 0)
{
/* Since not using PG_TRY, must reset error stack by hand */
error_context_stack = NULL;
/* Prevent interrupts while cleaning up */
HOLD_INTERRUPTS();
/* Report the error to the server log */
EmitErrorReport();
/*
* These operations are really just a minimal subset of
* AbortTransaction(). We don't have very many resources to worry
* about in checkpointer, but we do have LWLocks, buffers, and temp
* files.
*/
LWLockReleaseAll();
ConditionVariableCancelSleep();
pgstat_report_wait_end();
AbortBufferIO();
UnlockBuffers();
ReleaseAuxProcessResources(false);
AtEOXact_Buffers(false);
AtEOXact_SMgr();
AtEOXact_Files(false);
AtEOXact_HashTables(false);
/* Warn any waiting backends that the checkpoint failed. */
if (ckpt_active)
{
SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
CheckpointerShmem->ckpt_failed++;
CheckpointerShmem->ckpt_done = CheckpointerShmem->ckpt_started;
SpinLockRelease(&CheckpointerShmem->ckpt_lck);
ConditionVariableBroadcast(&CheckpointerShmem->done_cv);
ckpt_active = false;
}
/*
* Now return to normal top-level context and clear ErrorContext for
* next time.
*/
MemoryContextSwitchTo(checkpointer_context);
FlushErrorState();
/* Flush any leaked data in the top-level context */
MemoryContextResetAndDeleteChildren(checkpointer_context);
/* Now we can allow interrupts again */
RESUME_INTERRUPTS();
/*
* Sleep at least 1 second after any error. A write error is likely
* to be repeated, and we don't want to be filling the error logs as
* fast as we can.
*/
pg_usleep(1000000L);
/*
* Close all open files after any error. This is helpful on Windows,
* where holding deleted files open causes various strange errors.
* It's not clear we need it elsewhere, but shouldn't hurt.
*/
smgrcloseall();
}
/* We can now handle ereport(ERROR) */
PG_exception_stack = &local_sigjmp_buf;
/*
* Unblock signals (they were blocked when the postmaster forked us)
*/
PG_SETMASK(&UnBlockSig);
/*
* Ensure all shared memory values are set correctly for the config. Doing
* this here ensures no race conditions from other concurrent updaters.
*/
UpdateSharedMemoryConfig();
/*
* Advertise our latch that backends can use to wake us up while we're
* sleeping.
*/
ProcGlobal->checkpointerLatch = &MyProc->procLatch;
/*
* Loop forever // 死循环
*/
for (;;)
{
bool do_checkpoint = false;
int flags = 0;
pg_time_t now;
int elapsed_secs;
int cur_timeout;
/* Clear any already-pending wakeups */
ResetLatch(MyLatch); // 移除latch
/*
* Process any requests or signals received recently.
*/
AbsorbSyncRequests();
HandleCheckpointerInterrupts();
/*
* Detect a pending checkpoint request by checking whether the flags
* word in shared memory is nonzero. We shouldn't need to acquire the
* ckpt_lck for this.
*/
if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
{
do_checkpoint = true;
BgWriterStats.m_requested_checkpoints++;
}
/*
* Force a checkpoint if too much time has elapsed since the last one.
* Note that we count a timed checkpoint in stats only when this
* occurs without an external request, but we set the CAUSE_TIME flag
* bit even if there is also an external request.
*/
now = (pg_time_t) time(NULL);
elapsed_secs = now - last_checkpoint_time;
if (elapsed_secs >= CheckPointTimeout) // 触发 checkpoint机制之一:间隔时间达到 CheckPointTimeout
{ // pg 默认为 300s, ===> 配置文件
if (!do_checkpoint)
BgWriterStats.m_timed_checkpoints++;
do_checkpoint = true;
flags |= CHECKPOINT_CAUSE_TIME; // 添加标识信息,说明由超时引起的 checkpoint
}
/*
* Do a checkpoint if requested.
*/
if (do_checkpoint)
{
bool ckpt_performed = false;
bool do_restartpoint;
/*
* Check if we should perform a checkpoint or a restartpoint. As a
* side-effect, RecoveryInProgress() initializes TimeLineID if
* it's not set yet.
*/
do_restartpoint = RecoveryInProgress();
restartpoint 是系统恢复过程中创建日志恢复检查点 ==>> recover过程也会失败,
/*
* Atomically fetch the request flags to figure out what kind of a
* checkpoint we should perform, and increase the started-counter
* to acknowledge that we've started a new checkpoint.
*/
SpinLockAcquire(&CheckpointerShmem->ckpt_lck); // 获取ckpt_lck
flags |= CheckpointerShmem->ckpt_flags;
CheckpointerShmem->ckpt_flags = 0;
CheckpointerShmem->ckpt_started++; // 递增 ckpt_lck 计数器
SpinLockRelease(&CheckpointerShmem->ckpt_lck);
ConditionVariableBroadcast(&CheckpointerShmem->start_cv); // 广播,告知其他后台进程
/*
* The end-of-recovery checkpoint is a real checkpoint that's
* performed while we're still in recovery.
*/
if (flags & CHECKPOINT_END_OF_RECOVERY) 恢复完毕无需创建 restartpoint
do_restartpoint = false;
/*
* We will warn if (a) too soon since last checkpoint (whatever
* caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
* since the last checkpoint start. Note in particular that this
* implementation will not generate warnings caused by
* CheckPointTimeout < CheckPointWarning.
*/
if (!do_restartpoint &&
(flags & CHECKPOINT_CAUSE_XLOG) &&
elapsed_secs < CheckPointWarning)
ereport(LOG,
(errmsg_plural("checkpoints are occurring too frequently (%d second apart)",
"checkpoints are occurring too frequently (%d seconds apart)",
elapsed_secs,
elapsed_secs),
errhint("Consider increasing the configuration parameter \"max_wal_size\".")));
/*
* Initialize checkpointer-private variables used during
* checkpoint.
*/
ckpt_active = true; // checkpointer私有变量
if (do_restartpoint)
ckpt_start_recptr = GetXLogReplayRecPtr(NULL);
else
ckpt_start_recptr = GetInsertRecPtr(); // 获取 ckpt_start_recptr
ckpt_start_time = now; // 开始时间戳
ckpt_cached_elapsed = 0;
/*
* Do the checkpoint.
*/
if (!do_restartpoint)
{
CreateCheckPoint(flags); ==>> checkpoint 创建函数,下篇
ckpt_performed = true;
}
else
ckpt_performed = CreateRestartPoint(flags);
/*
* After any checkpoint, close all smgr files. This is so we
* won't hang onto smgr references to deleted files indefinitely.
*/
smgrcloseall(); // 执行checkpoint操作后关闭所有 smgr文件
/*
* Indicate checkpoint completion to any waiting backends.
*/
SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
CheckpointerShmem->ckpt_done = CheckpointerShmem->ckpt_started; //更新ckpt_done字段,告知等待d的
// backends checkpoint已完成
SpinLockRelease(&CheckpointerShmem->ckpt_lck);
ConditionVariableBroadcast(&CheckpointerShmem->done_cv); // 广播
if (ckpt_performed) // 执行完成,更新 last_checkpoint_time
{
/*
* Note we record the checkpoint start time not end time as
* last_checkpoint_time. This is so that time-driven
* checkpoints happen at a predictable spacing.
*/
last_checkpoint_time = now;
}
else
{ // restartpoint 会将 ckpt_performed 标识置为 false,
// 产生原因是自上次restartpoint以来没有接收到任何新的 checkpoint WAL记录
/* // 尝试15s
* We were not able to perform the restartpoint (checkpoints
* throw an ERROR in case of error). Most likely because we
* have not received any new checkpoint WAL records since the
* last restartpoint. Try again in 15 s.
*/
last_checkpoint_time = now - CheckPointTimeout + 15;
}
ckpt_active = false; // 活跃标识
}
/* Check for archive_timeout and switch xlog files if necessary. */
CheckArchiveTimeout(); // 检查归档超时 ==>>
/*
* Send off activity statistics to the stats collector. (The reason
* why we re-use bgwriter-related code for this is that the bgwriter
* and checkpointer used to be just one process. It's probably not
* worth the trouble to split the stats support into two independent
* stats message types.)
*/
pgstat_send_bgwriter(); // bgwriter 与 checkpoint 都会刷脏数据,两者间需做一定的协调
/* Send WAL statistics to the stats collector. */
pgstat_send_wal(true); // 向数据收集器发送 wal相关数据
/*
* If any checkpoint flags have been set, redo the loop to handle the
* checkpoint without sleeping.
*/
在上述流程中已将此 ckpt_flags置为0, 若不为0,则说明其他的backends修改此变量,需回到循环进行处理,而不是往下走sleep
if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
continue;
/*
* Sleep until we are signaled or it's time for another checkpoint or
* xlog file switch.
*/
休眠接收到信号通知或到达超时时间或进行 xlog 段文件切换
now = (pg_time_t) time(NULL); // 更新时间戳
elapsed_secs = now - last_checkpoint_time;
if (elapsed_secs >= CheckPointTimeout)
continue; /* no sleep for us ... */ // 如果超时,则回到循环初始点继续工作
cur_timeout = CheckPointTimeout - elapsed_secs; // 休眠时间
if (XLogArchiveTimeout > 0 && !RecoveryInProgress())
{
elapsed_secs = now - last_xlog_switch_time; //根据XLogArchiveTimeout参数判断是否进行sleep还是
// 在下一个检查点进行处理
if (elapsed_secs >= XLogArchiveTimeout)
continue; /* no sleep for us ... */
cur_timeout = Min(cur_timeout, XLogArchiveTimeout - elapsed_secs);
}
// 等待Latch
(void) WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, [后续学习,未懂]
cur_timeout * 1000L /* convert to ms */ ,
WAIT_EVENT_CHECKPOINTER_MAIN);
}
}
4 Checkpoint触发条件
1 超级用户(其他用户不可)执行CHECKPOINT命令
2 数据库shutdown
3 数据库recovery完成
4 XLOG日志量达到了触发checkpoint阈值
5 后台周期性地进行checkpoint
PostgreSQL有一个专门的后台进程用于周期性执行checkpoint,然而其他情况触发checkpoint后,也是通过向后台进程发信号的方式将checkpoint交给后台进程来完成。比如超级用户在执行了CHECKPOINT操作时,PostgreSQL会调用RequestCheckpoint,在该函数中会通过kill给后台进程发送信号(SIGINT 2),从而使后台进程执行checkpoint。
checkpoint相关的参数:
1、checkpoint_timeout:
这是自动WAL检查点之间的最长时间(默认为5分钟)。增加此参数可能会增加崩溃恢复所需的时间。
2、max_wal_size:
使WAL增长到自动WAL检查点之间的最大大小。默认值为1 GB。增大此参数可能会增加崩溃恢复所需的时间。
如果我们同时设置了这两个参数,则检查点将以先到者为准。
3、min_wal_size:
只要WAL磁盘使用率保持低于此设置,旧的WAL文件将始终在检查点被回收以备将来使用,而不是被删除。这可以用来确保保留足够的WAL空间来处理WAL使用率的峰值,例如在运行大型批处理作业时。 (默认为80 MB)
4、checkpoint_completion_target :
由于每5分钟或达到每个max_wal_size阈值都会发生一次检查点,因此在检查点时间内,共享缓冲区中存在的所有脏页将被刷新到磁盘,从而导致巨大的IO。
checkpoint_completion_target来这里进行救援。
这会使刷新速度变慢,这意味着PostgreSQL应该花费checkpoint_completion_target * checkpoint_timeout的时间来写入数据。
例如,如果我的checkpoint_completion_target为0.5,并且数据库将限制写入,以便最后写入在2.5分钟后完成。
5、wal_buffers :
用于尚未写入磁盘的WAL数据的共享内存量。默认设置为-1,选择的大小等于shared_buffers的1/32(大约3%),但不小于64kB,也不大于一个WAL段的大小,通常为16MB。
6、checkpoint_flush_after:
在执行检查点时,只要写入的字节数超过checkpoint_flush_after,则尝试强制OS将这些写入操作刷到存储中。这样做将限制内核页面缓存中的脏数据量,从而减少在检查点末尾发出fsync时停顿的可能性。
此设置在某些平台上可能无效。