本小节着重讲解CheckpointerMain中的CreateCheckPoint接口函数源码学习,相关知识回顾postgres checkpoint源码学习-1
源码分析
CreateCheckPoint函数主要有如下几个工作:
1)刷脏数据【数据页脏数据 share buffer 、Clog 脏数据 SLRU】
2)写checkpoint WAL日志;
3) 更新控制文件信息
4)删除陈旧的WAL日志文件 (须达到触发条件)
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
// 关闭数据库或者达到触发条件均会执行 checkpoint
* flags is a bitwise OR of the following:
* CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown. //关闭数据库
* CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery. // WAL恢复完执行 checkpoint
* CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP, // 立即执行
* ignoring checkpoint_completion_target parameter.
* CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occurred // 强制执行,即使相隔两次无XLOG
* since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
* CHECKPOINT_END_OF_RECOVERY).
* CHECKPOINT_FLUSH_ALL: also flush buffers of unlogged tables. // flush buffer,包括非日志表
*
* Note: flags contains other bits, of interest here only for logging purposes.
* In particular note that this routine is synchronous and does not pay
* attention to CHECKPOINT_WAIT.
*
* If !shutdown then we are writing an online checkpoint. This is a very special
* kind of operation and WAL record because the checkpoint action occurs over
* a period of time yet logically occurs at just a single LSN. The logical
* position of the WAL record (redo ptr) is the same or earlier than the
* physical position. When we replay WAL we locate the checkpoint via its
* physical position then read the redo ptr and actually start replay at the
* earlier logical position. Note that we don't write *anything* to WAL at
* the logical position, so that location could be any other kind of WAL record.
* All of this mechanism allows us to continue working while we checkpoint.
* As a result, timing of actions is critical here and be careful to note that
* this function will likely take minutes to execute on a busy system.
* 非shutdown过程,我们会创建一个在线检查点(online checkpoint)。这是一个非常特别的操作类型和WAL记录,
* 因为检查点动作(实际上)发生在一段时间内,而在逻辑上只发生在一个LSN上。WAL记录(redo ptr,即前面介绍的重做点)的
* 逻辑位置早于或等于物理位置。在replay WAL时,我们通过它的物理位置定位检查点,然后读取redo ptr,实际上开始replay
* 是在更早的逻辑位置。由于我们不向逻辑位置的WAL写任何东西,因此这个位置可以是任意类型的WAL记录。以上机制的目的是,
* 让我们在checkpoint时可以继续工作。导致的问题是,操作的时间会比较长,尤其在繁忙的系统中,该函数可能会持续数分钟。
void
CreateCheckPoint(int flags)
{
bool shutdown;
CheckPoint checkPoint;
XLogRecPtr recptr;
XLogSegNo _logSegNo;
XLogCtlInsert *Insert = &XLogCtl->Insert;
uint32 freespace;
XLogRecPtr PriorRedoPtr;
XLogRecPtr curInsert;
XLogRecPtr last_important_lsn;
VirtualTransactionId *vxids;
int nvxids;
/*
* An end-of-recovery checkpoint is really a shutdown checkpoint, just
* issued at a different time.
*/
if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) // 根据标识判断接下来的执行动作
shutdown = true;
else
shutdown = false;
/* sanity check */ // 安全检查,禁止在恢复过程中创建检查点
if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
elog(ERROR, "can't create a checkpoint during recovery");
/*
* Initialize InitXLogInsert working areas before entering the critical
* section. Normally, this is done by the first call to
* RecoveryInProgress() or LocalSetXLogInsertAllowed(), but when creating
* an end-of-recovery checkpoint, the LocalSetXLogInsertAllowed call is
* done below in a critical section, and InitXLogInsert cannot be called
* in a critical section.
* 初始化XLOG日志组装区【内存区域】,进入临界区前应准备好该工作
*/
InitXLogInsert();
/*
* Prepare to accumulate statistics.
* 创建CheckpointStats结构体,记录相关信息
* Note: because it is possible for log_checkpoints to change while a
* checkpoint proceeds, we always accumulate stats, even if
* log_checkpoints is currently off.
*/
MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
/*
* Use a critical section to force system panic if we have trouble.
*/
START_CRIT_SECTION();
// 如果执行的是关闭类型checkpoint,设置ControlFile->state和ControlFile->time,并更新控制文件
// 【需持有ControlFileLock排他锁】
if (shutdown)
{
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
ControlFile->state = DB_SHUTDOWNING;
ControlFile->time = (pg_time_t) time(NULL);
UpdateControlFile();
LWLockRelease(ControlFileLock);
}
/*
* Let smgr prepare for checkpoint; this has to happen before we determine
* the REDO pointer. Note that smgr must not do anything that'd have to
* be undone if we decide no checkpoint is needed.
*/
// 通知smgr 磁盘管理器准备好开始做checkpoint
SyncPreCheckpoint();
/* Begin filling in the checkpoint WAL record */
// 开始填充checkpoint WAL日志记录
MemSet(&checkPoint, 0, sizeof(checkPoint));
checkPoint.time = (pg_time_t) time(NULL);
/*
* For Hot Standby, derive the oldestActiveXid before we fix the redo
* pointer. This allows us to begin accumulating changes to assemble our
* starting snapshot of locks and transactions.
*/
// 对于从库非shutdown类别检查点,在设置 redo pointer需要计算出当前最老的活跃事务号。
// 这允许我们计算这些变化用于locks和事务的起始快照,否则设置为 InvalidTransactionId
if (!shutdown && XLogStandbyInfoActive())
checkPoint.oldestActiveXid = GetOldestActiveTransactionId();
else
checkPoint.oldestActiveXid = InvalidTransactionId;
/*
* Get location of last important record before acquiring insert locks (as
* GetLastImportantRecPtr() also locks WAL locks).
*/
// 在获取 insert locks前得到本地最新的 important_lsn
last_important_lsn = GetLastImportantRecPtr();
/*
* We must block concurrent insertions while examining insert state to
* determine the checkpoint REDO pointer.
*/
/// 获取wal insert 排他锁,计算当前xlog插入位置 [物理地址]
WALInsertLockAcquireExclusive();
curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
/*
* If this isn't a shutdown or forced checkpoint, and if there has been no
* WAL activity requiring a checkpoint, skip it. The idea here is to
* avoid inserting duplicate checkpoints when the system is idle.
*/
// 对于非shotdown/恢复结束/强制类型检查点,如果系统处于空闲状态无WAL日志生成,则释放锁,退出临界区并返回
if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
CHECKPOINT_FORCE)) == 0)
{
if (last_important_lsn == ControlFile->checkPoint)
{
WALInsertLockRelease();
END_CRIT_SECTION();
ereport(DEBUG1,
(errmsg_internal("checkpoint skipped because system is idle")));
return;
}
}
/*
* An end-of-recovery checkpoint is created before anyone is allowed to
* write WAL. To allow us to write the checkpoint record, temporarily
* enable XLogInsertAllowed. (This also ensures ThisTimeLineID is
* initialized, which we need here and in AdvanceXLInsertBuffer.)
*/
// 在写WAL前需创建 end-of-recovery checkpoint,为了允许我们写检查点记录,需开启XLogInsertAllowed,
// 该函数会初始化时间线并为后续WAL的组装分配内存等资源
if (flags & CHECKPOINT_END_OF_RECOVERY)
LocalSetXLogInsertAllowed();
checkPoint.ThisTimeLineID = ThisTimeLineID;
if (flags & CHECKPOINT_END_OF_RECOVERY)
checkPoint.PrevTimeLineID = XLogCtl->PrevTimeLineID;
else
checkPoint.PrevTimeLineID = ThisTimeLineID;
checkPoint.fullPageWrites = Insert->fullPageWrites;
检查点日志重做点的确定
/*
* Compute new REDO record ptr = location of next XLOG record.
*
* NB: this is NOT necessarily where the checkpoint record itself will be,
* since other backends may insert more XLOG records while we're off doing
* the buffer flush work. Those XLOG records are logically after the
* checkpoint, even though physically before it. Got that?
*/
// 确定边界,如果此时插入位点在页尾,
1)如果在段文件的页尾,则加上SizeOfXLogLongPHD偏移量,
2)如果在段内页尾,则加上SizeOfXLogShortPHD偏移量
freespace = INSERT_FREESPACE(curInsert);
if (freespace == 0)
{
if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
curInsert += SizeOfXLogLongPHD;
else
curInsert += SizeOfXLogShortPHD;
}
checkPoint.redo = curInsert;
/*
* Here we update the shared RedoRecPtr for future XLogInsert calls; this
* must be done while holding all the insertion locks.
// 更新共享的 RedoRecPtr位点,持有wal insertion locks
// 如果检查点失败了, RedoRecPtr将会指向需要指向的位点。唯一的后果是备份了不需要WAL buffer,
// 不能推迟 RedoRePtr,因为在dumping buffers时 bufferde 改变不包含在此检查点中
* Note: if we fail to complete the checkpoint, RedoRecPtr will be left
* pointing past where it really needs to point. This is okay; the only
* consequence is that XLogInsert might back up whole buffers that it
* didn't really need to. We can't postpone advancing RedoRecPtr because
* XLogInserts that happen while we are dumping buffers must assume that
* their buffer changes are not included in the checkpoint.
*/
RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
/*
* Now we can release the WAL insertion locks, allowing other xacts to
* proceed while we are flushing disk buffers.
*/
WALInsertLockRelease();
// 更新 XLog结构体中的RedoRecPtr
/* Update the info_lck-protected copy of RedoRecPtr as well */
SpinLockAcquire(&XLogCtl->info_lck);
XLogCtl->RedoRecPtr = checkPoint.redo;
SpinLockRelease(&XLogCtl->info_lck);
/*
* If enabled, log checkpoint start. We postpone this until now so as not
* to log anything if we decided to skip the checkpoint.
*/
if (log_checkpoints)
LogCheckpointStart(flags, false);
/* Update the process title */
update_checkpoint_display(flags, false, false);
TRACE_POSTGRESQL_CHECKPOINT_START(flags);
/*
* Get the other info we need for the checkpoint record.
// 收集检查点日志记录信息,如下:
* We don't need to save oldestClogXid in the checkpoint, it only matters
* for the short period in which clog is being truncated, and if we crash
* during that we'll redo the clog truncation and fix up oldestClogXid
* there.
*/
LWLockAcquire(XidGenLock, LW_SHARED);
checkPoint.nextXid = ShmemVariableCache->nextXid;
checkPoint.oldestXid = ShmemVariableCache->oldestXid;
checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
LWLockRelease(XidGenLock);
LWLockAcquire(CommitTsLock, LW_SHARED);
checkPoint.oldestCommitTsXid = ShmemVariableCache->oldestCommitTsXid;
checkPoint.newestCommitTsXid = ShmemVariableCache->newestCommitTsXid;
LWLockRelease(CommitTsLock);
LWLockAcquire(OidGenLock, LW_SHARED);
checkPoint.nextOid = ShmemVariableCache->nextOid;
if (!shutdown)
checkPoint.nextOid += ShmemVariableCache->oidCount;
LWLockRelease(OidGenLock);
MultiXactGetCheckptMulti(shutdown,
&checkPoint.nextMulti,
&checkPoint.nextMultiOffset,
&checkPoint.oldestMulti,
&checkPoint.oldestMultiDB);
/*
* Having constructed the checkpoint record, ensure all shmem disk buffers
* and commit-log buffers are flushed to disk.
*
* This I/O could fail for various reasons. If so, we will fail to
* complete the checkpoint, but there is no reason to force a system
* panic. Accordingly, exit critical section while doing it.
*/
END_CRIT_SECTION();
刷脏数据页
/*
* In some cases there are groups of actions that must all occur on one
* side or the other of a checkpoint record. Before flushing the
* checkpoint record we must explicitly wait for any backend currently
* performing those groups of actions.
*
* One example is end of transaction, so we must wait for any transactions
* that are currently in commit critical sections. If an xact inserted
* its commit record into XLOG just before the REDO point, then a crash
* restart from the REDO point would not replay that record, which means
* that our flushing had better include the xact's update of pg_xact. So
* we wait till he's out of his commit critical section before proceeding.
* See notes in RecordTransactionCommit().
*
* Because we've already released the insertion locks, this test is a bit
* fuzzy: it is possible that we will wait for xacts we didn't really need
* to wait for. But the delay should be short and it seems better to make
* checkpoint take a bit longer than to hold off insertions longer than
* necessary. (In fact, the whole reason we have this issue is that xact.c
* does commit record XLOG insertion and clog update as two separate steps
* protected by different locks, but again that seems best on grounds of
* minimizing lock contention.)
*
* A transaction that has not yet set delayChkpt when we look cannot be at
* risk, since he's not inserted his commit record yet; and one that's
* already cleared it is not at risk either, since he's done fixing clog
* and we will correctly flush the update below. So we cannot miss any
* xacts we need to wait for.
*/
vxids = GetVirtualXIDsDelayingChkpt(&nvxids); // 关键 设置睡眠【进入写XLOG/CLOG阶段】
if (nvxids > 0)
{
do
{
pg_usleep(10000L); /* wait for 10 msec */
} while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
}
pfree(vxids);
CheckPointGuts(checkPoint.redo, flags); // 刷脏数据 【核心工作,下篇讲解】
/*
* Take a snapshot of running transactions and write this to WAL. This
* allows us to reconstruct the state of running transactions during
* archive recovery, if required. Skip, if this info disabled.
*
* If we are shutting down, or Startup process is completing crash
* recovery we don't need to write running xact data.
*/
if (!shutdown && XLogStandbyInfoActive()) // 非shutdown 类型检查点,需记录运行事务快照,归档
// 恢复时会从重构事务状态
LogStandbySnapshot();
写检查点日志并刷盘
START_CRIT_SECTION();
/*
* Now insert the checkpoint record into XLOG.
*/
XLogBeginInsert();
XLogRegisterData((char *) (&checkPoint), sizeof(checkPoint));
recptr = XLogInsert(RM_XLOG_ID,
shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
XLOG_CHECKPOINT_ONLINE);
XLogFlush(recptr); // 刷盘
/*
* We mustn't write any new WAL after a shutdown checkpoint, or it will be
* overwritten at next startup. No-one should even try, this just allows
* sanity-checking. In the case of an end-of-recovery checkpoint, we want
* to just temporarily disable writing until the system has exited
* recovery.
*/
if (shutdown)
{
if (flags & CHECKPOINT_END_OF_RECOVERY)
LocalXLogInsertAllowed = -1; /* return to "check" state */ // 返回检标识
else
LocalXLogInsertAllowed = 0; /* never again write WAL */ // 不允许写WAL
}
// 如果关库过程检查点失败 checkPoint.redo != ProcLastRecPtr, 直接PANIC,输出错误信息
/*
* We now have ProcLastRecPtr = start of actual checkpoint record, recptr
* = end of actual checkpoint record.
*/
if (shutdown && checkPoint.redo != ProcLastRecPtr)
ereport(PANIC,
(errmsg("concurrent write-ahead log activity while database system is shutting down")));
/*
* Remember the prior checkpoint's redo ptr for
* UpdateCheckPointDistanceEstimate()
*/
// 记录上一个检查点的redo位点,目的是为更新 UpdateCheckPointDistanceEstimate
PriorRedoPtr = ControlFile->checkPointCopy.redo;
更新控制文件信息
/*
* Update the control file.
*/
// 获取排他ControlFileLock
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
if (shutdown)
ControlFile->state = DB_SHUTDOWNED; // 数据库状态标识
ControlFile->checkPoint = ProcLastRecPtr;
ControlFile->checkPointCopy = checkPoint; // 上述收集的 Checkpoint结构体信息
ControlFile->time = (pg_time_t) time(NULL);
/* crash recovery should always recover to the end of WAL */
ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
ControlFile->minRecoveryPointTLI = 0;
/*
* Persist unloggedLSN value. It's reset on crash recovery, so this goes
* unused on non-shutdown checkpoints, but seems useful to store it always
* for debugging purposes.
*/
SpinLockAcquire(&XLogCtl->ulsn_lck);
ControlFile->unloggedLSN = XLogCtl->unloggedLSN;
SpinLockRelease(&XLogCtl->ulsn_lck);
// 更新控制文件信息,并释放锁
UpdateControlFile();
LWLockRelease(ControlFileLock);
// 更新下一个事务号信息保存至XLogCt结构体信息,来自共享内存
/* Update shared-memory copy of checkpoint XID/epoch */
SpinLockAcquire(&XLogCtl->info_lck);
XLogCtl->ckptFullXid = checkPoint.nextXid;
SpinLockRelease(&XLogCtl->info_lck);
/*
* We are now done with critical updates; no need for system panic if we
* have trouble while fooling with old log segments.
*/
END_CRIT_SECTION();
/*
* Let smgr do post-checkpoint cleanup (eg, deleting old files).
*/
SyncPostCheckpoint();
删除陈旧的XLOG文件
/*
* Update the average distance between checkpoints if the prior checkpoint
* exists.
*/
// 更新两次检查点的平均距离,用于评估WAL增量情况
if (PriorRedoPtr != InvalidXLogRecPtr)
UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
/*
* Delete old log files, those no longer needed for last checkpoint to
* prevent the disk holding the xlog from growing full.
*/
XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size); // redo 位点对应段文件号
/* 根据max_slot_wal_keep_size和wal_keep_size两个参数设置,再次调整最旧的需要保留的_logSegNo */
KeepLogSeg(recptr, &_logSegNo);
if (InvalidateObsoleteReplicationSlots(_logSegNo))
{
/*
* Some slots have been invalidated; recalculate the old-segment
* horizon, starting again from RedoRecPtr.
*/
XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
KeepLogSeg(recptr, &_logSegNo);
}
_logSegNo--; // 该段文件号均为无用XLOG,即可删除
RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr); // 移除陈旧无用的XLOG段文件
/*
* Make more log segments if needed. (Do this after recycling old log
* segments, since that may supply some of the needed files.)
*/
if (!shutdown)
PreallocXlogFiles(recptr);
/*
* Truncate pg_subtrans if possible. We can throw away all data before
* the oldest XMIN of any running transaction. No future transaction will
* attempt to reference any pg_subtrans entry older than that (see Asserts
* in subtrans.c). During recovery, though, we mustn't do this because
* StartupSUBTRANS hasn't been called yet.
*/
// 截断运行最老事务前的子事务信息
if (!RecoveryInProgress())
TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
/* Real work is done; log and update stats. */
LogCheckpointEnd(false);
/* Reset the process title */
update_checkpoint_display(flags, false, true);
TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
NBuffers,
CheckpointStats.ckpt_segs_added,
CheckpointStats.ckpt_segs_removed,
CheckpointStats.ckpt_segs_recycled);
}