本文讲解checkpointer刷脏流程,其接口函数为CheckPointGuts,这部分内容是检查点的核心工作,相关内容见知识回顾:
postgres checkpoint源码学习-1
postgres checkpoint源码学习-2
源码解析
/*
* Flush all data in shared memory to disk, and fsync
//将共享内存的脏数据刷盘,常规检查点和回复检查点供用同一份代码
* This is the common code shared between regular checkpoints and
* recovery restartpoints.
*/
static void
CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots();
CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
CheckPointCLOG();
CheckPointCommitTs();
CheckPointSUBTRANS();
CheckPointMultiXact();
CheckPointPredicate();
CheckPointBuffers(flags);
/* Perform all queued up fsyncs */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
ProcessSyncRequests();
CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
/* We deliberately delay 2PC checkpointing as long as possible */
CheckPointTwoPhase(checkPointRedo);
}
1 CheckPointRelationMap接口函数
/*
* CheckPointRelationMap
*
* This is called during a checkpoint. It must ensure that any relation map
* updates that were WAL-logged before the start of the checkpoint are
* securely flushed to disk and will not need to be replayed later. This
* seems unlikely to be a performance-critical issue, so we use a simple
* method: we just take and release the RelationMappingLock. This ensures
* that any already-logged map update is complete, because write_relmap_file
* will fsync the map file before the lock is released.
* 该函数在检查点期间调用,目的是确保检查点开始前所有relation map的更新安全刷入磁盘。
* 优化方法:只需通过获取和释放 RelationMappingLock。 便可确保任何已记录的relation map
* 更新已完成,因为 write_relmap_file 将在释放锁之前同步map文件。
*/
void
CheckPointRelationMap(void)
{
LWLockAcquire(RelationMappingLock, LW_SHARED);
LWLockRelease(RelationMappingLock);
}
2 CheckPointReplicationSlots 接口函数
/*
* Flush all replication slots to disk.
// 将所有的复制操刷入磁盘
* This needn't actually be part of a checkpoint, but it's a convenient
* location.
*/
void
CheckPointReplicationSlots(void)
{
int i;
elog(DEBUG1, "performing replication slot checkpoint");
/*
* Prevent any slot from being created/dropped while we're active. As we
* explicitly do *not* want to block iterating over replication_slots or
* acquiring a slot we cannot take the control lock - but that's OK,
* because holding ReplicationSlotAllocationLock is strictly stronger, and
* enough to guarantee that nobody can change the in_use bits on us.
*/
LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);
for (i = 0; i < max_replication_slots; i++)
{
ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
char path[MAXPGPATH];
if (!s->in_use)
continue;
/* save the slot to disk, locking is handled in SaveSlotToPath() */
sprintf(path, "pg_replslot/%s", NameStr(s->data.name));
SaveSlotToPath(s, path, LOG);
}
LWLockRelease(ReplicationSlotAllocationLock);
}
3 CheckPointSnapBuild 接口函数
/*
* Remove all serialized snapshots that are not required anymore because no
* slot can need them. This doesn't actually have to run during a checkpoint,
* but it's a convenient point to schedule this.
*
* NB: We run this during checkpoints even if logical decoding is disabled so
* we cleanup old slots at some point after it got disabled.
*/
// 移除所有复制操不再需要的可串行化快照,即使逻辑解码被禁用,我们也会在检查点期间
// 运行它,因此我们会在禁用后的某个时间点清理旧插槽。
void
CheckPointSnapBuild(void)
4 CheckPointLogicalRewriteHeap 接口函数
/* ---
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
* 1) Remove all mappings not needed anymore based on the logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
* ---
* 两任务:移除逻辑复制的restart LSN位点前不再需要的mappings
* 将剩余的mappings刷盘,在后续回放是只需要处理已写入的mapping
*/
void
CheckPointLogicalRewriteHeap(void)
5 CheckPointReplicationOrigin 接口函数
// 对重放的 remote_lsn 执行每个复制源的进度检查点。 确保我们在检查点 (local_lsn) 中
// 引用的所有事务实际上都在磁盘上。 如果事务最初是异步提交的,情况可能还不是这样。
/* ---------------------------------------------------------------------------
* Perform a checkpoint of each replication origin's progress with respect to
* the replayed remote_lsn. Make sure that all transactions we refer to in the
* checkpoint (local_lsn) are actually on-disk. This might not yet be the case
* if the transactions were originally committed asynchronously.
*
* We store checkpoints in the following format:
* +-------+------------------------+------------------+-----+--------+
* | MAGIC | ReplicationStateOnDisk | struct Replic... | ... | CRC32C | EOF
* +-------+------------------------+------------------+-----+--------+
*
* So its just the magic, followed by the statically sized
* ReplicationStateOnDisk structs. Note that the maximum number of
* ReplicationState is determined by max_replication_slots.
* ---------------------------------------------------------------------------
*/
void
CheckPointReplicationOrigin(void)
以下API是重点内容,其缓冲区均采用SLRU实现,均调用调用SimpleLruWriteAll --> SlruInternalWritePage–>接口实现 ,具体可参考:postgres源码分析 Slru缓冲池的实现-1
postgres源码分析 Slru缓冲池的实现-2
CheckPointCLOG();
CheckPointCommitTs();
CheckPointSUBTRANS();
CheckPointMultiXact();
CheckPointPredicate();
6 CheckPointCLOG 接口函数
// 将CLOG脏数据刷至磁盘,调用SimpleLruWriteAll接口实现
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*/
void
CheckPointCLOG(void)
{
/*
* Write dirty CLOG pages to disk. This may result in sync requests
* queued for later handling by ProcessSyncRequests(), as part of the
* checkpoint.
*/
TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
SimpleLruWriteAll(XactCtl, true);
TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
}
7 CheckPointCommitTs 接口函数
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*/
void
CheckPointCommitTs(void)
{
/*
* Write dirty CommitTs pages to disk. This may result in sync requests
* queued for later handling by ProcessSyncRequests(), as part of the
* checkpoint.
*/
SimpleLruWriteAll(CommitTsCtl, true);
}
8 CheckPointSUBTRANS 接口函数
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*/
void
CheckPointSUBTRANS(void)
{
/*
* Write dirty SUBTRANS pages to disk
*
* This is not actually necessary from a correctness point of view. We do
* it merely to improve the odds that writing of dirty pages is done by
* the checkpoint process and not by backends.
*/
TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_START(true);
SimpleLruWriteAll(SubTransCtl, true);
TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_DONE(true);
}
9 CheckPointMultiXact 接口函数
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*/
void
CheckPointMultiXact(void)
{
TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_START(true);
/*
* Write dirty MultiXact pages to disk. This may result in sync requests
* queued for later handling by ProcessSyncRequests(), as part of the
* checkpoint.
*/
SimpleLruWriteAll(MultiXactOffsetCtl, true);
SimpleLruWriteAll(MultiXactMemberCtl, true);
TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
}
10 CheckPointPredicate 接口函数
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*
* We don't have any data that needs to survive a restart, but this is a
* convenient place to truncate the SLRU.
*/
void
CheckPointPredicate(void)
11 CheckPointBuffers 接口函数 <底层实现后续补充>
/*
* CheckPointBuffers
// 将共享缓冲池中的所有脏页刷盘
* Flush all dirty blocks in buffer pool to disk at checkpoint time.
*
* Note: temporary relations do not participate in checkpoints, so they don't
* need to be flushed.
*/
void
CheckPointBuffers(int flags)
{
BufferSync(flags);
}
12 ProcessSyncRequests 接口函数
/*
// 处理同步队列请求
* ProcessSyncRequests() -- Process queued fsync requests.
*/
void
ProcessSyncRequests(void)
13 CheckPointTwoPhase接口函数
/*
* CheckPointTwoPhase -- handle 2PC component of checkpointing.
// 处理两阶段提交事务工作
* We must fsync the state file of any GXACT that is valid or has been
* generated during redo and has a PREPARE LSN <= the checkpoint's redo
* horizon. (If the gxact isn't valid yet, has not been generated in
* redo, or has a later LSN, this checkpoint is not responsible for
* fsyncing it.)
// 我们必须 fsync 任何有效或在重做期间 PREPARE LSN <= checkpoint's redo 生成的 GXACT
的状态文件。(如果 gxact 还不是有效的,或者没有在重做中生成,或者有更高版本的 LSN,则此
检查点不负责 fsync。)
* This is deliberately run as late as possible in the checkpoint sequence,
* because GXACTs ordinarily have short lifespans, and so it is quite
* possible that GXACTs that were valid at checkpoint start will no longer
* exist if we wait a little bit. With typical checkpoint settings this
* will be about 3 minutes for an online checkpoint, so as a result we
* expect that there will be no GXACTs that need to be copied to disk.
// 尽可能晚地在检查点序列中执行该操作,因为 GXACT 通常具有较短的生命周期,所以如果
我们稍等片刻,很有可能在检查点开始时有效的 GXACT 将不再存在。 使用典型的检查点设置,
在线检查点大约需要 3 分钟,因此我们预计不需要将 GXACT 复制到磁盘。
* If a GXACT remains valid across multiple checkpoints, it will already
* be on disk so we don't bother to repeat that write.
* 如果一个GXACT持续多个检查点,只会刷一次盘不会重复进行
*/
void
CheckPointTwoPhase(XLogRecPtr redo_horizon)
{
int i;
int serialized_xacts = 0;
if (max_prepared_xacts <= 0) // 无prepared事务,直接返回
return; /* nothing to do */
TRACE_POSTGRESQL_TWOPHASE_CHECKPOINT_START();
/*
* We are expecting there to be zero GXACTs that need to be copied to
* disk, so we perform all I/O while holding TwoPhaseStateLock for
* simplicity. This prevents any new xacts from preparing while this
* occurs, which shouldn't be a problem since the presence of long-lived
* prepared xacts indicates the transaction manager isn't active.
// 在持有 TwoPhaseStateLock 将XlogReadTwoPhaseData复制至磁盘,这就意味这此过程不允许
新的prepared事务发生, 长期处于prepared事务表明事务管理器处于非活跃状态
* It's also possible to move I/O out of the lock, but on every error we
* should check whether somebody committed our transaction in different
* backend. Let's leave this optimization for future, if somebody will
* spot that this place cause bottleneck.
* Note that it isn't possible for there to be a GXACT with a
* prepare_end_lsn set prior to the last checkpoint yet is marked invalid,
* because of the efforts with delayChkpt.
*/
// 获取共享模式的TwoPhaseStateLock,遍历处于2pc阶段事务
LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
{
/*
* Note that we are using gxact not PGPROC so this works in recovery
* also
*/
GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
if ((gxact->valid || gxact->inredo) &&
!gxact->ondisk &&
gxact->prepare_end_lsn <= redo_horizon)
{
char *buf;
int len;
// 从xlog中读取2PC数据,prepare_start_lsn 为起点位置
// 创建TwoPhase文件,并将上步获取的数据写入其中
XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, &len);
RecreateTwoPhaseFile(gxact->xid, buf, len);
gxact->ondisk = true;
gxact->prepare_start_lsn = InvalidXLogRecPtr;
gxact->prepare_end_lsn = InvalidXLogRecPtr;
pfree(buf);
serialized_xacts++;
}
}
LWLockRelease(TwoPhaseStateLock);
/*
* Flush unconditionally the parent directory to make any information
* durable on disk. Two-phase files could have been removed and those
* removals need to be made persistent as well as any files newly created
* previously since the last checkpoint.
*/
// 持久化TWOPHASE_DIR下目录文件内容.
fsync_fname(TWOPHASE_DIR, true);
TRACE_POSTGRESQL_TWOPHASE_CHECKPOINT_DONE();
if (log_checkpoints && serialized_xacts > 0)
ereport(LOG,
(errmsg_plural("%u two-phase state file was written "
"for a long-running prepared transaction",
"%u two-phase state files were written "
"for long-running prepared transactions",
serialized_xacts,
serialized_xacts)));
}