postgres 源码解析18 checkpoint源码学习-3

Serendipity_Shy

已于 2022-10-01 11:42:15 修改

阅读量244

点赞数 1

分类专栏： postgres 文章标签：学习 postgresql 数据库

于 2022-08-27 19:50:17 首次发布

本文链接：https://blog.csdn.net/qq_52668274/article/details/126551201

版权

postgres 专栏收录该内容

54 篇文章 29 订阅

订阅专栏

本文讲解checkpointer刷脏流程，其接口函数为CheckPointGuts，这部分内容是检查点的核心工作，相关内容见知识回顾：
postgres checkpoint源码学习-1
postgres checkpoint源码学习-2

源码解析


/*
 * Flush all data in shared memory to disk, and fsync
 
 //将共享内存的脏数据刷盘，常规检查点和回复检查点供用同一份代码
 * This is the common code shared between regular checkpoints and
 * recovery restartpoints.
 */
static void
CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
	CheckPointRelationMap();      
	CheckPointReplicationSlots();
	CheckPointSnapBuild();
	CheckPointLogicalRewriteHeap();
	CheckPointReplicationOrigin();

	/* Write out all dirty data in SLRUs and the main buffer pool */
	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
	CheckPointCLOG();
	CheckPointCommitTs();
	CheckPointSUBTRANS();
	CheckPointMultiXact();
	CheckPointPredicate();
	CheckPointBuffers(flags);

	/* Perform all queued up fsyncs */
	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
	ProcessSyncRequests();
	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();

	/* We deliberately delay 2PC checkpointing as long as possible */
	CheckPointTwoPhase(checkPointRedo);
}

1 CheckPointRelationMap接口函数

/*
 * CheckPointRelationMap
 *
 * This is called during a checkpoint.  It must ensure that any relation map
 * updates that were WAL-logged before the start of the checkpoint are
 * securely flushed to disk and will not need to be replayed later.  This
 * seems unlikely to be a performance-critical issue, so we use a simple
 * method: we just take and release the RelationMappingLock.  This ensures
 * that any already-logged map update is complete, because write_relmap_file
 * will fsync the map file before the lock is released.
 * 该函数在检查点期间调用，目的是确保检查点开始前所有relation map的更新安全刷入磁盘。
 * 优化方法：只需通过获取和释放 RelationMappingLock。 便可确保任何已记录的relation map
 * 更新已完成，因为 write_relmap_file 将在释放锁之前同步map文件。
 */
void
CheckPointRelationMap(void)
{
	LWLockAcquire(RelationMappingLock, LW_SHARED);
	LWLockRelease(RelationMappingLock);
}

2 CheckPointReplicationSlots 接口函数

/*
 * Flush all replication slots to disk.
  // 将所有的复制操刷入磁盘
 * This needn't actually be part of a checkpoint, but it's a convenient
 * location.
 */
void
CheckPointReplicationSlots(void)
{
	int			i;

	elog(DEBUG1, "performing replication slot checkpoint");

	/*
	 * Prevent any slot from being created/dropped while we're active. As we
	 * explicitly do *not* want to block iterating over replication_slots or
	 * acquiring a slot we cannot take the control lock - but that's OK,
	 * because holding ReplicationSlotAllocationLock is strictly stronger, and
	 * enough to guarantee that nobody can change the in_use bits on us.
	 */
	LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);

	for (i = 0; i < max_replication_slots; i++)
	{
		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
		char		path[MAXPGPATH];

		if (!s->in_use)
			continue;

		/* save the slot to disk, locking is handled in SaveSlotToPath() */
		sprintf(path, "pg_replslot/%s", NameStr(s->data.name));
		SaveSlotToPath(s, path, LOG);
	}
	LWLockRelease(ReplicationSlotAllocationLock);
}

3 CheckPointSnapBuild 接口函数

/*
 * Remove all serialized snapshots that are not required anymore because no
 * slot can need them. This doesn't actually have to run during a checkpoint,
 * but it's a convenient point to schedule this.
 *
 * NB: We run this during checkpoints even if logical decoding is disabled so
 * we cleanup old slots at some point after it got disabled.
 */
 // 移除所有复制操不再需要的可串行化快照，即使逻辑解码被禁用，我们也会在检查点期间
 // 运行它，因此我们会在禁用后的某个时间点清理旧插槽。
void
CheckPointSnapBuild(void)

4 CheckPointLogicalRewriteHeap 接口函数

/* ---
 * Perform a checkpoint for logical rewrite mappings
 *
 * This serves two tasks:
 * 1) Remove all mappings not needed anymore based on the logical restart LSN
 * 2) Flush all remaining mappings to disk, so that replay after a checkpoint
 *	  only has to deal with the parts of a mapping that have been written out
 *	  after the checkpoint started.
 * ---
 * 两任务：移除逻辑复制的restart LSN位点前不再需要的mappings
 * 将剩余的mappings刷盘，在后续回放是只需要处理已写入的mapping
 */
void
CheckPointLogicalRewriteHeap(void)

5 CheckPointReplicationOrigin 接口函数

// 对重放的 remote_lsn 执行每个复制源的进度检查点。 确保我们在检查点 (local_lsn) 中
// 引用的所有事务实际上都在磁盘上。 如果事务最初是异步提交的，情况可能还不是这样。
/* ---------------------------------------------------------------------------
 * Perform a checkpoint of each replication origin's progress with respect to
 * the replayed remote_lsn. Make sure that all transactions we refer to in the
 * checkpoint (local_lsn) are actually on-disk. This might not yet be the case
 * if the transactions were originally committed asynchronously.
 *
 * We store checkpoints in the following format:
 * +-------+------------------------+------------------+-----+--------+
 * | MAGIC | ReplicationStateOnDisk | struct Replic... | ... | CRC32C | EOF
 * +-------+------------------------+------------------+-----+--------+
 *
 * So its just the magic, followed by the statically sized
 * ReplicationStateOnDisk structs. Note that the maximum number of
 * ReplicationState is determined by max_replication_slots.
 * ---------------------------------------------------------------------------
 */
void
CheckPointReplicationOrigin(void)

以下API是重点内容，其缓冲区均采用SLRU实现，均调用调用SimpleLruWriteAll --> SlruInternalWritePage–>接口实现，具体可参考：postgres源码分析 Slru缓冲池的实现-1
postgres源码分析 Slru缓冲池的实现-2

	CheckPointCLOG();
	CheckPointCommitTs();
	CheckPointSUBTRANS();
	CheckPointMultiXact();
	CheckPointPredicate();

6 CheckPointCLOG 接口函数

// 将CLOG脏数据刷至磁盘，调用SimpleLruWriteAll接口实现 
/*
 * Perform a checkpoint --- either during shutdown, or on-the-fly
 */
void
CheckPointCLOG(void)
{
	/*
	 * Write dirty CLOG pages to disk.  This may result in sync requests
	 * queued for later handling by ProcessSyncRequests(), as part of the
	 * checkpoint.
	 */
	TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
	SimpleLruWriteAll(XactCtl, true);
	TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
}

7 CheckPointCommitTs 接口函数

/*
 * Perform a checkpoint --- either during shutdown, or on-the-fly
 */
void
CheckPointCommitTs(void)
{
	/*
	 * Write dirty CommitTs pages to disk.  This may result in sync requests
	 * queued for later handling by ProcessSyncRequests(), as part of the
	 * checkpoint.
	 */
	SimpleLruWriteAll(CommitTsCtl, true);
}

8 CheckPointSUBTRANS 接口函数

/*
 * Perform a checkpoint --- either during shutdown, or on-the-fly
 */
void
CheckPointSUBTRANS(void)
{
	/*
	 * Write dirty SUBTRANS pages to disk
	 *
	 * This is not actually necessary from a correctness point of view. We do
	 * it merely to improve the odds that writing of dirty pages is done by
	 * the checkpoint process and not by backends.
	 */
	TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_START(true);
	SimpleLruWriteAll(SubTransCtl, true);
	TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_DONE(true);
}

9 CheckPointMultiXact 接口函数

/*
 * Perform a checkpoint --- either during shutdown, or on-the-fly
 */
void
CheckPointMultiXact(void)
{
	TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_START(true);

	/*
	 * Write dirty MultiXact pages to disk.  This may result in sync requests
	 * queued for later handling by ProcessSyncRequests(), as part of the
	 * checkpoint.
	 */
	SimpleLruWriteAll(MultiXactOffsetCtl, true);
	SimpleLruWriteAll(MultiXactMemberCtl, true);

	TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
}

10 CheckPointPredicate 接口函数

/*
 * Perform a checkpoint --- either during shutdown, or on-the-fly
 *
 * We don't have any data that needs to survive a restart, but this is a
 * convenient place to truncate the SLRU.
 */
void
CheckPointPredicate(void)

11 CheckPointBuffers 接口函数 <底层实现后续补充>

/*
 * CheckPointBuffers
 // 将共享缓冲池中的所有脏页刷盘  
 * Flush all dirty blocks in buffer pool to disk at checkpoint time.
 *
 * Note: temporary relations do not participate in checkpoints, so they don't
 * need to be flushed.
 */
void
CheckPointBuffers(int flags)
{
	BufferSync(flags);
}

12 ProcessSyncRequests 接口函数

/*
 // 处理同步队列请求
 *	ProcessSyncRequests() -- Process queued fsync requests.
 */
void
ProcessSyncRequests(void)

13 CheckPointTwoPhase接口函数

/*
 * CheckPointTwoPhase -- handle 2PC component of checkpointing.
 // 处理两阶段提交事务工作
 * We must fsync the state file of any GXACT that is valid or has been
 * generated during redo and has a PREPARE LSN <= the checkpoint's redo
 * horizon.  (If the gxact isn't valid yet, has not been generated in
 * redo, or has a later LSN, this checkpoint is not responsible for
 * fsyncing it.)
 // 我们必须 fsync 任何有效或在重做期间 PREPARE LSN <= checkpoint's redo 生成的 GXACT 
    的状态文件。（如果 gxact 还不是有效的，或者没有在重做中生成，或者有更高版本的 LSN，则此
    检查点不负责 fsync。）
 
 * This is deliberately run as late as possible in the checkpoint sequence,
 * because GXACTs ordinarily have short lifespans, and so it is quite
 * possible that GXACTs that were valid at checkpoint start will no longer
 * exist if we wait a little bit. With typical checkpoint settings this
 * will be about 3 minutes for an online checkpoint, so as a result we
 * expect that there will be no GXACTs that need to be copied to disk.
 // 尽可能晚地在检查点序列中执行该操作，因为 GXACT 通常具有较短的生命周期，所以如果
 我们稍等片刻，很有可能在检查点开始时有效的 GXACT 将不再存在。 使用典型的检查点设置，
 在线检查点大约需要 3 分钟，因此我们预计不需要将 GXACT 复制到磁盘。
 * If a GXACT remains valid across multiple checkpoints, it will already
 * be on disk so we don't bother to repeat that write.
 * 如果一个GXACT持续多个检查点,只会刷一次盘不会重复进行
 */
void
CheckPointTwoPhase(XLogRecPtr redo_horizon)
{
	int			i;
	int			serialized_xacts = 0;

	if (max_prepared_xacts <= 0)         // 无prepared事务,直接返回
 		return;					/* nothing to do */

	TRACE_POSTGRESQL_TWOPHASE_CHECKPOINT_START();

	/*
	 * We are expecting there to be zero GXACTs that need to be copied to
	 * disk, so we perform all I/O while holding TwoPhaseStateLock for
	 * simplicity. This prevents any new xacts from preparing while this
	 * occurs, which shouldn't be a problem since the presence of long-lived
	 * prepared xacts indicates the transaction manager isn't active.
	 // 在持有 TwoPhaseStateLock 将XlogReadTwoPhaseData复制至磁盘,这就意味这此过程不允许
	    新的prepared事务发生, 长期处于prepared事务表明事务管理器处于非活跃状态
	    
	 * It's also possible to move I/O out of the lock, but on every error we
	 * should check whether somebody committed our transaction in different
	 * backend. Let's leave this optimization for future, if somebody will
	 * spot that this place cause bottleneck.
	 * Note that it isn't possible for there to be a GXACT with a
	 * prepare_end_lsn set prior to the last checkpoint yet is marked invalid,
	 * because of the efforts with delayChkpt.
	 */
	 // 获取共享模式的TwoPhaseStateLock,遍历处于2pc阶段事务
	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
	{
		/*
		 * Note that we are using gxact not PGPROC so this works in recovery
		 * also
		 */
		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];

		if ((gxact->valid || gxact->inredo) &&
			!gxact->ondisk &&
			gxact->prepare_end_lsn <= redo_horizon)
		{
			char	   *buf;
			int			len;
			// 从xlog中读取2PC数据,prepare_start_lsn 为起点位置
			// 创建TwoPhase文件,并将上步获取的数据写入其中
			XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, &len);
			RecreateTwoPhaseFile(gxact->xid, buf, len);
			gxact->ondisk = true;
			gxact->prepare_start_lsn = InvalidXLogRecPtr;
			gxact->prepare_end_lsn = InvalidXLogRecPtr;
			pfree(buf);
			serialized_xacts++;
		}
	}
	LWLockRelease(TwoPhaseStateLock);

	/*
	 * Flush unconditionally the parent directory to make any information
	 * durable on disk.  Two-phase files could have been removed and those
	 * removals need to be made persistent as well as any files newly created
	 * previously since the last checkpoint.
	 */
	 // 持久化TWOPHASE_DIR下目录文件内容.
	fsync_fname(TWOPHASE_DIR, true);

	TRACE_POSTGRESQL_TWOPHASE_CHECKPOINT_DONE();

	if (log_checkpoints && serialized_xacts > 0)
		ereport(LOG,
				(errmsg_plural("%u two-phase state file was written "
							   "for a long-running prepared transaction",
							   "%u two-phase state files were written "
							   "for long-running prepared transactions",
							   serialized_xacts,
							   serialized_xacts)));
}