postgres 源码解析16 checkpoint源码学习-2

Serendipity_Shy

已于 2022-10-01 11:45:09 修改

阅读量258

点赞数 1

分类专栏： postgres 文章标签：数据库 postgresql

于 2022-08-21 18:55:47 首次发布

本文链接：https://blog.csdn.net/qq_52668274/article/details/126364113

版权

postgres 专栏收录该内容

54 篇文章 29 订阅

订阅专栏

本小节着重讲解CheckpointerMain中的CreateCheckPoint接口函数源码学习，相关知识回顾postgres checkpoint源码学习-1

源码分析

CreateCheckPoint函数主要有如下几个工作：
1）刷脏数据【数据页脏数据 share buffer 、Clog 脏数据 SLRU】
2）写checkpoint WAL日志；
3）更新控制文件信息
4）删除陈旧的WAL日志文件（须达到触发条件）

/*
 * Perform a checkpoint --- either during shutdown, or on-the-fly
	// 关闭数据库或者达到触发条件均会执行  checkpoint
 * flags is a bitwise OR of the following:
 *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.        	//关闭数据库
 *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.		// WAL恢复完执行 checkpoint
 *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,						// 立即执行
 *		ignoring checkpoint_completion_target parameter.
 *	CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occurred	// 强制执行，即使相隔两次无XLOG
 *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
 *		CHECKPOINT_END_OF_RECOVERY).
 *	CHECKPOINT_FLUSH_ALL: also flush buffers of unlogged tables.			// flush buffer,包括非日志表
 *
 * Note: flags contains other bits, of interest here only for logging purposes.
 * In particular note that this routine is synchronous and does not pay
 * attention to CHECKPOINT_WAIT.
 *
 * If !shutdown then we are writing an online checkpoint. This is a very special
 * kind of operation and WAL record because the checkpoint action occurs over
 * a period of time yet logically occurs at just a single LSN. The logical
 * position of the WAL record (redo ptr) is the same or earlier than the
 * physical position. When we replay WAL we locate the checkpoint via its
 * physical position then read the redo ptr and actually start replay at the
 * earlier logical position. Note that we don't write *anything* to WAL at
 * the logical position, so that location could be any other kind of WAL record.
 * All of this mechanism allows us to continue working while we checkpoint.
 * As a result, timing of actions is critical here and be careful to note that
 * this function will likely take minutes to execute on a busy system.
 * 非shutdown过程，我们会创建一个在线检查点（online checkpoint）。这是一个非常特别的操作类型和WAL记录，
 * 因为检查点动作（实际上）发生在一段时间内，而在逻辑上只发生在一个LSN上。WAL记录(redo ptr，即前面介绍的重做点)的
 * 逻辑位置早于或等于物理位置。在replay WAL时，我们通过它的物理位置定位检查点，然后读取redo ptr，实际上开始replay
 * 是在更早的逻辑位置。由于我们不向逻辑位置的WAL写任何东西，因此这个位置可以是任意类型的WAL记录。以上机制的目的是，
 * 让我们在checkpoint时可以继续工作。导致的问题是，操作的时间会比较长，尤其在繁忙的系统中，该函数可能会持续数分钟。

void
CreateCheckPoint(int flags)
{
	bool		shutdown;
	CheckPoint	checkPoint;
	XLogRecPtr	recptr;
	XLogSegNo	_logSegNo;
	XLogCtlInsert *Insert = &XLogCtl->Insert;
	uint32		freespace;
	XLogRecPtr	PriorRedoPtr;
	XLogRecPtr	curInsert;
	XLogRecPtr	last_important_lsn;
	VirtualTransactionId *vxids;
	int			nvxids;

	/*
	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
	 * issued at a different time.
	 */
	if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))  //  根据标识判断接下来的执行动作
		shutdown = true;
	else
		shutdown = false;

	/* sanity check */ 									// 安全检查，禁止在恢复过程中创建检查点
	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
		elog(ERROR, "can't create a checkpoint during recovery");

	/*
	 * Initialize InitXLogInsert working areas before entering the critical
	 * section.  Normally, this is done by the first call to
	 * RecoveryInProgress() or LocalSetXLogInsertAllowed(), but when creating
	 * an end-of-recovery checkpoint, the LocalSetXLogInsertAllowed call is
	 * done below in a critical section, and InitXLogInsert cannot be called
	 * in a critical section.
	 * 初始化XLOG日志组装区【内存区域】，进入临界区前应准备好该工作
	 */
	InitXLogInsert();		
	/*
	 * Prepare to accumulate statistics.
	 *  创建CheckpointStats结构体，记录相关信息
	 * Note: because it is possible for log_checkpoints to change while a
	 * checkpoint proceeds, we always accumulate stats, even if
	 * log_checkpoints is currently off.
	 */
	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();

	/*
	 * Use a critical section to force system panic if we have trouble.
	 */
	START_CRIT_SECTION();

	// 如果执行的是关闭类型checkpoint,设置ControlFile->state和ControlFile->time,并更新控制文件 
	// 【需持有ControlFileLock排他锁】
	if (shutdown)   	
	{
		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
		ControlFile->state = DB_SHUTDOWNING;
		ControlFile->time = (pg_time_t) time(NULL);
		UpdateControlFile();
		LWLockRelease(ControlFileLock);
	}

	/*
	 * Let smgr prepare for checkpoint; this has to happen before we determine
	 * the REDO pointer.  Note that smgr must not do anything that'd have to
	 * be undone if we decide no checkpoint is needed.
	 */
	 //  通知smgr 磁盘管理器准备好开始做checkpoint
	SyncPreCheckpoint();

	/* Begin filling in the checkpoint WAL record */
	// 开始填充checkpoint WAL日志记录
	MemSet(&checkPoint, 0, sizeof(checkPoint));
	checkPoint.time = (pg_time_t) time(NULL);

	/*
	 * For Hot Standby, derive the oldestActiveXid before we fix the redo
	 * pointer. This allows us to begin accumulating changes to assemble our
	 * starting snapshot of locks and transactions.
	 */
	 // 对于从库非shutdown类别检查点，在设置 redo pointer需要计算出当前最老的活跃事务号。
	 // 这允许我们计算这些变化用于locks和事务的起始快照，否则设置为 InvalidTransactionId
	if (!shutdown && XLogStandbyInfoActive())
		checkPoint.oldestActiveXid = GetOldestActiveTransactionId();  
	else
		checkPoint.oldestActiveXid = InvalidTransactionId;

	/*
	 * Get location of last important record before acquiring insert locks (as
	 * GetLastImportantRecPtr() also locks WAL locks).
	 */
	 // 在获取 insert locks前得到本地最新的 important_lsn
	last_important_lsn = GetLastImportantRecPtr();

	/*
	 * We must block concurrent insertions while examining insert state to
	 * determine the checkpoint REDO pointer.
	 */
	 /// 获取wal insert 排他锁，计算当前xlog插入位置 [物理地址]
	WALInsertLockAcquireExclusive();
	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
	
	/*
	 * If this isn't a shutdown or forced checkpoint, and if there has been no
	 * WAL activity requiring a checkpoint, skip it.  The idea here is to
	 * avoid inserting duplicate checkpoints when the system is idle.
	 */
	 // 对于非shotdown/恢复结束/强制类型检查点，如果系统处于空闲状态无WAL日志生成，则释放锁，退出临界区并返回
	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
				  CHECKPOINT_FORCE)) == 0)
	{
		if (last_important_lsn == ControlFile->checkPoint)
		{
			WALInsertLockRelease();
			END_CRIT_SECTION();
			ereport(DEBUG1,
					(errmsg_internal("checkpoint skipped because system is idle")));
			return;
		}
	}

	/*
	 * An end-of-recovery checkpoint is created before anyone is allowed to
	 * write WAL. To allow us to write the checkpoint record, temporarily
	 * enable XLogInsertAllowed.  (This also ensures ThisTimeLineID is
	 * initialized, which we need here and in AdvanceXLInsertBuffer.)
	 */
	 // 在写WAL前需创建 end-of-recovery checkpoint，为了允许我们写检查点记录，需开启XLogInsertAllowed，
	 // 该函数会初始化时间线并为后续WAL的组装分配内存等资源
	if (flags & CHECKPOINT_END_OF_RECOVERY)
		LocalSetXLogInsertAllowed();

	checkPoint.ThisTimeLineID = ThisTimeLineID;
	if (flags & CHECKPOINT_END_OF_RECOVERY)
		checkPoint.PrevTimeLineID = XLogCtl->PrevTimeLineID;
	else
		checkPoint.PrevTimeLineID = ThisTimeLineID;

	checkPoint.fullPageWrites = Insert->fullPageWrites;

检查点日志重做点的确定

	/*
	 * Compute new REDO record ptr = location of next XLOG record.
	 *
	 * NB: this is NOT necessarily where the checkpoint record itself will be,
	 * since other backends may insert more XLOG records while we're off doing
	 * the buffer flush work.  Those XLOG records are logically after the
	 * checkpoint, even though physically before it.  Got that?
	 */
	 // 确定边界，如果此时插入位点在页尾，
	 1）如果在段文件的页尾，则加上SizeOfXLogLongPHD偏移量，
	 2）如果在段内页尾，则加上SizeOfXLogShortPHD偏移量
	 
	freespace = INSERT_FREESPACE(curInsert);
	if (freespace == 0)
	{
		if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
			curInsert += SizeOfXLogLongPHD;
		else
			curInsert += SizeOfXLogShortPHD;
	}
	checkPoint.redo = curInsert;


	/*
	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
	 * must be done while holding all the insertion locks.
	 // 更新共享的 RedoRecPtr位点，持有wal insertion locks
	 // 如果检查点失败了， RedoRecPtr将会指向需要指向的位点。唯一的后果是备份了不需要WAL buffer,
	 // 不能推迟 RedoRePtr,因为在dumping buffers时 bufferde 改变不包含在此检查点中
	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
	 * pointing past where it really needs to point.  This is okay; the only
	 * consequence is that XLogInsert might back up whole buffers that it
	 * didn't really need to.  We can't postpone advancing RedoRecPtr because
	 * XLogInserts that happen while we are dumping buffers must assume that
	 * their buffer changes are not included in the checkpoint.
	 */
	RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;

	/*
	 * Now we can release the WAL insertion locks, allowing other xacts to
	 * proceed while we are flushing disk buffers.
	 */
	WALInsertLockRelease();
    
    // 更新 XLog结构体中的RedoRecPtr
	/* Update the info_lck-protected copy of RedoRecPtr as well */
	SpinLockAcquire(&XLogCtl->info_lck);
	XLogCtl->RedoRecPtr = checkPoint.redo;
	SpinLockRelease(&XLogCtl->info_lck);

	/*
	 * If enabled, log checkpoint start.  We postpone this until now so as not
	 * to log anything if we decided to skip the checkpoint.
	 */
	if (log_checkpoints)
		LogCheckpointStart(flags, false);

	/* Update the process title */
	update_checkpoint_display(flags, false, false);

	TRACE_POSTGRESQL_CHECKPOINT_START(flags);

	/*
	 * Get the other info we need for the checkpoint record.
	 // 收集检查点日志记录信息，如下：
	 * We don't need to save oldestClogXid in the checkpoint, it only matters
	 * for the short period in which clog is being truncated, and if we crash
	 * during that we'll redo the clog truncation and fix up oldestClogXid
	 * there.
	 */
	LWLockAcquire(XidGenLock, LW_SHARED);
	checkPoint.nextXid = ShmemVariableCache->nextXid;
	checkPoint.oldestXid = ShmemVariableCache->oldestXid;
	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
	LWLockRelease(XidGenLock);

	LWLockAcquire(CommitTsLock, LW_SHARED);
	checkPoint.oldestCommitTsXid = ShmemVariableCache->oldestCommitTsXid;
	checkPoint.newestCommitTsXid = ShmemVariableCache->newestCommitTsXid;
	LWLockRelease(CommitTsLock);

	LWLockAcquire(OidGenLock, LW_SHARED);
	checkPoint.nextOid = ShmemVariableCache->nextOid;
	if (!shutdown)
		checkPoint.nextOid += ShmemVariableCache->oidCount;
	LWLockRelease(OidGenLock);

	MultiXactGetCheckptMulti(shutdown,
							 &checkPoint.nextMulti,
							 &checkPoint.nextMultiOffset,
							 &checkPoint.oldestMulti,
							 &checkPoint.oldestMultiDB);

	/*
	 * Having constructed the checkpoint record, ensure all shmem disk buffers
	 * and commit-log buffers are flushed to disk.
	 *
	 * This I/O could fail for various reasons.  If so, we will fail to
	 * complete the checkpoint, but there is no reason to force a system
	 * panic. Accordingly, exit critical section while doing it.
	 */
	END_CRIT_SECTION();

刷脏数据页


	/*
	 * In some cases there are groups of actions that must all occur on one
	 * side or the other of a checkpoint record. Before flushing the
	 * checkpoint record we must explicitly wait for any backend currently
	 * performing those groups of actions.
	 *
	 * One example is end of transaction, so we must wait for any transactions
	 * that are currently in commit critical sections.  If an xact inserted
	 * its commit record into XLOG just before the REDO point, then a crash
	 * restart from the REDO point would not replay that record, which means
	 * that our flushing had better include the xact's update of pg_xact.  So
	 * we wait till he's out of his commit critical section before proceeding.
	 * See notes in RecordTransactionCommit().
	 *
	 * Because we've already released the insertion locks, this test is a bit
	 * fuzzy: it is possible that we will wait for xacts we didn't really need
	 * to wait for.  But the delay should be short and it seems better to make
	 * checkpoint take a bit longer than to hold off insertions longer than
	 * necessary. (In fact, the whole reason we have this issue is that xact.c
	 * does commit record XLOG insertion and clog update as two separate steps
	 * protected by different locks, but again that seems best on grounds of
	 * minimizing lock contention.)
	 *
	 * A transaction that has not yet set delayChkpt when we look cannot be at
	 * risk, since he's not inserted his commit record yet; and one that's
	 * already cleared it is not at risk either, since he's done fixing clog
	 * and we will correctly flush the update below.  So we cannot miss any
	 * xacts we need to wait for.
	 */
	vxids = GetVirtualXIDsDelayingChkpt(&nvxids); 	// 关键  设置睡眠【进入写XLOG/CLOG阶段】
	if (nvxids > 0)
	{
		do
		{
			pg_usleep(10000L);	/* wait for 10 msec */
		} while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
	}
	pfree(vxids);

	CheckPointGuts(checkPoint.redo, flags);			   // 刷脏数据 【核心工作，下篇讲解】

	/*
	 * Take a snapshot of running transactions and write this to WAL. This
	 * allows us to reconstruct the state of running transactions during
	 * archive recovery, if required. Skip, if this info disabled.
	 *
	 * If we are shutting down, or Startup process is completing crash
	 * recovery we don't need to write running xact data.
	 */
	if (!shutdown && XLogStandbyInfoActive())	// 非shutdown 类型检查点，需记录运行事务快照，归档
												// 恢复时会从重构事务状态
		LogStandbySnapshot();

写检查点日志并刷盘

	START_CRIT_SECTION();

	/*
	 * Now insert the checkpoint record into XLOG.
	 */
	XLogBeginInsert();
	XLogRegisterData((char *) (&checkPoint), sizeof(checkPoint));
	recptr = XLogInsert(RM_XLOG_ID,
						shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
						XLOG_CHECKPOINT_ONLINE);

	XLogFlush(recptr);								 	// 刷盘

	/*
	 * We mustn't write any new WAL after a shutdown checkpoint, or it will be
	 * overwritten at next startup.  No-one should even try, this just allows
	 * sanity-checking.  In the case of an end-of-recovery checkpoint, we want
	 * to just temporarily disable writing until the system has exited
	 * recovery.
	 */
	if (shutdown)				
	{
		if (flags & CHECKPOINT_END_OF_RECOVERY)
			LocalXLogInsertAllowed = -1;	/* return to "check" state */   // 返回检标识
		else
			LocalXLogInsertAllowed = 0; /* never again write WAL */         // 不允许写WAL
	}
    
    // 如果关库过程检查点失败 checkPoint.redo != ProcLastRecPtr， 直接PANIC，输出错误信息
	/*
	 * We now have ProcLastRecPtr = start of actual checkpoint record, recptr
	 * = end of actual checkpoint record.
	 */
	if (shutdown && checkPoint.redo != ProcLastRecPtr)
		ereport(PANIC,
				(errmsg("concurrent write-ahead log activity while database system is shutting down")));

	/*
	 * Remember the prior checkpoint's redo ptr for
	 * UpdateCheckPointDistanceEstimate()
	 */
	 // 记录上一个检查点的redo位点，目的是为更新 UpdateCheckPointDistanceEstimate
	PriorRedoPtr = ControlFile->checkPointCopy.redo;

更新控制文件信息

	/*
	 * Update the control file.
	 */
	 // 获取排他ControlFileLock
	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
	if (shutdown)
		ControlFile->state = DB_SHUTDOWNED;                        // 数据库状态标识
	ControlFile->checkPoint = ProcLastRecPtr;						
	ControlFile->checkPointCopy = checkPoint;					   // 上述收集的 Checkpoint结构体信息
	ControlFile->time = (pg_time_t) time(NULL);						
	/* crash recovery should always recover to the end of WAL */
	ControlFile->minRecoveryPoint = InvalidXLogRecPtr;         
	ControlFile->minRecoveryPointTLI = 0;

	/*
	 * Persist unloggedLSN value. It's reset on crash recovery, so this goes
	 * unused on non-shutdown checkpoints, but seems useful to store it always
	 * for debugging purposes.
	 */
	SpinLockAcquire(&XLogCtl->ulsn_lck);
	ControlFile->unloggedLSN = XLogCtl->unloggedLSN;
	SpinLockRelease(&XLogCtl->ulsn_lck);

	// 更新控制文件信息，并释放锁
	UpdateControlFile();
	LWLockRelease(ControlFileLock);
     
   // 更新下一个事务号信息保存至XLogCt结构体信息，来自共享内存
	/* Update shared-memory copy of checkpoint XID/epoch */
	SpinLockAcquire(&XLogCtl->info_lck);
	XLogCtl->ckptFullXid = checkPoint.nextXid;
	SpinLockRelease(&XLogCtl->info_lck);

	/*
	 * We are now done with critical updates; no need for system panic if we
	 * have trouble while fooling with old log segments.
	 */
	END_CRIT_SECTION();

	/*
	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
	 */
	SyncPostCheckpoint();

删除陈旧的XLOG文件


	/*
	 * Update the average distance between checkpoints if the prior checkpoint
	 * exists.
	 */
	 // 更新两次检查点的平均距离，用于评估WAL增量情况
	if (PriorRedoPtr != InvalidXLogRecPtr)
		UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);

	/*
	 * Delete old log files, those no longer needed for last checkpoint to
	 * prevent the disk holding the xlog from growing full.
	 */
	 
	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);   //  redo 位点对应段文件号
    /* 根据max_slot_wal_keep_size和wal_keep_size两个参数设置，再次调整最旧的需要保留的_logSegNo */
	KeepLogSeg(recptr, &_logSegNo);							
	if (InvalidateObsoleteReplicationSlots(_logSegNo))
	{
		/*
		 * Some slots have been invalidated; recalculate the old-segment
		 * horizon, starting again from RedoRecPtr.
		 */
		XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
		KeepLogSeg(recptr, &_logSegNo);
	}
	_logSegNo--;			 // 该段文件号均为无用XLOG，即可删除
	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);		// 移除陈旧无用的XLOG段文件

	/*
	 * Make more log segments if needed.  (Do this after recycling old log
	 * segments, since that may supply some of the needed files.)
	 */
	if (!shutdown)
		PreallocXlogFiles(recptr);

	/*
	 * Truncate pg_subtrans if possible.  We can throw away all data before
	 * the oldest XMIN of any running transaction.  No future transaction will
	 * attempt to reference any pg_subtrans entry older than that (see Asserts
	 * in subtrans.c).  During recovery, though, we mustn't do this because
	 * StartupSUBTRANS hasn't been called yet.
	 */
	 // 截断运行最老事务前的子事务信息
	if (!RecoveryInProgress())
		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());

	/* Real work is done; log and update stats. */
	LogCheckpointEnd(false);

	/* Reset the process title */
	update_checkpoint_display(flags, false, true);

	TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
									 NBuffers,
									 CheckpointStats.ckpt_segs_added,
									 CheckpointStats.ckpt_segs_removed,
									 CheckpointStats.ckpt_segs_recycled);
}