postgres 源码解析10 checkpoint机制--1

Serendipity_Shy

已于 2022-10-01 11:49:27 修改

阅读量635

点赞数 1

分类专栏： postgres 文章标签：学习数据库

于 2022-07-27 16:29:57 首次发布

本文链接：https://blog.csdn.net/qq_52668274/article/details/125988101

版权

postgres 专栏收录该内容

54 篇文章 29 订阅

订阅专栏

1 背景介绍

checkpoint是postgres数据库中的一个后台进程，其功能是将需要持久化的脏数据进行刷盘，包括：CLOG/SUBTRANS相关的SLRU缓冲池、共享内存中的脏buffer以及二阶段提交事务信息,更新控制文件和移除陈旧的WAL日志；在系统crash后会读取最近的检查点日志记录进行redo, 从而使数据库恢复至crash前的一致性状态，保证了数据的完整性<见官网>。

2 关键数据结构

2.1 CheckPoint

该结构记录Checkpoint日志的相关信息：redo位点、TimeLineID、nextXid和nextOid等

/*
 * Body of CheckPoint XLOG records.  This is declared here because we keep
 * a copy of the latest one in pg_control for possible disaster recovery.
 * Changing this struct requires a PG_CONTROL_VERSION bump.
 */
typedef struct CheckPoint
{
	XLogRecPtr	redo;			/* next RecPtr available when we began to    // 重做起始点
								 * create CheckPoint (i.e. REDO start point) */  
	TimeLineID	ThisTimeLineID; /* current TLI */
	TimeLineID	PrevTimeLineID; /* previous TLI, if this record begins a new
								 * timeline (equals ThisTimeLineID otherwise) */
	bool		fullPageWrites; /* current full_page_writes */
	FullTransactionId nextXid;	/* next free transaction ID */
	Oid			nextOid;		/* next free OID */
	MultiXactId nextMulti;		/* next free MultiXactId */
	MultiXactOffset nextMultiOffset;	/* next free MultiXact offset */
	TransactionId oldestXid;	/* cluster-wide minimum datfrozenxid */
	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
	Oid			oldestMultiDB;	/* database with minimum datminmxid */
	pg_time_t	time;			/* time stamp of checkpoint */
	TransactionId oldestCommitTsXid;	/* oldest Xid with valid commit
										 * timestamp */
	TransactionId newestCommitTsXid;	/* newest Xid with valid commit
										 * timestamp */
	/*
	 * Oldest XID still running. This is only needed to initialize hot standby
	 * mode from an online checkpoint, so we only bother calculating this for
	 * online checkpoints and only when wal_level is replica. Otherwise it's
	 * set to InvalidTransactionId.
	 */
	TransactionId oldestActiveXid;
} CheckPoint;

2.2 CheckpointerShmemStruct

该结构记录 checkpointer 进程和其他后台进程之间的通讯信息，位于共享内存中

typedef struct
{
	pid_t		checkpointer_pid;	/* PID (0 if not started) */   

	slock_t		ckpt_lck;		/* protects all the ckpt_* fields */ 

	int			ckpt_started;	/* advances when checkpoint starts */    // 计数器
	int			ckpt_done;		/* advances when checkpoint done */
	int			ckpt_failed;	/* advances when checkpoint fails */

	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */ 

	ConditionVariable start_cv; /* signaled when ckpt_started advances */
	ConditionVariable done_cv;	/* signaled when ckpt_done advances */

	uint32		num_backend_writes; /* counts user backend buffer writes */
	uint32		num_backend_fsync;	/* counts user backend fsync calls */

	int			num_requests;	/* current # of requests */
	int			max_requests;	/* allocated array size */
	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
} CheckpointerShmemStruct;

static CheckpointerShmemStruct *CheckpointerShmem;

   ckpt_flags 标识位信息 

/* These directly affect the behavior of CreateCheckPoint and subsidiaries */
#define CHECKPOINT_IS_SHUTDOWN	0x0001	/* Checkpoint is for shutdown */      
#define CHECKPOINT_END_OF_RECOVERY	0x0002	/* Like shutdown checkpoint, but
											 * issued at end of WAL recovery */
#define CHECKPOINT_IMMEDIATE	0x0004	/* Do it without delays */
#define CHECKPOINT_FORCE		0x0008	/* Force even if no activity */
#define CHECKPOINT_FLUSH_ALL	0x0010	/* Flush all pages, including those
										 * belonging to unlogged tables */
/* These are important to RequestCheckpoint */
#define CHECKPOINT_WAIT			0x0020	/* Wait for completion */
#define CHECKPOINT_REQUESTED	0x0040	/* Checkpoint request has been made */
/* These indicate the cause of a checkpoint request */
#define CHECKPOINT_CAUSE_XLOG	0x0080	/* XLOG consumption */
#define CHECKPOINT_CAUSE_TIME	0x0100	/* Elapsed time */

3 checkpoint的主流程

阅读源代码[代码版本为 pg14], 可总结出CheckpointerMain的执行流程如下：

CheckpointerMain 
	-> CreateCheckPoint 
		-> CheckPointGuts （checkpoint刷脏的函数，包括所有脏页）
			-> CheckPointBuffers （刷shared buffer的脏页）
				-> BufferSync （写出缓存区中的所有脏页）
					-> SyncOneBuffer （刷一页）
						-> FlushBuffer （实际刷脏的函数）
							-> XLogFlush（刷wal记录）
				-> ProcessSyncRequests（执行fsync操作，确保BufferSync 的数据真正落盘）

入口函数CheckpointMain内容很多，我们一起抽丝剥茧，看看其实现原理：

void
CheckpointerMain(void)
{
	sigjmp_buf	local_sigjmp_buf;
	MemoryContext checkpointer_context;

	CheckpointerShmem->checkpointer_pid = MyProcPid;
	
	/*
	 * Properly accept or ignore signals the postmaster might send us
	 *
	 * Note: we deliberately ignore SIGTERM, because during a standard Unix
	 * system shutdown cycle, init will SIGTERM all processes at once.  We
	 * want to wait for the backends to exit, whereupon the postmaster will
	 * tell us it's okay to shut down (via SIGUSR2).
	 */
	 // 注册信号处理函数：重载配置文件SIGHUP、checkpoint请求SIGINT、SIGPIPE和shutdown请求SIGUSR2等
	pqsignal(SIGHUP, SignalHandlerForConfigReload);
	pqsignal(SIGINT, ReqCheckpointHandler); /* request checkpoint */
	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
	/* SIGQUIT handler was already set up by InitPostmasterChild */
	pqsignal(SIGALRM, SIG_IGN);
	pqsignal(SIGPIPE, SIG_IGN);
	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);

	/*
	 * Reset some signals that are accepted by postmaster but not here
	 */
	pqsignal(SIGCHLD, SIG_DFL);

	/*
	 * Initialize so that first time-driven event happens at the correct time.
	 */
	last_checkpoint_time = last_xlog_switch_time = (pg_time_t) time(NULL);

	/*
	 * Create a memory context that we will do all our work in.  We do this so
	 * that we can reset the context during error recovery and thereby avoid
	 * possible memory leaks.  Formerly this code just ran in
	 * TopMemoryContext, but resetting that would be a really bad idea.
	 */
	 // 创建内存上下文用于checkpoint操作
	checkpointer_context = AllocSetContextCreate(TopMemoryContext,
												 "Checkpointer",
												 ALLOCSET_DEFAULT_SIZES);
	MemoryContextSwitchTo(checkpointer_context);

	/*
	 * If an exception is encountered, processing resumes here.
	 *
	 * You might wonder why this isn't coded as an infinite loop around a
	 * PG_TRY construct.  The reason is that this is the bottom of the
	 * exception stack, and so with PG_TRY there would be no exception handler
	 * in force at all during the CATCH part.  By leaving the outermost setjmp
	 * always active, we have at least some chance of recovering from an error
	 * during error recovery.  (If we get into an infinite loop thereby, it
	 * will soon be stopped by overflow of elog.c's internal state stack.)
	 *
	 * Note that we use sigsetjmp(..., 1), so that the prevailing signal mask
	 * (to wit, BlockSig) will be restored when longjmp'ing to here.  Thus,
	 * signals other than SIGQUIT will be blocked until we complete error
	 * recovery.  It might seem that this policy makes the HOLD_INTERRUPTS()
	 * call redundant, but it is not since InterruptPending might be set
	 * already.
	 */
	 // 设置异常跳跃栈进行容错处理
	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
	{
		/* Since not using PG_TRY, must reset error stack by hand */
		error_context_stack = NULL;

		/* Prevent interrupts while cleaning up */
		HOLD_INTERRUPTS();

		/* Report the error to the server log */
		EmitErrorReport();

		/*
		 * These operations are really just a minimal subset of
		 * AbortTransaction().  We don't have very many resources to worry
		 * about in checkpointer, but we do have LWLocks, buffers, and temp
		 * files.
		 */
		LWLockReleaseAll();
		ConditionVariableCancelSleep();
		pgstat_report_wait_end();
		AbortBufferIO();
		UnlockBuffers();
		ReleaseAuxProcessResources(false);
		AtEOXact_Buffers(false);
		AtEOXact_SMgr();
		AtEOXact_Files(false);
		AtEOXact_HashTables(false);

		/* Warn any waiting backends that the checkpoint failed. */
		if (ckpt_active)
		{
			SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
			CheckpointerShmem->ckpt_failed++;
			CheckpointerShmem->ckpt_done = CheckpointerShmem->ckpt_started;
			SpinLockRelease(&CheckpointerShmem->ckpt_lck);

			ConditionVariableBroadcast(&CheckpointerShmem->done_cv);

			ckpt_active = false;
		}

		/*
		 * Now return to normal top-level context and clear ErrorContext for
		 * next time.
		 */
		MemoryContextSwitchTo(checkpointer_context);
		FlushErrorState();

		/* Flush any leaked data in the top-level context */
		MemoryContextResetAndDeleteChildren(checkpointer_context);

		/* Now we can allow interrupts again */
		RESUME_INTERRUPTS();

		/*
		 * Sleep at least 1 second after any error.  A write error is likely
		 * to be repeated, and we don't want to be filling the error logs as
		 * fast as we can.
		 */
		pg_usleep(1000000L);

		/*
		 * Close all open files after any error.  This is helpful on Windows,
		 * where holding deleted files open causes various strange errors.
		 * It's not clear we need it elsewhere, but shouldn't hurt.
		 */
		smgrcloseall();
	}

	/* We can now handle ereport(ERROR) */
	PG_exception_stack = &local_sigjmp_buf;

	/*
	 * Unblock signals (they were blocked when the postmaster forked us)
	 */
	PG_SETMASK(&UnBlockSig);
	
	/*
	 * Ensure all shared memory values are set correctly for the config. Doing
	 * this here ensures no race conditions from other concurrent updaters.
	 */
	UpdateSharedMemoryConfig();
	
	/*
	 * Advertise our latch that backends can use to wake us up while we're
	 * sleeping.
	 */
	ProcGlobal->checkpointerLatch = &MyProc->procLatch;

	/*
	 * Loop forever		// 死循环
	 */
	for (;;)
	{
		bool		do_checkpoint = false;
		int			flags = 0;
		pg_time_t	now;
		int			elapsed_secs;
		int			cur_timeout;

		/* Clear any already-pending wakeups */
		ResetLatch(MyLatch);	// 移除latch

		/*
		 * Process any requests or signals received recently.
		 */
		AbsorbSyncRequests();     
		HandleCheckpointerInterrupts();  

		/*
		 * Detect a pending checkpoint request by checking whether the flags
		 * word in shared memory is nonzero.  We shouldn't need to acquire the
		 * ckpt_lck for this.
		 */
		if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
		{
			do_checkpoint = true;
			BgWriterStats.m_requested_checkpoints++;
		}

		/*
		 * Force a checkpoint if too much time has elapsed since the last one.
		 * Note that we count a timed checkpoint in stats only when this
		 * occurs without an external request, but we set the CAUSE_TIME flag
		 * bit even if there is also an external request.
		 */
		now = (pg_time_t) time(NULL);
		elapsed_secs = now - last_checkpoint_time;	
		if (elapsed_secs >= CheckPointTimeout)		  // 触发 checkpoint机制之一：间隔时间达到 CheckPointTimeout
		{											  // pg 默认为 300s,   ===> 配置文件 
			if (!do_checkpoint)
				BgWriterStats.m_timed_checkpoints++;
			do_checkpoint = true;					
			flags |= CHECKPOINT_CAUSE_TIME;			// 添加标识信息，说明由超时引起的 checkpoint  
		}

		/*
		 * Do a checkpoint if requested.
		 */
		if (do_checkpoint)
		{
			bool		ckpt_performed = false;
			bool		do_restartpoint;

			/*
			 * Check if we should perform a checkpoint or a restartpoint. As a
			 * side-effect, RecoveryInProgress() initializes TimeLineID if
			 * it's not set yet.
			 */
			do_restartpoint = RecoveryInProgress();			

			restartpoint 是系统恢复过程中创建日志恢复检查点 ==>> recover过程也会失败，
			/*
			 * Atomically fetch the request flags to figure out what kind of a
			 * checkpoint we should perform, and increase the started-counter
			 * to acknowledge that we've started a new checkpoint.
			 */
			SpinLockAcquire(&CheckpointerShmem->ckpt_lck);  // 获取ckpt_lck
			flags |= CheckpointerShmem->ckpt_flags;
			CheckpointerShmem->ckpt_flags = 0;				
			CheckpointerShmem->ckpt_started++;				// 递增 ckpt_lck 计数器
			SpinLockRelease(&CheckpointerShmem->ckpt_lck);

			ConditionVariableBroadcast(&CheckpointerShmem->start_cv);	// 广播，告知其他后台进程

			/*
			 * The end-of-recovery checkpoint is a real checkpoint that's
			 * performed while we're still in recovery.
			 */
			if (flags & CHECKPOINT_END_OF_RECOVERY)			恢复完毕无需创建 restartpoint 
				do_restartpoint = false;

			/*
			 * We will warn if (a) too soon since last checkpoint (whatever
			 * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
			 * since the last checkpoint start.  Note in particular that this
			 * implementation will not generate warnings caused by
			 * CheckPointTimeout < CheckPointWarning.
			 */
			if (!do_restartpoint &&
				(flags & CHECKPOINT_CAUSE_XLOG) &&
				elapsed_secs < CheckPointWarning)
				ereport(LOG,
						(errmsg_plural("checkpoints are occurring too frequently (%d second apart)",
									   "checkpoints are occurring too frequently (%d seconds apart)",
									   elapsed_secs,
									   elapsed_secs),
						 errhint("Consider increasing the configuration parameter \"max_wal_size\".")));
			
			/*
			 * Initialize checkpointer-private variables used during
			 * checkpoint.
			 */
			ckpt_active = true;			// checkpointer私有变量
			if (do_restartpoint)
				ckpt_start_recptr = GetXLogReplayRecPtr(NULL);
			else
				ckpt_start_recptr = GetInsertRecPtr();		// 获取 ckpt_start_recptr 
			ckpt_start_time = now;							// 开始时间戳
			ckpt_cached_elapsed = 0;

			/*
			 * Do the checkpoint.
			 */
			if (!do_restartpoint)
			{
				CreateCheckPoint(flags);		 ==>> checkpoint 创建函数，下篇
				ckpt_performed = true;
			}
			else
				ckpt_performed = CreateRestartPoint(flags);

			/*
			 * After any checkpoint, close all smgr files.  This is so we
			 * won't hang onto smgr references to deleted files indefinitely.
			 */ 
			smgrcloseall();    // 执行checkpoint操作后关闭所有 smgr文件

			/*
			 * Indicate checkpoint completion to any waiting backends.
			 */
			SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
			CheckpointerShmem->ckpt_done = CheckpointerShmem->ckpt_started;  //更新ckpt_done字段，告知等待d的
																			 // backends checkpoint已完成
			SpinLockRelease(&CheckpointerShmem->ckpt_lck);

			ConditionVariableBroadcast(&CheckpointerShmem->done_cv);		// 广播

			if (ckpt_performed)				// 执行完成，更新 last_checkpoint_time
			{
				/*
				 * Note we record the checkpoint start time not end time as
				 * last_checkpoint_time.  This is so that time-driven
				 * checkpoints happen at a predictable spacing.
				 */
				last_checkpoint_time = now;
			}
			else
			{							//  restartpoint 会将 ckpt_performed 标识置为 false， 
										//  产生原因是自上次restartpoint以来没有接收到任何新的 checkpoint WAL记录
				/*					    //  尝试15s
				 * We were not able to perform the restartpoint (checkpoints
				 * throw an ERROR in case of error).  Most likely because we
				 * have not received any new checkpoint WAL records since the
				 * last restartpoint. Try again in 15 s.
				 */
				last_checkpoint_time = now - CheckPointTimeout + 15;
			}

			ckpt_active = false;	  // 活跃标识
		}

		/* Check for archive_timeout and switch xlog files if necessary. */
		CheckArchiveTimeout();		// 检查归档超时  ==>> 

		/*
		 * Send off activity statistics to the stats collector.  (The reason
		 * why we re-use bgwriter-related code for this is that the bgwriter
		 * and checkpointer used to be just one process.  It's probably not
		 * worth the trouble to split the stats support into two independent
		 * stats message types.)
		 */
		pgstat_send_bgwriter();		// bgwriter 与 checkpoint 都会刷脏数据，两者间需做一定的协调

		/* Send WAL statistics to the stats collector. */
		pgstat_send_wal(true);		// 向数据收集器发送 wal相关数据

		/*
		 * If any checkpoint flags have been set, redo the loop to handle the
		 * checkpoint without sleeping.
		 */
	    在上述流程中已将此 ckpt_flags置为0， 若不为0，则说明其他的backends修改此变量，需回到循环进行处理，而不是往下走sleep
		if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
			continue;
		
		/*
		 * Sleep until we are signaled or it's time for another checkpoint or
		 * xlog file switch.
		 */
		休眠接收到信号通知或到达超时时间或进行 xlog 段文件切换

		now = (pg_time_t) time(NULL);						// 更新时间戳
		elapsed_secs = now - last_checkpoint_time;			 
		if (elapsed_secs >= CheckPointTimeout)
			continue;			/* no sleep for us ... */	// 如果超时，则回到循环初始点继续工作
		cur_timeout = CheckPointTimeout - elapsed_secs;		// 休眠时间
		if (XLogArchiveTimeout > 0 && !RecoveryInProgress())
		{
			elapsed_secs = now - last_xlog_switch_time;		//根据XLogArchiveTimeout参数判断是否进行sleep还是
															// 在下一个检查点进行处理
			if (elapsed_secs >= XLogArchiveTimeout)
				continue;		/* no sleep for us ... */
			cur_timeout = Min(cur_timeout, XLogArchiveTimeout - elapsed_secs);
		}
	
		// 等待Latch
		(void) WaitLatch(MyLatch,
						 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,    [后续学习，未懂]
						 cur_timeout * 1000L /* convert to ms */ ,
						 WAIT_EVENT_CHECKPOINTER_MAIN);
	}
}

4 Checkpoint触发条件

在这里插入图片描述
1 超级用户（其他用户不可）执行CHECKPOINT命令
2 数据库shutdown
3 数据库recovery完成
4 XLOG日志量达到了触发checkpoint阈值
5 后台周期性地进行checkpoint
PostgreSQL有一个专门的后台进程用于周期性执行checkpoint，然而其他情况触发checkpoint后，也是通过向后台进程发信号的方式将checkpoint交给后台进程来完成。比如超级用户在执行了CHECKPOINT操作时，PostgreSQL会调用RequestCheckpoint，在该函数中会通过kill给后台进程发送信号（SIGINT 2），从而使后台进程执行checkpoint。

checkpoint相关的参数：

1、checkpoint_timeout：
这是自动WAL检查点之间的最长时间（默认为5分钟）。增加此参数可能会增加崩溃恢复所需的时间。

2、max_wal_size:
使WAL增长到自动WAL检查点之间的最大大小。默认值为1 GB。增大此参数可能会增加崩溃恢复所需的时间。

如果我们同时设置了这两个参数，则检查点将以先到者为准。

3、min_wal_size：
只要WAL磁盘使用率保持低于此设置，旧的WAL文件将始终在检查点被回收以备将来使用，而不是被删除。这可以用来确保保留足够的WAL空间来处理WAL使用率的峰值，例如在运行大型批处理作业时。（默认为80 MB）

4、checkpoint_completion_target :
由于每5分钟或达到每个max_wal_size阈值都会发生一次检查点，因此在检查点时间内，共享缓冲区中存在的所有脏页将被刷新到磁盘，从而导致巨大的IO。
checkpoint_completion_target来这里进行救援。
这会使刷新速度变慢，这意味着PostgreSQL应该花费checkpoint_completion_target * checkpoint_timeout的时间来写入数据。
例如，如果我的checkpoint_completion_target为0.5，并且数据库将限制写入，以便最后写入在2.5分钟后完成。

5、wal_buffers :
用于尚未写入磁盘的WAL数据的共享内存量。默认设置为-1，选择的大小等于shared_buffers的1/32（大约3％），但不小于64kB，也不大于一个WAL段的大小，通常为16MB。

6、checkpoint_flush_after:
在执行检查点时，只要写入的字节数超过checkpoint_flush_after，则尝试强制OS将这些写入操作刷到存储中。这样做将限制内核页面缓存中的脏数据量，从而减少在检查点末尾发出fsync时停顿的可能性。
此设置在某些平台上可能无效。