PostgreSQL 基础模块---缓冲池管理

obvious__

已于 2023-07-31 16:27:29 修改

阅读量2.2k

点赞数 5

分类专栏： postgresql 文章标签：数据库 postgresql

于 2020-10-29 18:59:44 首次发布

本文链接：https://blog.csdn.net/obvious__/article/details/109366557

版权

postgresql 专栏收录该内容

25 篇文章 31 订阅

订阅专栏

参考资料

《PostgreSQL数据库内核分析》彭智勇彭煜玮：P99~P101

概述

在PostgreSQL中，任何对于表、元组、索引等操作都在缓冲池中进行，缓冲池的数据调度都以磁盘块为单位，需要访问的数据块以磁盘块为单位调用函数smgrread写入缓冲区，而smgrwrite将缓冲池数据写回磁盘。调入缓冲池中的磁盘块称为缓冲区，多个缓冲区组成的缓冲池。

PostgreSQL有两种缓冲池：共享缓冲池和本地缓冲池。共享缓冲池主要作为普通表的操作场所，本地缓冲池则仅本地可见的临时表的操作场所。本文仅对共享缓冲池进行阐述。

对缓冲池中，缓冲区的管理通过两种机制完成：

pin

当进程要访问缓冲区前，对于缓冲区加pin，pin的数目保存在缓冲区的refcount属性中。当refcount不为0时表明有进程正在访问缓冲区，此时该缓冲区不能被替换。
lock

lock机制为缓冲区的并发访问提供了保障，当进程对缓冲区进行写操作时加EXCLUSIVE锁，读操作加SHARE锁。比如：Insert操作，在获取到缓冲区后需要先将缓冲区加EXCLUSIVE锁。（加锁操作在RelationGetBufferForTuple函数中进行，详见插入流程）。

初始化共享缓冲区

共享缓冲池的初始化工作由InitBufferPool来完成。在共享缓冲池管理中，使用了一个全局数组BufferDescriptors来管理缓冲池中的缓冲区，其数组元素类型为BufferDesc。另外使用了一个全局指针变量BufferBlocks来存储缓冲池的起始地址。

下面先来看看BufferDesc的定义：

typedef struct BufferDesc
{
	BufferTag	tag;					/* ID of page contained in buffer */
	int			buf_id;					/* buffer's index number (from 0) */

	/* state of the tag, containing flags, refcount and usagecount */
	pg_atomic_uint32 state;

	int			wait_backend_pid;		/* backend PID of pin-count waiter */
	int			freeNext;				/* link in freelist chain */

	LWLock		content_lock;			/* to lock access to buffer contents */
} BufferDesc;

其中：

tag：用于标识该缓冲块的物理信息，具体定义如下：

typedef struct buftag
{
	RelFileNode rnode;			/* 表所在表空间oid，数据库oid，表本身oid组成 */
	ForkNumber	forkNum;		/* 枚举类型，标记缓冲区中是什么类型的文件块 */
	BlockNumber blockNum;		/* 块号 */
} BufferTag;

tag唯一标识了一个物理块，注意是物理块！（后面的缓冲区加载流程会再次用到tag）

buf_id：缓冲区的索引号，buf_id唯一标识了一个缓冲区。对缓冲区的各种操作都会用到buf_id。

共享缓冲区和本地缓冲区都使用buf_id，他们的编号规则不同：共享缓冲区的buf_id从0开始编号，后续依次加1。而本地缓冲区的buf_id从-2开始编号，后续依次减1。
```
/* 本地缓冲区从-2开始编号 */
#define LocalBufHdrGetBlock(bufHdr) LocalBufferBlockPointers[-((bufHdr)->buf_id + 2)]
```
state：由flags、refcount、usagecount组成
- flags：标志位，表示缓冲区是否为脏等。
- refcount：表示当前正在引用该块缓冲区的进程数，通过pin操作来修改该字段。
- usagecount：最近缓冲区使用次数，用于缓冲区替换。
wait_backend_pid：用于记录一个请求修改缓冲区的进程号。
freeNext：如果当前缓冲区在空闲链中，则freeNext指向下一个空闲缓冲区。
content_lock：当进程访问缓冲块时，会在content_lock上加锁，读访问加LW_SHARE锁，写访问加LW_EXCLUSIVE锁，此锁可以防止因多个进程对缓冲区访问的冲突而造成数据不一致。

缓冲区的操作

前面说到共享缓冲池管理中有两个全局变量：BufferDesc数组BufferDescriptors和BufferBlocks指针。那么这两个全局变量之间有什么关系，两者由如何转换？

首先，BufferDescriptors是一个数组，数组元素的个数为N。N=缓冲池中缓冲区的数量，默认值为1000。BufferBlocks是一段连续的内存空间，大小为BLCKSZ*N，所以BufferBlocks也可以理解为一个数组，数组元素个数为N，每个数组元素都是一个缓冲区。

在BufferDesc中有一个成员buf_id，这个值表示了当前的BufferDesc在BufferDescriptors中的下标，即

BufferDesc == BufferDescriptors[BufferDesc ->buf_id]。

所以根据buf_id就可以从BufferDescriptors中获取BufferDesc，也可以从BufferBlocks中获取实际的缓冲区。具体操作见如下函数：

/* 返回一个bufferid,后续的操作都是基于bufferid进行 */
#define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)

/* 从BufferDescriptors中获取一个BufferDesc */
#define GetBufferDescriptor(id) (&BufferDescriptors[(id)].bufferdesc)

/* 从BufferBlocks中获取一个缓冲区 */
#define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))

#define BufferIsLocal(buffer)	((buffer) < 0)	/* 判断是否是本地缓冲区 */

#define BufferGetBlock(buffer) \
( \
	AssertMacro(BufferIsValid(buffer)), \
	BufferIsLocal(buffer) ? \
		LocalBufferBlockPointers[-(buffer) - 1] \
	: \
		(Block) (BufferBlocks + ((Size) ((buffer) - 1)) * BLCKSZ) \
)

对于GetBufferDescriptor的调用需要的参数直接是BufferDesc的数组下标，但对于BufferGetBlock的调用需要的参数却必须是BufferDescriptorGetBuffer的返回值，即数组下标+1。目前尚不清楚为什么要这样设计。

InitBufferPool的主要功能

InitBufferPool主要做三件事：

初始化BufferDescriptors。
初始化BufferBlocks。
初始化缓冲区hash表。

初始化缓冲区hash表，在StrategyInitialize中调用InitBufTable来完成。缓冲区hash表的作用在共享缓冲区的加载中来讲。

共享缓冲区加载（查询）

当PostgreSQL读写一个物理块时，首先需要把物理块读取到共享缓冲区中，然后再从缓冲区中读写数据。从物理块读取到共享缓冲区的过程称为共享缓冲区加载。ReadBuffer_common是所有缓冲区的通用函数，定义了本地缓冲区和共享缓冲区的通用读取方法。代码如下：

static Buffer
ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
				  BlockNumber blockNum, ReadBufferMode mode,
				  BufferAccessStrategy strategy, bool *hit)
{
	BufferDesc *bufHdr;
	Block		bufBlock;
	bool		found;
	bool		isExtend;
	bool		isLocalBuf = SmgrIsTemp(smgr);

	*hit = false;

	/* Make sure we will have room to remember the buffer pin */
	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);

	isExtend = (blockNum == P_NEW);

	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
									   smgr->smgr_rnode.node.spcNode,
									   smgr->smgr_rnode.node.dbNode,
									   smgr->smgr_rnode.node.relNode,
									   smgr->smgr_rnode.backend,
									   isExtend);

	/* Substitute proper block number if caller asked for P_NEW */
	if (isExtend)
		blockNum = smgrnblocks(smgr, forkNum);

	if (isLocalBuf)
	{
		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
		if (found)
			pgBufferUsage.local_blks_hit++;
		else
			pgBufferUsage.local_blks_read++;
	}
	else
	{
		/*
		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
		 * not currently in memory.
		 */
		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
							 strategy, &found);
		if (found)
			pgBufferUsage.shared_blks_hit++;
		else
			pgBufferUsage.shared_blks_read++;
	}

	/* At this point we do NOT hold any locks. */

	/* if it was already in the buffer pool, we're done */
	if (found)
	{
		if (!isExtend)
		{
			/* Just need to update stats before we exit */
			*hit = true;
			VacuumPageHit++;

			if (VacuumCostActive)
				VacuumCostBalance += VacuumCostPageHit;

			TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
											  smgr->smgr_rnode.node.spcNode,
											  smgr->smgr_rnode.node.dbNode,
											  smgr->smgr_rnode.node.relNode,
											  smgr->smgr_rnode.backend,
											  isExtend,
											  found);

			/*
			 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be
			 * locked on return.
			 */
			if (!isLocalBuf)
			{
				if (mode == RBM_ZERO_AND_LOCK)
					LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
								  LW_EXCLUSIVE);
				else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
					LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
			}

			return BufferDescriptorGetBuffer(bufHdr);
		}

		/*
		 * We get here only in the corner case where we are trying to extend
		 * the relation but we found a pre-existing buffer marked BM_VALID.
		 * This can happen because mdread doesn't complain about reads beyond
		 * EOF (when zero_damaged_pages is ON) and so a previous attempt to
		 * read a block beyond EOF could have left a "valid" zero-filled
		 * buffer.  Unfortunately, we have also seen this case occurring
		 * because of buggy Linux kernels that sometimes return an
		 * lseek(SEEK_END) result that doesn't account for a recent write. In
		 * that situation, the pre-existing buffer would contain valid data
		 * that we don't want to overwrite.  Since the legitimate case should
		 * always have left a zero-filled buffer, complain if not PageIsNew.
		 */
		bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
		if (!PageIsNew((Page) bufBlock))
			ereport(ERROR,
			 (errmsg("unexpected data beyond EOF in block %u of relation %s",
					 blockNum, relpath(smgr->smgr_rnode, forkNum)),
			  errhint("This has been seen to occur with buggy kernels; consider updating your system.")));

		/*
		 * We *must* do smgrextend before succeeding, else the page will not
		 * be reserved by the kernel, and the next P_NEW call will decide to
		 * return the same page.  Clear the BM_VALID bit, do the StartBufferIO
		 * call that BufferAlloc didn't, and proceed.
		 */
		if (isLocalBuf)
		{
			/* Only need to adjust flags */
			uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);

			Assert(buf_state & BM_VALID);
			buf_state &= ~BM_VALID;
			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
		}
		else
		{
			/*
			 * Loop to handle the very small possibility that someone re-sets
			 * BM_VALID between our clearing it and StartBufferIO inspecting
			 * it.
			 */
			do
			{
				uint32		buf_state = LockBufHdr(bufHdr);

				Assert(buf_state & BM_VALID);
				buf_state &= ~BM_VALID;
				UnlockBufHdr(bufHdr, buf_state);
			} while (!StartBufferIO(bufHdr, true));
		}
	}

	/*
	 * if we have gotten to this point, we have allocated a buffer for the
	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
	 * if it's a shared buffer.
	 *
	 * Note: if smgrextend fails, we will end up with a buffer that is
	 * allocated but not marked BM_VALID.  P_NEW will still select the same
	 * block number (because the relation didn't get any longer on disk) and
	 * so future attempts to extend the relation will find the same buffer (if
	 * it's not been recycled) but come right back here to try smgrextend
	 * again.
	 */
	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */

	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);

	if (isExtend)
	{
		/* new buffers are zero-filled */
		MemSet((char *) bufBlock, 0, BLCKSZ);
		/* don't set checksum for all-zero page */
		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);

		/*
		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
		 * although we're essentially performing a write. At least on linux
		 * doing so defeats the 'delayed allocation' mechanism, leading to
		 * increased file fragmentation.
		 */
	}
	else
	{
		/*
		 * Read in the page, unless the caller intends to overwrite it and
		 * just wants us to allocate a buffer.
		 */
		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
			MemSet((char *) bufBlock, 0, BLCKSZ);
		else
		{
			instr_time	io_start,
						io_time;

			if (track_io_timing)
				INSTR_TIME_SET_CURRENT(io_start);

			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);

			if (track_io_timing)
			{
				INSTR_TIME_SET_CURRENT(io_time);
				INSTR_TIME_SUBTRACT(io_time, io_start);
				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
			}

			/* check for garbage data */
			if (!PageIsVerified((Page) bufBlock, blockNum))
			{
				if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
				{
					ereport(WARNING,
							(errcode(ERRCODE_DATA_CORRUPTED),
							 errmsg("invalid page in block %u of relation %s; zeroing out page",
									blockNum,
									relpath(smgr->smgr_rnode, forkNum))));
					MemSet((char *) bufBlock, 0, BLCKSZ);
				}
				else
					ereport(ERROR,
							(errcode(ERRCODE_DATA_CORRUPTED),
							 errmsg("invalid page in block %u of relation %s",
									blockNum,
									relpath(smgr->smgr_rnode, forkNum))));
			}
		}
	}

	/*
	 * In RBM_ZERO_AND_LOCK mode, grab the buffer content lock before marking
	 * the page as valid, to make sure that no other backend sees the zeroed
	 * page before the caller has had a chance to initialize it.
	 *
	 * Since no-one else can be looking at the page contents yet, there is no
	 * difference between an exclusive lock and a cleanup-strength lock. (Note
	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
	 * they assert that the buffer is already valid.)
	 */
	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
		!isLocalBuf)
	{
		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
	}

	if (isLocalBuf)
	{
		/* Only need to adjust flags */
		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);

		buf_state |= BM_VALID;
		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
	}
	else
	{
		/* Set BM_VALID, terminate IO, and wake up any waiters */
		TerminateBufferIO(bufHdr, false, BM_VALID);
	}

	VacuumPageMiss++;
	if (VacuumCostActive)
		VacuumCostBalance += VacuumCostPageMiss;

	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
									  smgr->smgr_rnode.node.spcNode,
									  smgr->smgr_rnode.node.dbNode,
									  smgr->smgr_rnode.node.relNode,
									  smgr->smgr_rnode.backend,
									  isExtend,
									  found);

	return BufferDescriptorGetBuffer(bufHdr);
}

代码较长，但就加载而言，只有两个步骤：

步骤1：调用BufferAlloc从从共享缓冲区中获取一个buf。

该buf中可能已经缓存了当前需要的块，此时直接返回即可。

BufferAlloc的出参found，表示buf中是否缓存了当前块。
步骤2：如果buf中没有缓存当前块，则需要调用smgrread将当前块从磁盘中读取到buf中。

不难看出，BufferAlloc是整个加载过程的核心，在查看BufferAlloc代码之前，我们带一个着问题来调试BufferAlloc：如果两个进程需要同时加载同一个物理块，那么如何保证这个块不会被重复加载？

块的重复加载问题

为了解决这个问题，我们先设计如下的测试步骤：

创建一张表。
```
create table t1(a int);
```
向表中插入一条记录，此时该表就会包含一个物理块。
```
insert into t1 values(1);
```
重启数据库，如此步骤2产生的物理块就不会存在于共享缓冲池中。
在BufferAlloc中打上断点。
开启两个客户端连接PostgreSQL，然后执行查询语句。
```
select * from t1;
```

还记得InitBufferPool中初始化的hash表么，下面它将隆重登场，hash在这里相当于一个缓冲区字典，以物理块的BufferTag为key，以缓冲区的buf_id为value。BufferAlloc按照以下步骤执行：

将物理块对应表的表空间oid、数据库oid、本身oid等信息组成BufferTag（见：INIT_BUFFERTAG）。前面说过BufferTag唯一标识一个物理块。那么就可以以BufferTag为key在hash表中进行查询，若能够查询到相应的buf_id，则说明请求的物理块已经被加载到缓冲池中，那么直接返回（以BufferDesc的形式返回）。
当hash表中不存在时，则需要在找到一个空闲的缓冲区来装入文件。如果存在空闲缓冲区则返回该缓冲区，如果不存在则使用替换机制进行替换缓冲区。

BufferAlloc的代码如下：

static BufferDesc *
BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
			BlockNumber blockNum,
			BufferAccessStrategy strategy,
			bool *foundPtr)
{
	/***省略***/

	/* 
	 * see if the block is in the buffer pool already 
	 * 步骤1：检查物理块是否已经在缓冲区中。
	 */
	LWLockAcquire(newPartitionLock, LW_SHARED);
	buf_id = BufTableLookup(&newTag, newHash);
	if (buf_id >= 0)
	{
        //找到则直接返回
		/***省略***/
		return buf;
	}
	
    /***省略***/
	/* 
	 * Loop here in case we have to try another victim buffer 
	 */
	for (;;)
	{
		/*
		 * Ensure, while the spinlock's not yet held, that there's a free
		 * refcount entry.
		 */
		ReservePrivateRefCountEntry();

		/*
		 * Select a victim buffer.  The buffer is returned with its header
		 * spinlock still held!
		 * 步骤2：获取一个空闲缓冲区。
		 */
		buf = StrategyGetBuffer(strategy, &buf_state);

		/***省略***/

		/*
		 * To change the association of a valid buffer, we'll need to have
		 * exclusive lock on both the old and new mapping partitions.
		 */
		if (oldFlags & BM_TAG_VALID)
		{
			/*
			 * Need to compute the old tag's hashcode and partition lock ID.
			 * XXX is it worth storing the hashcode in BufferDesc so we need
			 * not recompute it here?  Probably not.
			 */
			oldTag = buf->tag;
			oldHash = BufTableHashCode(&oldTag);
			oldPartitionLock = BufMappingPartitionLock(oldHash);

			/*
			 * Must lock the lower-numbered partition first to avoid
			 * deadlocks.
			 */
			if (oldPartitionLock < newPartitionLock)
			{
				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
			}
			else if (oldPartitionLock > newPartitionLock)
			{
				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
			}
			else
			{
				/* only one partition, only one lock */
				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
			}
		}
		else
		{
			/* if it wasn't valid, we need only the new partition */
			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
			/* remember we have no old-partition lock or tag */
			oldPartitionLock = NULL;
			/* this just keeps the compiler quiet about uninit variables */
			oldHash = 0;
		}

		/*
		 * Try to make a hashtable entry for the buffer under its new tag.
		 * This could fail because while we were writing someone else
		 * allocated another buffer for the same block we want to read in.
		 * Note that we have not yet removed the hashtable entry for the old
		 * tag.
		 * 步骤3：将newTag插入BufTable。
		 */
		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);

		if (buf_id >= 0)
		{
			/*
			 * Got a collision. Someone has already done what we were about to
			 * do. We'll just handle this as if it were found in the buffer
			 * pool in the first place.  First, give up the buffer we were
			 * planning to use.
			 *
			 * 放弃当前获取到的buf
			 */
			UnpinBuffer(buf, true);

			/* Can give up that buffer's mapping partition lock now */
			if (oldPartitionLock != NULL &&
				oldPartitionLock != newPartitionLock)
				LWLockRelease(oldPartitionLock);

			/* remaining code should match code at top of routine */

			buf = GetBufferDescriptor(buf_id);
			
            /***pin buf_id对应的buf***/
			valid = PinBuffer(buf, strategy);

			/* Can release the mapping lock as soon as we've pinned it */
			LWLockRelease(newPartitionLock);

			*foundPtr = TRUE;

			if (!valid)
			{
				/*
				 * We can only get here if (a) someone else is still reading
				 * in the page, or (b) a previous read attempt failed.  We
				 * have to wait for any active read attempt to finish, and
				 * then set up our own read attempt if the page is still not
				 * BM_VALID.  StartBufferIO does it all.
				 */
				if (StartBufferIO(buf, true))
				{
					/*
					 * If we get here, previous attempts to read the buffer
					 * must have failed ... but we shall bravely try again.
					 */
					*foundPtr = FALSE;
				}
			}

			return buf;
		}

		/*
		 * Need to lock the buffer header too in order to change its tag.
		 */
		buf_state = LockBufHdr(buf);

		/***省略***/
	}

	/***省略***/

	return buf;
}

现在我们回到之前提出的问题：如果两个进程需要同时加载同一个物理块，那么如何保证这个块不会被重复加载？在调试的过程中，我们发现由于数据库重新启动，所以物理块肯定不会被加载到缓冲池中，所以步骤1的BufTableLookup返回值为-1，于是进入到了步骤2。并且此时两个进程都获取到了一个缓冲区！紧接着执行步骤3，调用函数BufTableInsert将获取到的buf进程插入hash表中（以BufferTag为key，buf_id为value）。但在插入hash表之前，首先对hash表加了互斥锁（上面代码47行~86行），于是两个进程变为了串行！

接着进程1执行BufTableInsert，BufTableInsert会返回一个buf_id，由于在插入前hash表中没有相应的BufferTag，所以返回-1。当进程2执行BufTableInsert时，由于BufferTag已经被进程1插入到了hash表中，所以显然BufTableInsert会返回BufferTag对应的buf_id。此时进程2会放弃从StrategyGetBuffer中获取的buf（上面代码第108行），转为获取buf_id对应的buf（上面代码第120行）。

由此可见，通过对hash表的串行插入，防止了同一个物理块被重复加载的问题。

加载中间态问题

通过前面对ReadBuffer_common函数的描述，我们明白了一件事，就加载而言ReadBuffer_common有两个步骤：

步骤1：调用BufferAlloc获取一个buf
步骤2：调用smgrread将物理块读取到buf中

在重复加载的实验中，进程1就执行了上述两个步骤。而此时和进程1并发执行的进程2就会遇到一个问题：进程1在执行步骤1之后，hash表中就已经存在当前块对应的BufferTag了，而进程2也能看到这个BufferTag。但此时进程1可能尚未执行步骤2，或者正在执行步骤2，又或者步骤2执行失败，无论那种情况物理块都尚未读取到buf中，那么如果进程2直接使用这个buf显然会出问题。

所以在上述代码的120行，valid = PinBuffer(buf, strategy);Pin操作返回了一个valid，如果valid为false，则表示当前块尚未加载到缓存中，于是会调用StartBufferIO等待加载完毕（上面代码136行）。

如果进程1加载成功，那么进程2的StartBufferIO会返回false。此时进程2的BufferAlloc会返回buf，同时出参found为ture，表示buf中已经加载了需要的块，无需再加载。
如果进程1加载失败，那么进程2的StartBufferIO会返回true。此时进程2的BufferAlloc会返回buf，同时出参found为false，表示buf中没有加载了需要的块，需要smgrread将物理块读取到buf中。

缓冲区获取冲突问题

我们再来思考另外一个问题：如果两个进程需要加载不同的物理块，但是获取到了同一个缓冲区怎么办？获取缓冲区的函数为StrategyGetBuffer，该函数按如下步骤执行：

如果有空闲缓冲区，则获取一个空闲缓冲区。否则执行步骤2。
使用替换机制替换缓冲区。

代码如下：

BufferDesc *
StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
{
	BufferDesc *buf;
	int			bgwprocno;
	int			trycounter;
	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */

	/*
	 * If given a strategy object, see whether it can select a buffer. We
	 * assume strategy objects don't need buffer_strategy_lock.
	 */
	if (strategy != NULL)
	{
		buf = GetBufferFromRing(strategy, buf_state);
		if (buf != NULL)
			return buf;
	}

	/*
	 * If asked, we need to waken the bgwriter. Since we don't want to rely on
	 * a spinlock for this we force a read from shared memory once, and then
	 * set the latch based on that value. We need to go through that length
	 * because otherwise bgprocno might be reset while/after we check because
	 * the compiler might just reread from memory.
	 *
	 * This can possibly set the latch of the wrong process if the bgwriter
	 * dies in the wrong moment. But since PGPROC->procLatch is never
	 * deallocated the worst consequence of that is that we set the latch of
	 * some arbitrary process.
	 */
	bgwprocno = INT_ACCESS_ONCE(StrategyControl->bgwprocno);
	if (bgwprocno != -1)
	{
		/* reset bgwprocno first, before setting the latch */
		StrategyControl->bgwprocno = -1;

		/*
		 * Not acquiring ProcArrayLock here which is slightly icky. It's
		 * actually fine because procLatch isn't ever freed, so we just can
		 * potentially set the wrong process' (or no process') latch.
		 */
		SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
	}

	/*
	 * We count buffer allocation requests so that the bgwriter can estimate
	 * the rate of buffer consumption.  Note that buffers recycled by a
	 * strategy object are intentionally not counted here.
	 */
	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);

	/*
	 * First check, without acquiring the lock, whether there's buffers in the
	 * freelist. Since we otherwise don't require the spinlock in every
	 * StrategyGetBuffer() invocation, it'd be sad to acquire it here -
	 * uselessly in most cases. That obviously leaves a race where a buffer is
	 * put on the freelist but we don't see the store yet - but that's pretty
	 * harmless, it'll just get used during the next buffer acquisition.
	 *
	 * If there's buffers on the freelist, acquire the spinlock to pop one
	 * buffer of the freelist. Then check whether that buffer is usable and
	 * repeat if not.
	 *
	 * Note that the freeNext fields are considered to be protected by the
	 * buffer_strategy_lock not the individual buffer spinlocks, so it's OK to
	 * manipulate them without holding the spinlock.
	 *
	 * 步骤1：获取空闲缓冲区
	 *
	 */
	if (StrategyControl->firstFreeBuffer >= 0)
	{
		while (true)
		{
			/* 
			 * Acquire the spinlock to remove element from the freelist 
			 * 加锁
			 */
			SpinLockAcquire(&StrategyControl->buffer_strategy_lock);

			if (StrategyControl->firstFreeBuffer < 0)
			{
				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
				break;
			}

			buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);
			Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);

			/* Unconditionally remove buffer from freelist */
			StrategyControl->firstFreeBuffer = buf->freeNext;
			buf->freeNext = FREENEXT_NOT_IN_LIST;

			/*
			 * Release the lock so someone else can access the freelist while
			 * we check out this buffer.
			 */
			SpinLockRelease(&StrategyControl->buffer_strategy_lock);

			/*
			 * If the buffer is pinned or has a nonzero usage_count, we cannot
			 * use it; discard it and retry.  (This can only happen if VACUUM
			 * put a valid buffer in the freelist and then someone else used
			 * it before we got to it.  It's probably impossible altogether as
			 * of 8.3, but we'd better check anyway.)
			 */
			local_buf_state = LockBufHdr(buf);
			if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
				&& BUF_STATE_GET_USAGECOUNT(local_buf_state) == 0)
			{
				if (strategy != NULL)
					AddBufferToRing(strategy, buf);
				*buf_state = local_buf_state;
				return buf;
			}
			UnlockBufHdr(buf, local_buf_state);

		}
	}

	/* Nothing on the freelist, so run the "clock sweep" algorithm 
	 * 步骤2：使用替换机制替换缓冲区
	 */
	trycounter = NBuffers;
	for (;;)
	{
		buf = GetBufferDescriptor(ClockSweepTick());

		/*
		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
		 * it; decrement the usage_count (unless pinned) and keep scanning.
		 * 加锁
		 */
		local_buf_state = LockBufHdr(buf);

		if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0)
		{
			if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0)
			{
				local_buf_state -= BUF_USAGECOUNT_ONE;

				trycounter = NBuffers;
			}
			else
			{
				/* Found a usable buffer */
				if (strategy != NULL)
					AddBufferToRing(strategy, buf);
				*buf_state = local_buf_state;
				return buf;
			}
		}
		else if (--trycounter == 0)
		{
			/*
			 * We've scanned all the buffers without making any state changes,
			 * so all the buffers are pinned (or were when we looked at them).
			 * We could hope that someone will free one eventually, but it's
			 * probably better to fail than to risk getting stuck in an
			 * infinite loop.
			 */
			UnlockBufHdr(buf, local_buf_state);
			elog(ERROR, "no unpinned buffers available");
		}
		UnlockBufHdr(buf, local_buf_state);
	}
}

注意不论是步骤1还是步骤2，都有加锁的操作，所以两个进程不可能获取到同一个缓冲区。

并发控制问题

我们再来研究一下BufferAlloc的并发控制，首先我们简化一下BufferAlloc的代码，只留关键框架：

static BufferDesc *
BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
			BlockNumber blockNum,
			BufferAccessStrategy strategy,
			bool *foundPtr)
{
	/*.....省略....*/
	/* 
	 * see if the block is in the buffer pool already 
	 * 步骤1：加锁BufTable。
	 */
	LWLockAcquire(newPartitionLock, LW_SHARED);
    //步骤2：检查物理块是否已经在缓冲区中。
	buf_id = BufTableLookup(&newTag, newHash);
	if (buf_id >= 0)
	{
		/*.....
		 * 物理块已在缓冲区中，直接返回。
		 * ....*/
		return buf;
	}

	/*
	 * Didn't find it in the buffer pool.  We'll have to initialize a new
	 * buffer.  Remember to unlock the mapping lock while doing the work.
	 * 步骤3：解锁BufTable
	 */
	LWLockRelease(newPartitionLock);

	/* 
	 * Loop here in case we have to try another victim buffer 
	 */
	for (;;)
	{
		/*
		 * Ensure, while the spinlock's not yet held, that there's a free
		 * refcount entry.
		 */
		ReservePrivateRefCountEntry();

		/*
		 * Select a victim buffer.  The buffer is returned with its header
		 * spinlock still held!
		 * 步骤4：根据策略从共享缓冲区中获取一个buf。
		 */
		buf = StrategyGetBuffer(strategy, &buf_state);

		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);

		/* Must copy buffer flags while we still hold the spinlock */
		oldFlags = buf_state & BUF_FLAG_MASK;

		/* Pin the buffer and then release the buffer spinlock */
		PinBuffer_Locked(buf);

		/*
		 * 步骤5：如果buf中的数据没有落盘，则需要对数据进行落盘操作。
		 */
		if (oldFlags & BM_DIRTY)
		{
			/***省略**/
		}

		/*
		 * To change the association of a valid buffer, we'll need to have
		 * exclusive lock on both the old and new mapping partitions.
		 * 步骤6：加锁BufTable
		 */
		if (oldFlags & BM_TAG_VALID)
		{
			if (oldPartitionLock < newPartitionLock)
			{
				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
			}
			else if (oldPartitionLock > newPartitionLock)
			{
				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
			}
			else
			{
				/* only one partition, only one lock */
				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
			}
		}
		else
		{
			/* if it wasn't valid, we need only the new partition */
			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
			/* remember we have no old-partition lock or tag */
			oldPartitionLock = NULL;
			/* this just keeps the compiler quiet about uninit variables */
			oldHash = 0;
		}

		/*
		 * Try to make a hashtable entry for the buffer under its new tag.
		 * This could fail because while we were writing someone else
		 * allocated another buffer for the same block we want to read in.
		 * Note that we have not yet removed the hashtable entry for the old
		 * tag.
		 * 步骤7：将buf插入BufTable
		 */
		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);

		if (buf_id >= 0)
		{
			/***省略，省略的这段代码非常重要，下面将详细解释***/
			return buf;
		}

		/*
		 * Need to lock the buffer header too in order to change its tag.
		 */
		buf_state = LockBufHdr(buf);

		/*
		 * Somebody could have pinned or re-dirtied the buffer while we were
		 * doing the I/O and making the new hashtable entry.  If so, we can't
		 * recycle this buffer; we must undo everything we've done and start
		 * over with a new victim buffer.
		 * 步骤8：在前面的过程中，可能有其他进程对当前buf上锁并做了修改，那么当前buf就不能被使用了
		 * 所以需要重新执行整个流程。
		 */
		oldFlags = buf_state & BUF_FLAG_MASK;
		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
			break;

		UnlockBufHdr(buf, buf_state);
		BufTableDelete(&newTag, newHash);
		if (oldPartitionLock != NULL &&
			oldPartitionLock != newPartitionLock)
			LWLockRelease(oldPartitionLock);
		LWLockRelease(newPartitionLock);
		UnpinBuffer(buf, true);
	}

	/*
	 * Okay, it's finally safe to rename the buffer.
	 *
	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
	 * paranoia.  We also reset the usage_count since any recency of use of
	 * the old content is no longer relevant.  (The usage_count starts out at
	 * 1 so that the buffer can survive one clock-sweep pass.)
	 *
	 * Make sure BM_PERMANENT is set for buffers that must be written at every
	 * checkpoint.  Unlogged buffers only need to be written at shutdown
	 * checkpoints, except for their "init" forks, which need to be treated
	 * just like permanent relations.
	 */
	buf->tag = newTag;
	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
				   BUF_USAGECOUNT_MASK);
	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
	else
		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;

	UnlockBufHdr(buf, buf_state);

    //步骤9：将淘汰的块从hash表中删除
	if (oldPartitionLock != NULL)
	{
		BufTableDelete(&oldTag, oldHash);
		if (oldPartitionLock != newPartitionLock)
			LWLockRelease(oldPartitionLock);
	}

    //步骤10：操作完成解锁BufTable
	LWLockRelease(newPartitionLock);

	/*
	 * Buffer contents are currently invalid.  Try to get the io_in_progress
	 * lock.  If StartBufferIO returns false, then someone else managed to
	 * read it before we did, so there's nothing left for BufferAlloc() to do.
	 */
	if (StartBufferIO(buf, true))
		*foundPtr = FALSE;
	else
		*foundPtr = TRUE;

	return buf;
}

下面我们来看看BufferAlloc的流程：

步骤1：加锁BufTable

后面再来讨论为什么要加锁。
步骤2：检查待加载的数据块（newTag对应的块）是否已经在缓存中。

如果在就直接返回，否则执行步骤2。
步骤3：解锁BufTable

后面再来讨论为什么要解锁。
步骤4：根据策略从共享缓冲区中获取一个buf

这个buf有可能是空闲buf，可能是一个**被淘汰的块（oldTag）**使用的buf。
步骤5：判断从步骤3中获取的buf是否落盘

步骤3中的buf可能缓存了一个被淘汰的块，且块中的数据尚未落盘，所以步骤4需要进行判断，如果未落盘，则需要落盘。
步骤6：加锁BufTable

后面再来讨论为什么要加锁。
步骤7：将newTag对应的buf插入BufTable

如果插入时发现BufTable中已经存在了相同的newTag，则说明有别的进程已经对这个块进行了加载，
步骤8：判断前面获取到的buf是不是又被别人用了

如果是，则需要放弃这个buf，并将步骤7回退（BufTableDelete(&newTag, newHash)）
步骤9：将被淘汰的块（oldTag）从BufTable中删除
步骤10：解锁BufTable

后面再来讨论为什么要解锁。

下面我们来看看上述流程中对BufTable的几次加锁解锁操作：

步骤1的加锁：由于步骤2需要查询BufTable，所以需要对BufTable加读锁。
步骤3的解锁：这个解锁非常关键，其实从正确性的角度来讲，步骤3完全可以不用解锁。但是接下来的步骤4和步骤5是一个相当耗时的步骤，尤其步骤5可能涉及落盘操作，所以从并发性的角度考虑，必须在步骤4、步骤5之前对BufTable进行解锁。
步骤6的加锁：由于步骤7需要向BufTable写入newTag，所有需要对BufTable加写锁。
步骤10的解锁：整个流程执行完毕所以可以释放BufTable的锁。

上述锁的流程中，最关键的就是步骤3，出于性能考虑的这次解锁操作。这个解锁提高了BufTable的并发性，但是会带来什么问题呢？

从上面的10个步骤中，不难发现我们在步骤4中获取到了一个buf，这个buf中可能加载了一个当前没有进程访问的块（所以被StrategyGetBuffer淘汰出来了），此时这个块对应的oldTag并没有从BufTable中删除，删除oldTag是在步骤9才执行的。此时BufTable上并没有锁，那么在当前进程执行步骤4和步骤5的过程中，这个oldTag对应的块是可以被其他进程访问并修改的。所以当当前进程执行完步骤6和7决定要使用这个buf之前，必须要判断下buf中的块数据有没有被改过、是不是正在被其他进程使用。如果是，则必须放弃这个buf，再重新找一个。

上面的流程看着挺麻烦，那么能不能把步骤9提前呢？比如改成如下顺序：

步骤1：加锁BufTable
步骤2：检查待加载的数据块（newTag对应的块）是否已经在缓存中。
步骤3（原步骤4）：根据策略从共享缓冲区中获取一个buf
步骤4（原步骤9）：将被淘汰的块（BufTable）从BufTable中删除
步骤5（原步骤3）：解锁BufTable
步骤6：判断从步骤3中获取的buf是否落盘
步骤7：加锁BufTable
步骤8：将newTag对应的buf插入BufTable
~~步骤9：判断前面获取到的buf是不是又被别人用了~~（不需要这一步了）
步骤10：解锁BufTable

修改之后，我们获取到buf之后就立即将其从BufTable中删除了（步骤3、步骤4），这样后面就不会在有其他进程使用这个块了。但这样会有更严重的问题：

问题1：我们必须在获取buf（调用StrategyGetBuffer）并删除oldTag之后才能解锁BufTable，这降低了BufTable的并发性。（虽然可能比最后来解锁要好一点点）
问题2：如果在步骤8发现当前块已经被别的进程加载了（BufTableInsert返回大于0的buf_id），那么步骤4就白删了！而且不仅是白删，这相当于无故将一个数据块从缓存中驱逐了，后面要用这个块就必须重新加载。

小结

针对BufferAlloc的场景，PostgreSQL的流程是最佳方案。但就从LRU管理的角度来讲，如果对并发性要求不高，且不涉及落盘，修改后的流程也是可以考虑的，毕竟对于源步骤9这种判断buf是不是又被其他进程使用的逻辑比较复杂。

共享缓冲区替换策略

在缓冲池中，初始化定义的缓冲区个数是有限的（由宏NBuffers定义，默认为1000个），并且这个值在初始化分配后将不会再被改变。因此在不断的操作过程中，可能出现缓冲区被用光的局面，这时候就需要替换一些最近未使用的缓冲区，以加载请求的文件块。

PostgreSQL提供两种缓冲区替换策略：一般替换策略和缓冲环替换策略。在上述StrategyGetBuffer代码中，缓冲环替换策略在GetBufferFromRing函数中实现，即13行~18行。剩下的代码就是一般替换策略的实现，下面我们分别来阐述这两种策略：

一般替换策略

在前面其实已经讲过一般替换策略的两个步骤，这里再详细描述下

如果有空闲缓冲区，则获取一个空闲缓冲区

首先在缓冲池中维持一个FreeList链表，FreeList是一个单项链表。FreeList中的缓冲区通过其描述符的FreeNext字段链接起来，在BufferStrategyControl结构中记录了FreeList第一个和最后一个元素。当某缓冲区refcount变为0时，将其加入到FreeList链尾，当需要一个空闲缓冲区时，从链首取得。BufferStrategyControl定义如下：

typedef struct
{
	/* Spinlock: protects the values below */
	slock_t		buffer_strategy_lock;

	/*
	 * Clock sweep hand: index of next buffer to consider grabbing. Note that
	 * this isn't a concrete buffer - we only ever increase the value. So, to
	 * get an actual buffer, it needs to be used modulo NBuffers.
	 */
	pg_atomic_uint32 nextVictimBuffer;

	int			firstFreeBuffer;	/* Head of list of unused buffers */
	int			lastFreeBuffer; /* Tail of list of unused buffers */

	/*
	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
	 * when the list is empty)
	 */

	/*
	 * Statistics.  These counters should be wide enough that they can't
	 * overflow during a single bgwriter cycle.
	 */
	uint32		completePasses; /* Complete cycles of the clock sweep */
	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */

	/*
	 * Bgworker process to be notified upon activity or -1 if none. See
	 * StrategyNotifyBgWriter.
	 */
	int			bgwprocno;
} BufferStrategyControl;

使用替换机制替换缓冲区

替换机制实际是一个简单的clock-sweep算法。主要流程如下：

初始化tryCounter = NBuffers。
根据nextVictimBuffer字段找到相应缓冲区，初始值为0。
将nextVictimBuffer+1，如果当nextVictimBuffer指向池中最后一个缓冲区，设置nextVictimBuffer为0。
如果步骤2中得到的缓冲区refcount为0：
a. 若usagecount不为0，则置usagecount减1，并重置trycounter为NBuffers。
b. 否则获取这个缓冲区并返回。
如果步骤2中得到的缓冲区的refcount不为0，则将trycounter减1，如果trycounter等于0，报错。
返回步骤2。

为了看的更清楚，将这部分代码（StrategyGetBuffer125行~167行）再罗列一下，对应上面的步骤添加相应注释：

trycounter = NBuffers;		/* 步骤1 */
for (;;)
{
    /* 步骤2~步骤3 */
	buf = GetBufferDescriptor(ClockSweepTick());
	/*
	 * If the buffer is pinned or has a nonzero usage_count, we cannot use
	 * it; decrement the usage_count (unless pinned) and keep scanning.
	 */
	local_buf_state = LockBufHdr(buf);
    
    /* 步骤4，判断refcount是否为0 */
	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0)
	{
        /* 判断usagecount是否为0 */
		if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0)
		{
            /* usagecount不为0，则置usagecount减1，并重置trycounter为NBuffers */
			local_buf_state -= BUF_USAGECOUNT_ONE;
			trycounter = NBuffers;
		}
		else
		{
			/* usagecount为0，获取这个缓冲区并返回 */
			if (strategy != NULL)	/* 如果使用了缓冲环策略，则将这个缓冲区添加到缓冲环中 */
				AddBufferToRing(strategy, buf);
			*buf_state = local_buf_state;
			return buf;
		}
	}
	else if (--trycounter == 0)
	{
        /* 步骤5*/
		/*
		 * We've scanned all the buffers without making any state changes,
		 * so all the buffers are pinned (or were when we looked at them).
		 * We could hope that someone will free one eventually, but it's
		 * probably better to fail than to risk getting stuck in an
		 * infinite loop.
		 */
		UnlockBufHdr(buf, local_buf_state);
		elog(ERROR, "no unpinned buffers available");
	}
	UnlockBufHdr(buf, local_buf_state);
    /* 步骤6 继续循环*/
}

核心思想：
不论何种数据库，缓存替换的核心思想都是将访问不频繁的页面交换出去。在PostgreSQL中就通过usage_count来表示一个页面的访问频率，usage_count初始值为0，页面每次执行pin操作都会递增usage_count。所以访问越频繁的页面usage_count就越大，那么在clock-sweep算法中就越不容易变为0，从而越不容易被交换。

缓冲环替换策略

缓冲环是一般替换策略的一种优化，考虑如下场景：假设当前有多个进程在对数据库进行常规操作。此时有一个进程发起了一个全表遍历查询。这个查询会访问大量物理块，但每个块都只访问一次。如果按照一般替换策略，这个全表遍历将导致缓冲池中存在大量只会使用一次的页面，而将许多会被多次使用的页面替换出缓冲区。显然这违背了缓冲区减少I\O的初衷。针对这种情况，缓冲环的基本思想是分配固定数量的缓冲区，替换操作首先在这些缓冲区中进行，如果这些缓冲区中没有可替换的，再使用一般替换策略。环缓冲区主要依靠数据结构BufferAccessStrategy结构来控制，其定义如下：

typedef struct BufferAccessStrategyData
{
	/* Overall strategy type 缓冲环控制策略*/
	BufferAccessStrategyType btype;
	/* Number of elements in buffers[] array 环大小*/
	int			ring_size;

	/*
	 * Index of the "current" slot in the ring, ie, the one most recently
	 * returned by GetBufferFromRing.
	 * 最近加入到环中的Buffer
	 */
	int			current;

	/*
	 * True if the buffer just returned by StrategyGetBuffer had been in the
	 * ring already.
	 * 最近通过StrategyGetBuffer获取的Buffer是否是直接在环中取的
	 */
	bool		current_was_in_ring;

	/*
	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
	 * have not yet selected a buffer for this ring slot.  For allocation
	 * simplicity this is palloc'd together with the fixed fields of the
	 * struct.
	 * 数组，用于存储加入到环中的缓冲区索引号
	 */
	Buffer		buffers[FLEXIBLE_ARRAY_MEMBER];
}	BufferAccessStrategyData;
typedef struct BufferAccessStrategyData *BufferAccessStrategy;

缓冲环替换策略在GetBufferFromRing函数中实现，该函数有三个步骤：

将strategy中的current指针指向strategy的Buffers字段的下一个元素（代表可能的下一个缓冲区），如果当前指向的是Buffers的最后一个元素，则将current置为0 (指向Buffers的第一个元素）。
检査current指针指向的元素，如果其中记录的值为InvalidBuffer，表明环还未充满，这个位置还没有记录一个缓冲区。这种情况下设置strategy的current_was_in_ring字段为 false之后返回空值。

GetBufferFromRing的上层调用函数（StrategyGetBuffer）在检测到返回值为空之后会采用一般的替换策略取得一个空闲缓冲区，并通过AddBufferToRing将该缓冲区加人到缓冲环中。
如果current指针指向的元素中记录的是一个有效的缓冲区索引号，则检査该缓冲区的refcount和usagecount。如果refcount为0且usagecount<=1 （最多被访问过一次，而这一次很可能是全表遍历时，当前进程访问的）, 则把这个缓冲区替换出来返回；否则表明该缓冲区仍在被其他进程使用中或最近被其他进程使用过，这时需采用和步骤2类似的方法，由上层调用函数采用一般的替换策略取得空闲缓冲区。

上述三步，简而言之就是：获取当前指针的下一个元素对应的缓冲区，若存在一个合法缓冲区，且该缓冲区没有进程在访问，且最近最多被访问过一次，则返回该缓冲区，否则采用一般替换策略。

代码如下：

static BufferDesc *
GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
{
	BufferDesc *buf;
	Buffer		bufnum;
	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */


	/* Advance to next ring slot */
	if (++strategy->current >= strategy->ring_size)
		strategy->current = 0;

	/*
	 * If the slot hasn't been filled yet, tell the caller to allocate a new
	 * buffer with the normal allocation strategy.  He will then fill this
	 * slot by calling AddBufferToRing with the new buffer.
	 */
	bufnum = strategy->buffers[strategy->current];
	if (bufnum == InvalidBuffer)
	{
		strategy->current_was_in_ring = false;
		return NULL;
	}

	/*
	 * If the buffer is pinned we cannot use it under any circumstances.
	 *
	 * If usage_count is 0 or 1 then the buffer is fair game (we expect 1,
	 * since our own previous usage of the ring element would have left it
	 * there, but it might've been decremented by clock sweep since then). A
	 * higher usage_count indicates someone else has touched the buffer, so we
	 * shouldn't re-use it.
	 */
	buf = GetBufferDescriptor(bufnum - 1);
	local_buf_state = LockBufHdr(buf);
	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
	{
		strategy->current_was_in_ring = true;
		*buf_state = local_buf_state;
		return buf;
	}
	UnlockBufHdr(buf, local_buf_state);

	/*
	 * Tell caller to allocate a new buffer with the normal allocation
	 * strategy.  He'll then replace this ring element via AddBufferToRing.
	 */
	strategy->current_was_in_ring = false;
	return NULL;
}

AddBufferToRing

前面提到了AddBufferToRing，我们来看看他的实现：

static void
AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
{
	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
}