PostgreSQL 流程---全表遍历

最新推荐文章于 2024-05-25 01:12:20 发布

obvious__

最新推荐文章于 2024-05-25 01:12:20 发布

阅读量2.1k

点赞数 1

分类专栏： postgresql 文章标签： postgresql 数据库

本文链接：https://blog.csdn.net/obvious__/article/details/120706222

版权

postgresql 专栏收录该内容

25 篇文章 28 订阅

订阅专栏

全表遍历

预备知识

《PostgreSQL 流程—查询》

概述

在《PostgreSQL 流程—查询》我们重点讨论了PostgreSQL的查询流程，提到全表遍历的操作主要有函数SeqNext来实现，本文将重点讨论SeqNext的流程。

SeqNext

SeqNext的代码比较短，我们直接来看代码：

static TupleTableSlot *
SeqNext(SeqScanState *node)
{
	HeapTuple	tuple;
	HeapScanDesc scandesc;
	EState	   *estate;
	ScanDirection direction;
	TupleTableSlot *slot;

	/*
	 * get information from the estate and scan state
	 */
	scandesc = node->ss.ss_currentScanDesc;
	estate = node->ss.ps.state;
	direction = estate->es_direction;
	slot = node->ss.ss_ScanTupleSlot;
	
    /* 如果不存在scandesc，就创建一个scandesc */
	if (scandesc == NULL)
	{
		/*
		 * We reach here if the scan is not parallel, or if we're serially
		 * executing a scan that was planned to be parallel.
		 */
		scandesc = heap_beginscan(node->ss.ss_currentRelation,
								  estate->es_snapshot,
								  0, NULL);
		node->ss.ss_currentScanDesc = scandesc;
	}

	/*
	 * get the next tuple from the table
	 * 获取一条可见元组
	 */
	tuple = heap_getnext(scandesc, direction);

	/*
	 * save the tuple and the buffer returned to us by the access methods in
	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
	 * tuples returned by heap_getnext() are pointers onto disk pages and were
	 * not created with palloc() and so should not be pfree()'d.  Note also
	 * that ExecStoreTuple will increment the refcount of the buffer; the
	 * refcount will not be dropped until the tuple table slot is cleared.
	 * 一些清理工作，比如对buffer page执行unpin操作。
	 */
	if (tuple)
		ExecStoreTuple(tuple,	/* tuple to store */
					   slot,	/* slot to store in */
					   scandesc->rs_cbuf,		/* buffer associated with this
												 * tuple */
					   false);	/* don't pfree this pointer */
	else
		ExecClearTuple(slot);

	return slot;
}

SeqNext会做三件事：

如果不存在scandesc则创建并初始化一个scandesc

由于SeqNext一次只返回一条可见元组，所以需要一个迭代器，用于记录当前遍历到了哪一个缓存页的哪一条元组，scandesc就是这个迭代器。scandesc是一个HeapScanDesc类型的结构体，定义如下：

typedef struct HeapScanDescData
{
	/* scan parameters */
	Relation	rs_rd;			/* heap relation descriptor */
	Snapshot	rs_snapshot;	/* snapshot to see */
	int			rs_nkeys;		/* number of scan keys */
	ScanKey		rs_key;			/* array of scan key descriptors */
	bool		rs_bitmapscan;	/* true if this is really a bitmap scan */
	bool		rs_samplescan;	/* true if this is really a sample scan */
	bool		rs_pageatatime; /* verify visibility page-at-a-time? */
	bool		rs_allow_strat; /* allow or disallow use of access strategy */
	bool		rs_allow_sync;	/* allow or disallow use of syncscan */
	bool		rs_temp_snap;	/* unregister snapshot at scan end? */

	/* state set up at initscan time */
	BlockNumber rs_nblocks;		/* total number of blocks in rel */
	BlockNumber rs_startblock;	/* block # to start at */
	BlockNumber rs_numblocks;	/* max number of blocks to scan */
	/* rs_numblocks is usually InvalidBlockNumber, meaning "scan whole rel" */
	BufferAccessStrategy rs_strategy;	/* access strategy for reads */
	bool		rs_syncscan;	/* report location to syncscan logic? */

	/* scan current state */
	bool		rs_inited;		/* false = scan not init'd yet */
	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
	BlockNumber rs_cblock;		/* current block # in scan, if any */
	Buffer		rs_cbuf;		/* current buffer in scan, if any */
	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
	ParallelHeapScanDesc rs_parallel;	/* parallel scan information */

	/* these fields only used in page-at-a-time mode and for bitmap scans */
	int			rs_cindex;		/* current tuple's index in vistuples */
	int			rs_ntuples;		/* number of visible tuples on page */
	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
}	HeapScanDescData;

typedef struct HeapScanDescData *HeapScanDesc;

HeapScanDesc中的成员较多，HeapScanDesc中成员的初始化工作都在heap_beginscan中完成，我们会在后面的流程中对全表遍历所涉及到的成员进行说明。

调用heap_getnext获取一条可见元组

该函数也是全表遍历的一个核心函数。
调用ExecStoreTuple或ExecClearTuple执行一些清理操作

全表遍历流程

简单来说，全表遍历就是从某个表的第一个块的第一条记录开始逐条遍历，直到最后一个块的最后一个记录。

整个过程包含许多细节，我们把这些细节总结成问题，然后通过解决这些问题来了解整个流程。

如何获取遍历的起始位置？
记录的获取如何实现？
记录的可见性如何判断？
在遍历某个缓存页时，如何保证缓存页不被淘汰？缓存页什么时候可以淘汰？

如何获取遍历的起始位置

显然遍历的起始位置是表的第一个块的第一条记录。在具体实现时，涉及到HeapScanDesc的两个成员：

rs_rd：表示当前表。
rs_startblock：表示遍历的起始块的块号。注意这是物理块的块号，在实际访问时，需要将其加载到内存页中！

记录的获取如何实现

记录的获取在heap_getnext中实现，我们来看看heap_getnext的代码：

HeapTuple
heap_getnext(HeapScanDesc scan, ScanDirection direction)
{
	/* Note: no locking manipulations needed */

	HEAPDEBUG_1;				/* heap_getnext( info ) */

	if (scan->rs_pageatatime)
		heapgettup_pagemode(scan, direction,
							scan->rs_nkeys, scan->rs_key);
	else
		heapgettup(scan, direction, scan->rs_nkeys, scan->rs_key);

	if (scan->rs_ctup.t_data == NULL)
	{
		HEAPDEBUG_2;			/* heap_getnext returning EOS */
		return NULL;
	}

	/*
	 * if we get here it means we have a new current scan tuple, so point to
	 * the proper return buffer and return the tuple.
	 */
	HEAPDEBUG_3;				/* heap_getnext returning tuple */

	pgstat_count_heap_getnext(scan->rs_rd);

	return &(scan->rs_ctup);
}

这里涉及HeapScanDesc的一个成员：

rs_pageatatime

rs_pageatatime用于标识是否可以使用page-at-a-time模式，page-at-a-time模式是一种全表遍历的模式，在该模式下会调用heapgettup_pagemode，heapgettup_pagemode的功能与heapgettup完全一致，但实现上轻量级。

我们接下来重点看下heapgettup_pagemode的实现。

heapgettup_pagemode

heapgettup_pagemode是全表遍历最底层的函数，用于返回一条可见的元组，该函数流程如下：

判断迭代器是否初始化（根据HeapScanDesc的rs_inited成员来判断）

a. 未初始化

如果未初始化，则说明查询刚刚开始。那么首先需要判断当前表是否存在数据块（根据HeapScanDesc的rs_nblocks成员来判断）以及可供扫描的块的个数（根据HeapScanDesc的rs_numblocks成员来判断）。rs_nblocks与rs_numblocks只要有一个为0，则说明该表没有可供扫描的块，直接返回。

如果rs_nblocks与rs_numblocks都不为0。则根据HeapScanDesc的rs_startblock获取遍历的第一个块。然后调用heapgetpage函数获取块中所有可见元组的ItemId（即元组定长部分的下标）并存放在HeapScanDesc的rs_vistuples成员中，后面元组是通过遍历rs_vistuples来获取的。heapgetpage会在后面详细讲解。

由于查询刚开始，所以显然是要获取第一条可见元素，即rs_vistuples[0]对应的元组。所以将lineindex设为0。

最后将rs_inited设为true表明已经进行了初始化。

b. 已初始化

继续遍历，获取当前缓存页（从HeapScanDesc的rs_cblock成员获取）的下一条元组（从HeapScanDesc的rs_cindex成员获取）。将lineindex设置为rs_cindex+1。

获取当前页的剩余元组数

lines = scan->rs_ntuples;			/* rs_ntuples在heapgetpage中获取 */
linesleft = lines - lineindex;

判断剩余元组数

a. > 0

获取lineindex对应的元组，修改当前元组下标（将scan->rs_cindex设置为lineindex），然后返回。

b. < 0

执行步骤4
获取下一个物理块

如果还有未遍历的物理块，则获取下一个物理块，调用heapgetpage获取块中的可见元组。将lineindex置为0，然后返回步骤3。如果已经遍历完所有物理块，则直接返回。

下面我们来看看heapgettup_pagemode的代码：

static void
heapgettup_pagemode(HeapScanDesc scan,
					ScanDirection dir,
					int nkeys,
					ScanKey key)
{
	HeapTuple	tuple = &(scan->rs_ctup);
	bool		backward = ScanDirectionIsBackward(dir);
	BlockNumber page;
	bool		finished;
	Page		dp;
	int			lines;
	int			lineindex;
	OffsetNumber lineoff;
	int			linesleft;
	ItemId		lpp;

	/*
	 * calculate next starting lineindex, given scan direction
	 */
	if (ScanDirectionIsForward(dir))
	{
        /* 步骤1：判断迭代器是否初始化 */
		if (!scan->rs_inited)
		{
			/*
			 * return null immediately if relation is empty
			 * 判断是否有可以遍历的块
			 */
			if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0)
			{
                /* 没有可遍历块，直接返回 */
				Assert(!BufferIsValid(scan->rs_cbuf));
				tuple->t_data = NULL;
				return;
			}
             /* 判断是否是并行遍历 */
			if (scan->rs_parallel != NULL)
			{
				page = heap_parallelscan_nextpage(scan);

				/* Other processes might have already finished the scan. */
				if (page == InvalidBlockNumber)
				{
					Assert(!BufferIsValid(scan->rs_cbuf));
					tuple->t_data = NULL;
					return;
				}
			}
			else /* 获取第一个物理块 */
				page = scan->rs_startblock;		/* first page */
            
             /* 获取该物理块中所有的可见元组 */
			heapgetpage(scan, page);
			lineindex = 0;
			scan->rs_inited = true;
		}
		else
		{
			/* 
			 * continue from previously returned page/tuple 
			 * 继续遍历
			 */
			page = scan->rs_cblock;		/* current page */
			lineindex = scan->rs_cindex + 1;
		}

        /* 步骤2：获取当前页的剩余元组数 */
		dp = BufferGetPage(scan->rs_cbuf);
		TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);
        lines = scan->rs_ntuples;
		/* page and lineindex now reference the next visible tid */

		linesleft = lines - lineindex;
	}
	else if (backward)
	{
		/* backward parallel scan not supported */
		Assert(scan->rs_parallel == NULL);

		if (!scan->rs_inited)
		{
			/*
			 * return null immediately if relation is empty
			 */
			if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0)
			{
				Assert(!BufferIsValid(scan->rs_cbuf));
				tuple->t_data = NULL;
				return;
			}

			/*
			 * Disable reporting to syncscan logic in a backwards scan; it's
			 * not very likely anyone else is doing the same thing at the same
			 * time, and much more likely that we'll just bollix things for
			 * forward scanners.
			 */
			scan->rs_syncscan = false;
			/* start from last page of the scan */
			if (scan->rs_startblock > 0)
				page = scan->rs_startblock - 1;
			else
				page = scan->rs_nblocks - 1;
			heapgetpage(scan, page);
		}
		else
		{
			/* continue from previously returned page/tuple */
			page = scan->rs_cblock;		/* current page */
		}

		dp = BufferGetPage(scan->rs_cbuf);
		TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);
		lines = scan->rs_ntuples;

		if (!scan->rs_inited)
		{
			lineindex = lines - 1;
			scan->rs_inited = true;
		}
		else
		{
			lineindex = scan->rs_cindex - 1;
		}
		/* page and lineindex now reference the previous visible tid */

		linesleft = lineindex + 1;
	}
	else
	{
		/*
		 * ``no movement'' scan direction: refetch prior tuple
		 */
		if (!scan->rs_inited)
		{
			Assert(!BufferIsValid(scan->rs_cbuf));
			tuple->t_data = NULL;
			return;
		}

		page = ItemPointerGetBlockNumber(&(tuple->t_self));
		if (page != scan->rs_cblock)
			heapgetpage(scan, page);

		/* Since the tuple was previously fetched, needn't lock page here */
		dp = BufferGetPage(scan->rs_cbuf);
		TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);
		lineoff = ItemPointerGetOffsetNumber(&(tuple->t_self));
		lpp = PageGetItemId(dp, lineoff);
		Assert(ItemIdIsNormal(lpp));

		tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
		tuple->t_len = ItemIdGetLength(lpp);

		/* check that rs_cindex is in sync */
		Assert(scan->rs_cindex < scan->rs_ntuples);
		Assert(lineoff == scan->rs_vistuples[scan->rs_cindex]);

		return;
	}

	/*
	 * advance the scan until we find a qualifying tuple or run out of stuff
	 * to scan
	 */
	for (;;)
	{
        /* 步骤3：判断剩余元组数 */
		while (linesleft > 0)
		{
			lineoff = scan->rs_vistuples[lineindex];
			lpp = PageGetItemId(dp, lineoff);
			Assert(ItemIdIsNormal(lpp));
			 /* 
			  * 剩余元组不为0，获取lineindex对应的元组并返回，
			  * 这里其实是PostgreSQL的一大优势，后面会说明
              */
			tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
			tuple->t_len = ItemIdGetLength(lpp);
			ItemPointerSet(&(tuple->t_self), page, lineoff);

			/*
			 * if current tuple qualifies, return it.
			 */
			if (key != NULL)
			{
				bool		valid;

				HeapKeyTest(tuple, RelationGetDescr(scan->rs_rd),
							nkeys, key, valid);
				if (valid)
				{
					scan->rs_cindex = lineindex;
					return;
				}
			}
			else
			{
				scan->rs_cindex = lineindex;
				return;
			}

			/*
			 * otherwise move to the next item on the page
			 */
			--linesleft;
			if (backward)
				--lineindex;
			else
				++lineindex;
		}

		/*
		 * if we get here, it means we've exhausted the items on this page and
		 * it's time to move to the next.
		 * 剩余元组为0
		 * 步骤4：获取下一个物理块
		 */
		if (backward)
		{
			finished = (page == scan->rs_startblock) ||
				(scan->rs_numblocks != InvalidBlockNumber?--scan->rs_numblocks==0:false);
			if (page == 0)
				page = scan->rs_nblocks;
			page--;
		}
		else if (scan->rs_parallel != NULL)
		{
			page = heap_parallelscan_nextpage(scan);
			finished = (page == InvalidBlockNumber);
		}
		else
		{
            /* 判断是否还有未遍历的物理块，如果没有则将finished置为true */
			page++;
			if (page >= scan->rs_nblocks)
				page = 0;
			finished = (page == scan->rs_startblock) ||
				(scan->rs_numblocks != InvalidBlockNumber?--scan->rs_numblocks==0:false);

			/*
			 * Report our new scan position for synchronization purposes. We
			 * don't do that when moving backwards, however. That would just
			 * mess up any other forward-moving scanners.
			 *
			 * Note: we do this before checking for end of scan so that the
			 * final state of the position hint is back at the start of the
			 * rel.  That's not strictly necessary, but otherwise when you run
			 * the same query multiple times the starting position would shift
			 * a little bit backwards on every invocation, which is confusing.
			 * We don't guarantee any specific ordering in general, though.
			 */
			if (scan->rs_syncscan)
				ss_report_location(scan->rs_rd, page);
		}

		/*
		 * return NULL if we've exhausted all the pages
		 * 结束查询
		 */
		if (finished)
		{
			if (BufferIsValid(scan->rs_cbuf))
				ReleaseBuffer(scan->rs_cbuf);
			scan->rs_cbuf = InvalidBuffer;
			scan->rs_cblock = InvalidBlockNumber;
			tuple->t_data = NULL;
			scan->rs_inited = false;
			return;
		}
		
        /* 则获取下一个物理块的可见元组 */
		heapgetpage(scan, page);

		dp = BufferGetPage(scan->rs_cbuf);
		TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);
		lines = scan->rs_ntuples;
		linesleft = lines;
		if (backward)
			lineindex = lines - 1;
		else
			lineindex = 0;  /* 重置lineindex */
         /* 返回步骤3 */
	}
}

heapgetpage

下面来介绍全表遍历最核心的一个函数heapgetpage，heapgetpage用于获取一个物理块中所有可见元组，将这些可见元组的ItemId存放在rs_vistuples中。heapgetpage的流程如下：

将物理块加载到缓存页中。
为缓存页加轻量级共享锁。

轻量级锁，即light wight lock，是PostgreSQL自己实现的一种锁，本质上是latch，用于共享资源的多进程同步，具备等待队列，不具备死锁检测。
遍历页面中的所有元组，判断元组的可见性，将可见元组的ItemId存入rs_vistuples中。
释放轻量级共享锁。

下面我们来看看源代码：

void
heapgetpage(HeapScanDesc scan, BlockNumber page)
{
	Buffer		buffer;
	Snapshot	snapshot;
	Page		dp;
	int			lines;
	int			ntup;
	OffsetNumber lineoff;
	ItemId		lpp;
	bool		all_visible;

	Assert(page < scan->rs_nblocks);

	/* release previous scan buffer, if any */
	if (BufferIsValid(scan->rs_cbuf))
	{
		ReleaseBuffer(scan->rs_cbuf);
		scan->rs_cbuf = InvalidBuffer;
	}

	/*
	 * Be sure to check for interrupts at least once per page.  Checks at
	 * higher code levels won't be able to stop a seqscan that encounters many
	 * pages' worth of consecutive dead tuples.
	 */
	CHECK_FOR_INTERRUPTS();

	/* 
	 * read page using selected strategy 
	 * 步骤1：将物理块加载到缓存页中
	 */
	scan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM, page,
									   RBM_NORMAL, scan->rs_strategy);
	scan->rs_cblock = page;

	if (!scan->rs_pageatatime)
		return;

	buffer = scan->rs_cbuf;
	snapshot = scan->rs_snapshot;

	/*
	 * Prune and repair fragmentation for the whole page, if possible.
	 */
	heap_page_prune_opt(scan->rs_rd, buffer);

	/*
	 * We must hold share lock on the buffer content while examining tuple
	 * visibility.  Afterwards, however, the tuples we have found to be
	 * visible are guaranteed good as long as we hold the buffer pin.
	 * 步骤2：为缓存页加轻量级共享锁
	 */
	LockBuffer(buffer, BUFFER_LOCK_SHARE);

	dp = BufferGetPage(buffer);
	TestForOldSnapshot(snapshot, scan->rs_rd, dp);
	lines = PageGetMaxOffsetNumber(dp);
	ntup = 0;

	/*
	 * If the all-visible flag indicates that all tuples on the page are
	 * visible to everyone, we can skip the per-tuple visibility tests.
	 *
	 * Note: In hot standby, a tuple that's already visible to all
	 * transactions in the master might still be invisible to a read-only
	 * transaction in the standby. We partly handle this problem by tracking
	 * the minimum xmin of visible tuples as the cut-off XID while marking a
	 * page all-visible on master and WAL log that along with the visibility
	 * map SET operation. In hot standby, we wait for (or abort) all
	 * transactions that can potentially may not see one or more tuples on the
	 * page. That's how index-only scans work fine in hot standby. A crucial
	 * difference between index-only scans and heap scans is that the
	 * index-only scan completely relies on the visibility map where as heap
	 * scan looks at the page-level PD_ALL_VISIBLE flag. We are not sure if
	 * the page-level flag can be trusted in the same way, because it might
	 * get propagated somehow without being explicitly WAL-logged, e.g. via a
	 * full page write. Until we can prove that beyond doubt, let's check each
	 * tuple for visibility the hard way.
	 * 步骤3：遍历页面中的所有元组，判断元组的可见性，将可见元组的ItemId存入rs_vistuples中
	 */
	all_visible = PageIsAllVisible(dp) && !snapshot->takenDuringRecovery;

	for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff);
		 lineoff <= lines;
		 lineoff++, lpp++)
	{
		if (ItemIdIsNormal(lpp))
		{
			HeapTupleData loctup;
			bool		valid;

			loctup.t_tableOid = RelationGetRelid(scan->rs_rd);
			loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
			loctup.t_len = ItemIdGetLength(lpp);
			ItemPointerSet(&(loctup.t_self), page, lineoff);

			if (all_visible)
				valid = true;
			else
				valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);

			CheckForSerializableConflictOut(valid, scan->rs_rd, &loctup,
											buffer, snapshot);

			if (valid)
				scan->rs_vistuples[ntup++] = lineoff;
		}
	}
	/* 步骤4：释放轻量级共享锁 */
	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);

	Assert(ntup <= MaxHeapTuplesPerPage);
	scan->rs_ntuples = ntup;
}

PostgreSQL全表遍历优势

在前面讲解heapgettup_pagemode流程时，我们不难发现，对于可见元组，是直接返回其在缓存页中对应的指针。相关代码如下（向上搜索“优势”二字也可以找到对应代码）：

tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
tuple->t_len = ItemIdGetLength(lpp);

通常数据库不会直接返回元组在页面上的指针，而是将页面重新copy一份，返回copy页上对应的元组指针，或者将元组拷贝一份返回。这是因为，为了提高数据库并发性，查询不会对页面和元组加锁，只会在必要时对页面加latch（也就是前面讨论过的轻量级锁）。查询过程中，页面可以并行执行修改操作，由MVCC来保证查询数据的正确。所以查询进程获取的元组，在查询结束前可能就被一个修改进程更改了。如果直接返回页面指针，肯定会出现数据不一致，所以可见元组通常都需要拷贝一份返回。但是在PostgreSQL中，update操作是delete + insert（也叫outplace update），也就是说元组一旦插入后元组内容是不可能被修改的，所以也就不存在上述的多进程问题，可以放心的使用指针。

当然，相比inplace update，outplace update会造成很大的数据膨胀。这部分内容会在专门的文档中进行说明。

缓存页的淘汰

前面提出的4个问题，我们已经解决了前两个，对于元组可见性的判断相对比较复杂，详见《PostgreSQL 事务—MVCC》。现在我们来解决最后一个问题，缓存页的淘汰。

前面提到，元组的返回是直接返回元组在缓存页上对应的指针。后面对于该元组的所有操作都是基于这个指针进行的。那么很显然，只要还需要用到这个元组，元组对应的缓存页就不能被淘汰。为了确保正在访问的缓存页不会被淘汰，在PostgreSQL中，每个缓存页都存在一个引用计数，只要缓存页的引用计数不为0，该缓存页就不能被淘汰。增加缓存页引用计数的操作称为pin，反之减少引用计数的操作称为unpin。下面，我们来看看全表遍历过程中，缓存页的引用计数是如何控制的？

页面的pin和unpin在下面两个函数中进行：

heapgetpage
ExecStoreTuple

heapgetpage函数的功能前面已经讲过，这么重点强调下函数的调用时机。由于heapgetpage会获取页面中所有可见元组，所以自然是第一次访问某页面时调用。具体而言在下面两种情况调用：

全表遍历开始时，访问rs_startblock对应块时。
块切换时（当遍历完一个块，开始遍历下一个块时）。

而ExecStoreTuple在每返回一条元组时调用。

假设当前结束了对page0的遍历，需要继续遍历page1，那么就需要对page1进行pin操作，流程如下：

在heapgetpage中通过ReadBufferExtended将一个块加载到缓存page1时，在ReadBufferExtended的内部会第一次对page1进行pin操作，此时page1的引用计数为1。

当page1的第一条可见元组返回后，会调用ExecStoreTuple函数，实现如下：

TupleTableSlot *
ExecStoreTuple(HeapTuple tuple,
			   TupleTableSlot *slot,
			   Buffer buffer,
			   bool shouldFree)
{
	/*
	 * sanity checks
	 */
	Assert(tuple != NULL);
	Assert(slot != NULL);
	Assert(slot->tts_tupleDescriptor != NULL);
	/* passing shouldFree=true for a tuple on a disk page is not sane */
	Assert(BufferIsValid(buffer) ? (!shouldFree) : true);

	/*
	 * Free any old physical tuple belonging to the slot.
	 */
	if (slot->tts_shouldFree)
		heap_freetuple(slot->tts_tuple);
	if (slot->tts_shouldFreeMin)
		heap_free_minimal_tuple(slot->tts_mintuple);

	/*
	 * Store the new tuple into the specified slot.
	 */
	slot->tts_isempty = false;
	slot->tts_shouldFree = shouldFree;
	slot->tts_shouldFreeMin = false;
	slot->tts_tuple = tuple;
	slot->tts_mintuple = NULL;

	/* Mark extracted state invalid */
	slot->tts_nvalid = 0;

	/*
	 * If tuple is on a disk page, keep the page pinned as long as we hold a
	 * pointer into it.  We assume the caller already has such a pin.
	 *
	 * This is coded to optimize the case where the slot previously held a
	 * tuple on the same disk page: in that case releasing and re-acquiring
	 * the pin is a waste of cycles.  This is a common situation during
	 * seqscans, so it's worth troubling over.
	 */
	if (slot->tts_buffer != buffer)
	{
		if (BufferIsValid(slot->tts_buffer))
			ReleaseBuffer(slot->tts_buffer);
		slot->tts_buffer = buffer;
		if (BufferIsValid(buffer))
			IncrBufferRefCount(buffer);
	}

	return slot;
}

注意从45行开始的这段代码，由于页面从page0切换到了page1，所以slot->tts_buffer与buffer不相等（slot->tts_buffers为page0，buffer为page1）。此时会增加第二次增加page1的引用计数，使page1的引用计数从1变为2。

假设当前结束了对page1的遍历，需要继续遍历page2，此时对于page1的遍历已经结束，那么就可以对page1进行unpin操作使其可以被交换。我们来看看page1的unpin操作。

通过前面对heapgetpage调用时机的描述不难发现，在调用heapgetpage时只要当前的缓存页不为空，就说明发生了页面切换，所以第一次为page1执行unpin操作，此时page1的引用计数被减为1。
```
	if (BufferIsValid(scan->rs_cbuf))
	{
		ReleaseBuffer(scan->rs_cbuf);	/* ReleaseBuffer会调用unpin */
		scan->rs_cbuf = InvalidBuffer;
	}
```
当page2的第一条可见元组返回后，同样会调用ExecStoreTuple函数，此时会第二次减少page1的引用计数，使page1的引用计数从1变为0。
```
if (BufferIsValid(slot->tts_buffer))
	ReleaseBuffer(slot->tts_buffer);
```

遗留问题

在什么条件下可以使用page-at-a-time模式，什么时候不可以？

obvious__

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
PostgreSQL 流程---全表遍历

全表遍历预备知识《PostgreSQL 流程—查询》概述在《PostgreSQL 流程—查询》我们重点讨论了PostgreSQL的查询流程，提到全表遍历的操作主要有函数SeqNext来实现，本文将重点讨论SeqNext的流程。SeqNextSeqNext的代码比较短，我们直接来看代码：static TupleTableSlot *SeqNext(SeqScanState *node){ HeapTuple tuple; HeapScanDesc scandesc; EState
复制链接

扫一扫

专栏目录