PostgreSQL 流程---全表遍历

全表遍历

预备知识

PostgreSQL 流程—查询

概述

在《PostgreSQL 流程—查询》我们重点讨论了PostgreSQL的查询流程,提到全表遍历的操作主要有函数SeqNext来实现,本文将重点讨论SeqNext的流程。

SeqNext

SeqNext的代码比较短,我们直接来看代码:

static TupleTableSlot *
SeqNext(SeqScanState *node)
{
	HeapTuple	tuple;
	HeapScanDesc scandesc;
	EState	   *estate;
	ScanDirection direction;
	TupleTableSlot *slot;

	/*
	 * get information from the estate and scan state
	 */
	scandesc = node->ss.ss_currentScanDesc;
	estate = node->ss.ps.state;
	direction = estate->es_direction;
	slot = node->ss.ss_ScanTupleSlot;
	
    /* 如果不存在scandesc,就创建一个scandesc */
	if (scandesc == NULL)
	{
		/*
		 * We reach here if the scan is not parallel, or if we're serially
		 * executing a scan that was planned to be parallel.
		 */
		scandesc = heap_beginscan(node->ss.ss_currentRelation,
								  estate->es_snapshot,
								  0, NULL);
		node->ss.ss_currentScanDesc = scandesc;
	}

	/*
	 * get the next tuple from the table
	 * 获取一条可见元组
	 */
	tuple = heap_getnext(scandesc, direction);

	/*
	 * save the tuple and the buffer returned to us by the access methods in
	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
	 * tuples returned by heap_getnext() are pointers onto disk pages and were
	 * not created with palloc() and so should not be pfree()'d.  Note also
	 * that ExecStoreTuple will increment the refcount of the buffer; the
	 * refcount will not be dropped until the tuple table slot is cleared.
	 * 一些清理工作,比如对buffer page执行unpin操作。
	 */
	if (tuple)
		ExecStoreTuple(tuple,	/* tuple to store */
					   slot,	/* slot to store in */
					   scandesc->rs_cbuf,		/* buffer associated with this
												 * tuple */
					   false);	/* don't pfree this pointer */
	else
		ExecClearTuple(slot);

	return slot;
}

SeqNext会做三件事:

  1. 如果不存在scandesc则创建并初始化一个scandesc

    由于SeqNext一次只返回一条可见元组,所以需要一个迭代器,用于记录当前遍历到了哪一个缓存页的哪一条元组,scandesc就是这个迭代器。scandesc是一个HeapScanDesc类型的结构体,定义如下:

    typedef struct HeapScanDescData
    {
    	/* scan parameters */
    	Relation	rs_rd;			/* heap relation descriptor */
    	Snapshot	rs_snapshot;	/* snapshot to see */
    	int			rs_nkeys;		/* number of scan keys */
    	ScanKey		rs_key;			/* array of scan key descriptors */
    	bool		rs_bitmapscan;	/* true if this is really a bitmap scan */
    	bool		rs_samplescan;	/* true if this is really a sample scan */
    	bool		rs_pageatatime; /* verify visibility page-at-a-time? */
    	bool		rs_allow_strat; /* allow or disallow use of access strategy */
    	bool		rs_allow_sync;	/* allow or disallow use of syncscan */
    	bool		rs_temp_snap;	/* unregister snapshot at scan end? */
    
    	/* state set up at initscan time */
    	BlockNumber rs_nblocks;		/* total number of blocks in rel */
    	BlockNumber rs_startblock;	/* block # to start at */
    	BlockNumber rs_numblocks;	/* max number of blocks to scan */
    	/* rs_numblocks is usually InvalidBlockNumber, meaning "scan whole rel" */
    	BufferAccessStrategy rs_strategy;	/* access strategy for reads */
    	bool		rs_syncscan;	/* report location to syncscan logic? */
    
    	/* scan current state */
    	bool		rs_inited;		/* false = scan not init'd yet */
    	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
    	BlockNumber rs_cblock;		/* current block # in scan, if any */
    	Buffer		rs_cbuf;		/* current buffer in scan, if any */
    	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
    	ParallelHeapScanDesc rs_parallel;	/* parallel scan information */
    
    	/* these fields only used in page-at-a-time mode and for bitmap scans */
    	int			rs_cindex;		/* current tuple's index in vistuples */
    	int			rs_ntuples;		/* number of visible tuples on page */
    	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
    }	HeapScanDescData;
    
    typedef struct HeapScanDescData *HeapScanDesc;
    

    HeapScanDesc中的成员较多,HeapScanDesc中成员的初始化工作都在heap_beginscan中完成,我们会在后面的流程中对全表遍历所涉及到的成员进行说明。

  2. 调用heap_getnext获取一条可见元组

    该函数也是全表遍历的一个核心函数。

  3. 调用ExecStoreTuple或ExecClearTuple执行一些清理操作

全表遍历流程

简单来说,全表遍历就是从某个表第一个块第一条记录开始逐条遍历,直到最后一个块最后一个记录

整个过程包含许多细节,我们把这些细节总结成问题,然后通过解决这些问题来了解整个流程。

  1. 如何获取遍历的起始位置?
  2. 记录的获取如何实现?
  3. 记录的可见性如何判断?
  4. 在遍历某个缓存页时,如何保证缓存页不被淘汰?缓存页什么时候可以淘汰?

如何获取遍历的起始位置

显然遍历的起始位置是表的第一个块的第一条记录。在具体实现时,涉及到HeapScanDesc的两个成员:

  • rs_rd:表示当前表。
  • rs_startblock:表示遍历的起始块的块号。注意这是物理块的块号,在实际访问时,需要将其加载到内存页中!

记录的获取如何实现

记录的获取在heap_getnext中实现,我们来看看heap_getnext的代码:

HeapTuple
heap_getnext(HeapScanDesc scan, ScanDirection direction)
{
	/* Note: no locking manipulations needed */

	HEAPDEBUG_1;				/* heap_getnext( info ) */

	if (scan->rs_pageatatime)
		heapgettup_pagemode(scan, direction,
							scan->rs_nkeys, scan->rs_key);
	else
		heapgettup(scan, direction, scan->rs_nkeys, scan->rs_key);

	if (scan->rs_ctup.t_data == NULL)
	{
		HEAPDEBUG_2;			/* heap_getnext returning EOS */
		return NULL;
	}

	/*
	 * if we get here it means we have a new current scan tuple, so point to
	 * the proper return buffer and return the tuple.
	 */
	HEAPDEBUG_3;				/* heap_getnext returning tuple */

	pgstat_count_heap_getnext(scan->rs_rd);

	return &(scan->rs_ctup);
}

这里涉及HeapScanDesc的一个成员:

  • rs_pageatatime

    rs_pageatatime用于标识是否可以使用page-at-a-time模式,page-at-a-time模式是一种全表遍历的模式,在该模式下会调用heapgettup_pagemode,heapgettup_pagemode的功能与heapgettup完全一致,但实现上轻量级。

我们接下来重点看下heapgettup_pagemode的实现。

heapgettup_pagemode

heapgettup_pagemode是全表遍历最底层的函数,用于返回一条可见的元组,该函数流程如下:

  1. 判断迭代器是否初始化(根据HeapScanDesc的rs_inited成员来判断)

    a. 未初始化

    如果未初始化,则说明查询刚刚开始。那么首先需要判断当前表是否存在数据块(根据HeapScanDesc的rs_nblocks成员来判断)以及可供扫描的块的个数(根据HeapScanDesc的rs_numblocks成员来判断)。rs_nblocks与rs_numblocks只要有一个为0,则说明该表没有可供扫描的块,直接返回。

    如果rs_nblocks与rs_numblocks都不为0。则根据HeapScanDesc的rs_startblock获取遍历的第一个块。然后调用heapgetpage函数获取块中所有可见元组的ItemId(即元组定长部分的下标)并存放在HeapScanDesc的rs_vistuples成员中,后面元组是通过遍历rs_vistuples来获取的。heapgetpage会在后面详细讲解。

    由于查询刚开始,所以显然是要获取第一条可见元素,即rs_vistuples[0]对应的元组。所以将lineindex设为0。

    最后将rs_inited设为true表明已经进行了初始化。

    b. 已初始化

    继续遍历,获取当前缓存页(从HeapScanDesc的rs_cblock成员获取)的下一条元组(从HeapScanDesc的rs_cindex成员获取)。将lineindex设置为rs_cindex+1。

  2. 获取当前页的剩余元组数

    lines = scan->rs_ntuples;			/* rs_ntuples在heapgetpage中获取 */
    linesleft = lines - lineindex;
    
  3. 判断剩余元组数

    a. > 0

    获取lineindex对应的元组,修改当前元组下标(将scan->rs_cindex设置为lineindex),然后返回。

    b. < 0

    执行步骤4

  4. 获取下一个物理块

    如果还有未遍历的物理块,则获取下一个物理块,调用heapgetpage获取块中的可见元组。将lineindex置为0,然后返回步骤3。如果已经遍历完所有物理块,则直接返回。

下面我们来看看heapgettup_pagemode的代码:

static void
heapgettup_pagemode(HeapScanDesc scan,
					ScanDirection dir,
					int nkeys,
					ScanKey key)
{
	HeapTuple	tuple = &(scan->rs_ctup);
	bool		backward = ScanDirectionIsBackward(dir);
	BlockNumber page;
	bool		finished;
	Page		dp;
	int			lines;
	int			lineindex;
	OffsetNumber lineoff;
	int			linesleft;
	ItemId		lpp;

	/*
	 * calculate next starting lineindex, given scan direction
	 */
	if (ScanDirectionIsForward(dir))
	{
        /* 步骤1:判断迭代器是否初始化 */
		if (!scan->rs_inited)
		{
			/*
			 * return null immediately if relation is empty
			 * 判断是否有可以遍历的块
			 */
			if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0)
			{
                /* 没有可遍历块,直接返回 */
				Assert(!BufferIsValid(scan->rs_cbuf));
				tuple->t_data = NULL;
				return;
			}
             /* 判断是否是并行遍历 */
			if (scan->rs_parallel != NULL)
			{
				page = heap_parallelscan_nextpage(scan);

				/* Other processes might have already finished the scan. */
				if (page == InvalidBlockNumber)
				{
					Assert(!BufferIsValid(scan->rs_cbuf));
					tuple->t_data = NULL;
					return;
				}
			}
			else /* 获取第一个物理块 */
				page = scan->rs_startblock;		/* first page */
            
             /* 获取该物理块中所有的可见元组 */
			heapgetpage(scan, page);
			lineindex = 0;
			scan->rs_inited = true;
		}
		else
		{
			/* 
			 * continue from previously returned page/tuple 
			 * 继续遍历
			 */
			page = scan->rs_cblock;		/* current page */
			lineindex = scan->rs_cindex + 1;
		}

        /* 步骤2:获取当前页的剩余元组数 */
		dp = BufferGetPage(scan->rs_cbuf);
		TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);
        lines = scan->rs_ntuples;
		/* page and lineindex now reference the next visible tid */

		linesleft = lines - lineindex;
	}
	else if (backward)
	{
		/* backward parallel scan not supported */
		Assert(scan->rs_parallel == NULL);

		if (!scan->rs_inited)
		{
			/*
			 * return null immediately if relation is empty
			 */
			if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0)
			{
				Assert(!BufferIsValid(scan->rs_cbuf));
				tuple->t_data = NULL;
				return;
			}

			/*
			 * Disable reporting to syncscan logic in a backwards scan; it's
			 * not very likely anyone else is doing the same thing at the same
			 * time, and much more likely that we'll just bollix things for
			 * forward scanners.
			 */
			scan->rs_syncscan = false;
			/* start from last page of the scan */
			if (scan->rs_startblock > 0)
				page = scan->rs_startblock - 1;
			else
				page = scan->rs_nblocks - 1;
			heapgetpage(scan, page);
		}
		else
		{
			/* continue from previously returned page/tuple */
			page = scan->rs_cblock;		/* current page */
		}

		dp = BufferGetPage(scan->rs_cbuf);
		TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);
		lines = scan->rs_ntuples;

		if (!scan->rs_inited)
		{
			lineindex = lines - 1;
			scan->rs_inited = true;
		}
		else
		{
			lineindex = scan->rs_cindex - 1;
		}
		/* page and lineindex now reference the previous visible tid */

		linesleft = lineindex + 1;
	}
	else
	{
		/*
		 * ``no movement'' scan direction: refetch prior tuple
		 */
		if (!scan->rs_inited)
		{
			Assert(!BufferIsValid(scan->rs_cbuf));
			tuple->t_data = NULL;
			return;
		}

		page = ItemPointerGetBlockNumber(&(tuple->t_self));
		if (page != scan->rs_cblock)
			heapgetpage(scan, page);

		/* Since the tuple was previously fetched, needn't lock page here */
		dp = BufferGetPage(scan->rs_cbuf);
		TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);
		lineoff = ItemPointerGetOffsetNumber(&(tuple->t_self));
		lpp = PageGetItemId(dp, lineoff);
		Assert(ItemIdIsNormal(lpp));

		tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
		tuple->t_len = ItemIdGetLength(lpp);

		/* check that rs_cindex is in sync */
		Assert(scan->rs_cindex < scan->rs_ntuples);
		Assert(lineoff == scan->rs_vistuples[scan->rs_cindex]);

		return;
	}

	/*
	 * advance the scan until we find a qualifying tuple or run out of stuff
	 * to scan
	 */
	for (;;)
	{
        /* 步骤3:判断剩余元组数 */
		while (linesleft > 0)
		{
			lineoff = scan->rs_vistuples[lineindex];
			lpp = PageGetItemId(dp, lineoff);
			Assert(ItemIdIsNormal(lpp));
			 /* 
			  * 剩余元组不为0,获取lineindex对应的元组并返回,
			  * 这里其实是PostgreSQL的一大优势,后面会说明
              */
			tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
			tuple->t_len = ItemIdGetLength(lpp);
			ItemPointerSet(&(tuple->t_self), page, lineoff);

			/*
			 * if current tuple qualifies, return it.
			 */
			if (key != NULL)
			{
				bool		valid;

				HeapKeyTest(tuple, RelationGetDescr(scan->rs_rd),
							nkeys, key, valid);
				if (valid)
				{
					scan->rs_cindex = lineindex;
					return;
				}
			}
			else
			{
				scan->rs_cindex = lineindex;
				return;
			}

			/*
			 * otherwise move to the next item on the page
			 */
			--linesleft;
			if (backward)
				--lineindex;
			else
				++lineindex;
		}

		/*
		 * if we get here, it means we've exhausted the items on this page and
		 * it's time to move to the next.
		 * 剩余元组为0
		 * 步骤4:获取下一个物理块
		 */
		if (backward)
		{
			finished = (page == scan->rs_startblock) ||
				(scan->rs_numblocks != InvalidBlockNumber?--scan->rs_numblocks==0:false);
			if (page == 0)
				page = scan->rs_nblocks;
			page--;
		}
		else if (scan->rs_parallel != NULL)
		{
			page = heap_parallelscan_nextpage(scan);
			finished = (page == InvalidBlockNumber);
		}
		else
		{
            /* 判断是否还有未遍历的物理块,如果没有则将finished置为true */
			page++;
			if (page >= scan->rs_nblocks)
				page = 0;
			finished = (page == scan->rs_startblock) ||
				(scan->rs_numblocks != InvalidBlockNumber?--scan->rs_numblocks==0:false);

			/*
			 * Report our new scan position for synchronization purposes. We
			 * don't do that when moving backwards, however. That would just
			 * mess up any other forward-moving scanners.
			 *
			 * Note: we do this before checking for end of scan so that the
			 * final state of the position hint is back at the start of the
			 * rel.  That's not strictly necessary, but otherwise when you run
			 * the same query multiple times the starting position would shift
			 * a little bit backwards on every invocation, which is confusing.
			 * We don't guarantee any specific ordering in general, though.
			 */
			if (scan->rs_syncscan)
				ss_report_location(scan->rs_rd, page);
		}

		/*
		 * return NULL if we've exhausted all the pages
		 * 结束查询
		 */
		if (finished)
		{
			if (BufferIsValid(scan->rs_cbuf))
				ReleaseBuffer(scan->rs_cbuf);
			scan->rs_cbuf = InvalidBuffer;
			scan->rs_cblock = InvalidBlockNumber;
			tuple->t_data = NULL;
			scan->rs_inited = false;
			return;
		}
		
        /* 则获取下一个物理块的可见元组 */
		heapgetpage(scan, page);

		dp = BufferGetPage(scan->rs_cbuf);
		TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);
		lines = scan->rs_ntuples;
		linesleft = lines;
		if (backward)
			lineindex = lines - 1;
		else
			lineindex = 0;  /* 重置lineindex */
         /* 返回步骤3 */
	}
}
heapgetpage

下面来介绍全表遍历最核心的一个函数heapgetpage,heapgetpage用于获取一个物理块中所有可见元组,将这些可见元组的ItemId存放在rs_vistuples中。heapgetpage的流程如下:

  1. 将物理块加载到缓存页中。

  2. 为缓存页加轻量级共享锁

    轻量级锁,即light wight lock,是PostgreSQL自己实现的一种锁,本质上是latch,用于共享资源的多进程同步,具备等待队列,不具备死锁检测。

  3. 遍历页面中的所有元组,判断元组的可见性,将可见元组的ItemId存入rs_vistuples中。

  4. 释放轻量级共享锁。

下面我们来看看源代码:

void
heapgetpage(HeapScanDesc scan, BlockNumber page)
{
	Buffer		buffer;
	Snapshot	snapshot;
	Page		dp;
	int			lines;
	int			ntup;
	OffsetNumber lineoff;
	ItemId		lpp;
	bool		all_visible;

	Assert(page < scan->rs_nblocks);

	/* release previous scan buffer, if any */
	if (BufferIsValid(scan->rs_cbuf))
	{
		ReleaseBuffer(scan->rs_cbuf);
		scan->rs_cbuf = InvalidBuffer;
	}

	/*
	 * Be sure to check for interrupts at least once per page.  Checks at
	 * higher code levels won't be able to stop a seqscan that encounters many
	 * pages' worth of consecutive dead tuples.
	 */
	CHECK_FOR_INTERRUPTS();

	/* 
	 * read page using selected strategy 
	 * 步骤1:将物理块加载到缓存页中
	 */
	scan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM, page,
									   RBM_NORMAL, scan->rs_strategy);
	scan->rs_cblock = page;

	if (!scan->rs_pageatatime)
		return;

	buffer = scan->rs_cbuf;
	snapshot = scan->rs_snapshot;

	/*
	 * Prune and repair fragmentation for the whole page, if possible.
	 */
	heap_page_prune_opt(scan->rs_rd, buffer);

	/*
	 * We must hold share lock on the buffer content while examining tuple
	 * visibility.  Afterwards, however, the tuples we have found to be
	 * visible are guaranteed good as long as we hold the buffer pin.
	 * 步骤2:为缓存页加轻量级共享锁
	 */
	LockBuffer(buffer, BUFFER_LOCK_SHARE);

	dp = BufferGetPage(buffer);
	TestForOldSnapshot(snapshot, scan->rs_rd, dp);
	lines = PageGetMaxOffsetNumber(dp);
	ntup = 0;

	/*
	 * If the all-visible flag indicates that all tuples on the page are
	 * visible to everyone, we can skip the per-tuple visibility tests.
	 *
	 * Note: In hot standby, a tuple that's already visible to all
	 * transactions in the master might still be invisible to a read-only
	 * transaction in the standby. We partly handle this problem by tracking
	 * the minimum xmin of visible tuples as the cut-off XID while marking a
	 * page all-visible on master and WAL log that along with the visibility
	 * map SET operation. In hot standby, we wait for (or abort) all
	 * transactions that can potentially may not see one or more tuples on the
	 * page. That's how index-only scans work fine in hot standby. A crucial
	 * difference between index-only scans and heap scans is that the
	 * index-only scan completely relies on the visibility map where as heap
	 * scan looks at the page-level PD_ALL_VISIBLE flag. We are not sure if
	 * the page-level flag can be trusted in the same way, because it might
	 * get propagated somehow without being explicitly WAL-logged, e.g. via a
	 * full page write. Until we can prove that beyond doubt, let's check each
	 * tuple for visibility the hard way.
	 * 步骤3:遍历页面中的所有元组,判断元组的可见性,将可见元组的ItemId存入rs_vistuples中
	 */
	all_visible = PageIsAllVisible(dp) && !snapshot->takenDuringRecovery;

	for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff);
		 lineoff <= lines;
		 lineoff++, lpp++)
	{
		if (ItemIdIsNormal(lpp))
		{
			HeapTupleData loctup;
			bool		valid;

			loctup.t_tableOid = RelationGetRelid(scan->rs_rd);
			loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
			loctup.t_len = ItemIdGetLength(lpp);
			ItemPointerSet(&(loctup.t_self), page, lineoff);

			if (all_visible)
				valid = true;
			else
				valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);

			CheckForSerializableConflictOut(valid, scan->rs_rd, &loctup,
											buffer, snapshot);

			if (valid)
				scan->rs_vistuples[ntup++] = lineoff;
		}
	}
	/* 步骤4:释放轻量级共享锁 */
	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);

	Assert(ntup <= MaxHeapTuplesPerPage);
	scan->rs_ntuples = ntup;
}

PostgreSQL全表遍历优势

在前面讲解heapgettup_pagemode流程时,我们不难发现,对于可见元组,是直接返回其在缓存页中对应的指针。相关代码如下(向上搜索“优势”二字也可以找到对应代码):

tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
tuple->t_len = ItemIdGetLength(lpp);

通常数据库不会直接返回元组在页面上的指针,而是将页面重新copy一份,返回copy页上对应的元组指针,或者将元组拷贝一份返回。这是因为,为了提高数据库并发性,查询不会对页面和元组加锁,只会在必要时对页面加latch(也就是前面讨论过的轻量级锁)。查询过程中,页面可以并行执行修改操作,由MVCC来保证查询数据的正确。所以查询进程获取的元组,在查询结束前可能就被一个修改进程更改了。如果直接返回页面指针,肯定会出现数据不一致,所以可见元组通常都需要拷贝一份返回。但是在PostgreSQL中,update操作是delete + insert(也叫outplace update),也就是说元组一旦插入后元组内容是不可能被修改的,所以也就不存在上述的多进程问题,可以放心的使用指针。

当然,相比inplace update,outplace update会造成很大的数据膨胀。这部分内容会在专门的文档中进行说明。

缓存页的淘汰

前面提出的4个问题,我们已经解决了前两个,对于元组可见性的判断相对比较复杂,详见《PostgreSQL 事务—MVCC》。现在我们来解决最后一个问题,缓存页的淘汰。

前面提到,元组的返回是直接返回元组在缓存页上对应的指针。后面对于该元组的所有操作都是基于这个指针进行的。那么很显然,只要还需要用到这个元组,元组对应的缓存页就不能被淘汰。为了确保正在访问的缓存页不会被淘汰,在PostgreSQL中,每个缓存页都存在一个引用计数,只要缓存页的引用计数不为0,该缓存页就不能被淘汰。增加缓存页引用计数的操作称为pin,反之减少引用计数的操作称为unpin。下面,我们来看看全表遍历过程中,缓存页的引用计数是如何控制的?

页面的pin和unpin在下面两个函数中进行:

  • heapgetpage
  • ExecStoreTuple

heapgetpage函数的功能前面已经讲过,这么重点强调下函数的调用时机。由于heapgetpage会获取页面中所有可见元组,所以自然是第一次访问某页面时调用。具体而言在下面两种情况调用:

  • 全表遍历开始时,访问rs_startblock对应块时。
  • 块切换时(当遍历完一个块,开始遍历下一个块时)。

而ExecStoreTuple在每返回一条元组时调用。

假设当前结束了对page0的遍历,需要继续遍历page1,那么就需要对page1进行pin操作,流程如下:

  1. 在heapgetpage中通过ReadBufferExtended将一个块加载到缓存page1时,在ReadBufferExtended的内部会第一次对page1进行pin操作,此时page1的引用计数为1。

  2. 当page1的第一条可见元组返回后,会调用ExecStoreTuple函数,实现如下:

    TupleTableSlot *
    ExecStoreTuple(HeapTuple tuple,
    			   TupleTableSlot *slot,
    			   Buffer buffer,
    			   bool shouldFree)
    {
    	/*
    	 * sanity checks
    	 */
    	Assert(tuple != NULL);
    	Assert(slot != NULL);
    	Assert(slot->tts_tupleDescriptor != NULL);
    	/* passing shouldFree=true for a tuple on a disk page is not sane */
    	Assert(BufferIsValid(buffer) ? (!shouldFree) : true);
    
    	/*
    	 * Free any old physical tuple belonging to the slot.
    	 */
    	if (slot->tts_shouldFree)
    		heap_freetuple(slot->tts_tuple);
    	if (slot->tts_shouldFreeMin)
    		heap_free_minimal_tuple(slot->tts_mintuple);
    
    	/*
    	 * Store the new tuple into the specified slot.
    	 */
    	slot->tts_isempty = false;
    	slot->tts_shouldFree = shouldFree;
    	slot->tts_shouldFreeMin = false;
    	slot->tts_tuple = tuple;
    	slot->tts_mintuple = NULL;
    
    	/* Mark extracted state invalid */
    	slot->tts_nvalid = 0;
    
    	/*
    	 * If tuple is on a disk page, keep the page pinned as long as we hold a
    	 * pointer into it.  We assume the caller already has such a pin.
    	 *
    	 * This is coded to optimize the case where the slot previously held a
    	 * tuple on the same disk page: in that case releasing and re-acquiring
    	 * the pin is a waste of cycles.  This is a common situation during
    	 * seqscans, so it's worth troubling over.
    	 */
    	if (slot->tts_buffer != buffer)
    	{
    		if (BufferIsValid(slot->tts_buffer))
    			ReleaseBuffer(slot->tts_buffer);
    		slot->tts_buffer = buffer;
    		if (BufferIsValid(buffer))
    			IncrBufferRefCount(buffer);
    	}
    
    	return slot;
    }
    

    注意从45行开始的这段代码,由于页面从page0切换到了page1,所以slot->tts_buffer与buffer不相等(slot->tts_buffers为page0,buffer为page1)。此时会增加第二次增加page1的引用计数,使page1的引用计数从1变为2。

假设当前结束了对page1的遍历,需要继续遍历page2,此时对于page1的遍历已经结束,那么就可以对page1进行unpin操作使其可以被交换。我们来看看page1的unpin操作。

  1. 通过前面对heapgetpage调用时机的描述不难发现,在调用heapgetpage时只要当前的缓存页不为空,就说明发生了页面切换,所以第一次为page1执行unpin操作,此时page1的引用计数被减为1。

    	if (BufferIsValid(scan->rs_cbuf))
    	{
    		ReleaseBuffer(scan->rs_cbuf);	/* ReleaseBuffer会调用unpin */
    		scan->rs_cbuf = InvalidBuffer;
    	}
    
  2. 当page2的第一条可见元组返回后,同样会调用ExecStoreTuple函数,此时会第二次减少page1的引用计数,使page1的引用计数从1变为0。

    if (BufferIsValid(slot->tts_buffer))
    	ReleaseBuffer(slot->tts_buffer);
    

遗留问题

  1. 在什么条件下可以使用page-at-a-time模式,什么时候不可以?
  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值