全表遍历
预备知识
概述
在《PostgreSQL 流程—查询》我们重点讨论了PostgreSQL的查询流程,提到全表遍历的操作主要有函数SeqNext来实现,本文将重点讨论SeqNext的流程。
SeqNext
SeqNext的代码比较短,我们直接来看代码:
static TupleTableSlot *
SeqNext(SeqScanState *node)
{
HeapTuple tuple;
HeapScanDesc scandesc;
EState *estate;
ScanDirection direction;
TupleTableSlot *slot;
/*
* get information from the estate and scan state
*/
scandesc = node->ss.ss_currentScanDesc;
estate = node->ss.ps.state;
direction = estate->es_direction;
slot = node->ss.ss_ScanTupleSlot;
/* 如果不存在scandesc,就创建一个scandesc */
if (scandesc == NULL)
{
/*
* We reach here if the scan is not parallel, or if we're serially
* executing a scan that was planned to be parallel.
*/
scandesc = heap_beginscan(node->ss.ss_currentRelation,
estate->es_snapshot,
0, NULL);
node->ss.ss_currentScanDesc = scandesc;
}
/*
* get the next tuple from the table
* 获取一条可见元组
*/
tuple = heap_getnext(scandesc, direction);
/*
* save the tuple and the buffer returned to us by the access methods in
* our scan tuple slot and return the slot. Note: we pass 'false' because
* tuples returned by heap_getnext() are pointers onto disk pages and were
* not created with palloc() and so should not be pfree()'d. Note also
* that ExecStoreTuple will increment the refcount of the buffer; the
* refcount will not be dropped until the tuple table slot is cleared.
* 一些清理工作,比如对buffer page执行unpin操作。
*/
if (tuple)
ExecStoreTuple(tuple, /* tuple to store */
slot, /* slot to store in */
scandesc->rs_cbuf, /* buffer associated with this
* tuple */
false); /* don't pfree this pointer */
else
ExecClearTuple(slot);
return slot;
}
SeqNext会做三件事:
-
如果不存在scandesc则创建并初始化一个scandesc
由于SeqNext一次只返回一条可见元组,所以需要一个迭代器,用于记录当前遍历到了哪一个缓存页的哪一条元组,scandesc就是这个迭代器。scandesc是一个HeapScanDesc类型的结构体,定义如下:
typedef struct HeapScanDescData { /* scan parameters */ Relation rs_rd; /* heap relation descriptor */ Snapshot rs_snapshot; /* snapshot to see */ int rs_nkeys; /* number of scan keys */ ScanKey rs_key; /* array of scan key descriptors */ bool rs_bitmapscan; /* true if this is really a bitmap scan */ bool rs_samplescan; /* true if this is really a sample scan */ bool rs_pageatatime; /* verify visibility page-at-a-time? */ bool rs_allow_strat; /* allow or disallow use of access strategy */ bool rs_allow_sync; /* allow or disallow use of syncscan */ bool rs_temp_snap; /* unregister snapshot at scan end? */ /* state set up at initscan time */ BlockNumber rs_nblocks; /* total number of blocks in rel */ BlockNumber rs_startblock; /* block # to start at */ BlockNumber rs_numblocks; /* max number of blocks to scan */ /* rs_numblocks is usually InvalidBlockNumber, meaning "scan whole rel" */ BufferAccessStrategy rs_strategy; /* access strategy for reads */ bool rs_syncscan; /* report location to syncscan logic? */ /* scan current state */ bool rs_inited; /* false = scan not init'd yet */ HeapTupleData rs_ctup; /* current tuple in scan, if any */ BlockNumber rs_cblock; /* current block # in scan, if any */ Buffer rs_cbuf; /* current buffer in scan, if any */ /* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */ ParallelHeapScanDesc rs_parallel; /* parallel scan information */ /* these fields only used in page-at-a-time mode and for bitmap scans */ int rs_cindex; /* current tuple's index in vistuples */ int rs_ntuples; /* number of visible tuples on page */ OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */ } HeapScanDescData; typedef struct HeapScanDescData *HeapScanDesc;
HeapScanDesc中的成员较多,HeapScanDesc中成员的初始化工作都在heap_beginscan中完成,我们会在后面的流程中对全表遍历所涉及到的成员进行说明。
-
调用heap_getnext获取一条可见元组
该函数也是全表遍历的一个核心函数。
-
调用ExecStoreTuple或ExecClearTuple执行一些清理操作
全表遍历流程
简单来说,全表遍历就是从某个表的第一个块的第一条记录开始逐条遍历,直到最后一个块的最后一个记录。
整个过程包含许多细节,我们把这些细节总结成问题,然后通过解决这些问题来了解整个流程。
- 如何获取遍历的起始位置?
- 记录的获取如何实现?
- 记录的可见性如何判断?
- 在遍历某个缓存页时,如何保证缓存页不被淘汰?缓存页什么时候可以淘汰?
如何获取遍历的起始位置
显然遍历的起始位置是表的第一个块的第一条记录。在具体实现时,涉及到HeapScanDesc的两个成员:
- rs_rd:表示当前表。
- rs_startblock:表示遍历的起始块的块号。注意这是物理块的块号,在实际访问时,需要将其加载到内存页中!
记录的获取如何实现
记录的获取在heap_getnext中实现,我们来看看heap_getnext的代码:
HeapTuple
heap_getnext(HeapScanDesc scan, ScanDirection direction)
{
/* Note: no locking manipulations needed */
HEAPDEBUG_1; /* heap_getnext( info ) */
if (scan->rs_pageatatime)
heapgettup_pagemode(scan, direction,
scan->rs_nkeys, scan->rs_key);
else
heapgettup(scan, direction, scan->rs_nkeys, scan->rs_key);
if (scan->rs_ctup.t_data == NULL)
{
HEAPDEBUG_2; /* heap_getnext returning EOS */
return NULL;
}
/*
* if we get here it means we have a new current scan tuple, so point to
* the proper return buffer and return the tuple.
*/
HEAPDEBUG_3; /* heap_getnext returning tuple */
pgstat_count_heap_getnext(scan->rs_rd);
return &(scan->rs_ctup);
}
这里涉及HeapScanDesc的一个成员:
-
rs_pageatatime
rs_pageatatime用于标识是否可以使用page-at-a-time模式,page-at-a-time模式是一种全表遍历的模式,在该模式下会调用heapgettup_pagemode,heapgettup_pagemode的功能与heapgettup完全一致,但实现上轻量级。
我们接下来重点看下heapgettup_pagemode的实现。
heapgettup_pagemode
heapgettup_pagemode是全表遍历最底层的函数,用于返回一条可见的元组,该函数流程如下:
-
判断迭代器是否初始化(根据HeapScanDesc的rs_inited成员来判断)
a. 未初始化
如果未初始化,则说明查询刚刚开始。那么首先需要判断当前表是否存在数据块(根据HeapScanDesc的rs_nblocks成员来判断)以及可供扫描的块的个数(根据HeapScanDesc的rs_numblocks成员来判断)。rs_nblocks与rs_numblocks只要有一个为0,则说明该表没有可供扫描的块,直接返回。
如果rs_nblocks与rs_numblocks都不为0。则根据HeapScanDesc的rs_startblock获取遍历的第一个块。然后调用heapgetpage函数获取块中所有可见元组的ItemId(即元组定长部分的下标)并存放在HeapScanDesc的rs_vistuples成员中,后面元组是通过遍历rs_vistuples来获取的。heapgetpage会在后面详细讲解。
由于查询刚开始,所以显然是要获取第一条可见元素,即rs_vistuples[0]对应的元组。所以将lineindex设为0。
最后将rs_inited设为true表明已经进行了初始化。
b. 已初始化
继续遍历,获取当前缓存页(从HeapScanDesc的rs_cblock成员获取)的下一条元组(从HeapScanDesc的rs_cindex成员获取)。将lineindex设置为rs_cindex+1。
-
获取当前页的剩余元组数
lines = scan->rs_ntuples; /* rs_ntuples在heapgetpage中获取 */ linesleft = lines - lineindex;
-
判断剩余元组数
a. > 0
获取lineindex对应的元组,修改当前元组下标(将scan->rs_cindex设置为lineindex),然后返回。
b. < 0
执行步骤4
-
获取下一个物理块
如果还有未遍历的物理块,则获取下一个物理块,调用heapgetpage获取块中的可见元组。将lineindex置为0,然后返回步骤3。如果已经遍历完所有物理块,则直接返回。
下面我们来看看heapgettup_pagemode的代码:
static void
heapgettup_pagemode(HeapScanDesc scan,
ScanDirection dir,
int nkeys,
ScanKey key)
{
HeapTuple tuple = &(scan->rs_ctup);
bool backward = ScanDirectionIsBackward(dir);
BlockNumber page;
bool finished;
Page dp;
int lines;
int lineindex;
OffsetNumber lineoff;
int linesleft;
ItemId lpp;
/*
* calculate next starting lineindex, given scan direction
*/
if (ScanDirectionIsForward(dir))
{
/* 步骤1:判断迭代器是否初始化 */
if (!scan->rs_inited)
{
/*
* return null immediately if relation is empty
* 判断是否有可以遍历的块
*/
if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0)
{
/* 没有可遍历块,直接返回 */
Assert(!BufferIsValid(scan->rs_cbuf));
tuple->t_data = NULL;
return;
}
/* 判断是否是并行遍历 */
if (scan->rs_parallel != NULL)
{
page = heap_parallelscan_nextpage(scan);
/* Other processes might have already finished the scan. */
if (page == InvalidBlockNumber)
{
Assert(!BufferIsValid(scan->rs_cbuf));
tuple->t_data = NULL;
return;
}
}
else /* 获取第一个物理块 */
page = scan->rs_startblock; /* first page */
/* 获取该物理块中所有的可见元组 */
heapgetpage(scan, page);
lineindex = 0;
scan->rs_inited = true;
}
else
{
/*
* continue from previously returned page/tuple
* 继续遍历
*/
page = scan->rs_cblock; /* current page */
lineindex = scan->rs_cindex + 1;
}
/* 步骤2:获取当前页的剩余元组数 */
dp = BufferGetPage(scan->rs_cbuf);
TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);
lines = scan->rs_ntuples;
/* page and lineindex now reference the next visible tid */
linesleft = lines - lineindex;
}
else if (backward)
{
/* backward parallel scan not supported */
Assert(scan->rs_parallel == NULL);
if (!scan->rs_inited)
{
/*
* return null immediately if relation is empty
*/
if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0)
{
Assert(!BufferIsValid(scan->rs_cbuf));
tuple->t_data = NULL;
return;
}
/*
* Disable reporting to syncscan logic in a backwards scan; it's
* not very likely anyone else is doing the same thing at the same
* time, and much more likely that we'll just bollix things for
* forward scanners.
*/
scan->rs_syncscan = false;
/* start from last page of the scan */
if (scan->rs_startblock > 0)
page = scan->rs_startblock - 1;
else
page = scan->rs_nblocks - 1;
heapgetpage(scan, page);
}
else
{
/* continue from previously returned page/tuple */
page = scan->rs_cblock; /* current page */
}
dp = BufferGetPage(scan->rs_cbuf);
TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);
lines = scan->rs_ntuples;
if (!scan->rs_inited)
{
lineindex = lines - 1;
scan->rs_inited = true;
}
else
{
lineindex = scan->rs_cindex - 1;
}
/* page and lineindex now reference the previous visible tid */
linesleft = lineindex + 1;
}
else
{
/*
* ``no movement'' scan direction: refetch prior tuple
*/
if (!scan->rs_inited)
{
Assert(!BufferIsValid(scan->rs_cbuf));
tuple->t_data = NULL;
return;
}
page = ItemPointerGetBlockNumber(&(tuple->t_self));
if (page != scan->rs_cblock)
heapgetpage(scan, page);
/* Since the tuple was previously fetched, needn't lock page here */
dp = BufferGetPage(scan->rs_cbuf);
TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);
lineoff = ItemPointerGetOffsetNumber(&(tuple->t_self));
lpp = PageGetItemId(dp, lineoff);
Assert(ItemIdIsNormal(lpp));
tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
tuple->t_len = ItemIdGetLength(lpp);
/* check that rs_cindex is in sync */
Assert(scan->rs_cindex < scan->rs_ntuples);
Assert(lineoff == scan->rs_vistuples[scan->rs_cindex]);
return;
}
/*
* advance the scan until we find a qualifying tuple or run out of stuff
* to scan
*/
for (;;)
{
/* 步骤3:判断剩余元组数 */
while (linesleft > 0)
{
lineoff = scan->rs_vistuples[lineindex];
lpp = PageGetItemId(dp, lineoff);
Assert(ItemIdIsNormal(lpp));
/*
* 剩余元组不为0,获取lineindex对应的元组并返回,
* 这里其实是PostgreSQL的一大优势,后面会说明
*/
tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
tuple->t_len = ItemIdGetLength(lpp);
ItemPointerSet(&(tuple->t_self), page, lineoff);
/*
* if current tuple qualifies, return it.
*/
if (key != NULL)
{
bool valid;
HeapKeyTest(tuple, RelationGetDescr(scan->rs_rd),
nkeys, key, valid);
if (valid)
{
scan->rs_cindex = lineindex;
return;
}
}
else
{
scan->rs_cindex = lineindex;
return;
}
/*
* otherwise move to the next item on the page
*/
--linesleft;
if (backward)
--lineindex;
else
++lineindex;
}
/*
* if we get here, it means we've exhausted the items on this page and
* it's time to move to the next.
* 剩余元组为0
* 步骤4:获取下一个物理块
*/
if (backward)
{
finished = (page == scan->rs_startblock) ||
(scan->rs_numblocks != InvalidBlockNumber?--scan->rs_numblocks==0:false);
if (page == 0)
page = scan->rs_nblocks;
page--;
}
else if (scan->rs_parallel != NULL)
{
page = heap_parallelscan_nextpage(scan);
finished = (page == InvalidBlockNumber);
}
else
{
/* 判断是否还有未遍历的物理块,如果没有则将finished置为true */
page++;
if (page >= scan->rs_nblocks)
page = 0;
finished = (page == scan->rs_startblock) ||
(scan->rs_numblocks != InvalidBlockNumber?--scan->rs_numblocks==0:false);
/*
* Report our new scan position for synchronization purposes. We
* don't do that when moving backwards, however. That would just
* mess up any other forward-moving scanners.
*
* Note: we do this before checking for end of scan so that the
* final state of the position hint is back at the start of the
* rel. That's not strictly necessary, but otherwise when you run
* the same query multiple times the starting position would shift
* a little bit backwards on every invocation, which is confusing.
* We don't guarantee any specific ordering in general, though.
*/
if (scan->rs_syncscan)
ss_report_location(scan->rs_rd, page);
}
/*
* return NULL if we've exhausted all the pages
* 结束查询
*/
if (finished)
{
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
scan->rs_cbuf = InvalidBuffer;
scan->rs_cblock = InvalidBlockNumber;
tuple->t_data = NULL;
scan->rs_inited = false;
return;
}
/* 则获取下一个物理块的可见元组 */
heapgetpage(scan, page);
dp = BufferGetPage(scan->rs_cbuf);
TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);
lines = scan->rs_ntuples;
linesleft = lines;
if (backward)
lineindex = lines - 1;
else
lineindex = 0; /* 重置lineindex */
/* 返回步骤3 */
}
}
heapgetpage
下面来介绍全表遍历最核心的一个函数heapgetpage,heapgetpage用于获取一个物理块中所有可见元组,将这些可见元组的ItemId存放在rs_vistuples中。heapgetpage的流程如下:
-
将物理块加载到缓存页中。
-
为缓存页加轻量级共享锁。
轻量级锁,即light wight lock,是PostgreSQL自己实现的一种锁,本质上是latch,用于共享资源的多进程同步,具备等待队列,不具备死锁检测。
-
遍历页面中的所有元组,判断元组的可见性,将可见元组的ItemId存入rs_vistuples中。
-
释放轻量级共享锁。
下面我们来看看源代码:
void
heapgetpage(HeapScanDesc scan, BlockNumber page)
{
Buffer buffer;
Snapshot snapshot;
Page dp;
int lines;
int ntup;
OffsetNumber lineoff;
ItemId lpp;
bool all_visible;
Assert(page < scan->rs_nblocks);
/* release previous scan buffer, if any */
if (BufferIsValid(scan->rs_cbuf))
{
ReleaseBuffer(scan->rs_cbuf);
scan->rs_cbuf = InvalidBuffer;
}
/*
* Be sure to check for interrupts at least once per page. Checks at
* higher code levels won't be able to stop a seqscan that encounters many
* pages' worth of consecutive dead tuples.
*/
CHECK_FOR_INTERRUPTS();
/*
* read page using selected strategy
* 步骤1:将物理块加载到缓存页中
*/
scan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM, page,
RBM_NORMAL, scan->rs_strategy);
scan->rs_cblock = page;
if (!scan->rs_pageatatime)
return;
buffer = scan->rs_cbuf;
snapshot = scan->rs_snapshot;
/*
* Prune and repair fragmentation for the whole page, if possible.
*/
heap_page_prune_opt(scan->rs_rd, buffer);
/*
* We must hold share lock on the buffer content while examining tuple
* visibility. Afterwards, however, the tuples we have found to be
* visible are guaranteed good as long as we hold the buffer pin.
* 步骤2:为缓存页加轻量级共享锁
*/
LockBuffer(buffer, BUFFER_LOCK_SHARE);
dp = BufferGetPage(buffer);
TestForOldSnapshot(snapshot, scan->rs_rd, dp);
lines = PageGetMaxOffsetNumber(dp);
ntup = 0;
/*
* If the all-visible flag indicates that all tuples on the page are
* visible to everyone, we can skip the per-tuple visibility tests.
*
* Note: In hot standby, a tuple that's already visible to all
* transactions in the master might still be invisible to a read-only
* transaction in the standby. We partly handle this problem by tracking
* the minimum xmin of visible tuples as the cut-off XID while marking a
* page all-visible on master and WAL log that along with the visibility
* map SET operation. In hot standby, we wait for (or abort) all
* transactions that can potentially may not see one or more tuples on the
* page. That's how index-only scans work fine in hot standby. A crucial
* difference between index-only scans and heap scans is that the
* index-only scan completely relies on the visibility map where as heap
* scan looks at the page-level PD_ALL_VISIBLE flag. We are not sure if
* the page-level flag can be trusted in the same way, because it might
* get propagated somehow without being explicitly WAL-logged, e.g. via a
* full page write. Until we can prove that beyond doubt, let's check each
* tuple for visibility the hard way.
* 步骤3:遍历页面中的所有元组,判断元组的可见性,将可见元组的ItemId存入rs_vistuples中
*/
all_visible = PageIsAllVisible(dp) && !snapshot->takenDuringRecovery;
for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff);
lineoff <= lines;
lineoff++, lpp++)
{
if (ItemIdIsNormal(lpp))
{
HeapTupleData loctup;
bool valid;
loctup.t_tableOid = RelationGetRelid(scan->rs_rd);
loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
loctup.t_len = ItemIdGetLength(lpp);
ItemPointerSet(&(loctup.t_self), page, lineoff);
if (all_visible)
valid = true;
else
valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
CheckForSerializableConflictOut(valid, scan->rs_rd, &loctup,
buffer, snapshot);
if (valid)
scan->rs_vistuples[ntup++] = lineoff;
}
}
/* 步骤4:释放轻量级共享锁 */
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
Assert(ntup <= MaxHeapTuplesPerPage);
scan->rs_ntuples = ntup;
}
PostgreSQL全表遍历优势
在前面讲解heapgettup_pagemode流程时,我们不难发现,对于可见元组,是直接返回其在缓存页中对应的指针。相关代码如下(向上搜索“优势”二字也可以找到对应代码):
tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
tuple->t_len = ItemIdGetLength(lpp);
通常数据库不会直接返回元组在页面上的指针,而是将页面重新copy一份,返回copy页上对应的元组指针,或者将元组拷贝一份返回。这是因为,为了提高数据库并发性,查询不会对页面和元组加锁,只会在必要时对页面加latch(也就是前面讨论过的轻量级锁)。查询过程中,页面可以并行执行修改操作,由MVCC来保证查询数据的正确。所以查询进程获取的元组,在查询结束前可能就被一个修改进程更改了。如果直接返回页面指针,肯定会出现数据不一致,所以可见元组通常都需要拷贝一份返回。但是在PostgreSQL中,update操作是delete + insert(也叫outplace update),也就是说元组一旦插入后元组内容是不可能被修改的,所以也就不存在上述的多进程问题,可以放心的使用指针。
当然,相比inplace update,outplace update会造成很大的数据膨胀。这部分内容会在专门的文档中进行说明。
缓存页的淘汰
前面提出的4个问题,我们已经解决了前两个,对于元组可见性的判断相对比较复杂,详见《PostgreSQL 事务—MVCC》。现在我们来解决最后一个问题,缓存页的淘汰。
前面提到,元组的返回是直接返回元组在缓存页上对应的指针。后面对于该元组的所有操作都是基于这个指针进行的。那么很显然,只要还需要用到这个元组,元组对应的缓存页就不能被淘汰。为了确保正在访问的缓存页不会被淘汰,在PostgreSQL中,每个缓存页都存在一个引用计数,只要缓存页的引用计数不为0,该缓存页就不能被淘汰。增加缓存页引用计数的操作称为pin
,反之减少引用计数的操作称为unpin
。下面,我们来看看全表遍历过程中,缓存页的引用计数是如何控制的?
页面的pin和unpin在下面两个函数中进行:
- heapgetpage
- ExecStoreTuple
heapgetpage函数的功能前面已经讲过,这么重点强调下函数的调用时机。由于heapgetpage会获取页面中所有可见元组,所以自然是第一次访问某页面时调用。具体而言在下面两种情况调用:
- 全表遍历开始时,访问rs_startblock对应块时。
- 块切换时(当遍历完一个块,开始遍历下一个块时)。
而ExecStoreTuple在每返回一条元组时调用。
假设当前结束了对page0的遍历,需要继续遍历page1,那么就需要对page1进行pin操作,流程如下:
-
在heapgetpage中通过ReadBufferExtended将一个块加载到缓存page1时,在ReadBufferExtended的内部会第一次对page1进行pin操作,此时page1的引用计数为1。
-
当page1的第一条可见元组返回后,会调用ExecStoreTuple函数,实现如下:
TupleTableSlot * ExecStoreTuple(HeapTuple tuple, TupleTableSlot *slot, Buffer buffer, bool shouldFree) { /* * sanity checks */ Assert(tuple != NULL); Assert(slot != NULL); Assert(slot->tts_tupleDescriptor != NULL); /* passing shouldFree=true for a tuple on a disk page is not sane */ Assert(BufferIsValid(buffer) ? (!shouldFree) : true); /* * Free any old physical tuple belonging to the slot. */ if (slot->tts_shouldFree) heap_freetuple(slot->tts_tuple); if (slot->tts_shouldFreeMin) heap_free_minimal_tuple(slot->tts_mintuple); /* * Store the new tuple into the specified slot. */ slot->tts_isempty = false; slot->tts_shouldFree = shouldFree; slot->tts_shouldFreeMin = false; slot->tts_tuple = tuple; slot->tts_mintuple = NULL; /* Mark extracted state invalid */ slot->tts_nvalid = 0; /* * If tuple is on a disk page, keep the page pinned as long as we hold a * pointer into it. We assume the caller already has such a pin. * * This is coded to optimize the case where the slot previously held a * tuple on the same disk page: in that case releasing and re-acquiring * the pin is a waste of cycles. This is a common situation during * seqscans, so it's worth troubling over. */ if (slot->tts_buffer != buffer) { if (BufferIsValid(slot->tts_buffer)) ReleaseBuffer(slot->tts_buffer); slot->tts_buffer = buffer; if (BufferIsValid(buffer)) IncrBufferRefCount(buffer); } return slot; }
注意从45行开始的这段代码,由于页面从page0切换到了page1,所以slot->tts_buffer与buffer不相等(slot->tts_buffers为page0,buffer为page1)。此时会增加第二次增加page1的引用计数,使page1的引用计数从1变为2。
假设当前结束了对page1的遍历,需要继续遍历page2,此时对于page1的遍历已经结束,那么就可以对page1进行unpin操作使其可以被交换。我们来看看page1的unpin操作。
-
通过前面对heapgetpage调用时机的描述不难发现,在调用heapgetpage时只要当前的缓存页不为空,就说明发生了页面切换,所以第一次为page1执行unpin操作,此时page1的引用计数被减为1。
if (BufferIsValid(scan->rs_cbuf)) { ReleaseBuffer(scan->rs_cbuf); /* ReleaseBuffer会调用unpin */ scan->rs_cbuf = InvalidBuffer; }
-
当page2的第一条可见元组返回后,同样会调用ExecStoreTuple函数,此时会第二次减少page1的引用计数,使page1的引用计数从1变为0。
if (BufferIsValid(slot->tts_buffer)) ReleaseBuffer(slot->tts_buffer);
遗留问题
- 在什么条件下可以使用page-at-a-time模式,什么时候不可以?