PostgreSQL B+树索引—页面删除
预备知识
《PostgreSQL Blink-tree ReadMe—翻译》
概述
本文将阐述PostgreSQL B*树索引的最后一个部分,索引的页面删除。该部分内容在《PostgreSQL Blink-tree ReadMe—翻译》中其实已经涉及到了,不过Read Me的内容较多,本文将重新梳理一下B*树索引页面删除的逻辑,并从源代码的角度分析PostgreSQL如何实现页面删除。
在PostgreSQL中,VACUUM操作会对B*树中的索引元组进行物理删除,当一个索引页面中的所有元组都被删除,即索引页面被删空后,VACUUM操作会将这个页面从B*树中删除。PostgreSQL中关于B*树页面删除的思想来源于Lanin和Shasha的论《A Symmetric Concurrent B-Tree Algorithm》。但该论文的核心是讨论B*树如何实现合并操作(当页面数据足够少时,如何将两个页面合并为一个页面),而PostgreSQL的B*树并不支持页面合并,其原因在《PostgreSQL Blink-tree ReadMe—翻译》中解释过,这里不再赘述。
PostgreSQL页面删除功能由函数_bt_pagedel
来实现,该函数的调用栈如下:
ExecVacuum > vacuum > vacuum_rel > lazy_vacuum_rel > lazy_scan_heap > lazy_cleanup_index > index_vacuum_cleanup > btvacuumcleanup > btvacuumscan > btvacuumpage > _bt_pagedel
(在_bt_pagedel中打上断点,然后在客户端工具中输入VACUUM即可进入断点,观察调用栈)
我们的故事就从_bt_pagedel开始。_bt_pagedel的实现思路非常简单,我们先来图解一下:
-
阶段1:
从page6中删除指向page2的downlink,并将page2标记为half-dead,如图2所示。
图2 该阶段由函数
_bt_mark_page_halfdead
实现。 -
阶段2:
将page2从其左右兄弟中删除,并将page2标记为deleted,如图3所示。
图3 该阶段由函数
_bt_unlink_halfdead_page
实现。
这个简单的实现思路下隐含了很多的细节,本文将从如下两个方面对B*树索引的页面删除进行阐述。
- 删除操作的流程和实现细节
- 删除操作的并发控制
_bt_pagedel
_bt_pagedel
是整个页面删除操作的入口函数,所以我们先来看看这个函数的实现。
-
函数声明
int _bt_pagedel(Relation rel, Buffer buf)
其中,rel是表信息,buf是待删除页面。
-
函数框架
for (;;) { //...省略 rightsib = opaque->btpo_next; _bt_relbuf(rel, buf); CHECK_FOR_INTERRUPTS(); if (!rightsib_empty) break; buf = _bt_getbuf(rel, rightsib, BT_WRITE); }
一进_bt_pagedel函数就能看到上面的这个循环。于是顿生疑惑:从_bt_pagedel的函数声明来看,_bt_pagedel函数负责删除一个指定的节点,那么为什么会有一个循环呢?这个问题我们先不管,放到后面来解释。
-
获取待删除节点的父亲节点
在概述部分我们讲过,在删除页面的第一阶段,我们需要断开待删除节点与其父亲节点之间的关系,所以我们首先需要找到它的父亲节点是谁。而查找父亲节点的方式,就是获取待删除节点的high key,然后遍历二叉树找到这个high key所在的节点。代码实现如下:
//代码位置:nbtpage.c line:1211 if (!stack) { ScanKey itup_scankey; ItemId itemid; IndexTuple targetkey; Buffer lbuf; BlockNumber leftsib; //获取节点的high key itemid = PageGetItemId(page, P_HIKEY); targetkey = CopyIndexTuple((IndexTuple) PageGetItem(page, itemid)); leftsib = opaque->btpo_prev; /* * To avoid deadlocks, we'd better drop the leaf page lock * before going further. * * 节点解锁,防止查询时发生死锁 */ LockBuffer(buf, BUFFER_LOCK_UNLOCK); /* * Fetch the left sibling, to check that it's not marked with * INCOMPLETE_SPLIT flag. That would mean that the page * to-be-deleted doesn't have a downlink, and the page * deletion algorithm isn't prepared to handle that. */ if (!P_LEFTMOST(opaque)) { BTPageOpaque lopaque; Page lpage; lbuf = _bt_getbuf(rel, leftsib, BT_READ); lpage = BufferGetPage(lbuf); lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage); /* * If the left sibling is split again by another backend, * after we released the lock, we know that the first * split must have finished, because we don't allow an * incompletely-split page to be split again. So we don't * need to walk right here. */ if (lopaque->btpo_next == BufferGetBlockNumber(buf) && P_INCOMPLETE_SPLIT(lopaque)) { ReleaseBuffer(buf); _bt_relbuf(rel, lbuf); return ndeleted; } _bt_relbuf(rel, lbuf); } /* * we need an insertion scan key for the search, so build one * 将high key转换为scan key用于遍历 */ itup_scankey = _bt_mkscankey(rel, targetkey); /* find the leftmost leaf page containing this key */ stack = _bt_search(rel, rel->rd_rel->relnatts, itup_scankey, false, &lbuf, BT_READ, NULL); /* don't need a pin on the page */ _bt_relbuf(rel, lbuf); /* * Re-lock the leaf page, and start over, to re-check that the * page can still be deleted. */ LockBuffer(buf, BT_WRITE); continue; }
页面删除第一阶段
问题1
在概述中,我们简单阐述了页面删除流程中的两个阶段,并且以最简单粗暴的方式图解了这两个阶段,但是如果我们回过头去看看图2和图3,我们就会发现一个问题:如果删除page2后B*树如图3所示,那么如果我们此时希望向B*树中插入30,30应该插入到哪里?
-
插入到page1?
由于30大于page1的high key,如果将30插入page1,则需要将page1的high key改为30。
-
插入到page3?
由于30小于page3的min key,如果将30插入page3,那么就需要将page6中的51改为30。
所以,看起来30无论插入到哪个page都会有额外的开销。造成这一问题的原因,在于我们简单粗暴的删除了page2,从而使得page1的high key与page3的min key变得不相等了。而这一原则是PostgreSQL在实现B*树索引时必须要保证的。
为了确保这一点,PostgreSQL在实现B*树分裂时,特意将左节点的high key赋值为右节点的min key。
为了解决这个问题,PostgreSQL在从父节点删除page2时,并不是简单将51左移以覆盖21,而是巧妙的采用了如下步骤:
-
将page6中的21指向page2的右兄弟page3。
-
删除page6中21右边的索引元组51。(即原来指向page3的索引元组)
完成这两步操作后,B*数的结构如图4所示:
这两个步骤,在PostgreSQL中还有一个更为专业的说法,叫做:合并key space。通过这种方式,我们就将page2的key space(21~51)合并到了page3的key space(51~63)中。
上述步骤的实现代码位于
nbtpage.c line 1429~1443
。
问题2
现在我们来考虑另外一个问题,page3能不能删除?要回答这个问题,最直接的方式,就是试试看如果删除page3会发生什么?如图5所示:
按照前面的流程,我们需要将page6中的51指向page3的右兄弟page4,然后删除page7中的63,最终如图6所示:
不难看出,我们删除了page7中的63,但page8中指向page7的key依然是63,这就造成了上下级的不一致。此时如果我们向B*树中插入70,page8会把70路由到page7,然后我们就会发现70大于page7的min key,于是又出现的混乱。
出现这一问题的原因在于,我们删除了63之后,使得page6的high key与page7的min key不相等。在图2中,我们之所以可以合并page2和page3的key space,是因为page2和page3拥有相同的父节点(page6)。而现在,我们希望将page3与page4的key space合并,由于page3和page4的父节点不一样,所以如果合并,会导致page6与page7的key space发生改变!,如果一定要合并page3和page4,就必须修改page6和page7的bounding-key(可见,亲兄弟与表兄弟始终是有区别的!)。而修改父亲节点bounding-key这件事,开销会比较大,并且可能会引发递归修改(比如page8的63就需要改成85),所以PostgreSQL不予支持。
正是这个原因,PostgreSQL也就不支持父亲节点最右孩子的删除,除非这是这个父亲唯一的孩子,因为此时父亲节点也即将被删除。这里还有一种特殊情况,就是page5,page5是整棵树最右的叶子节点,要删除pgae5,那么page5必须是page7的唯一孩子,而要删除page7,page7必须是page8的唯一孩子,所以只有当整棵树为空时page5才能被删除!
所以,删除一个节点之前,需要判断这个节点是否可以被删除,判断的结果有三种:
- 当前节点不是父亲的最右孩子,可以删除。
- 当前节点是父亲的最右孩子,但不是唯一孩子,不能删除。
- 当前节点是父亲的最右孩子,且是唯一孩子,在删除的时需要连父亲一起删除。所以需要向上递归,以判断父亲是否可以删除!
那么简单来说,如果节点可以被删除,可能只删除这一个节点,可能需要删除一串节点,显然这两种情况都可以用相同的逻辑来处理。我们现在来看看这种情况:
现在我们希望删除图7中的page3,page3为page6的最右节点且唯一,所以需要递归向上判断page6是否可以删除,page6不是page8的最右节点,也可以删除,所以我们最终需要将page6和page8都从B*树中删除。而在第一阶段,我们需要做的事情就是将从page8中删除指向page6的downlink(断绝page6和page8的父子关系)。然后将page3标记为half-dead。
注意:
不论是只删除一个叶子节点,还是需要删除一串节点,在第一阶段都是将叶子节点标记为half-dead。
该步骤完成后,page6和page3就形成了一条链,如图8所示:
链中的page6和page3还维系着与右兄弟的联系,这个联系会在第二阶段删除。
_bt_mark_page_halfdead实现
我们下面来看看第一阶段的代码实现,第一阶段由_bt_mark_page_halfdead
函数实现,其代码如下:
static bool
_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
{
BlockNumber leafblkno;
BlockNumber leafrightsib;
BlockNumber target;
BlockNumber rightsib;
ItemId itemid;
Page page;
BTPageOpaque opaque;
Buffer topparent;
OffsetNumber topoff;
OffsetNumber nextoffset;
IndexTuple itup;
IndexTupleData trunctuple;
page = BufferGetPage(leafbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
Assert(!P_RIGHTMOST(opaque) && !P_ISROOT(opaque) && !P_ISDELETED(opaque) &&
!P_ISHALFDEAD(opaque) && P_ISLEAF(opaque) &&
P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page));
/*
* Save info about the leaf page.
*/
leafblkno = BufferGetBlockNumber(leafbuf);
leafrightsib = opaque->btpo_next;
/*
* Before attempting to lock the parent page, check that the right sibling
* is not in half-dead state. A half-dead right sibling would have no
* downlink in the parent, which would be highly confusing later when we
* delete the downlink that follows the current page's downlink. (I
* believe the deletion would work correctly, but it would fail the
* cross-check we make that the following downlink points to the right
* sibling of the delete page.)
*/
if (_bt_is_page_halfdead(rel, leafrightsib))
{
elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
leafblkno, leafrightsib);
return false;
}
/*
* We cannot delete a page that is the rightmost child of its immediate
* parent, unless it is the only child --- in which case the parent has to
* be deleted too, and the same condition applies recursively to it. We
* have to check this condition all the way up before trying to delete,
* and lock the final parent of the to-be-deleted branch.
*
* 步骤1:向上遍历。
*/
rightsib = leafrightsib;
target = leafblkno;
if (!_bt_lock_branch_parent(rel, leafblkno, stack,
&topparent, &topoff, &target, &rightsib))
return false;
/*
* Check that the parent-page index items we're about to delete/overwrite
* contain what we expect. This can fail if the index has become corrupt
* for some reason. We want to throw any error before entering the
* critical section --- otherwise it'd be a PANIC.
*
* The test on the target item is just an Assert because
* _bt_lock_branch_parent should have guaranteed it has the expected
* contents. The test on the next-child downlink is known to sometimes
* fail in the field, though.
*/
page = BufferGetPage(topparent);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
#ifdef USE_ASSERT_CHECKING
itemid = PageGetItemId(page, topoff);
itup = (IndexTuple) PageGetItem(page, itemid);
Assert(ItemPointerGetBlockNumber(&(itup->t_tid)) == target);
#endif
nextoffset = OffsetNumberNext(topoff);
itemid = PageGetItemId(page, nextoffset);
itup = (IndexTuple) PageGetItem(page, itemid);
if (ItemPointerGetBlockNumber(&(itup->t_tid)) != rightsib)
elog(ERROR, "right sibling %u of block %u is not next child %u of block %u in index \"%s\"",
rightsib, target, ItemPointerGetBlockNumber(&(itup->t_tid)),
BufferGetBlockNumber(topparent), RelationGetRelationName(rel));
/*
* Any insert which would have gone on the leaf block will now go to its
* right sibling.
*/
PredicateLockPageCombine(rel, leafblkno, leafrightsib);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
/*
* Update parent. The normal case is a tad tricky because we want to
* delete the target's downlink and the *following* key. Easiest way is
* to copy the right sibling's downlink over the target downlink, and then
* delete the following item.
*
* 步骤2:删除downlink
*/
page = BufferGetPage(topparent);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
itemid = PageGetItemId(page, topoff);
itup = (IndexTuple) PageGetItem(page, itemid);
ItemPointerSet(&(itup->t_tid), rightsib, P_HIKEY);
nextoffset = OffsetNumberNext(topoff);
PageIndexTupleDelete(page, nextoffset);
/*
* Mark the leaf page as half-dead, and stamp it with a pointer to the
* highest internal page in the branch we're deleting. We use the tid of
* the high key to store it.
*
* 步骤3:将叶子节点标记为BTP_HALF_DEAD
*/
page = BufferGetPage(leafbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
opaque->btpo_flags |= BTP_HALF_DEAD;
//步骤4:让leaf page指向顶层页面
PageIndexTupleDelete(page, P_HIKEY);
Assert(PageGetMaxOffsetNumber(page) == 0);
MemSet(&trunctuple, 0, sizeof(IndexTupleData));
trunctuple.t_info = sizeof(IndexTupleData);
if (target != leafblkno)
ItemPointerSet(&trunctuple.t_tid, target, P_HIKEY);
else
ItemPointerSetInvalid(&trunctuple.t_tid);
if (PageAddItem(page, (Item) &trunctuple, sizeof(IndexTupleData), P_HIKEY,
false, false) == InvalidOffsetNumber)
elog(ERROR, "could not add dummy high key to half-dead page");
/*
* Must mark buffers dirty before XLogInsert
* 步骤5:写XLOG
*/
MarkBufferDirty(topparent);
MarkBufferDirty(leafbuf);
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
xl_btree_mark_page_halfdead xlrec;
XLogRecPtr recptr;
xlrec.poffset = topoff;
xlrec.leafblk = leafblkno;
if (target != leafblkno)
xlrec.topparent = target;
else
xlrec.topparent = InvalidBlockNumber;
XLogBeginInsert();
XLogRegisterBuffer(0, leafbuf, REGBUF_WILL_INIT);
XLogRegisterBuffer(1, topparent, REGBUF_STANDARD);
page = BufferGetPage(leafbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
xlrec.leftblk = opaque->btpo_prev;
xlrec.rightblk = opaque->btpo_next;
XLogRegisterData((char *) &xlrec, SizeOfBtreeMarkPageHalfDead);
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_MARK_PAGE_HALFDEAD);
page = BufferGetPage(topparent);
PageSetLSN(page, recptr);
page = BufferGetPage(leafbuf);
PageSetLSN(page, recptr);
}
END_CRIT_SECTION();
_bt_relbuf(rel, topparent);
return true;
}
该函数包括如下步骤(步骤已在代码中注释):
-
步骤1:向上遍历
步骤1由
_bt_lock_branch_parent
函数实现。前面讲过,我们不能删除一个节点的最右孩子,除非这个节点只有一个孩子,此时我们需要向上递归的以判断父亲节点是否可以删除。所以_bt_lock_branch_parent
负责向上递归,直到找到一个不是最右节点的父亲(递归路径上的所有节点都可以删除)或者找到一个孩子不唯一的父亲(递归路径上的所有节点都不可以删除)。我们先来看看这个函数的四个传出参数:topparent、topoff、target、rightsib。假设当前树的结构为图7,那么这4个参数的含义和值如图9所示:
- topparent:_bt_lock_branch_parent最终获取到的孩子不唯一的父亲页面的页号。
- topoff:topparent中指向下级页面的downlink所在的index tuple的偏移。
- target:topoffset指向的下级页面,这个页面就是图8所示的链表的链头。
- rightsib:target的右兄弟。
该函数的实现如下:
/*
* Subroutine to find the parent of the branch we're deleting. This climbs
* up the tree until it finds a page with more than one child, i.e. a page
* that will not be totally emptied by the deletion. The chain of pages below
* it, with one downlink each, will form the branch that we need to delete.
*
* If we cannot remove the downlink from the parent, because it's the
* rightmost entry, returns false. On success, *topparent and *topoff are set
* to the buffer holding the parent, and the offset of the downlink in it.
* *topparent is write-locked, the caller is responsible for releasing it when
* done. *target is set to the topmost page in the branch to-be-deleted, i.e.
* the page whose downlink *topparent / *topoff point to, and *rightsib to its
* right sibling.
*
* "child" is the leaf page we wish to delete, and "stack" is a search stack
* leading to it (approximately). Note that we will update the stack
* entry(s) to reflect current downlink positions --- this is harmless and
* indeed saves later search effort in _bt_pagedel. The caller should
* initialize *target and *rightsib to the leaf page and its right sibling.
*
* Note: it's OK to release page locks on any internal pages between the leaf
* and *topparent, because a safe deletion can't become unsafe due to
* concurrent activity. An internal page can only acquire an entry if the
* child is split, but that cannot happen as long as we hold a lock on the
* leaf.
*/
static bool
_bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
Buffer *topparent, OffsetNumber *topoff,
BlockNumber *target, BlockNumber *rightsib)
{
BlockNumber parent;
OffsetNumber poffset,
maxoff;
Buffer pbuf;
Page page;
BTPageOpaque opaque;
BlockNumber leftsib;
/*
* Locate the downlink of "child" in the parent (updating the stack entry
* if needed)
*
* 步骤1:对当前节点的父节点以及相应的downlink进行校验和矫正。
*/
ItemPointerSet(&(stack->bts_btentry.t_tid), child, P_HIKEY);
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
if (pbuf == InvalidBuffer)
elog(ERROR, "failed to re-find parent key in index \"%s\" for deletion target page %u",
RelationGetRelationName(rel), child);
parent = stack->bts_blkno;
poffset = stack->bts_offset;
page = BufferGetPage(pbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
maxoff = PageGetMaxOffsetNumber(page);
/*
* If the target is the rightmost child of its parent, then we can't
* delete, unless it's also the only child.
*
* 步骤2:判断当前节点是否为最右节点。
*/
if (poffset >= maxoff)
{
/*
* It's rightmost child...
* 步骤3:判断当前节点是否为唯一节点。
*/
if (poffset == P_FIRSTDATAKEY(opaque))
{
/*
* It's only child, so safe if parent would itself be removable.
* We have to check the parent itself, and then recurse to test
* the conditions at the parent's parent.
*/
if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) ||
P_INCOMPLETE_SPLIT(opaque))
{
_bt_relbuf(rel, pbuf);
return false;
}
*target = parent;
*rightsib = opaque->btpo_next;
leftsib = opaque->btpo_prev;
//步骤4:解锁当前节点
_bt_relbuf(rel, pbuf);
/*
* Like in _bt_pagedel, check that the left sibling is not marked
* with INCOMPLETE_SPLIT flag. That would mean that there is no
* downlink to the page to be deleted, and the page deletion
* algorithm isn't prepared to handle that.
*/
if (leftsib != P_NONE)
{
Buffer lbuf;
Page lpage;
BTPageOpaque lopaque;
lbuf = _bt_getbuf(rel, leftsib, BT_READ);
lpage = BufferGetPage(lbuf);
lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
/*
* If the left sibling was concurrently split, so that its
* next-pointer doesn't point to the current page anymore, the
* split that created the current page must be completed. (We
* don't allow splitting an incompletely split page again
* until the previous split has been completed)
*/
if (lopaque->btpo_next == parent &&
P_INCOMPLETE_SPLIT(lopaque))
{
_bt_relbuf(rel, lbuf);
return false;
}
_bt_relbuf(rel, lbuf);
}
/*
* Perform the same check on this internal level that
* _bt_mark_page_halfdead performed on the leaf level.
*/
if (_bt_is_page_halfdead(rel, *rightsib))
{
elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
parent, *rightsib);
return false;
}
//步骤5:向上递归
return _bt_lock_branch_parent(rel, parent, stack->bts_parent,
topparent, topoff, target, rightsib);
}
else
{
/* Unsafe to delete
* 不是唯一节点,无法删除
*/
_bt_relbuf(rel, pbuf);
return false;
}
}
else
{
/*
* Not rightmost child, so safe to delete
* 可以删除,直接返回
*/
*topparent = pbuf;
*topoff = poffset;
return true;
}
}
该函数包括如下步骤(步骤已在代码中注释):
-
步骤1:对当前节点的父节点以及相应的downlink进行校验和矫正
这个步骤通过_bt_getstackbuf来实现。主要是为了应对并发分裂可能导致stack中信息不靠谱的问题,在《PostgreSQL B+树索引—分裂》中有详细描述。
-
步骤2:判断当前节点是否为最右节点
如果是,则执行步骤3,否则说明节点可以删除,返回true。
-
步骤3:判断当前节点是否为唯一节点
如果是,则执行步骤4,否则说明节点不可以删除,返回false。
-
步骤4:解锁当前节点
向上递归和向下递归的并发控制逻辑一致,必须先解锁当前节点,再锁定父节点,从而避免死锁。
-
步骤5:向上递归
-
步骤2:删除downlink
从page6中删除指向page3的downlink。
-
步骤3:将叶子节点标记为BTP_HALF_DEAD
-
步骤4:让leaf page指向顶层页面
让leaf page指向“内部页面链”的最顶层,也就是在page3中存放指向page6的指针,具体的实现方式是将page3中high key的tid改为page6的块号。这个步骤是为了redo考虑,关于这个部分后面由专门的文档来阐述。
注意
high key的tid原本就没有意义,所以可以随便改。
-
步骤5:写XLOG
这个步骤会在后面详细阐述。
页面删除第二阶段
经过第一阶段,我们得到了如图8所示的一条内部节点和叶子节点组成的链表,并且叶子节点page3中存放了指向链头page6的指针。此时链表中的每个节点都还维系着与右兄弟的联系(实际情况可能是左右兄弟),而第二阶段的工作就是要将链表中的页面从他们的左右兄弟中移除,如图10所示:
_bt_unlink_halfdead_page实现
第二阶段的实现流程如下:
-
步骤1:从叶子节点中获取链头的页号。
在本用例中就是从page3中获取page6。
-
步骤2:将链头从其左右兄弟中移除。
在本用例中就是将page6从右兄弟中移除(因为page6没有左兄弟)。
-
步骤3:如果当前的链头不是叶子节点,则获取链头的下级节点,并将下级节点作为新的链头存入叶子节点中。
在本用例中将page6的下级节点page3页面编号存入page3中。
对,你没看错,确实是将page3的页面编号存入page3中。这样page3就自己指向了自己,下一次删除时就可以把自己删除了。
-
步骤4:将链头标记为BTP_DELETED。
在本用例中就是将page6标记为BTP_DELETED。
-
步骤5:判断叶子节点的标记是否为BTP_DELETED,如果是则表明删除结束,否则跳转到步骤1。
_bt_unlink_halfdead_page的代码实现如下:
首先,在_bt_pagedel函数中有一个循环(nbtpage.c line:1291),代码如下:
while (P_ISHALFDEAD(opaque))
{
/* will check for interrupts, once lock is released */
if (!_bt_unlink_halfdead_page(rel, buf, &rightsib_empty))
{
/* _bt_unlink_halfdead_page already released buffer */
return ndeleted;
}
ndeleted++;
}
这段代码实现了步骤5,而_bt_unlink_halfdead_page负责实现步骤1~4,具体代码如下,对应步骤在注释中标注:
/*
* Unlink a page in a branch of half-dead pages from its siblings.
*
* If the leaf page still has a downlink pointing to it, unlinks the highest
* parent in the to-be-deleted branch instead of the leaf page. To get rid
* of the whole branch, including the leaf page itself, iterate until the
* leaf page is deleted.
*
* Returns 'false' if the page could not be unlinked (shouldn't happen).
* If the (new) right sibling of the page is empty, *rightsib_empty is set
* to true.
*
* Must hold pin and lock on leafbuf at entry (read or write doesn't matter).
* On success exit, we'll be holding pin and write lock. On failure exit,
* we'll release both pin and lock before returning (we define it that way
* to avoid having to reacquire a lock we already released).
*/
static bool
_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
{
BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
BlockNumber leafleftsib;
BlockNumber leafrightsib;
BlockNumber target;
BlockNumber leftsib;
BlockNumber rightsib;
Buffer lbuf = InvalidBuffer;
Buffer buf;
Buffer rbuf;
Buffer metabuf = InvalidBuffer;
Page metapg = NULL;
BTMetaPageData *metad = NULL;
ItemId itemid;
Page page;
BTPageOpaque opaque;
bool rightsib_is_rightmost;
int targetlevel;
ItemPointer leafhikey;
BlockNumber nextchild;
page = BufferGetPage(leafbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
Assert(P_ISLEAF(opaque) && P_ISHALFDEAD(opaque));
/*
* Remember some information about the leaf page.
*/
itemid = PageGetItemId(page, P_HIKEY);
leafhikey = &((IndexTuple) PageGetItem(page, itemid))->t_tid;
leafleftsib = opaque->btpo_prev;
leafrightsib = opaque->btpo_next;
LockBuffer(leafbuf, BUFFER_LOCK_UNLOCK);
/*
* Check here, as calling loops will have locks held, preventing
* interrupts from being processed.
*/
CHECK_FOR_INTERRUPTS();
/*
* If the leaf page still has a parent pointing to it (or a chain of
* parents), we don't unlink the leaf page yet, but the topmost remaining
* parent in the branch. Set 'target' and 'buf' to reference the page
* actually being unlinked.
*
* 步骤1:从叶子节点中获取链头的页号。
*/
if (ItemPointerIsValid(leafhikey))
{
target = ItemPointerGetBlockNumber(leafhikey);
Assert(target != leafblkno);
/* fetch the block number of the topmost parent's left sibling */
buf = _bt_getbuf(rel, target, BT_READ);
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
leftsib = opaque->btpo_prev;
targetlevel = opaque->btpo.level;
/*
* To avoid deadlocks, we'd better drop the target page lock before
* going further.
*/
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
}
else
{
target = leafblkno;
buf = leafbuf;
leftsib = leafleftsib;
targetlevel = 0;
}
/*
* We have to lock the pages we need to modify in the standard order:
* moving right, then up. Else we will deadlock against other writers.
*
* So, first lock the leaf page, if it's not the target. Then find and
* write-lock the current left sibling of the target page. The sibling
* that was current a moment ago could have split, so we may have to move
* right. This search could fail if either the sibling or the target page
* was deleted by someone else meanwhile; if so, give up. (Right now,
* that should never happen, since page deletion is only done in VACUUM
* and there shouldn't be multiple VACUUMs concurrently on the same
* table.)
*/
if (target != leafblkno)
LockBuffer(leafbuf, BT_WRITE);
if (leftsib != P_NONE)
{
lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
page = BufferGetPage(lbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
while (P_ISDELETED(opaque) || opaque->btpo_next != target)
{
/* step right one page */
leftsib = opaque->btpo_next;
_bt_relbuf(rel, lbuf);
/*
* It'd be good to check for interrupts here, but it's not easy to
* do so because a lock is always held. This block isn't
* frequently reached, so hopefully the consequences of not
* checking interrupts aren't too bad.
*/
if (leftsib == P_NONE)
{
elog(LOG, "no left sibling (concurrent deletion?) of block %u in \"%s\"",
target,
RelationGetRelationName(rel));
if (target != leafblkno)
{
/* we have only a pin on target, but pin+lock on leafbuf */
ReleaseBuffer(buf);
_bt_relbuf(rel, leafbuf);
}
else
{
/* we have only a pin on leafbuf */
ReleaseBuffer(leafbuf);
}
return false;
}
lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
page = BufferGetPage(lbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
}
}
else
lbuf = InvalidBuffer;
/*
* Next write-lock the target page itself. It should be okay to take just
* a write lock not a superexclusive lock, since no scans would stop on an
* empty page.
*/
LockBuffer(buf, BT_WRITE);
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
/*
* Check page is still empty etc, else abandon deletion. This is just for
* paranoia's sake; a half-dead page cannot resurrect because there can be
* only one vacuum process running at a time.
*/
if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque))
{
elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
target, RelationGetRelationName(rel));
}
if (opaque->btpo_prev != leftsib)
elog(ERROR, "left link changed unexpectedly in block %u of index \"%s\"",
target, RelationGetRelationName(rel));
if (target == leafblkno)
{
if (P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page) ||
!P_ISLEAF(opaque) || !P_ISHALFDEAD(opaque))
elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
target, RelationGetRelationName(rel));
nextchild = InvalidBlockNumber;
}
else
{
if (P_FIRSTDATAKEY(opaque) != PageGetMaxOffsetNumber(page) ||
P_ISLEAF(opaque))
elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
target, RelationGetRelationName(rel));
/* remember the next non-leaf child down in the branch. */
itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
nextchild = ItemPointerGetBlockNumber(&((IndexTuple) PageGetItem(page, itemid))->t_tid);
if (nextchild == leafblkno)
nextchild = InvalidBlockNumber;
}
/*
* And next write-lock the (current) right sibling.
*/
rightsib = opaque->btpo_next;
rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
page = BufferGetPage(rbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
if (opaque->btpo_prev != target)
elog(ERROR, "right sibling's left-link doesn't match: "
"block %u links to %u instead of expected %u in index \"%s\"",
rightsib, opaque->btpo_prev, target,
RelationGetRelationName(rel));
rightsib_is_rightmost = P_RIGHTMOST(opaque);
*rightsib_empty = (P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page));
/*
* If we are deleting the next-to-last page on the target's level, then
* the rightsib is a candidate to become the new fast root. (In theory, it
* might be possible to push the fast root even further down, but the odds
* of doing so are slim, and the locking considerations daunting.)
*
* We don't support handling this in the case where the parent is becoming
* half-dead, even though it theoretically could occur.
*
* We can safely acquire a lock on the metapage here --- see comments for
* _bt_newroot().
*/
if (leftsib == P_NONE && rightsib_is_rightmost)
{
page = BufferGetPage(rbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
if (P_RIGHTMOST(opaque))
{
/* rightsib will be the only one left on the level */
metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
/*
* The expected case here is btm_fastlevel == targetlevel+1; if
* the fastlevel is <= targetlevel, something is wrong, and we
* choose to overwrite it to fix it.
*/
if (metad->btm_fastlevel > targetlevel + 1)
{
/* no update wanted */
_bt_relbuf(rel, metabuf);
metabuf = InvalidBuffer;
}
}
}
/*
* Here we begin doing the deletion.
*/
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
/*
* Update siblings' side-links. Note the target page's side-links will
* continue to point to the siblings. Asserts here are just rechecking
* things we already verified above.
*
* 步骤2:将链头从其左右兄弟中移除。
*/
if (BufferIsValid(lbuf))
{
page = BufferGetPage(lbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
Assert(opaque->btpo_next == target);
opaque->btpo_next = rightsib;
}
page = BufferGetPage(rbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
Assert(opaque->btpo_prev == target);
opaque->btpo_prev = leftsib;
/*
* If we deleted a parent of the targeted leaf page, instead of the leaf
* itself, update the leaf to point to the next remaining child in the
* branch.
*
* 步骤3:如果当前的链头不是叶子节点,则获取链头的下级节点,并将下级节点作为新的链头存入叶子节点中。
*/
if (target != leafblkno)
{
if (nextchild == InvalidBlockNumber)
ItemPointerSetInvalid(leafhikey);
else
ItemPointerSet(leafhikey, nextchild, P_HIKEY);
}
/*
* Mark the page itself deleted. It can be recycled when all current
* transactions are gone. Storing GetTopTransactionId() would work, but
* we're in VACUUM and would not otherwise have an XID. Having already
* updated links to the target, ReadNewTransactionId() suffices as an
* upper bound. Any scan having retained a now-stale link is advertising
* in its PGXACT an xmin less than or equal to the value we read here. It
* will continue to do so, holding back RecentGlobalXmin, for the duration
* of that scan.
*
* 步骤4:将链头标记为BTP_DELETED。
*/
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
opaque->btpo_flags &= ~BTP_HALF_DEAD;
opaque->btpo_flags |= BTP_DELETED;
opaque->btpo.xact = ReadNewTransactionId();
/* And update the metapage, if needed */
if (BufferIsValid(metabuf))
{
metad->btm_fastroot = rightsib;
metad->btm_fastlevel = targetlevel;
MarkBufferDirty(metabuf);
}
/* Must mark buffers dirty before XLogInsert */
MarkBufferDirty(rbuf);
MarkBufferDirty(buf);
if (BufferIsValid(lbuf))
MarkBufferDirty(lbuf);
if (target != leafblkno)
MarkBufferDirty(leafbuf);
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
xl_btree_unlink_page xlrec;
xl_btree_metadata xlmeta;
uint8 xlinfo;
XLogRecPtr recptr;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT);
if (BufferIsValid(lbuf))
XLogRegisterBuffer(1, lbuf, REGBUF_STANDARD);
XLogRegisterBuffer(2, rbuf, REGBUF_STANDARD);
if (target != leafblkno)
XLogRegisterBuffer(3, leafbuf, REGBUF_WILL_INIT);
/* information on the unlinked block */
xlrec.leftsib = leftsib;
xlrec.rightsib = rightsib;
xlrec.btpo_xact = opaque->btpo.xact;
/* information needed to recreate the leaf block (if not the target) */
xlrec.leafleftsib = leafleftsib;
xlrec.leafrightsib = leafrightsib;
xlrec.topparent = nextchild;
XLogRegisterData((char *) &xlrec, SizeOfBtreeUnlinkPage);
if (BufferIsValid(metabuf))
{
XLogRegisterBuffer(4, metabuf, REGBUF_WILL_INIT);
xlmeta.root = metad->btm_root;
xlmeta.level = metad->btm_level;
xlmeta.fastroot = metad->btm_fastroot;
xlmeta.fastlevel = metad->btm_fastlevel;
XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
}
else
xlinfo = XLOG_BTREE_UNLINK_PAGE;
recptr = XLogInsert(RM_BTREE_ID, xlinfo);
if (BufferIsValid(metabuf))
{
PageSetLSN(metapg, recptr);
}
page = BufferGetPage(rbuf);
PageSetLSN(page, recptr);
page = BufferGetPage(buf);
PageSetLSN(page, recptr);
if (BufferIsValid(lbuf))
{
page = BufferGetPage(lbuf);
PageSetLSN(page, recptr);
}
if (target != leafblkno)
{
page = BufferGetPage(leafbuf);
PageSetLSN(page, recptr);
}
}
END_CRIT_SECTION();
/* release metapage */
if (BufferIsValid(metabuf))
_bt_relbuf(rel, metabuf);
/* release siblings */
if (BufferIsValid(lbuf))
_bt_relbuf(rel, lbuf);
_bt_relbuf(rel, rbuf);
/*
* Release the target, if it was not the leaf block. The leaf is always
* kept locked.
*/
if (target != leafblkno)
_bt_relbuf(rel, buf);
return true;
}
_bt_pagedel遗留问题
下面,我们来解决一下前面的遗留问题:_bt_pagedel
函数为什么需要一个循环,再来看看代码框架,这次我们把源码里的注释加上:
for (;;)
{
//...省略
rightsib = opaque->btpo_next;
_bt_relbuf(rel, buf);
/*
* Check here, as calling loops will have locks held, preventing
* interrupts from being processed.
*/
CHECK_FOR_INTERRUPTS();
/*
* The page has now been deleted. If its right sibling is completely
* empty, it's possible that the reason we haven't deleted it earlier
* is that it was the rightmost child of the parent. Now that we
* removed the downlink for this page, the right sibling might now be
* the only child of the parent, and could be removed. It would be
* picked up by the next vacuum anyway, but might as well try to
* remove it now, so loop back to process the right sibling.
*/
if (!rightsib_empty)
break;
buf = _bt_getbuf(rel, rightsib, BT_WRITE);
}
上面的注释很清楚的解释了原因:由于,在一个节点只剩下唯一一个孩子之前,我们是无法删除它的最右孩子的。所以每当我们删除一个节点后,都需要判断一下,此时其父节点是否只剩下最右孩子,如果是则需要把最右孩子也删除。如果不做这个判断和循环,那么最右节点就永远无法删除了。
并发控制
最后,我们来梳理下页面删除的并发控制流程:
-
在页面被删除前需要加super-exclusive lock(代码位置:nbtpage.c line:956)
所以,在调用_bt_pagedel之前,待删除的页面已经加上了super-exclusive lock,加上这个锁之后,其他进程就无法再给这个页面加锁或者pin。
关于super-exclusive lock的相关内容在《PostgreSQL Blink-tree ReadMe—翻译》与《PostgreSQL Buffer ReadMe—翻译》中有详细说明。
-
删除第一阶段
- 删除第一阶段,会调用_bt_lock_branch_parent实现向上遍历,在遍历过程中会先释放当前节点的锁,再给父节点加写锁,所以遍历完成后,图8链表中,链头会持有写锁。(代码位置:参见_bt_lock_branch_parent中的_bt_getstackbuf(加锁)与_bt_relbuf(解锁))
- 当断开链头与其父节点的父子联系后,释放链头写锁(代码位置:nbtpage.c line:1504)。
-
删除第二阶段
- 释放叶子节点上的写锁.(代码位置:nbtpage.c line:1561)
- 如果待删除的节点不是叶子节点,则给叶子节点加写锁。(代码位置:nbtpage.c line:1616)
- 给待删除节点的左兄弟加写锁。(代码位置:nbtpage.c line:1617~1659)
- 给待删除节点加写锁。(代码位置:nbtpage.c line:1666)
- 给待删除节点的右兄弟加写锁。(代码位置:nbtpage.c line:1710)
- 解锁待删除节点的左兄弟。(代码位置:nbtpage.c line:1901)
- 解锁待删除节点的右兄弟。(代码位置:nbtpage.c line:1902)
- 如果待删除节点不是叶子节点,解锁待删除节点。(代码位置:nbtpage.c line:1909)
这个流程有点麻烦,我们需要解释一下。由于第二阶段我们需要断开待删除节点与其左右兄弟的联系,所以我们需要同时锁定当前页面和它的左右兄弟。为了避免死锁问题,PostgreSQL规定,对于多个页面的锁定,只能从左向右加锁或者从下向上加锁(不能从右向左加锁或者从上向下加锁)。而上述流程的第3~5步就实现了从左向右加锁。
那么,步骤1和2又是为了什么呢?假设没有步骤1和2。在删除的第一阶段结束后,叶子节点是存在写锁的。如果待删除节点正好就是叶子节点自己(通常都是这种情况),那么到了第二阶段如果直接执行步骤3,其实就是在当前节点持有锁时,给左兄弟加锁,这就是一个从右向左加锁的动作。这是不允许的,所以需要将叶子节点解锁,如果后面发现待删除节点不是叶子节点,再给叶子节点加上锁,也就是步骤1和2。
-
释放待删除页面上的锁(代码位置:多个地方调用,参见_bt_pagedel中的_bt_relbuf)