PostgreSQL B+树索引---页面删除

PostgreSQL B+树索引—页面删除

预备知识

PostgreSQL Blink-tree ReadMe—翻译

PostgreSQL Buffer ReadMe—翻译

PostgreSQL Vacuum—索引删除

PostgreSQL B+树索引—分裂

概述

本文将阐述PostgreSQL B*树索引的最后一个部分,索引的页面删除。该部分内容在《PostgreSQL Blink-tree ReadMe—翻译》中其实已经涉及到了,不过Read Me的内容较多,本文将重新梳理一下B*树索引页面删除的逻辑,并从源代码的角度分析PostgreSQL如何实现页面删除。
在PostgreSQL中,VACUUM操作会对B*树中的索引元组进行物理删除,当一个索引页面中的所有元组都被删除,即索引页面被删空后,VACUUM操作会将这个页面从B*树中删除。PostgreSQL中关于B*树页面删除的思想来源于Lanin和Shasha的论《A Symmetric Concurrent B-Tree Algorithm》。但该论文的核心是讨论B*树如何实现合并操作(当页面数据足够少时,如何将两个页面合并为一个页面),而PostgreSQL的B*树并不支持页面合并,其原因在《PostgreSQL Blink-tree ReadMe—翻译》中解释过,这里不再赘述。

PostgreSQL页面删除功能由函数_bt_pagedel来实现,该函数的调用栈如下:

ExecVacuum > vacuum > vacuum_rel > lazy_vacuum_rel > lazy_scan_heap > lazy_cleanup_index > index_vacuum_cleanup > btvacuumcleanup > btvacuumscan > btvacuumpage > _bt_pagedel
(在_bt_pagedel中打上断点,然后在客户端工具中输入VACUUM即可进入断点,观察调用栈)

我们的故事就从_bt_pagedel开始。_bt_pagedel的实现思路非常简单,我们先来图解一下:
在这里插入图片描述

图1
如图1所示,现在我们希望删除图1中的page2页面,整个删除过程分为两个阶段:
  1. 阶段1:
    从page6中删除指向page2的downlink,并将page2标记为half-dead,如图2所示。
    在这里插入图片描述

    图2

    该阶段由函数_bt_mark_page_halfdead实现。

  2. 阶段2:
    将page2从其左右兄弟中删除,并将page2标记为deleted,如图3所示。
    在这里插入图片描述

    图3

    该阶段由函数_bt_unlink_halfdead_page实现。

这个简单的实现思路下隐含了很多的细节,本文将从如下两个方面对B*树索引的页面删除进行阐述。

  1. 删除操作的流程和实现细节
  2. 删除操作的并发控制

_bt_pagedel

_bt_pagedel是整个页面删除操作的入口函数,所以我们先来看看这个函数的实现。

  1. 函数声明

    int _bt_pagedel(Relation rel, Buffer buf)
    

    其中,rel是表信息,buf是待删除页面。

  2. 函数框架

    for (;;)
    {
        //...省略
    
        rightsib = opaque->btpo_next;
    
        _bt_relbuf(rel, buf);
        
        CHECK_FOR_INTERRUPTS();
    
        if (!rightsib_empty)
            break;
    
        buf = _bt_getbuf(rel, rightsib, BT_WRITE);
    }
    

    一进_bt_pagedel函数就能看到上面的这个循环。于是顿生疑惑:从_bt_pagedel的函数声明来看,_bt_pagedel函数负责删除一个指定的节点,那么为什么会有一个循环呢?这个问题我们先不管,放到后面来解释。

  3. 获取待删除节点的父亲节点

    在概述部分我们讲过,在删除页面的第一阶段,我们需要断开待删除节点与其父亲节点之间的关系,所以我们首先需要找到它的父亲节点是谁。而查找父亲节点的方式,就是获取待删除节点的high key,然后遍历二叉树找到这个high key所在的节点。代码实现如下:

    //代码位置:nbtpage.c line:1211
    if (!stack)
    {
        ScanKey		itup_scankey;
        ItemId		itemid;
        IndexTuple	targetkey;
        Buffer		lbuf;
        BlockNumber leftsib;
    
        //获取节点的high key
        itemid = PageGetItemId(page, P_HIKEY);
        targetkey = CopyIndexTuple((IndexTuple) PageGetItem(page, itemid));
    
        leftsib = opaque->btpo_prev;
    
        /*
         * To avoid deadlocks, we'd better drop the leaf page lock
         * before going further.
         *
         * 节点解锁,防止查询时发生死锁
         */
        LockBuffer(buf, BUFFER_LOCK_UNLOCK);
    
        /*
      	 * Fetch the left sibling, to check that it's not marked with
      	 * INCOMPLETE_SPLIT flag.  That would mean that the page
      	 * to-be-deleted doesn't have a downlink, and the page
      	 * deletion algorithm isn't prepared to handle that.
      	 */
        if (!P_LEFTMOST(opaque))
        {
            BTPageOpaque lopaque;
            Page		lpage;
    
            lbuf = _bt_getbuf(rel, leftsib, BT_READ);
            lpage = BufferGetPage(lbuf);
            lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
    
            /*
      		 * If the left sibling is split again by another backend,
      		 * after we released the lock, we know that the first
      		 * split must have finished, because we don't allow an
      		 * incompletely-split page to be split again.  So we don't
      		 * need to walk right here.
      		 */
            if (lopaque->btpo_next == BufferGetBlockNumber(buf) &&
                P_INCOMPLETE_SPLIT(lopaque))
            {
                ReleaseBuffer(buf);
                _bt_relbuf(rel, lbuf);
                return ndeleted;
            }
            _bt_relbuf(rel, lbuf);
        }
    
        /*
         * we need an insertion scan key for the search, so build one 
         * 将high key转换为scan key用于遍历
         */
        itup_scankey = _bt_mkscankey(rel, targetkey);
        /* find the leftmost leaf page containing this key */
        stack = _bt_search(rel, rel->rd_rel->relnatts, itup_scankey,
                           false, &lbuf, BT_READ, NULL);
        /* don't need a pin on the page */
        _bt_relbuf(rel, lbuf);
    
        /*
      	 * Re-lock the leaf page, and start over, to re-check that the
      	 * page can still be deleted.
      	 */
        LockBuffer(buf, BT_WRITE);
        continue;
    }
    

页面删除第一阶段

问题1

在概述中,我们简单阐述了页面删除流程中的两个阶段,并且以最简单粗暴的方式图解了这两个阶段,但是如果我们回过头去看看图2和图3,我们就会发现一个问题:如果删除page2后B*树如图3所示,那么如果我们此时希望向B*树中插入30,30应该插入到哪里

  • 插入到page1?

    由于30大于page1的high key,如果将30插入page1,则需要将page1的high key改为30。

  • 插入到page3?

    由于30小于page3的min key,如果将30插入page3,那么就需要将page6中的51改为30。

所以,看起来30无论插入到哪个page都会有额外的开销。造成这一问题的原因,在于我们简单粗暴的删除了page2,从而使得page1的high key与page3的min key变得不相等了。而这一原则是PostgreSQL在实现B*树索引时必须要保证的。

为了确保这一点,PostgreSQL在实现B*树分裂时,特意将左节点的high key赋值为右节点的min key。

为了解决这个问题,PostgreSQL在从父节点删除page2时,并不是简单将51左移以覆盖21,而是巧妙的采用了如下步骤:

  • 将page6中的21指向page2的右兄弟page3。

  • 删除page6中21右边的索引元组51。(即原来指向page3的索引元组)

完成这两步操作后,B*数的结构如图4所示:
在这里插入图片描述

图4

这两个步骤,在PostgreSQL中还有一个更为专业的说法,叫做:合并key space。通过这种方式,我们就将page2的key space(21~51)合并到了page3的key space(51~63)中。

上述步骤的实现代码位于nbtpage.c line 1429~1443

问题2

现在我们来考虑另外一个问题,page3能不能删除?要回答这个问题,最直接的方式,就是试试看如果删除page3会发生什么?如图5所示:
在这里插入图片描述

图5

按照前面的流程,我们需要将page6中的51指向page3的右兄弟page4,然后删除page7中的63,最终如图6所示:
在这里插入图片描述

图6

不难看出,我们删除了page7中的63,但page8中指向page7的key依然是63,这就造成了上下级的不一致。此时如果我们向B*树中插入70,page8会把70路由到page7,然后我们就会发现70大于page7的min key,于是又出现的混乱。

出现这一问题的原因在于,我们删除了63之后,使得page6的high key与page7的min key不相等。在图2中,我们之所以可以合并page2和page3的key space,是因为page2和page3拥有相同的父节点(page6)。而现在,我们希望将page3与page4的key space合并,由于page3和page4的父节点不一样,所以如果合并,会导致page6与page7的key space发生改变!,如果一定要合并page3和page4,就必须修改page6和page7的bounding-key(可见,亲兄弟与表兄弟始终是有区别的!)。而修改父亲节点bounding-key这件事,开销会比较大,并且可能会引发递归修改(比如page8的63就需要改成85),所以PostgreSQL不予支持。

正是这个原因,PostgreSQL也就不支持父亲节点最右孩子的删除,除非这是这个父亲唯一的孩子,因为此时父亲节点也即将被删除。这里还有一种特殊情况,就是page5,page5是整棵树最右的叶子节点,要删除pgae5,那么page5必须是page7的唯一孩子,而要删除page7,page7必须是page8的唯一孩子,所以只有当整棵树为空时page5才能被删除!

所以,删除一个节点之前,需要判断这个节点是否可以被删除,判断的结果有三种:

  1. 当前节点不是父亲的最右孩子,可以删除。
  2. 当前节点是父亲的最右孩子,但不是唯一孩子,不能删除。
  3. 当前节点是父亲的最右孩子,且是唯一孩子,在删除的时需要连父亲一起删除。所以需要向上递归,以判断父亲是否可以删除!

那么简单来说,如果节点可以被删除,可能只删除这一个节点,可能需要删除一串节点,显然这两种情况都可以用相同的逻辑来处理。我们现在来看看这种情况:
在这里插入图片描述

图7

现在我们希望删除图7中的page3,page3为page6的最右节点且唯一,所以需要递归向上判断page6是否可以删除,page6不是page8的最右节点,也可以删除,所以我们最终需要将page6和page8都从B*树中删除。而在第一阶段,我们需要做的事情就是将从page8中删除指向page6的downlink(断绝page6和page8的父子关系)。然后将page3标记为half-dead。

注意:

不论是只删除一个叶子节点,还是需要删除一串节点,在第一阶段都是将叶子节点标记为half-dead。

该步骤完成后,page6和page3就形成了一条链,如图8所示:
在这里插入图片描述

图8

链中的page6和page3还维系着与右兄弟的联系,这个联系会在第二阶段删除。

_bt_mark_page_halfdead实现

我们下面来看看第一阶段的代码实现,第一阶段由_bt_mark_page_halfdead函数实现,其代码如下:

static bool
_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
{
	BlockNumber leafblkno;
	BlockNumber leafrightsib;
	BlockNumber target;
	BlockNumber rightsib;
	ItemId		itemid;
	Page		page;
	BTPageOpaque opaque;
	Buffer		topparent;
	OffsetNumber topoff;
	OffsetNumber nextoffset;
	IndexTuple	itup;
	IndexTupleData trunctuple;

	page = BufferGetPage(leafbuf);
	opaque = (BTPageOpaque) PageGetSpecialPointer(page);

	Assert(!P_RIGHTMOST(opaque) && !P_ISROOT(opaque) && !P_ISDELETED(opaque) &&
		   !P_ISHALFDEAD(opaque) && P_ISLEAF(opaque) &&
		   P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page));

	/*
	 * Save info about the leaf page.
	 */
	leafblkno = BufferGetBlockNumber(leafbuf);
	leafrightsib = opaque->btpo_next;

	/*
	 * Before attempting to lock the parent page, check that the right sibling
	 * is not in half-dead state.  A half-dead right sibling would have no
	 * downlink in the parent, which would be highly confusing later when we
	 * delete the downlink that follows the current page's downlink. (I
	 * believe the deletion would work correctly, but it would fail the
	 * cross-check we make that the following downlink points to the right
	 * sibling of the delete page.)
	 */
	if (_bt_is_page_halfdead(rel, leafrightsib))
	{
		elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
			 leafblkno, leafrightsib);
		return false;
	}

	/*
	 * We cannot delete a page that is the rightmost child of its immediate
	 * parent, unless it is the only child --- in which case the parent has to
	 * be deleted too, and the same condition applies recursively to it. We
	 * have to check this condition all the way up before trying to delete,
	 * and lock the final parent of the to-be-deleted branch.
	 *
	 * 步骤1:向上遍历。
	 */
	rightsib = leafrightsib;
	target = leafblkno;
	if (!_bt_lock_branch_parent(rel, leafblkno, stack,
								&topparent, &topoff, &target, &rightsib))
		return false;

	/*
	 * Check that the parent-page index items we're about to delete/overwrite
	 * contain what we expect.  This can fail if the index has become corrupt
	 * for some reason.  We want to throw any error before entering the
	 * critical section --- otherwise it'd be a PANIC.
	 *
	 * The test on the target item is just an Assert because
	 * _bt_lock_branch_parent should have guaranteed it has the expected
	 * contents.  The test on the next-child downlink is known to sometimes
	 * fail in the field, though.
	 */
	page = BufferGetPage(topparent);
	opaque = (BTPageOpaque) PageGetSpecialPointer(page);

#ifdef USE_ASSERT_CHECKING
	itemid = PageGetItemId(page, topoff);
	itup = (IndexTuple) PageGetItem(page, itemid);
	Assert(ItemPointerGetBlockNumber(&(itup->t_tid)) == target);
#endif

	nextoffset = OffsetNumberNext(topoff);
	itemid = PageGetItemId(page, nextoffset);
	itup = (IndexTuple) PageGetItem(page, itemid);
	if (ItemPointerGetBlockNumber(&(itup->t_tid)) != rightsib)
		elog(ERROR, "right sibling %u of block %u is not next child %u of block %u in index \"%s\"",
			 rightsib, target, ItemPointerGetBlockNumber(&(itup->t_tid)),
			 BufferGetBlockNumber(topparent), RelationGetRelationName(rel));

	/*
	 * Any insert which would have gone on the leaf block will now go to its
	 * right sibling.
	 */
	PredicateLockPageCombine(rel, leafblkno, leafrightsib);

	/* No ereport(ERROR) until changes are logged */
	START_CRIT_SECTION();

	/*
	 * Update parent.  The normal case is a tad tricky because we want to
	 * delete the target's downlink and the *following* key.  Easiest way is
	 * to copy the right sibling's downlink over the target downlink, and then
	 * delete the following item.
	 *
	 * 步骤2:删除downlink
	 */
	page = BufferGetPage(topparent);
	opaque = (BTPageOpaque) PageGetSpecialPointer(page);

	itemid = PageGetItemId(page, topoff);
	itup = (IndexTuple) PageGetItem(page, itemid);
	ItemPointerSet(&(itup->t_tid), rightsib, P_HIKEY);

	nextoffset = OffsetNumberNext(topoff);
	PageIndexTupleDelete(page, nextoffset);

	/*
	 * Mark the leaf page as half-dead, and stamp it with a pointer to the
	 * highest internal page in the branch we're deleting.  We use the tid of
	 * the high key to store it.
	 *
	 * 步骤3:将叶子节点标记为BTP_HALF_DEAD
	 */
	page = BufferGetPage(leafbuf);
	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
	opaque->btpo_flags |= BTP_HALF_DEAD;

    //步骤4:让leaf page指向顶层页面
	PageIndexTupleDelete(page, P_HIKEY);
	Assert(PageGetMaxOffsetNumber(page) == 0);
	MemSet(&trunctuple, 0, sizeof(IndexTupleData));
	trunctuple.t_info = sizeof(IndexTupleData);
	if (target != leafblkno)
		ItemPointerSet(&trunctuple.t_tid, target, P_HIKEY);
	else
		ItemPointerSetInvalid(&trunctuple.t_tid);
	if (PageAddItem(page, (Item) &trunctuple, sizeof(IndexTupleData), P_HIKEY,
					false, false) == InvalidOffsetNumber)
		elog(ERROR, "could not add dummy high key to half-dead page");

	/* 
	 * Must mark buffers dirty before XLogInsert 
	 * 步骤5:写XLOG
	 */
    
	MarkBufferDirty(topparent);
	MarkBufferDirty(leafbuf);

	/* XLOG stuff */
	if (RelationNeedsWAL(rel))
	{
		xl_btree_mark_page_halfdead xlrec;
		XLogRecPtr	recptr;

		xlrec.poffset = topoff;
		xlrec.leafblk = leafblkno;
		if (target != leafblkno)
			xlrec.topparent = target;
		else
			xlrec.topparent = InvalidBlockNumber;

		XLogBeginInsert();
		XLogRegisterBuffer(0, leafbuf, REGBUF_WILL_INIT);
		XLogRegisterBuffer(1, topparent, REGBUF_STANDARD);

		page = BufferGetPage(leafbuf);
		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
		xlrec.leftblk = opaque->btpo_prev;
		xlrec.rightblk = opaque->btpo_next;

		XLogRegisterData((char *) &xlrec, SizeOfBtreeMarkPageHalfDead);

		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_MARK_PAGE_HALFDEAD);

		page = BufferGetPage(topparent);
		PageSetLSN(page, recptr);
		page = BufferGetPage(leafbuf);
		PageSetLSN(page, recptr);
	}

	END_CRIT_SECTION();

	_bt_relbuf(rel, topparent);
	return true;
}

该函数包括如下步骤(步骤已在代码中注释):

  1. 步骤1:向上遍历

    步骤1由_bt_lock_branch_parent函数实现。前面讲过,我们不能删除一个节点的最右孩子,除非这个节点只有一个孩子,此时我们需要向上递归的以判断父亲节点是否可以删除。所以_bt_lock_branch_parent负责向上递归,直到找到一个不是最右节点的父亲(递归路径上的所有节点都可以删除)或者找到一个孩子不唯一的父亲(递归路径上的所有节点都不可以删除)。我们先来看看这个函数的四个传出参数:topparent、topoff、target、rightsib。假设当前树的结构为图7,那么这4个参数的含义和值如图9所示:

在这里插入图片描述

图9
  • topparent:_bt_lock_branch_parent最终获取到的孩子不唯一的父亲页面的页号。
  • topoff:topparent中指向下级页面的downlink所在的index tuple的偏移。
  • target:topoffset指向的下级页面,这个页面就是图8所示的链表的链头。
  • rightsib:target的右兄弟。

该函数的实现如下:

/*
 * Subroutine to find the parent of the branch we're deleting.  This climbs
 * up the tree until it finds a page with more than one child, i.e. a page
 * that will not be totally emptied by the deletion.  The chain of pages below
 * it, with one downlink each, will form the branch that we need to delete.
 *
 * If we cannot remove the downlink from the parent, because it's the
 * rightmost entry, returns false.  On success, *topparent and *topoff are set
 * to the buffer holding the parent, and the offset of the downlink in it.
 * *topparent is write-locked, the caller is responsible for releasing it when
 * done.  *target is set to the topmost page in the branch to-be-deleted, i.e.
 * the page whose downlink *topparent / *topoff point to, and *rightsib to its
 * right sibling.
 *
 * "child" is the leaf page we wish to delete, and "stack" is a search stack
 * leading to it (approximately).  Note that we will update the stack
 * entry(s) to reflect current downlink positions --- this is harmless and
 * indeed saves later search effort in _bt_pagedel.  The caller should
 * initialize *target and *rightsib to the leaf page and its right sibling.
 *
 * Note: it's OK to release page locks on any internal pages between the leaf
 * and *topparent, because a safe deletion can't become unsafe due to
 * concurrent activity.  An internal page can only acquire an entry if the
 * child is split, but that cannot happen as long as we hold a lock on the
 * leaf.
 */
static bool
_bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
					   Buffer *topparent, OffsetNumber *topoff,
					   BlockNumber *target, BlockNumber *rightsib)
{
	BlockNumber parent;
	OffsetNumber poffset,
				maxoff;
	Buffer		pbuf;
	Page		page;
	BTPageOpaque opaque;
	BlockNumber leftsib;

	/*
	 * Locate the downlink of "child" in the parent (updating the stack entry
	 * if needed)
	 *
	 * 步骤1:对当前节点的父节点以及相应的downlink进行校验和矫正。
	 */
	ItemPointerSet(&(stack->bts_btentry.t_tid), child, P_HIKEY);
	pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
	if (pbuf == InvalidBuffer)
		elog(ERROR, "failed to re-find parent key in index \"%s\" for deletion target page %u",
			 RelationGetRelationName(rel), child);
	parent = stack->bts_blkno;
	poffset = stack->bts_offset;

	page = BufferGetPage(pbuf);
	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
	maxoff = PageGetMaxOffsetNumber(page);

	/*
	 * If the target is the rightmost child of its parent, then we can't
	 * delete, unless it's also the only child.
	 *
	 * 步骤2:判断当前节点是否为最右节点。
	 */
	if (poffset >= maxoff)
	{
		/* 
		 * It's rightmost child... 
		 * 步骤3:判断当前节点是否为唯一节点。
		 */
		if (poffset == P_FIRSTDATAKEY(opaque))
		{
			/*
			 * It's only child, so safe if parent would itself be removable.
			 * We have to check the parent itself, and then recurse to test
			 * the conditions at the parent's parent.
			 */
			if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) ||
				P_INCOMPLETE_SPLIT(opaque))
			{
				_bt_relbuf(rel, pbuf);
				return false;
			}

			*target = parent;
			*rightsib = opaque->btpo_next;
			leftsib = opaque->btpo_prev;

            //步骤4:解锁当前节点
			_bt_relbuf(rel, pbuf);

			/*
			 * Like in _bt_pagedel, check that the left sibling is not marked
			 * with INCOMPLETE_SPLIT flag.  That would mean that there is no
			 * downlink to the page to be deleted, and the page deletion
			 * algorithm isn't prepared to handle that.
			 */
			if (leftsib != P_NONE)
			{
				Buffer		lbuf;
				Page		lpage;
				BTPageOpaque lopaque;

				lbuf = _bt_getbuf(rel, leftsib, BT_READ);
				lpage = BufferGetPage(lbuf);
				lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);

				/*
				 * If the left sibling was concurrently split, so that its
				 * next-pointer doesn't point to the current page anymore, the
				 * split that created the current page must be completed. (We
				 * don't allow splitting an incompletely split page again
				 * until the previous split has been completed)
				 */
				if (lopaque->btpo_next == parent &&
					P_INCOMPLETE_SPLIT(lopaque))
				{
					_bt_relbuf(rel, lbuf);
					return false;
				}
				_bt_relbuf(rel, lbuf);
			}

			/*
			 * Perform the same check on this internal level that
			 * _bt_mark_page_halfdead performed on the leaf level.
			 */
			if (_bt_is_page_halfdead(rel, *rightsib))
			{
				elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
					 parent, *rightsib);
				return false;
			}
			//步骤5:向上递归
			return _bt_lock_branch_parent(rel, parent, stack->bts_parent,
										topparent, topoff, target, rightsib);
		}
		else
		{
			/* Unsafe to delete 
			 * 不是唯一节点,无法删除
			 */
			_bt_relbuf(rel, pbuf);
			return false;
		}
	}
	else
	{
		/* 
		 * Not rightmost child, so safe to delete 
		 * 可以删除,直接返回
		 */
		*topparent = pbuf;
		*topoff = poffset;
		return true;
	}
}

该函数包括如下步骤(步骤已在代码中注释):

  • 步骤1:对当前节点的父节点以及相应的downlink进行校验和矫正

    这个步骤通过_bt_getstackbuf来实现。主要是为了应对并发分裂可能导致stack中信息不靠谱的问题,在《PostgreSQL B+树索引—分裂》中有详细描述。

  • 步骤2:判断当前节点是否为最右节点

    如果是,则执行步骤3,否则说明节点可以删除,返回true。

  • 步骤3:判断当前节点是否为唯一节点

    如果是,则执行步骤4,否则说明节点不可以删除,返回false。

  • 步骤4:解锁当前节点

    向上递归和向下递归的并发控制逻辑一致,必须先解锁当前节点,再锁定父节点,从而避免死锁。

  • 步骤5:向上递归

  1. 步骤2:删除downlink

    从page6中删除指向page3的downlink。

  2. 步骤3:将叶子节点标记为BTP_HALF_DEAD

  3. 步骤4:让leaf page指向顶层页面

    让leaf page指向“内部页面链”的最顶层,也就是在page3中存放指向page6的指针,具体的实现方式是将page3中high key的tid改为page6的块号。这个步骤是为了redo考虑,关于这个部分后面由专门的文档来阐述。

    注意

    high key的tid原本就没有意义,所以可以随便改。

  4. 步骤5:写XLOG

    这个步骤会在后面详细阐述。

页面删除第二阶段

经过第一阶段,我们得到了如图8所示的一条内部节点和叶子节点组成的链表,并且叶子节点page3中存放了指向链头page6的指针。此时链表中的每个节点都还维系着与右兄弟的联系(实际情况可能是左右兄弟),而第二阶段的工作就是要将链表中的页面从他们的左右兄弟中移除,如图10所示:
在这里插入图片描述

图10

_bt_unlink_halfdead_page实现

第二阶段的实现流程如下:

  1. 步骤1:从叶子节点中获取链头的页号。

    在本用例中就是从page3中获取page6。

  2. 步骤2:将链头从其左右兄弟中移除。

    在本用例中就是将page6从右兄弟中移除(因为page6没有左兄弟)。

  3. 步骤3:如果当前的链头不是叶子节点,则获取链头的下级节点,并将下级节点作为新的链头存入叶子节点中。

    在本用例中将page6的下级节点page3页面编号存入page3中。

    对,你没看错,确实是将page3的页面编号存入page3中。这样page3就自己指向了自己,下一次删除时就可以把自己删除了。

  4. 步骤4:将链头标记为BTP_DELETED。

    在本用例中就是将page6标记为BTP_DELETED。

  5. 步骤5:判断叶子节点的标记是否为BTP_DELETED,如果是则表明删除结束,否则跳转到步骤1。

_bt_unlink_halfdead_page的代码实现如下:

首先,在_bt_pagedel函数中有一个循环(nbtpage.c line:1291),代码如下:

while (P_ISHALFDEAD(opaque))
{
    /* will check for interrupts, once lock is released */
    if (!_bt_unlink_halfdead_page(rel, buf, &rightsib_empty))
    {
        /* _bt_unlink_halfdead_page already released buffer */
        return ndeleted;
    }
    ndeleted++;
}

这段代码实现了步骤5,而_bt_unlink_halfdead_page负责实现步骤1~4,具体代码如下,对应步骤在注释中标注:

/*
 * Unlink a page in a branch of half-dead pages from its siblings.
 *
 * If the leaf page still has a downlink pointing to it, unlinks the highest
 * parent in the to-be-deleted branch instead of the leaf page.  To get rid
 * of the whole branch, including the leaf page itself, iterate until the
 * leaf page is deleted.
 *
 * Returns 'false' if the page could not be unlinked (shouldn't happen).
 * If the (new) right sibling of the page is empty, *rightsib_empty is set
 * to true.
 *
 * Must hold pin and lock on leafbuf at entry (read or write doesn't matter).
 * On success exit, we'll be holding pin and write lock.  On failure exit,
 * we'll release both pin and lock before returning (we define it that way
 * to avoid having to reacquire a lock we already released).
 */
static bool
_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
{
	BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
	BlockNumber leafleftsib;
	BlockNumber leafrightsib;
	BlockNumber target;
	BlockNumber leftsib;
	BlockNumber rightsib;
	Buffer		lbuf = InvalidBuffer;
	Buffer		buf;
	Buffer		rbuf;
	Buffer		metabuf = InvalidBuffer;
	Page		metapg = NULL;
	BTMetaPageData *metad = NULL;
	ItemId		itemid;
	Page		page;
	BTPageOpaque opaque;
	bool		rightsib_is_rightmost;
	int			targetlevel;
	ItemPointer leafhikey;
	BlockNumber nextchild;

	page = BufferGetPage(leafbuf);
	opaque = (BTPageOpaque) PageGetSpecialPointer(page);

	Assert(P_ISLEAF(opaque) && P_ISHALFDEAD(opaque));

	/*
	 * Remember some information about the leaf page.
	 */
	itemid = PageGetItemId(page, P_HIKEY);
	leafhikey = &((IndexTuple) PageGetItem(page, itemid))->t_tid;
	leafleftsib = opaque->btpo_prev;
	leafrightsib = opaque->btpo_next;

	LockBuffer(leafbuf, BUFFER_LOCK_UNLOCK);

	/*
	 * Check here, as calling loops will have locks held, preventing
	 * interrupts from being processed.
	 */
	CHECK_FOR_INTERRUPTS();

	/*
	 * If the leaf page still has a parent pointing to it (or a chain of
	 * parents), we don't unlink the leaf page yet, but the topmost remaining
	 * parent in the branch.  Set 'target' and 'buf' to reference the page
	 * actually being unlinked.
	 * 
	 *  步骤1:从叶子节点中获取链头的页号。
	 */
	if (ItemPointerIsValid(leafhikey))
	{
		target = ItemPointerGetBlockNumber(leafhikey);
		Assert(target != leafblkno);

		/* fetch the block number of the topmost parent's left sibling */
		buf = _bt_getbuf(rel, target, BT_READ);
		page = BufferGetPage(buf);
		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
		leftsib = opaque->btpo_prev;
		targetlevel = opaque->btpo.level;

		/*
		 * To avoid deadlocks, we'd better drop the target page lock before
		 * going further.
		 */
		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
	}
	else
	{
		target = leafblkno;

		buf = leafbuf;
		leftsib = leafleftsib;
		targetlevel = 0;
	}

	/*
	 * We have to lock the pages we need to modify in the standard order:
	 * moving right, then up.  Else we will deadlock against other writers.
	 *
	 * So, first lock the leaf page, if it's not the target.  Then find and
	 * write-lock the current left sibling of the target page.  The sibling
	 * that was current a moment ago could have split, so we may have to move
	 * right.  This search could fail if either the sibling or the target page
	 * was deleted by someone else meanwhile; if so, give up.  (Right now,
	 * that should never happen, since page deletion is only done in VACUUM
	 * and there shouldn't be multiple VACUUMs concurrently on the same
	 * table.)
	 */
	if (target != leafblkno)
		LockBuffer(leafbuf, BT_WRITE);
	if (leftsib != P_NONE)
	{
		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
		page = BufferGetPage(lbuf);
		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
		{
			/* step right one page */
			leftsib = opaque->btpo_next;
			_bt_relbuf(rel, lbuf);

			/*
			 * It'd be good to check for interrupts here, but it's not easy to
			 * do so because a lock is always held. This block isn't
			 * frequently reached, so hopefully the consequences of not
			 * checking interrupts aren't too bad.
			 */

			if (leftsib == P_NONE)
			{
				elog(LOG, "no left sibling (concurrent deletion?) of block %u in \"%s\"",
					 target,
					 RelationGetRelationName(rel));
				if (target != leafblkno)
				{
					/* we have only a pin on target, but pin+lock on leafbuf */
					ReleaseBuffer(buf);
					_bt_relbuf(rel, leafbuf);
				}
				else
				{
					/* we have only a pin on leafbuf */
					ReleaseBuffer(leafbuf);
				}
				return false;
			}
			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
			page = BufferGetPage(lbuf);
			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
		}
	}
	else
		lbuf = InvalidBuffer;

	/*
	 * Next write-lock the target page itself.  It should be okay to take just
	 * a write lock not a superexclusive lock, since no scans would stop on an
	 * empty page.
	 */
	LockBuffer(buf, BT_WRITE);
	page = BufferGetPage(buf);
	opaque = (BTPageOpaque) PageGetSpecialPointer(page);

	/*
	 * Check page is still empty etc, else abandon deletion.  This is just for
	 * paranoia's sake; a half-dead page cannot resurrect because there can be
	 * only one vacuum process running at a time.
	 */
	if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque))
	{
		elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
			 target, RelationGetRelationName(rel));
	}
	if (opaque->btpo_prev != leftsib)
		elog(ERROR, "left link changed unexpectedly in block %u of index \"%s\"",
			 target, RelationGetRelationName(rel));

	if (target == leafblkno)
	{
		if (P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page) ||
			!P_ISLEAF(opaque) || !P_ISHALFDEAD(opaque))
			elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
				 target, RelationGetRelationName(rel));
		nextchild = InvalidBlockNumber;
	}
	else
	{
		if (P_FIRSTDATAKEY(opaque) != PageGetMaxOffsetNumber(page) ||
			P_ISLEAF(opaque))
			elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
				 target, RelationGetRelationName(rel));

		/* remember the next non-leaf child down in the branch. */
		itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
		nextchild = ItemPointerGetBlockNumber(&((IndexTuple) PageGetItem(page, itemid))->t_tid);
		if (nextchild == leafblkno)
			nextchild = InvalidBlockNumber;
	}

	/*
	 * And next write-lock the (current) right sibling.
	 */
	rightsib = opaque->btpo_next;
	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
	page = BufferGetPage(rbuf);
	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
	if (opaque->btpo_prev != target)
		elog(ERROR, "right sibling's left-link doesn't match: "
			 "block %u links to %u instead of expected %u in index \"%s\"",
			 rightsib, opaque->btpo_prev, target,
			 RelationGetRelationName(rel));
	rightsib_is_rightmost = P_RIGHTMOST(opaque);
	*rightsib_empty = (P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page));

	/*
	 * If we are deleting the next-to-last page on the target's level, then
	 * the rightsib is a candidate to become the new fast root. (In theory, it
	 * might be possible to push the fast root even further down, but the odds
	 * of doing so are slim, and the locking considerations daunting.)
	 *
	 * We don't support handling this in the case where the parent is becoming
	 * half-dead, even though it theoretically could occur.
	 *
	 * We can safely acquire a lock on the metapage here --- see comments for
	 * _bt_newroot().
	 */
	if (leftsib == P_NONE && rightsib_is_rightmost)
	{
		page = BufferGetPage(rbuf);
		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
		if (P_RIGHTMOST(opaque))
		{
			/* rightsib will be the only one left on the level */
			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
			metapg = BufferGetPage(metabuf);
			metad = BTPageGetMeta(metapg);

			/*
			 * The expected case here is btm_fastlevel == targetlevel+1; if
			 * the fastlevel is <= targetlevel, something is wrong, and we
			 * choose to overwrite it to fix it.
			 */
			if (metad->btm_fastlevel > targetlevel + 1)
			{
				/* no update wanted */
				_bt_relbuf(rel, metabuf);
				metabuf = InvalidBuffer;
			}
		}
	}

	/*
	 * Here we begin doing the deletion.
	 */

	/* No ereport(ERROR) until changes are logged */
	START_CRIT_SECTION();

	/*
	 * Update siblings' side-links.  Note the target page's side-links will
	 * continue to point to the siblings.  Asserts here are just rechecking
	 * things we already verified above.
	 *
	 * 步骤2:将链头从其左右兄弟中移除。
	 */
	if (BufferIsValid(lbuf))
	{
		page = BufferGetPage(lbuf);
		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
		Assert(opaque->btpo_next == target);
		opaque->btpo_next = rightsib;
	}
	page = BufferGetPage(rbuf);
	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
	Assert(opaque->btpo_prev == target);
	opaque->btpo_prev = leftsib;

	/*
	 * If we deleted a parent of the targeted leaf page, instead of the leaf
	 * itself, update the leaf to point to the next remaining child in the
	 * branch.
	 *
	 * 步骤3:如果当前的链头不是叶子节点,则获取链头的下级节点,并将下级节点作为新的链头存入叶子节点中。
	 */
	if (target != leafblkno)
	{
		if (nextchild == InvalidBlockNumber)
			ItemPointerSetInvalid(leafhikey);
		else
			ItemPointerSet(leafhikey, nextchild, P_HIKEY);
	}

	/*
	 * Mark the page itself deleted.  It can be recycled when all current
	 * transactions are gone.  Storing GetTopTransactionId() would work, but
	 * we're in VACUUM and would not otherwise have an XID.  Having already
	 * updated links to the target, ReadNewTransactionId() suffices as an
	 * upper bound.  Any scan having retained a now-stale link is advertising
	 * in its PGXACT an xmin less than or equal to the value we read here.  It
	 * will continue to do so, holding back RecentGlobalXmin, for the duration
	 * of that scan.
	 *
	 * 步骤4:将链头标记为BTP_DELETED。
	 */
	page = BufferGetPage(buf);
	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
	opaque->btpo_flags &= ~BTP_HALF_DEAD;
	opaque->btpo_flags |= BTP_DELETED;
	opaque->btpo.xact = ReadNewTransactionId();

	/* And update the metapage, if needed */
	if (BufferIsValid(metabuf))
	{
		metad->btm_fastroot = rightsib;
		metad->btm_fastlevel = targetlevel;
		MarkBufferDirty(metabuf);
	}

	/* Must mark buffers dirty before XLogInsert */
	MarkBufferDirty(rbuf);
	MarkBufferDirty(buf);
	if (BufferIsValid(lbuf))
		MarkBufferDirty(lbuf);
	if (target != leafblkno)
		MarkBufferDirty(leafbuf);

	/* XLOG stuff */
	if (RelationNeedsWAL(rel))
	{
		xl_btree_unlink_page xlrec;
		xl_btree_metadata xlmeta;
		uint8		xlinfo;
		XLogRecPtr	recptr;

		XLogBeginInsert();

		XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT);
		if (BufferIsValid(lbuf))
			XLogRegisterBuffer(1, lbuf, REGBUF_STANDARD);
		XLogRegisterBuffer(2, rbuf, REGBUF_STANDARD);
		if (target != leafblkno)
			XLogRegisterBuffer(3, leafbuf, REGBUF_WILL_INIT);

		/* information on the unlinked block */
		xlrec.leftsib = leftsib;
		xlrec.rightsib = rightsib;
		xlrec.btpo_xact = opaque->btpo.xact;

		/* information needed to recreate the leaf block (if not the target) */
		xlrec.leafleftsib = leafleftsib;
		xlrec.leafrightsib = leafrightsib;
		xlrec.topparent = nextchild;

		XLogRegisterData((char *) &xlrec, SizeOfBtreeUnlinkPage);

		if (BufferIsValid(metabuf))
		{
			XLogRegisterBuffer(4, metabuf, REGBUF_WILL_INIT);

			xlmeta.root = metad->btm_root;
			xlmeta.level = metad->btm_level;
			xlmeta.fastroot = metad->btm_fastroot;
			xlmeta.fastlevel = metad->btm_fastlevel;

			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
		}
		else
			xlinfo = XLOG_BTREE_UNLINK_PAGE;

		recptr = XLogInsert(RM_BTREE_ID, xlinfo);

		if (BufferIsValid(metabuf))
		{
			PageSetLSN(metapg, recptr);
		}
		page = BufferGetPage(rbuf);
		PageSetLSN(page, recptr);
		page = BufferGetPage(buf);
		PageSetLSN(page, recptr);
		if (BufferIsValid(lbuf))
		{
			page = BufferGetPage(lbuf);
			PageSetLSN(page, recptr);
		}
		if (target != leafblkno)
		{
			page = BufferGetPage(leafbuf);
			PageSetLSN(page, recptr);
		}
	}

	END_CRIT_SECTION();

	/* release metapage */
	if (BufferIsValid(metabuf))
		_bt_relbuf(rel, metabuf);

	/* release siblings */
	if (BufferIsValid(lbuf))
		_bt_relbuf(rel, lbuf);
	_bt_relbuf(rel, rbuf);

	/*
	 * Release the target, if it was not the leaf block.  The leaf is always
	 * kept locked.
	 */
	if (target != leafblkno)
		_bt_relbuf(rel, buf);

	return true;
}

_bt_pagedel遗留问题

下面,我们来解决一下前面的遗留问题:_bt_pagedel函数为什么需要一个循环,再来看看代码框架,这次我们把源码里的注释加上:

for (;;)
{
   //...省略

    rightsib = opaque->btpo_next;

    _bt_relbuf(rel, buf);

    /*
	 * Check here, as calling loops will have locks held, preventing
	 * interrupts from being processed.
	 */
    CHECK_FOR_INTERRUPTS();

    /*
	 * The page has now been deleted. If its right sibling is completely
	 * empty, it's possible that the reason we haven't deleted it earlier
	 * is that it was the rightmost child of the parent. Now that we
	 * removed the downlink for this page, the right sibling might now be
	 * the only child of the parent, and could be removed. It would be
	 * picked up by the next vacuum anyway, but might as well try to
	 * remove it now, so loop back to process the right sibling.
	 */
    if (!rightsib_empty)
        break;

    buf = _bt_getbuf(rel, rightsib, BT_WRITE);
}

上面的注释很清楚的解释了原因:由于,在一个节点只剩下唯一一个孩子之前,我们是无法删除它的最右孩子的。所以每当我们删除一个节点后,都需要判断一下,此时其父节点是否只剩下最右孩子,如果是则需要把最右孩子也删除。如果不做这个判断和循环,那么最右节点就永远无法删除了

并发控制

最后,我们来梳理下页面删除的并发控制流程:

  1. 在页面被删除前需要加super-exclusive lock(代码位置:nbtpage.c line:956)

    所以,在调用_bt_pagedel之前,待删除的页面已经加上了super-exclusive lock,加上这个锁之后,其他进程就无法再给这个页面加锁或者pin。

    关于super-exclusive lock的相关内容在《PostgreSQL Blink-tree ReadMe—翻译》与《PostgreSQL Buffer ReadMe—翻译》中有详细说明。

  2. 删除第一阶段

    • 删除第一阶段,会调用_bt_lock_branch_parent实现向上遍历,在遍历过程中会先释放当前节点的锁,再给父节点加写锁,所以遍历完成后,图8链表中,链头会持有写锁。(代码位置:参见_bt_lock_branch_parent中的_bt_getstackbuf(加锁)与_bt_relbuf(解锁))
    • 当断开链头与其父节点的父子联系后,释放链头写锁(代码位置:nbtpage.c line:1504)。
  3. 删除第二阶段

    • 释放叶子节点上的写锁.(代码位置:nbtpage.c line:1561)
    • 如果待删除的节点不是叶子节点,则给叶子节点加写锁。(代码位置:nbtpage.c line:1616)
    • 给待删除节点的左兄弟加写锁。(代码位置:nbtpage.c line:1617~1659)
    • 给待删除节点加写锁。(代码位置:nbtpage.c line:1666)
    • 给待删除节点的右兄弟加写锁。(代码位置:nbtpage.c line:1710)
    • 解锁待删除节点的左兄弟。(代码位置:nbtpage.c line:1901)
    • 解锁待删除节点的右兄弟。(代码位置:nbtpage.c line:1902)
    • 如果待删除节点不是叶子节点,解锁待删除节点。(代码位置:nbtpage.c line:1909)

    这个流程有点麻烦,我们需要解释一下。由于第二阶段我们需要断开待删除节点与其左右兄弟的联系,所以我们需要同时锁定当前页面和它的左右兄弟。为了避免死锁问题,PostgreSQL规定,对于多个页面的锁定,只能从左向右加锁或者从下向上加锁(不能从右向左加锁或者从上向下加锁)。而上述流程的第3~5步就实现了从左向右加锁。

    那么,步骤1和2又是为了什么呢?假设没有步骤1和2。在删除的第一阶段结束后,叶子节点是存在写锁的。如果待删除节点正好就是叶子节点自己(通常都是这种情况),那么到了第二阶段如果直接执行步骤3,其实就是在当前节点持有锁时,给左兄弟加锁,这就是一个从右向左加锁的动作。这是不允许的,所以需要将叶子节点解锁,如果后面发现待删除节点不是叶子节点,再给叶子节点加上锁,也就是步骤1和2。

  4. 释放待删除页面上的锁(代码位置:多个地方调用,参见_bt_pagedel中的_bt_relbuf)

评论 10
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值