PostgreSQL Vacuum---索引删除

PostgreSQL Vacuum—索引删除

预备知识

PostgreSQL Vacuum—元组删除

PostgreSQL B+树索引—并发控制

概述

在《PostgreSQL Vacuum—元组删除》中,我们现实阐述了元组的删除过程,从中我们知道,在删除索引之前HOT链的链头元组的ItemData只能被标记为LP_DEAD来防止重用,只有当索引删除之后ItemData才能标记为LP_UNUSED以供重用。所以本章我们将来阐述如何删除索引。索引的删除有如下两个场景:

  • 用户执行Vaccum命令

  • 索引执行插入操作

    向索引插入index tuple时,如果发现待插入的页面没有足够的空闲空间,会先尝试对页面进行空间整理。

本文将重点阐述这两个场景的实现细节。

Lazy Vacuum

在《PostgreSQL Vacuum—元组删除》中我们阐述了Lazy Vaccum如何删除元组,现在我们可以来看看Lazy Vacuum如何删除索引。Vacuum操作要做的事情很多,包括删除元组、冻结事务标识、更新空间映射表、更新可见性映射表等,这里我们只关注删除元组操作。Lazy Vacuum的核心函数是lazy_scan_heap,这个函数非常长,就不看代码了,只讨论与删除相关的流程,在流程的每个步骤中给出代码位置:

  • 步骤1:遍历表的所有数据块(vacuumlazy.c line:585

  • 步骤2:对每一个块调用heap_page_prune清理过期元组(vacuumlazy.c line:911

    通过《PostgreSQL Vacuum—元组删除》我们知道,heap_page_prune完成后,块中会留下标记为LP_DEAD的元组,我们将它们称为dead tuple。

  • 步骤3:创建dead tuple数组记录表中所有元组dead tuple的tid(vacuumlazy.c line:961

  • 步骤4:删除dead tuple数组对应的索引元组(lazy_vacuum_index()函数,vacuumlazy.c line:1304

  • 步骤5:将dead tuple数组中所有元组的标识改为LP_UNUSED(lazy_vacuum_heap()函数,vacuumlazy.c line:1317

在这5个步骤中,我们需要重点关注步骤4,通过调用lazy_vacuum_index来删除索引,下面我们来看看lazy_vacuum_index的实现。

lazy_vacuum_index

lazy_vacuum_index最终会调用到btvacuumscan函数,btvacuumscan的核心代码如下:

blkno = BTREE_METAPAGE + 1;
for (;;)
{
    /* Get the current relation length */
    if (needLock)
        LockRelationForExtension(rel, ExclusiveLock);
    	num_pages = RelationGetNumberOfBlocks(rel);
    if (needLock)
        UnlockRelationForExtension(rel, ExclusiveLock);

    /* Quit if we've scanned the whole relation */
    if (blkno >= num_pages)
        break;
    /* Iterate over pages, then loop back to recheck length */
    for (; blkno < num_pages; blkno++)
    {
        btvacuumpage(&vstate, blkno, blkno);
    }
}

btvacuumpage

上述代码遍历索引的所有块,对于每个块调用btvacuumpage对块的索引元组进行删除,该函数主要有三个步骤:

  • 步骤1:遍历页面内索引元组,判断元组是否可以删除。

    判断元组删除是通过调用callback函数来实现,callback是一个函数指针,实际指向lazy_tid_reaped函数,而lazy_tid_reaped会校验index tuple的tid是否存在于dead tuple数组中(此外dead tuple数组是按照tid升序排列,所以在校验的时候会采用二分法)。

  • 步骤2:使用deletable数组记录需要删除的元组。

  • 步骤3:调用_bt_delitems_vacuum删除deletable中的元组。

btvacuumpage的实现如下:

static void
btvacuumpage(BTVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
{
	IndexVacuumInfo *info = vstate->info;
	IndexBulkDeleteResult *stats = vstate->stats;
	IndexBulkDeleteCallback callback = vstate->callback;
	void	   *callback_state = vstate->callback_state;
	Relation	rel = info->index;
	bool		delete_now;
	BlockNumber recurse_to;
	Buffer		buf;
	Page		page;
	BTPageOpaque opaque = NULL;

restart:
	delete_now = false;
	recurse_to = P_NONE;

	/* call vacuum_delay_point while not holding any buffer lock */
	vacuum_delay_point();

	/*
	 * We can't use _bt_getbuf() here because it always applies
	 * _bt_checkpage(), which will barf on an all-zero page. We want to
	 * recycle all-zero pages, not fail.  Also, we want to use a nondefault
	 * buffer access strategy.
	 */
	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
							 info->strategy);
	LockBuffer(buf, BT_READ);
	page = BufferGetPage(buf);
	if (!PageIsNew(page))
	{
		_bt_checkpage(rel, buf);
		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
	}

	/*
	 * If we are recursing, the only case we want to do anything with is a
	 * live leaf page having the current vacuum cycle ID.  Any other state
	 * implies we already saw the page (eg, deleted it as being empty).
	 */
	if (blkno != orig_blkno)
	{
		if (_bt_page_recyclable(page) ||
			P_IGNORE(opaque) ||
			!P_ISLEAF(opaque) ||
			opaque->btpo_cycleid != vstate->cycleid)
		{
			_bt_relbuf(rel, buf);
			return;
		}
	}

	/* Page is valid, see what to do with it */
	if (_bt_page_recyclable(page))
	{
		/* Okay to recycle this page */
		RecordFreeIndexPage(rel, blkno);
		vstate->totFreePages++;
		stats->pages_deleted++;
	}
	else if (P_ISDELETED(opaque))
	{
		/* Already deleted, but can't recycle yet */
		stats->pages_deleted++;
	}
	else if (P_ISHALFDEAD(opaque))
	{
		/* Half-dead, try to delete */
		delete_now = true;
	}
	else if (P_ISLEAF(opaque))
	{
		OffsetNumber deletable[MaxOffsetNumber];
		int			ndeletable;
		OffsetNumber offnum,
					minoff,
					maxoff;

		/*
		 * Trade in the initial read lock for a super-exclusive write lock on
		 * this page.  We must get such a lock on every leaf page over the
		 * course of the vacuum scan, whether or not it actually contains any
		 * deletable tuples --- see nbtree/README.
		 */
		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
		LockBufferForCleanup(buf);

		//省略
		/*
		 * Scan over all items to see which ones need deleted according to the
		 * callback function.
		 */
		ndeletable = 0;
		minoff = P_FIRSTDATAKEY(opaque);
		maxoff = PageGetMaxOffsetNumber(page);
		if (callback)
		{
            //步骤1:遍历页面内索引元组,判断元组是否可以删除。
			for (offnum = minoff;
				 offnum <= maxoff;
				 offnum = OffsetNumberNext(offnum))
			{
				IndexTuple	itup;
				ItemPointer htup;

				itup = (IndexTuple) PageGetItem(page,
												PageGetItemId(page, offnum));
				htup = &(itup->t_tid);

                //步骤2:使用deletable数组记录需要删除的元组。
				if (callback(htup, callback_state))
					deletable[ndeletable++] = offnum;
			}
		}

		/*
		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
		 * call per page, so as to minimize WAL traffic.
		 */
		if (ndeletable > 0)
		{
			/* 步骤3:调用_bt_delitems_vacuum删除deletable中的元组
			 */
			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
								vstate->lastBlockVacuumed);

			/*
			 * Remember highest leaf page number we've issued a
			 * XLOG_BTREE_VACUUM WAL record for.
			 */
			if (blkno > vstate->lastBlockVacuumed)
				vstate->lastBlockVacuumed = blkno;

			stats->tuples_removed += ndeletable;
			/* must recompute maxoff */
			maxoff = PageGetMaxOffsetNumber(page);
		}
		else
		{
			/*
			 * If the page has been split during this vacuum cycle, it seems
			 * worth expending a write to clear btpo_cycleid even if we don't
			 * have any deletions to do.  (If we do, _bt_delitems_vacuum takes
			 * care of this.)  This ensures we won't process the page again.
			 *
			 * We treat this like a hint-bit update because there's no need to
			 * WAL-log it.
			 */
			if (vstate->cycleid != 0 &&
				opaque->btpo_cycleid == vstate->cycleid)
			{
				opaque->btpo_cycleid = 0;
				MarkBufferDirtyHint(buf, true);
			}
		}

		/*
		 * If it's now empty, try to delete; else count the live tuples. We
		 * don't delete when recursing, though, to avoid putting entries into
		 * freePages out-of-order (doesn't seem worth any extra code to handle
		 * the case).
		 */
		if (minoff > maxoff)
			delete_now = (blkno == orig_blkno);
		else
			stats->num_index_tuples += maxoff - minoff + 1;
	}

	if (delete_now)
	{
		MemoryContext oldcontext;
		int			ndel;

		/* Run pagedel in a temp context to avoid memory leakage */
		MemoryContextReset(vstate->pagedelcontext);
		oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);

		ndel = _bt_pagedel(rel, buf);

		/* count only this page, else may double-count parent */
		if (ndel)
			stats->pages_deleted++;

		MemoryContextSwitchTo(oldcontext);
		/* pagedel released buffer, so we shouldn't */
	}
	else
		_bt_relbuf(rel, buf);

	/*
	 * This is really tail recursion, but if the compiler is too stupid to
	 * optimize it as such, we'd eat an uncomfortably large amount of stack
	 * space per recursion level (due to the deletable[] array). A failure is
	 * improbable since the number of levels isn't likely to be large ... but
	 * just in case, let's hand-optimize into a loop.
	 */
	if (recurse_to != P_NONE)
	{
		blkno = recurse_to;
		goto restart;
	}
}

_bt_delitems_vacuum

我们再来看看_bt_delitems_vacuum函数的实现。_bt_delitems_vacuum的核心是PageIndexMultiDelete函数,PageIndexMultiDelete有两个步骤:

  • 清理索引元组的ItemData

    清理ItemData的方式是,通过deletable数组获取不用清理的ItemData,存放在newitemids数组中,然后直接将newitemids拷贝到索引页面的pd_linp数组中。

  • 调用compactify_tuples清理索引元组的元组实体

    compactify_tuples在《PostgreSQL Vacuum—元组删除》中已经介绍过了,此处不再赘述。

索引插入

处理Vacuum,还有一种索引删除的场景,是在向索引执行插入操作时,如果发现待插入的页面没有足够的空闲空间,会先尝试对页面进行空间整理,具体代码如下:

//nbtinsert.c line 605
movedright = false;
vacuumed = false;
while (PageGetFreeSpace(page) < itemsz)
{
    Buffer		rbuf;
    BlockNumber rblkno;

    /*
	 * before considering moving right, see if we can obtain enough space
	 * by erasing LP_DEAD items
	 */
    if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
    {
        _bt_vacuum_one_page(rel, buf, heapRel);

        /*
		 * remember that we vacuumed this page, because that makes the
		 * hint supplied by the caller invalid
		 */
        vacuumed = true;

        if (PageGetFreeSpace(page) >= itemsz)
            break;			/* OK, now we have enough space */
    }
}

_bt_vacuum_one_page负责页面空间整理,主要有两个步骤:

  • 步骤1:遍历页面内所有索引元组,记录标记为LP_DEAD的索引元组,存放在deletable中。
  • 步骤2:调用_bt_delitems_delete删除deletable中的元组。

_bt_vacuum_one_page实现如下:

static void
_bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
{
	OffsetNumber deletable[MaxOffsetNumber];
	int			ndeletable = 0;
	OffsetNumber offnum,
				minoff,
				maxoff;
	Page		page = BufferGetPage(buffer);
	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);

	/*
	 * Scan over all items to see which ones need to be deleted according to
	 * LP_DEAD flags.
	 * 步骤1:遍历页面内所有索引元组,记录**标记为LP_DEAD**的索引元组,存放在deletable中。
	 */
	minoff = P_FIRSTDATAKEY(opaque);
	maxoff = PageGetMaxOffsetNumber(page);
	for (offnum = minoff;
		 offnum <= maxoff;
		 offnum = OffsetNumberNext(offnum))
	{
		ItemId		itemId = PageGetItemId(page, offnum);

		if (ItemIdIsDead(itemId))
			deletable[ndeletable++] = offnum;
	}

	if (ndeletable > 0)
		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);

	/*
	 * Note: if we didn't find any LP_DEAD items, then the page's
	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
	 * separate write to clear it, however.  We will clear it when we split
	 * the page.
	 */
}

_bt_delitems_delete和前面讲过的_bt_delitems_vacuum十分相似,本质都是调用PageIndexMultiDelete,这里就不在赘述了。在这里我们需要重点解决的问题是:索引元组是在什么时候如何被标记为LP_DEAD的?

索引元组的LP_DEAD标记

基本流程

索引元组的LP_DEAD是在索引遍历时标记的,我们先来回顾下索引遍历的流程:

  • 步骤1:对索引页面加共享锁
  • 步骤2:遍历索引页面获取满足索引条件的所有索引元组,存放到本地缓存中(index tuple array)。
  • 步骤3:对索引页面解锁
  • 步骤4:遍历index tuple array,通过index tuple找到数据元组,判断元组可见性、正确性。
  • 步骤5:遍历完一个页面,就移动到下一个页面,重复步骤1~步骤4。

在步骤4中,我们会判断数据元组的可见的性,也就是遍历元组的HOT链,找到一个可见的版本。其实在这个时候,我们就可以得知HOT链上的元组是否全部过期?如果是就表示相应的索引元组可以被删除了,这个索引元组就可以被标记为LP_DEAD!

而在实际实现时,PostgreSQL会使用killedItems数组缓存需要标记为LP_DEAD的元组,在遍历下一个页面(也就是步骤5)之前,来统一对这些元组加LP_DEAD标记。关于加LP_DEAD标记,我们需要注意以下几点:

  • 对页面加共享锁

    由于在前面步骤3的时候,我们已经对页面进行了解锁,所以此时需要再次对页面加共享锁。对你没有看错,是共享锁!因为修改ItemData的标记只会涉及4字节的修改(准确的说是4字节中的两位),而4字节读写本来就是原子的,所以可以并发执行。

  • 并发分裂

    在遍历index tuple array的过程中,页面可能发生分裂。所以在给元组加LP_DEAD标记时,不能保证index tuple array中的所有元组都能在页面中找到(甚至可能全都找不到)。不过这没有关系,元组的可见性正确性都只由元组本身决定,索引删不删不影响正确性。

  • XLOG

    既然索引删不删不影响正确定,那么给索引元组加LP_DEAD标记也不需要写XLOG。

代码实现

下面我们看看上述流程的相关代码。

index_getnext

index_getnext是索引遍历的主要循环,主要有两个步骤:

  • 步骤1:获取一条索引元组
  • 步骤2:通过索引元组获取数据元组
HeapTuple
index_getnext(IndexScanDesc scan, ScanDirection direction)
{
	HeapTuple	heapTuple;
	ItemPointer tid;

	for (;;)
	{
		if (scan->xs_continue_hot)
		{
			/*
			 * We are resuming scan of a HOT chain after having returned an
			 * earlier member.  Must still hold pin on current heap page.
			 */
			Assert(BufferIsValid(scan->xs_cbuf));
			Assert(ItemPointerGetBlockNumber(&scan->xs_ctup.t_self) ==
				   BufferGetBlockNumber(scan->xs_cbuf));
		}
		else
		{
			/* Time to fetch the next TID from the index */
            //步骤1:获取一条索引元组
			tid = index_getnext_tid(scan, direction);

			/* If we're out of index entries, we're done */
			if (tid == NULL)
				break;
		}

		/*
		 * Fetch the next (or only) visible heap tuple for this index entry.
		 * If we don't find anything, loop around and grab the next TID from
		 * the index.
		 * 步骤2:通过索引元组获取数据元组
		 */
		heapTuple = index_fetch_heap(scan);
		if (heapTuple != NULL)
			return heapTuple;
	}

	return NULL;				/* failure exit */
}

index_fetch_heap

在获取数据元组时,我们会判断数据元组HOT链上的所有元组是否都过期了,index_fetch_heap的主要代码如下:

HeapTuple
index_fetch_heap(IndexScanDesc scan)
{
	ItemPointer tid = &scan->xs_ctup.t_self;
	bool		all_dead = false;
	bool		got_heap_tuple;

	/* Obtain share-lock on the buffer so we can examine visibility */
	LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
	got_heap_tuple = heap_hot_search_buffer(tid, scan->heapRelation,
											scan->xs_cbuf,
											scan->xs_snapshot,
											&scan->xs_ctup,
											&all_dead,
											!scan->xs_continue_hot);
	LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);

	if (!scan->xactStartedInRecovery)
		scan->kill_prior_tuple = all_dead;

	return NULL;
}

heap_hot_search_buffer负责遍历HOT链判断HOT链上元组是否过期,如果元组均过期all_dead就为TRUE,表示该索引元组需要标记为LP_DEAD。

btgettuple

再次进入index_getnext_tid时,就会将这条索引元组加入killedItems数组,实现代码如下:

//nbtree.c line:332
//调用顺序:index_getnext_tid -> btgettuple
if (scan->kill_prior_tuple)
{
    if (so->killedItems == NULL)
        so->killedItems = (int *)
        palloc(MaxIndexTuplesPerPage * sizeof(int));
    if (so->numKilled < MaxIndexTuplesPerPage)
        so->killedItems[so->numKilled++] = so->currPos.itemIndex;
}

_bt_steppage

在访问下一个页面之前,会为killedItems数组中记录的索引元组加上LP_DEAD标记,代码如下:

//nbtree.c line:1302
//调用顺序:index_getnext_tid -> btgettuple -> _bt_next -> _bt_steppage
static bool
_bt_steppage(IndexScanDesc scan, ScanDirection dir)
{
	BTScanOpaque so = (BTScanOpaque) scan->opaque;
	Relation	rel;
	Page		page;
	BTPageOpaque opaque;

	Assert(BTScanPosIsValid(so->currPos));

	/* Before leaving current page, deal with any killed items */
	if (so->numKilled > 0)
		_bt_killitems(scan);
    //省略
}

_bt_killitems

最后,我们来看看_bt_killitems的代码实现:

void
_bt_killitems(IndexScanDesc scan)
{
	BTScanOpaque so = (BTScanOpaque) scan->opaque;
	Page		page;
	BTPageOpaque opaque;
	OffsetNumber minoff;
	OffsetNumber maxoff;
	int			i;
	int			numKilled = so->numKilled;
	bool		killedsomething = false;

	Assert(BTScanPosIsValid(so->currPos));

	/*
	 * Always reset the scan state, so we don't look for same items on other
	 * pages.
	 */
	so->numKilled = 0;

	if (BTScanPosIsPinned(so->currPos))
	{
		/*
		 * We have held the pin on this page since we read the index tuples,
		 * so all we need to do is lock it.  The pin will have prevented
		 * re-use of any TID on the page, so there is no need to check the
		 * LSN.
		 */
		LockBuffer(so->currPos.buf, BT_READ);

		page = BufferGetPage(so->currPos.buf);
	}
	else
	{
		Buffer		buf;

		/* Attempt to re-read the buffer, getting pin and lock. */
        //步骤1:给页面加共享锁
		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);

		/* It might not exist anymore; in which case we can't hint it. */
		if (!BufferIsValid(buf))
			return;

		page = BufferGetPage(buf);
		if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
			so->currPos.buf = buf;
		else
		{
			/* Modified while not pinned means hinting is not safe. */
			_bt_relbuf(scan->indexRelation, buf);
			return;
		}
	}

	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
	minoff = P_FIRSTDATAKEY(opaque);
	maxoff = PageGetMaxOffsetNumber(page);

    //步骤2:在页面中查找killedItems中的每一条元组,如果找到就加上LP_DEAD标记
	for (i = 0; i < numKilled; i++)
	{
		int			itemIndex = so->killedItems[i];
		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
		OffsetNumber offnum = kitem->indexOffset;

		Assert(itemIndex >= so->currPos.firstItem &&
			   itemIndex <= so->currPos.lastItem);
		if (offnum < minoff)
			continue;			/* pure paranoia */
		while (offnum <= maxoff)
		{
			ItemId		iid = PageGetItemId(page, offnum);
			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);

			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
			{
				/* found the item */
				ItemIdMarkDead(iid);
				killedsomething = true;
				break;			/* out of inner search loop */
			}
			offnum = OffsetNumberNext(offnum);
		}
	}

	/*
	 * Since this can be redone later if needed, mark as dirty hint.
	 *
	 * Whenever we mark anything LP_DEAD, we also set the page's
	 * BTP_HAS_GARBAGE flag, which is likewise just a hint.
	 */
	if (killedsomething)
	{
		opaque->btpo_flags |= BTP_HAS_GARBAGE;
		MarkBufferDirtyHint(so->currPos.buf, true);
	}

    //步骤3:解锁
	LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
}
  • 5
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值