PostgreSQL B+树索引---查询

B+树索引—查询

预备知识

PostgreSQL B+树索引—基本结构

概述

B+树索引有三个基本的操作:查询、插入、删除(更新是由删除+插入来完成)。在插入和删除之前首先需要定位插入和删除的位置,这个定位过程与查询流程基本一致,所以在讨论插入和删除之前,我们先来B+树的查询如何实现。

index_getnext_tid

index_getnext_tid是索引遍历的最上层函数,index_getnext_tid每调用一次就返回一个符合索引条件的TID。TID为Tuple ID用于标识元组的位置,和Oracle中的rowid是一个意思。index_getnext_tid的实现如下:


/* ----------------
 * index_getnext_tid - get the next TID from a scan
 *
 * The result is the next TID satisfying the scan keys,
 * or NULL if no more matching tuples exist.
 * ----------------
 */
ItemPointer
index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
{
	bool		found;

	SCAN_CHECKS;
	CHECK_SCAN_PROCEDURE(amgettuple);

	Assert(TransactionIdIsValid(RecentGlobalXmin));

	/*
	 * The AM's amgettuple proc finds the next index entry matching the scan
	 * keys, and puts the TID into scan->xs_ctup.t_self.  It should also set
	 * scan->xs_recheck and possibly scan->xs_itup, though we pay no attention
	 * to those fields here.
	 */
	found = scan->indexRelation->rd_amroutine->amgettuple(scan, direction);

	/* Reset kill flag immediately for safety */
	scan->kill_prior_tuple = false;

	/* If we're out of index entries, we're done */
	if (!found)
	{
		/* ... but first, release any held pin on a heap page */
		if (BufferIsValid(scan->xs_cbuf))
		{
			ReleaseBuffer(scan->xs_cbuf);
			scan->xs_cbuf = InvalidBuffer;
		}
		return NULL;
	}

	pgstat_count_index_tuples(scan->indexRelation, 1);

	/* Return the TID of the tuple we found. */
	return &scan->xs_ctup.t_self;
}

上述代码最关键的是:

found = scan->indexRelation->rd_amroutine->amgettuple(scan, direction);

found表示是否有满足索引条件的item,如果有就将其中的tid返回,否则返回NULL。上面的代码实际会调用btgettuple来获取满足索引条件的item。

btgettuple

索引查询的基本流程

假定test表上有索引列a,a有如下值:

1 2 3 4 5 5 5 5 6 7 8 9 10

有如下查询语句:

select * from test where a = 5;

那么数据库是如何利用B+树索引来实现查询的呢?查询流程主要分为两个阶段:

  • 阶段一:定位

    在B+树中定位第一个5出现的位置。定位过程可以充分利用数据的有序性,通过二分法进行快速定位。

  • 阶段二:遍历

    从第一个5出现的位置起,向后遍历,直到出现第一个不为5的值。

btgettuple的代码实现

/*
 *	btgettuple() -- Get the next tuple in the scan.
 */
bool
btgettuple(IndexScanDesc scan, ScanDirection dir)
{
	BTScanOpaque so = (BTScanOpaque) scan->opaque;
	bool		res;

	/* btree indexes are never lossy */
	scan->xs_recheck = false;

	/*
	 * If we have any array keys, initialize them during first call for a
	 * scan.  We can't do this in btrescan because we don't know the scan
	 * direction at that time.
	 */
	if (so->numArrayKeys && !BTScanPosIsValid(so->currPos))
	{
		/* punt if we have any unsatisfiable array keys */
		if (so->numArrayKeys < 0)
			return false;

		_bt_start_array_keys(scan, dir);
	}

	/* This loop handles advancing to the next array elements, if any */
	do
	{
		/*
		 * If we've already initialized this scan, we can just advance it in
		 * the appropriate direction.  If we haven't done so yet, we call
		 * _bt_first() to get the first item in the scan.
		 * 阶段一:定位
		 */
		if (!BTScanPosIsValid(so->currPos))
			res = _bt_first(scan, dir);
		else
		{
			/*
			 * Check to see if we should kill the previously-fetched tuple.
			 */
			if (scan->kill_prior_tuple)
			{
				/*
				 * Yes, remember it for later. (We'll deal with all such
				 * tuples at once right before leaving the index page.)  The
				 * test for numKilled overrun is not just paranoia: if the
				 * caller reverses direction in the indexscan then the same
				 * item might get entered multiple times. It's not worth
				 * trying to optimize that, so we don't detect it, but instead
				 * just forget any excess entries.
				 */
				if (so->killedItems == NULL)
					so->killedItems = (int *)
						palloc(MaxIndexTuplesPerPage * sizeof(int));
				if (so->numKilled < MaxIndexTuplesPerPage)
					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
			}

			/*
			 * Now continue the scan.
			 * 阶段二:遍历
			 */
			res = _bt_next(scan, dir);
		}

		/* If we have a tuple, return it ... */
		if (res)
			break;
		/* ... otherwise see if we have more array keys to deal with */
	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));

	return res;
}

在上述代码中,_bt_first为定位阶段。当if (!BTScanPosIsValid(so->currPos))时,即currPos不是一个合法的位置,这说明还没有进行定位,于是调用_bt_first来完成定位,下面我们来看看_bt_first的实现。

_bt_first

bool
_bt_first(IndexScanDesc scan, ScanDirection dir)
{
	Relation	rel = scan->indexRelation;
	BTScanOpaque so = (BTScanOpaque) scan->opaque;
	Buffer		buf;
	BTStack		stack;
	OffsetNumber offnum;
	StrategyNumber strat;
	bool		nextkey;
	bool		goback;
	ScanKey		startKeys[INDEX_MAX_KEYS];
	ScanKeyData scankeys[INDEX_MAX_KEYS];
	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
	int			keysCount = 0;
	int			i;
	StrategyNumber strat_total;
	BTScanPosItem *currItem;

	Assert(!BTScanPosIsValid(so->currPos));

	pgstat_count_index_scan(rel);

    
	/*
	 * step1:ScanKey的构建和处理,这里省略。
	 */
    ......
        
	/*
	 * Use the manufactured insertion scan key to descend the tree and
	 * position ourselves on the target leaf page.
	 * step2:在B+树中定位叶子节点。
	 */
	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
					   scan->xs_snapshot);

	/* don't need to keep the stack around... */
	_bt_freestack(stack);

	if (!BufferIsValid(buf))
	{
		/*
		 * We only get here if the index is completely empty. Lock relation
		 * because nothing finer to lock exists.
		 */
		PredicateLockRelation(rel, scan->xs_snapshot);
		return false;
	}
	else
		PredicateLockPage(rel, BufferGetBlockNumber(buf),
						  scan->xs_snapshot);

	/* initialize moreLeft/moreRight appropriately for scan direction */
	if (ScanDirectionIsForward(dir))
	{
		so->currPos.moreLeft = false;
		so->currPos.moreRight = true;
	}
	else
	{
		so->currPos.moreLeft = true;
		so->currPos.moreRight = false;
	}
	so->numKilled = 0;			/* just paranoia */
	Assert(so->markItemIndex == -1);

	/* position to the precise item on the page 
	 * step3:在叶子节点中定位遍历的起始位置。
	 */
	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);

	/*
	 * If nextkey = false, we are positioned at the first item >= scan key, or
	 * possibly at the end of a page on which all the existing items are less
	 * than the scan key and we know that everything on later pages is greater
	 * than or equal to scan key.
	 *
	 * If nextkey = true, we are positioned at the first item > scan key, or
	 * possibly at the end of a page on which all the existing items are less
	 * than or equal to the scan key and we know that everything on later
	 * pages is greater than scan key.
	 *
	 * The actually desired starting point is either this item or the prior
	 * one, or in the end-of-page case it's the first item on the next page or
	 * the last item on this page.  Adjust the starting offset if needed. (If
	 * this results in an offset before the first item or after the last one,
	 * _bt_readpage will report no items found, and then we'll step to the
	 * next page as needed.)
	 */
	if (goback)
		offnum = OffsetNumberPrev(offnum);

	/* remember which buffer we have pinned, if any */
	Assert(!BTScanPosIsValid(so->currPos));
	so->currPos.buf = buf;

	/*
	 * Now load data from the first page of the scan.
	 * step4:获取当前叶子节点中所有合法的item,存放到本地的私有内存中。
	 */
	if (!_bt_readpage(scan, dir, offnum))
	{
		/*
		 * There's no actually-matching data on this page.  Try to advance to
		 * the next page.  Return false if there's no matching data at all.
		 */
		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
		if (!_bt_steppage(scan, dir))
			return false;
	}
	else
	{
		/* Drop the lock, and maybe the pin, on the current page */
		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
	}

	/* OK, itemIndex says what to return 
	 * step5:从本地缓存中获取第一个合法item。
	 */
	currItem = &so->currPos.items[so->currPos.itemIndex];
	scan->xs_ctup.t_self = currItem->heapTid;
	if (scan->xs_want_itup)
		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);

	return true;
}

_bt_first的代码非常的长,主要是对于ScanKey的处理比较麻烦,ScanKey是用于B+树定位和遍历的key,这个会在《PostgreSQL B+树索引—ScanKey》中详细介绍,所以这里我们省略这部分代码。上述代码主要分为5个步骤:

  • step1:ScanKey的构建和处理
  • step2:在B+树中定位叶子节点
  • step3:在叶子节点中定位遍历的起始位置
  • step4:获取当前叶子节点中所有合法的item
  • step5:从本地缓存中获取第一个合法item
定位阶段

在上述5个步骤中,step2和step3共同组成定位阶段。首先调用_bt_search找到遍历阶段的起始叶子节点,然后调用_bt_binsrch在这个节点中用二分法定位遍历阶段的起始item。_bt_binsrch也在《PostgreSQL B+树索引—ScanKey》中详细介绍。这里,我们详细介绍下_bt_search:

/*
 *	_bt_search() -- Search the tree for a particular scankey,
 *		or more precisely for the first leaf page it could be on.
 *
 * The passed scankey must be an insertion-type scankey (see nbtree/README),
 * but it can omit the rightmost column(s) of the index.
 *
 * When nextkey is false (the usual case), we are looking for the first
 * item >= scankey.  When nextkey is true, we are looking for the first
 * item strictly greater than scankey.
 *
 * Return value is a stack of parent-page pointers.  *bufP is set to the
 * address of the leaf-page buffer, which is read-locked and pinned.
 * No locks are held on the parent pages, however!
 *
 * If the snapshot parameter is not NULL, "old snapshot" checking will take
 * place during the descent through the tree.  This is not needed when
 * positioning for an insert or delete, so NULL is used for those cases.
 *
 * NOTE that the returned buffer is read-locked regardless of the access
 * parameter.  However, access = BT_WRITE will allow an empty root page
 * to be created and returned.  When access = BT_READ, an empty index
 * will result in *bufP being set to InvalidBuffer.  Also, in BT_WRITE mode,
 * any incomplete splits encountered during the search will be finished.
 */
BTStack
_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
		   Buffer *bufP, int access, Snapshot snapshot)
{
	BTStack		stack_in = NULL;

	/* Get the root page to start with 
	 * step1:获取根节点。
	 */
	*bufP = _bt_getroot(rel, access);

	/* If index is empty and access = BT_READ, no root page is created. */
	if (!BufferIsValid(*bufP))
		return (BTStack) NULL;

	/* Loop iterates once per level descended in the tree */
	for (;;)
	{
		Page		page;
		BTPageOpaque opaque;
		OffsetNumber offnum;
		ItemId		itemid;
		IndexTuple	itup;
		BlockNumber blkno;
		BlockNumber par_blkno;
		BTStack		new_stack;

		/*
		 * Race -- the page we just grabbed may have split since we read its
		 * pointer in the parent (or metapage).  If it has, we may need to
		 * move right to its new sibling.  Do that.
		 *
		 * In write-mode, allow _bt_moveright to finish any incomplete splits
		 * along the way.  Strictly speaking, we'd only need to finish an
		 * incomplete split on the leaf page we're about to insert to, not on
		 * any of the upper levels (they are taken care of in _bt_getstackbuf,
		 * if the leaf page is split and we insert to the parent page).  But
		 * this is a good opportunity to finish splits of internal pages too.
		 *
		 * step2:moveright,找到第一个high key > scankey的节点。
		 */
		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
							  (access == BT_WRITE), stack_in,
							  BT_READ, snapshot);

		/* if this is a leaf page, we're done 
		 * step3:如果当前节点是叶子节点则直接返回。
		 */
		page = BufferGetPage(*bufP);
		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
		if (P_ISLEAF(opaque))
			break;

		/*
		 * Find the appropriate item on the internal page, and get the child
		 * page that it points to.
		 *
		 * step3:如果当前节点是非叶子节点则用二分法找到合适的位置,然后获取下级节点的编号。
		 */
		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
		itemid = PageGetItemId(page, offnum);
		itup = (IndexTuple) PageGetItem(page, itemid);
		blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
		par_blkno = BufferGetBlockNumber(*bufP);

		/*
		 * We need to save the location of the index entry we chose in the
		 * parent page on a stack. In case we split the tree, we'll use the
		 * stack to work back up to the parent page.  We also save the actual
		 * downlink (TID) to uniquely identify the index entry, in case it
		 * moves right while we're working lower in the tree.  See the paper
		 * by Lehman and Yao for how this is detected and handled. (We use the
		 * child link to disambiguate duplicate keys in the index -- Lehman
		 * and Yao disallow duplicate keys.)
		 *
		 * 遍历过程中需要记录遍历的路径,用于(插入)分裂时向父亲节点插入新节点的key和节点编号
		 */
		new_stack = (BTStack) palloc(sizeof(BTStackData));
		new_stack->bts_blkno = par_blkno;
		new_stack->bts_offset = offnum;
		memcpy(&new_stack->bts_btentry, itup, sizeof(IndexTupleData));
		new_stack->bts_parent = stack_in;

		/* drop the read lock on the parent page, acquire one on the child 
		 * step4:释放当前当前节点的锁,同时给下级节点上锁
		 */
		*bufP = _bt_relandgetbuf(rel, *bufP, blkno, BT_READ);

		/* okay, all set to move down a level */
		stack_in = new_stack;
	}

	return stack_in;
}

_bt_search分为4个步骤:

  • step1:获取根节点

  • step2:moveright

    找到第一个high key > scankey的节点。这个步骤很关键,在《B+树索引—基本结构》我们提到过B*树的并发性。B*树的高并发可能导致,当我们访问一个节点时节点发生了分裂,所以需要通过_bt_moveright,找到第一个high key > scankey的节点。

  • step3:判断当前节点的节点类型

    如果是叶子节点则直接返回。如果是非叶子节点则需要用二分法找到合适的位置,然后获取下级节点的编号。

  • step4:移动至下级节点

    释放当前当前节点的锁,同时给下级节点上锁。这个动作是通过_bt_relandgetbuf来实现的,_bt_relandgetbuf的代码如下:

    Buffer
    _bt_relandgetbuf(Relation rel, Buffer obuf, BlockNumber blkno, int access)
    {
    	Buffer		buf;
    
    	Assert(blkno != P_NEW);
    	if (BufferIsValid(obuf))
    		LockBuffer(obuf, BUFFER_LOCK_UNLOCK);
    	buf = ReleaseAndReadBuffer(obuf, rel, blkno);
    	LockBuffer(buf, access);
    	_bt_checkpage(rel, buf);
    	return buf;
    }
    

    这个代码很简单,就是释放obuf的锁,然后给blkno加锁。_bt_moveright中从当前节点移动到右兄弟,也是调用_bt_relandgetbuf,先释放当前节点的锁,再给右兄弟加锁。这样就保证了同一时刻只会有一个节点持有锁,从而避免了死锁。

在整个遍历流程中,会使用一个栈来记录遍历路径。这是给插入操作使用的,插入时如果节点发生了分裂,则需要将新节点的节点编号和最小key插入到父节点中,这就需要一个栈来记录父节点。

遍历阶段

定位阶段结束后,就开始了遍历阶段,即从定位的位置开始,按照升序\降序来向前(向做)\向后(向右)遍历B+树的叶子节点,直到不满足遍历条件时终止遍历。由于我们在读取一个索引节点时,必须给这个节点加共享锁(索引item没有MVCC),为了防止共享锁持有时间太长,PostgreSQL会一次性遍历整个节点,获取符合索引条件的所有item然后缓存到本地是私有内存,然后释放节点共享锁,这就是_bt_first的step4干的事情。_bt_readpage负责从节点中获取所有满足索引条件的item,存放到so->currPos.items中。_bt_first的step5从so->currPos.items中取出第一个item。

_bt_next

进入遍历阶段后,再调用btgettuple就会直接调用_bt_next,_bt_next的代码如下:

bool
_bt_next(IndexScanDesc scan, ScanDirection dir)
{
	BTScanOpaque so = (BTScanOpaque) scan->opaque;
	BTScanPosItem *currItem;

	/*
	 * Advance to next tuple on current page; or if there's no more, try to
	 * step to the next page with data.
	 *
	 */
	if (ScanDirectionIsForward(dir))
	{
        /*
         * 遍历完一个节点后需要遍历下一个阶段,
		 * _bt_steppage中也会调用_bt_readpage来获取节点中所有合法的item。
         */
		if (++so->currPos.itemIndex > so->currPos.lastItem)
		{
			if (!_bt_steppage(scan, dir))
				return false;
		}
	}
	else
	{
		if (--so->currPos.itemIndex < so->currPos.firstItem)
		{
			if (!_bt_steppage(scan, dir))
				return false;
		}
	}

	/* OK, itemIndex says what to return */
	currItem = &so->currPos.items[so->currPos.itemIndex];
	scan->xs_ctup.t_self = currItem->heapTid;
	if (scan->xs_want_itup)
		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);

	return true;
}
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 8
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 8
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值