B+树索引—查询
预备知识
概述
B+树索引有三个基本的操作:查询、插入、删除(更新是由删除+插入来完成)。在插入和删除之前首先需要定位插入和删除的位置,这个定位过程与查询流程基本一致,所以在讨论插入和删除之前,我们先来B+树的查询如何实现。
index_getnext_tid
index_getnext_tid是索引遍历的最上层函数,index_getnext_tid每调用一次就返回一个符合索引条件的TID。TID为Tuple ID用于标识元组的位置,和Oracle中的rowid是一个意思。index_getnext_tid的实现如下:
/* ----------------
* index_getnext_tid - get the next TID from a scan
*
* The result is the next TID satisfying the scan keys,
* or NULL if no more matching tuples exist.
* ----------------
*/
ItemPointer
index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
{
bool found;
SCAN_CHECKS;
CHECK_SCAN_PROCEDURE(amgettuple);
Assert(TransactionIdIsValid(RecentGlobalXmin));
/*
* The AM's amgettuple proc finds the next index entry matching the scan
* keys, and puts the TID into scan->xs_ctup.t_self. It should also set
* scan->xs_recheck and possibly scan->xs_itup, though we pay no attention
* to those fields here.
*/
found = scan->indexRelation->rd_amroutine->amgettuple(scan, direction);
/* Reset kill flag immediately for safety */
scan->kill_prior_tuple = false;
/* If we're out of index entries, we're done */
if (!found)
{
/* ... but first, release any held pin on a heap page */
if (BufferIsValid(scan->xs_cbuf))
{
ReleaseBuffer(scan->xs_cbuf);
scan->xs_cbuf = InvalidBuffer;
}
return NULL;
}
pgstat_count_index_tuples(scan->indexRelation, 1);
/* Return the TID of the tuple we found. */
return &scan->xs_ctup.t_self;
}
上述代码最关键的是:
found = scan->indexRelation->rd_amroutine->amgettuple(scan, direction);
found表示是否有满足索引条件的item,如果有就将其中的tid返回,否则返回NULL。上面的代码实际会调用btgettuple来获取满足索引条件的item。
btgettuple
索引查询的基本流程
假定test表上有索引列a,a有如下值:
1 2 3 4 5 5 5 5 6 7 8 9 10
有如下查询语句:
select * from test where a = 5;
那么数据库是如何利用B+树索引来实现查询的呢?查询流程主要分为两个阶段:
-
阶段一:定位
在B+树中定位第一个5出现的位置。定位过程可以充分利用数据的有序性,通过二分法进行快速定位。
-
阶段二:遍历
从第一个5出现的位置起,向后遍历,直到出现第一个不为5的值。
btgettuple的代码实现
/*
* btgettuple() -- Get the next tuple in the scan.
*/
bool
btgettuple(IndexScanDesc scan, ScanDirection dir)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool res;
/* btree indexes are never lossy */
scan->xs_recheck = false;
/*
* If we have any array keys, initialize them during first call for a
* scan. We can't do this in btrescan because we don't know the scan
* direction at that time.
*/
if (so->numArrayKeys && !BTScanPosIsValid(so->currPos))
{
/* punt if we have any unsatisfiable array keys */
if (so->numArrayKeys < 0)
return false;
_bt_start_array_keys(scan, dir);
}
/* This loop handles advancing to the next array elements, if any */
do
{
/*
* If we've already initialized this scan, we can just advance it in
* the appropriate direction. If we haven't done so yet, we call
* _bt_first() to get the first item in the scan.
* 阶段一:定位
*/
if (!BTScanPosIsValid(so->currPos))
res = _bt_first(scan, dir);
else
{
/*
* Check to see if we should kill the previously-fetched tuple.
*/
if (scan->kill_prior_tuple)
{
/*
* Yes, remember it for later. (We'll deal with all such
* tuples at once right before leaving the index page.) The
* test for numKilled overrun is not just paranoia: if the
* caller reverses direction in the indexscan then the same
* item might get entered multiple times. It's not worth
* trying to optimize that, so we don't detect it, but instead
* just forget any excess entries.
*/
if (so->killedItems == NULL)
so->killedItems = (int *)
palloc(MaxIndexTuplesPerPage * sizeof(int));
if (so->numKilled < MaxIndexTuplesPerPage)
so->killedItems[so->numKilled++] = so->currPos.itemIndex;
}
/*
* Now continue the scan.
* 阶段二:遍历
*/
res = _bt_next(scan, dir);
}
/* If we have a tuple, return it ... */
if (res)
break;
/* ... otherwise see if we have more array keys to deal with */
} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
return res;
}
在上述代码中,_bt_first为定位阶段。当if (!BTScanPosIsValid(so->currPos))
时,即currPos不是一个合法的位置,这说明还没有进行定位,于是调用_bt_first来完成定位,下面我们来看看_bt_first的实现。
_bt_first
bool
_bt_first(IndexScanDesc scan, ScanDirection dir)
{
Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
Buffer buf;
BTStack stack;
OffsetNumber offnum;
StrategyNumber strat;
bool nextkey;
bool goback;
ScanKey startKeys[INDEX_MAX_KEYS];
ScanKeyData scankeys[INDEX_MAX_KEYS];
ScanKeyData notnullkeys[INDEX_MAX_KEYS];
int keysCount = 0;
int i;
StrategyNumber strat_total;
BTScanPosItem *currItem;
Assert(!BTScanPosIsValid(so->currPos));
pgstat_count_index_scan(rel);
/*
* step1:ScanKey的构建和处理,这里省略。
*/
......
/*
* Use the manufactured insertion scan key to descend the tree and
* position ourselves on the target leaf page.
* step2:在B+树中定位叶子节点。
*/
stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
scan->xs_snapshot);
/* don't need to keep the stack around... */
_bt_freestack(stack);
if (!BufferIsValid(buf))
{
/*
* We only get here if the index is completely empty. Lock relation
* because nothing finer to lock exists.
*/
PredicateLockRelation(rel, scan->xs_snapshot);
return false;
}
else
PredicateLockPage(rel, BufferGetBlockNumber(buf),
scan->xs_snapshot);
/* initialize moreLeft/moreRight appropriately for scan direction */
if (ScanDirectionIsForward(dir))
{
so->currPos.moreLeft = false;
so->currPos.moreRight = true;
}
else
{
so->currPos.moreLeft = true;
so->currPos.moreRight = false;
}
so->numKilled = 0; /* just paranoia */
Assert(so->markItemIndex == -1);
/* position to the precise item on the page
* step3:在叶子节点中定位遍历的起始位置。
*/
offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
* possibly at the end of a page on which all the existing items are less
* than the scan key and we know that everything on later pages is greater
* than or equal to scan key.
*
* If nextkey = true, we are positioned at the first item > scan key, or
* possibly at the end of a page on which all the existing items are less
* than or equal to the scan key and we know that everything on later
* pages is greater than scan key.
*
* The actually desired starting point is either this item or the prior
* one, or in the end-of-page case it's the first item on the next page or
* the last item on this page. Adjust the starting offset if needed. (If
* this results in an offset before the first item or after the last one,
* _bt_readpage will report no items found, and then we'll step to the
* next page as needed.)
*/
if (goback)
offnum = OffsetNumberPrev(offnum);
/* remember which buffer we have pinned, if any */
Assert(!BTScanPosIsValid(so->currPos));
so->currPos.buf = buf;
/*
* Now load data from the first page of the scan.
* step4:获取当前叶子节点中所有合法的item,存放到本地的私有内存中。
*/
if (!_bt_readpage(scan, dir, offnum))
{
/*
* There's no actually-matching data on this page. Try to advance to
* the next page. Return false if there's no matching data at all.
*/
LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
if (!_bt_steppage(scan, dir))
return false;
}
else
{
/* Drop the lock, and maybe the pin, on the current page */
_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
}
/* OK, itemIndex says what to return
* step5:从本地缓存中获取第一个合法item。
*/
currItem = &so->currPos.items[so->currPos.itemIndex];
scan->xs_ctup.t_self = currItem->heapTid;
if (scan->xs_want_itup)
scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
return true;
}
_bt_first的代码非常的长,主要是对于ScanKey的处理比较麻烦,ScanKey是用于B+树定位和遍历的key,这个会在《PostgreSQL B+树索引—ScanKey》中详细介绍,所以这里我们省略这部分代码。上述代码主要分为5个步骤:
- step1:ScanKey的构建和处理
- step2:在B+树中定位叶子节点
- step3:在叶子节点中定位遍历的起始位置
- step4:获取当前叶子节点中所有合法的item
- step5:从本地缓存中获取第一个合法item
定位阶段
在上述5个步骤中,step2和step3共同组成定位阶段。首先调用_bt_search找到遍历阶段的起始叶子节点,然后调用_bt_binsrch在这个节点中用二分法定位遍历阶段的起始item。_bt_binsrch也在《PostgreSQL B+树索引—ScanKey》中详细介绍。这里,我们详细介绍下_bt_search:
/*
* _bt_search() -- Search the tree for a particular scankey,
* or more precisely for the first leaf page it could be on.
*
* The passed scankey must be an insertion-type scankey (see nbtree/README),
* but it can omit the rightmost column(s) of the index.
*
* When nextkey is false (the usual case), we are looking for the first
* item >= scankey. When nextkey is true, we are looking for the first
* item strictly greater than scankey.
*
* Return value is a stack of parent-page pointers. *bufP is set to the
* address of the leaf-page buffer, which is read-locked and pinned.
* No locks are held on the parent pages, however!
*
* If the snapshot parameter is not NULL, "old snapshot" checking will take
* place during the descent through the tree. This is not needed when
* positioning for an insert or delete, so NULL is used for those cases.
*
* NOTE that the returned buffer is read-locked regardless of the access
* parameter. However, access = BT_WRITE will allow an empty root page
* to be created and returned. When access = BT_READ, an empty index
* will result in *bufP being set to InvalidBuffer. Also, in BT_WRITE mode,
* any incomplete splits encountered during the search will be finished.
*/
BTStack
_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
Buffer *bufP, int access, Snapshot snapshot)
{
BTStack stack_in = NULL;
/* Get the root page to start with
* step1:获取根节点。
*/
*bufP = _bt_getroot(rel, access);
/* If index is empty and access = BT_READ, no root page is created. */
if (!BufferIsValid(*bufP))
return (BTStack) NULL;
/* Loop iterates once per level descended in the tree */
for (;;)
{
Page page;
BTPageOpaque opaque;
OffsetNumber offnum;
ItemId itemid;
IndexTuple itup;
BlockNumber blkno;
BlockNumber par_blkno;
BTStack new_stack;
/*
* Race -- the page we just grabbed may have split since we read its
* pointer in the parent (or metapage). If it has, we may need to
* move right to its new sibling. Do that.
*
* In write-mode, allow _bt_moveright to finish any incomplete splits
* along the way. Strictly speaking, we'd only need to finish an
* incomplete split on the leaf page we're about to insert to, not on
* any of the upper levels (they are taken care of in _bt_getstackbuf,
* if the leaf page is split and we insert to the parent page). But
* this is a good opportunity to finish splits of internal pages too.
*
* step2:moveright,找到第一个high key > scankey的节点。
*/
*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
(access == BT_WRITE), stack_in,
BT_READ, snapshot);
/* if this is a leaf page, we're done
* step3:如果当前节点是叶子节点则直接返回。
*/
page = BufferGetPage(*bufP);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
if (P_ISLEAF(opaque))
break;
/*
* Find the appropriate item on the internal page, and get the child
* page that it points to.
*
* step3:如果当前节点是非叶子节点则用二分法找到合适的位置,然后获取下级节点的编号。
*/
offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
itemid = PageGetItemId(page, offnum);
itup = (IndexTuple) PageGetItem(page, itemid);
blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
par_blkno = BufferGetBlockNumber(*bufP);
/*
* We need to save the location of the index entry we chose in the
* parent page on a stack. In case we split the tree, we'll use the
* stack to work back up to the parent page. We also save the actual
* downlink (TID) to uniquely identify the index entry, in case it
* moves right while we're working lower in the tree. See the paper
* by Lehman and Yao for how this is detected and handled. (We use the
* child link to disambiguate duplicate keys in the index -- Lehman
* and Yao disallow duplicate keys.)
*
* 遍历过程中需要记录遍历的路径,用于(插入)分裂时向父亲节点插入新节点的key和节点编号
*/
new_stack = (BTStack) palloc(sizeof(BTStackData));
new_stack->bts_blkno = par_blkno;
new_stack->bts_offset = offnum;
memcpy(&new_stack->bts_btentry, itup, sizeof(IndexTupleData));
new_stack->bts_parent = stack_in;
/* drop the read lock on the parent page, acquire one on the child
* step4:释放当前当前节点的锁,同时给下级节点上锁
*/
*bufP = _bt_relandgetbuf(rel, *bufP, blkno, BT_READ);
/* okay, all set to move down a level */
stack_in = new_stack;
}
return stack_in;
}
_bt_search分为4个步骤:
-
step1:获取根节点
-
step2:moveright
找到第一个high key > scankey的节点。这个步骤很关键,在《B+树索引—基本结构》我们提到过B*树的并发性。B*树的高并发可能导致,当我们访问一个节点时节点发生了分裂,所以需要通过_bt_moveright,找到第一个high key > scankey的节点。
-
step3:判断当前节点的节点类型
如果是叶子节点则直接返回。如果是非叶子节点则需要用二分法找到合适的位置,然后获取下级节点的编号。
-
step4:移动至下级节点
释放当前当前节点的锁,同时给下级节点上锁。这个动作是通过_bt_relandgetbuf来实现的,_bt_relandgetbuf的代码如下:
Buffer _bt_relandgetbuf(Relation rel, Buffer obuf, BlockNumber blkno, int access) { Buffer buf; Assert(blkno != P_NEW); if (BufferIsValid(obuf)) LockBuffer(obuf, BUFFER_LOCK_UNLOCK); buf = ReleaseAndReadBuffer(obuf, rel, blkno); LockBuffer(buf, access); _bt_checkpage(rel, buf); return buf; }
这个代码很简单,就是释放obuf的锁,然后给blkno加锁。_bt_moveright中从当前节点移动到右兄弟,也是调用_bt_relandgetbuf,先释放当前节点的锁,再给右兄弟加锁。这样就保证了同一时刻只会有一个节点持有锁,从而避免了死锁。
在整个遍历流程中,会使用一个栈来记录遍历路径。这是给插入操作使用的,插入时如果节点发生了分裂,则需要将新节点的节点编号和最小key插入到父节点中,这就需要一个栈来记录父节点。
遍历阶段
定位阶段结束后,就开始了遍历阶段,即从定位的位置开始,按照升序\降序来向前(向做)\向后(向右)遍历B+树的叶子节点,直到不满足遍历条件时终止遍历。由于我们在读取一个索引节点时,必须给这个节点加共享锁(索引item没有MVCC),为了防止共享锁持有时间太长,PostgreSQL会一次性遍历整个节点,获取符合索引条件的所有item然后缓存到本地是私有内存,然后释放节点共享锁,这就是_bt_first的step4干的事情。_bt_readpage负责从节点中获取所有满足索引条件的item,存放到so->currPos.items中。_bt_first的step5从so->currPos.items中取出第一个item。
_bt_next
进入遍历阶段后,再调用btgettuple就会直接调用_bt_next,_bt_next的代码如下:
bool
_bt_next(IndexScanDesc scan, ScanDirection dir)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
BTScanPosItem *currItem;
/*
* Advance to next tuple on current page; or if there's no more, try to
* step to the next page with data.
*
*/
if (ScanDirectionIsForward(dir))
{
/*
* 遍历完一个节点后需要遍历下一个阶段,
* _bt_steppage中也会调用_bt_readpage来获取节点中所有合法的item。
*/
if (++so->currPos.itemIndex > so->currPos.lastItem)
{
if (!_bt_steppage(scan, dir))
return false;
}
}
else
{
if (--so->currPos.itemIndex < so->currPos.firstItem)
{
if (!_bt_steppage(scan, dir))
return false;
}
}
/* OK, itemIndex says what to return */
currItem = &so->currPos.items[so->currPos.itemIndex];
scan->xs_ctup.t_self = currItem->heapTid;
if (scan->xs_want_itup)
scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
return true;
}