PostgreSQL B+树索引---分裂

obvious__

已于 2022-06-20 23:02:35 修改

阅读量4k

点赞数

分类专栏： postgresql 文章标签： postgresql 链表数据结构

于 2021-09-23 12:42:02 首次发布

本文链接：https://blog.csdn.net/obvious__/article/details/120429726

版权

postgresql 专栏收录该内容

25 篇文章

订阅专栏

B+树索引—分裂

预备知识

《PostgreSQL B+树索引—查询》

《PostgreSQL B+树索引—插入》

概述

现在，我们终于进入到了B+树索引最难的一个部分，节点的分裂。其实分裂本身并不难，难的主要是两个问题：

分裂引起的并发控制问题。
如何保证分裂的原子性，即分列时如果发生系统崩溃，如何恢复索引。

在本章，主要阐述分裂基本流程及实现，分裂相关的并发控制以及原子性问题，后续由专门的文档来阐述。

图解分裂

在讲代码实现之前，我们先来图解一下PostgreSQL的分裂算法。

在这里插入图片描述

图1

B+树的当前状态如图1所示。关于图1，我们需要注意以下几个点：

当前B+树只有一个叶子节点block1，所以block1自然也就是根节点。
由于block1没有右兄弟，所以block1是一个right_most节点，而right_most节点没有high key（因为high key应该是无穷大）。
此时，block1已经装满了，所以如果再插入一个元素就会发生分裂。

现在我们向block1中插入元素6，block1就会发生分裂。分裂主要分为以下步骤：

确定分裂点
迁移

分配一个新的节点，将原节点中分裂点后的数据迁移到新节点中。
串链

将新节点加入到链表中。
写父节点

将新节点的min key和blockno作为index tuple插入父节点。
修改根节点

这个操作不是所有节点分裂都会发生，只有分裂根节点才会需要修改根节点。

下面我们分别来看这些步骤：

确定分裂点

分裂点通常位于节点的中间，这样可以使分裂后节点的空间更加均衡。block1中有5个元素，所以我们选择5/2 = 2为分裂点（数组从0开始计数），即将3和3之后的元素移动到新节点。如图2所示：

在这里插入图片描述

图2

迁移

现在，我们需要创建一个新节点，然后将分裂点后的数据迁移到新节点中，如图3所示。

在这里插入图片描述

图3

迁移之后，block2将作为block1的右兄弟，所以block2现在是最右节点，所以block2没有high key，但block1此时已经不是最右节点了，所以block1需要high key。block1的high key应该与block2的min key相等，所以block1的high key为3。如图4所示：

在这里插入图片描述

图4

串链

将block2加入双向链表，如图5所示：

在这里插入图片描述

图5

写父节点

完成上述流程之后，最后一个步骤就是将新节点的min key和节点编号作为index tuple写入父节点。如图6所示：

在这里插入图片描述

图6

关于这个步骤有几点需要注意：

由于block1的原本是根节点，没有父节点，所以需要分配一个父节点block3。
block3没有右兄弟，所以block3也是一个right_most节点，没有high key。
block3是一个非叶子节点（也称内部节点），block3也没有左兄弟，所以block3也是一个left_most节点，对于这样的节点，也不应该有最小值，因为最小值应该是无穷小，所以block3的第一个index tuple用null表示。在PostgreSQL的源代码中，也有相应的注释：
```
/*
 * Create downlink item for left page (old root).  Since this will be the
 * first item in a non-leaf page, it implicitly has minus-infinity key
 * value, so we need not store any actual key in it.
 */
```
block3的第二个元素为block2的最小值，其实也就是block1的high key，即3。

修改根节点

显然图6中，根节点已经不再是block1而是block3了，于是我们需要对根节点进行修改。

在这里插入图片描述

图7

小结

我们先小结一下，整个分裂流程，有两个值得考究的地方：

关于high key

节点的high key与其右兄弟的min key相等，这样如果待插入的index tuple与节点high key相等，那么这个index tuple既可以插入当前节点，也可以插入当前节点的右兄弟。这一点，我们在《B+树索引—插入》中已经见过了。
关于非叶子节点的index tuple

非叶子节点的index tuple为下级节点的最小值。这一点比较有意思，在一些文献书籍中也提到过另外一种做法，把下级节点的high key作为非叶子节点的index tuple。而这样的方式存在一个问题，我们以图7为例。假设我们将high key作为非叶子节点的index tuple，那么对于block2来说，对应的index tuple就应该是<5,block2>。当block2发生分裂，由于是向右分裂，所以block2的high key会发生变化。那么相应的index tuple就需要修改。所以如果我们将high key作为非叶子节点的index tuple，那么当节点发生分裂后，我们不光需要将<high key, new blockno>插入父节点，还需要修改<high key, orign blockno>的high key。但是，如果我们将<min key, blockno>作为index tuple，节点的min key不会因为分裂而发生变化，所以分裂时只需要插入<high key, new blockno>。别小看这个优化，少一个步骤，意味着少一条XLOG，少一个需要维护的数据，少一个可能发生错误的因素。

代码实现

现在我们来看看B+树分裂的代码实现，B+树分裂的代码在_bt_insertonpg函数中实现，代码如下：

if (PageGetFreeSpace(page) < itemsz)
{
    bool		is_root = P_ISROOT(lpageop);
    bool		is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(lpageop);
    bool		newitemonleft;
    Buffer		rbuf;

    /* Choose the split point 
     * step1：获取分裂点
     */
    firstright = _bt_findsplitloc(rel, page,
                                  newitemoff, itemsz,
                                  &newitemonleft);

    /* split the buffer into left and right halves 
     * step2：分裂
     */
    rbuf = _bt_split(rel, buf, cbuf, firstright,
                     newitemoff, itemsz, itup, newitemonleft);
    PredicateLockPageSplit(rel,
                           BufferGetBlockNumber(buf),
                           BufferGetBlockNumber(rbuf));
	/*step3：写父节点*/
    _bt_insert_parent(rel, buf, rbuf, stack, is_root, is_only);
}

上述代码主要包括三个步骤：

step1：获取分裂点
step2：分裂

由前面说过的迁移和串链两个步骤组成。
step3：写父节点

我们一个步骤一个步骤来看。

获取分裂点—_bt_findsplitloc

分裂点选择原则

获取分裂点，需要决定两件事：

从哪里开始分裂？
待插入的index tuple应该插入到分裂后的左节点还是右节点？

分裂的原则有三个：

通常情况下，分裂后的左右两个节点空闲空间尽量均衡。

在前面的实例中，图1~图7，我们都假设index tuple是一个定长的值，所以block就是一个index tuple的数组，那么我们只需要简单的将数组中间作为分裂点就好了。而实际情况是，index tuple可能是变长的，因为可能包含字符串，所以找分裂点的事就没有那么简单了。
如果插入的节点是right most节点，则尽可能多的将空闲空间留给新节点。

这主要是为了应对自增列的批量插入，由于列值是自增的，所以每次都只会在B+树的最后进行插入，如果还是按照空间均衡的方式进行分裂，会增加分裂次数，同时B+树的空间也会比较大，影响查询效率（查询时需要读取的索引块就更多）。
不论怎样分裂，都必须保证分裂后的节点有足够的空间插入新的index tuple。

这个很好理解，分裂后如果空间还不够，那不是白分裂了。

算法基本思路

如何做到分裂后两个节点尽量均衡，且保证节点有足够空间插入新的index tuple？这个算法可谓是简单粗暴，核心思想就是试一试！怎么试？

在这里插入图片描述

图8

现在，我们要向图8的block1中插入元素5。那么首先假设分裂点是1，即block1中的所有元素都会被迁移到新节点block2中。那么显然5也应该插入block2，但是由于所有元素都迁移到了block2，所以显然block2没有空间写入5。这个方案就不可行。于是我们又假设分裂点是2，那么分裂后，block1中有元素1，block2中有元素：2、5、10、11、12。block1和block2相差4个元素。我们再假设分裂点是10，那么分裂后block1元素为：1、2、5，block2元素为：10、11、12，block2和block2相差0个元素。继续假设分裂点是11，以此类推。当我们将block1中的所有元素都试过一遍后，我们就能找到最佳的分裂点。

这就是PostgreSQL确定分裂点是基本思路，而在具体实现上有几个点值得注意：

PostgreSQL通常不会穷举节点中的所有元素，这样开销比较大，PostgreSQL在找到一个good-enough的分裂点后就会停止。而所谓good-enough是指左右页面的空闲空间相差小于页面大小的1/16。相关的代码和注释如下所示：

/*
 * Finding the best possible split would require checking all the possible
 * split points, because of the high-key and left-key special cases.
 * That's probably more work than it's worth; instead, stop as soon as we
 * find a "good-enough" split, where good-enough is defined as an
 * imbalance in free space of no more than pagesize/16 (arbitrary...) This
 * should let us stop near the middle on most pages, instead of plowing to
 * the end.
 */
goodenough = leftspace / 16;

前面提到过，right-most的节点分裂需要尽可能保证右节点有更多的空闲空间，这一点是由一个叫做fillfactor来控制的。fillfactor有下面三种可能

#define BTREE_MIN_FILLFACTOR		10	//页面中至少要有10%的数据
#define BTREE_DEFAULT_FILLFACTOR	90  //叶子节点的默认FILLFACTOR
#define BTREE_NONLEAF_FILLFACTOR	70  //非叶子节点的默认FILLFACTOR

现在我们要向图8中插入10（记为new 10），假设分裂点也是10（记为old 10），那么new 10既可以插入block1也可以插入block2。所以需要先假设new 10插入block1，再假设new 10插入block2，比较两种方式哪种好。

下面，我们来看看这个部分的代码实现。

代码实现

在介绍代码流程之前，我们先来看一个非常重要的结构体FindSplitData，定义如下：

typedef struct
{
	/* context data for _bt_checksplitloc */
	Size			newitemsz;			/* size of new item to be inserted */
	int				fillfactor;			/* needed when splitting rightmost page */
	bool			is_leaf;			/* T if splitting a leaf page */
	bool			is_rightmost;		/* T if splitting a rightmost page */
	OffsetNumber 	newitemoff;			/* where the new item is to be inserted */
	int				leftspace;			/* space available for items on left page */
	int				rightspace;			/* space available for items on right page */
	int				olddataitemstotal;	/* space taken by old items */

	bool			have_split;			/* found a valid split? */

	/* these fields valid only if have_split is true */
	bool			newitemonleft;		/* new item on left or right of best split */
	OffsetNumber 	firstright;			/* best split point */
	int				best_delta;			/* best size delta so far */
} FindSplitData;

由于这个结构体很重要，所以我们一一说明其中的每一个成员：

newitemsz

待插入的index tuple的大小。
fillfactor

填充因子，前面提到过，注意只有在分裂right-most节点时会用到这个值，其余时候填因子都是50%。
is_leaf

发生分裂的节点是不是叶子节点。
is_rightmost

发生分裂的节点是不是right-most节点，这个成员决定是否使用fillfactor，is_leaf决定fillfactor为多少。
newitemoff

index tuple的插入位置，这个是在定位阶段就决定的。
leftspace

左节点的空闲空间。
rightspace

右节点的空间空间。
olddataitemstotal

分裂前节点中数据的总大小（不包括待插入的index tuple的大小）。这里需要注意一下olddataitemstotal与leftspace、rightspace的区别。olddataitemstotal是指数据的大小，leftspace、rightspace的是指空闲空间的大小。

have_split

是否找到了分裂点。其实是不可能找不到分裂点的，如果找不到就需要报错。

/*
 * I believe it is not possible to fail to find a feasible split, but just
 * in case ...
 */
if (!state.have_split)
	elog(ERROR, "could not find a feasible split point for index \"%s\"",
		 RelationGetRelationName(rel));

newitemonleft

待插入的index tuple会插入到左边节点还是右边节点。如果分裂点的index tuple与待插入的index tuple相等，那么这个index tuple既可以插入左节点又可以插入右节点，所以需要这个成员来标识index tuple到底插入到哪个节点。
firstright

分裂点，firstright及之后的元素都会被移动到新节点。
best_delta

左右节点空间差的绝对值，如果是非righ-most节点分裂，这个值就那等于abs(leftspace - rightspace)，分裂的原则就是让这个值尽可能的小。

_bt_findsplitloc函数的声明如下：

static OffsetNumber
_bt_findsplitloc(Relation rel,
				 Page page,
				 OffsetNumber newitemoff,
				 Size newitemsz,
				 bool *newitemonleft);

参数：

rel

表信息。
page

待分裂的页面。
newitemoff

待插入的index tuple的插入位置。
newitemsz

待插入的index tuple的大小。
newitemonleft

待插入的index tuple是否插入左节点，这是一个出参。

返回值：

分裂点

_bt_findsplitloc的代码实现如下：

static OffsetNumber
_bt_findsplitloc(Relation rel,
				 Page page,
				 OffsetNumber newitemoff,
				 Size newitemsz,
				 bool *newitemonleft)
{
	BTPageOpaque opaque;
	OffsetNumber offnum;
	OffsetNumber maxoff;
	ItemId		itemid;
	FindSplitData state;
	int			leftspace,
				rightspace,
				goodenough,
				olddataitemstotal,
				olddataitemstoleft;
	bool		goodenoughfound;

	opaque = (BTPageOpaque) PageGetSpecialPointer(page);

	/*初始化state，省略*/
    
	/*
	 * Scan through the data items and calculate space usage for a split at
	 * each possible position.
	 * 遍历待分裂节点的所有item，将每个item都作为分裂点，
	 * 计算以此节点分裂后，左节点和右节点的空间比例，从而找到最优分裂点。
	 */
	olddataitemstoleft = 0;
	goodenoughfound = false;
	maxoff = PageGetMaxOffsetNumber(page);

	for (offnum = P_FIRSTDATAKEY(opaque);
		 offnum <= maxoff;
		 offnum = OffsetNumberNext(offnum))
	{
		Size		itemsz;

		itemid = PageGetItemId(page, offnum);
		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);

		/*
		 * Will the new item go to left or right of split?
		 */
        /* offnum > newitemoff，表示new item将插入左节点 */
		if (offnum > newitemoff)
			_bt_checksplitloc(&state, offnum, true,
							  olddataitemstoleft, itemsz);
		/* offnum < newitemoff，表示new item将插入右节点 */
		else if (offnum < newitemoff)
			_bt_checksplitloc(&state, offnum, false,
							  olddataitemstoleft, itemsz);
		else
		{
			/* need to try it both ways! 
			 * offnum == newitemoff，不确定new item应该插入哪个节点，所以两边都是要试一试
			 */
			_bt_checksplitloc(&state, offnum, true,
							  olddataitemstoleft, itemsz);

			_bt_checksplitloc(&state, offnum, false,
							  olddataitemstoleft, itemsz);
		}

		/* Abort scan once we find a good-enough choice */
		if (state.have_split && state.best_delta <= goodenough)
		{
			goodenoughfound = true;
			break;
		}

		olddataitemstoleft += itemsz;
	}

	/*
	 * If the new item goes as the last item, check for splitting so that all
	 * the old items go to the left page and the new item goes to the right
	 * page.
	 * 
	 * 如果newitemoff > maxoff，表示new item将插入到节点最后。
	 * 在这种情况下，如果没有找到一个goodenoughfound的分裂点，则试试将所有原始数据留在左节点，
	 * 只将new item插入右节点。
	 */
	if (newitemoff > maxoff && !goodenoughfound)
		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);

	/*
	 * I believe it is not possible to fail to find a feasible split, but just
	 * in case ...
	 */
	if (!state.have_split)
		elog(ERROR, "could not find a feasible split point for index \"%s\"",
			 RelationGetRelationName(rel));

	*newitemonleft = state.newitemonleft;
	return state.firstright;
}

这个函数的核心就是_bt_checksplitloc函数，_bt_checksplitloc的声明如下：

static void
_bt_checksplitloc(FindSplitData *state,
				  OffsetNumber firstoldonright,
				  bool newitemonleft,
				  int olddataitemstoleft,
				  Size firstoldonrightsz);

参数：

state

FindSplitData结构体，前面介绍过。
firstoldonright

分裂点。
newitemonleft

待插入的index tuple应该插入左节点还是右节点，这是一个出参。
olddataitemstoleft

左节点剩余的数据量
firstoldonrightsz

右节点第一个item的大小，其实就是分裂点item大小。

功能：

该函数会计算以firstoldonright为分裂点进行分裂，分裂后左右节点的空间大小，空间的差值。将这些信息记录到state中。

_bt_checksplitloc的实现如下：

static void
_bt_checksplitloc(FindSplitData *state,
				  OffsetNumber firstoldonright,
				  bool newitemonleft,
				  int olddataitemstoleft,
				  Size firstoldonrightsz)
{
	int			leftfree,
				rightfree;
	Size		firstrightitemsz;
	bool		newitemisfirstonright;

	/* Is the new item going to be the first item on the right page? */
	newitemisfirstonright = (firstoldonright == state->newitemoff
							 && !newitemonleft);

	if (newitemisfirstonright)
		firstrightitemsz = state->newitemsz;
	else
		firstrightitemsz = firstoldonrightsz;

	/* Account for all the old tuples */
	leftfree = state->leftspace - olddataitemstoleft;
	rightfree = state->rightspace -
		(state->olddataitemstotal - olddataitemstoleft);

	/*
	 * The first item on the right page becomes the high key of the left page;
	 * therefore it counts against left space as well as right space.
	 */
	leftfree -= firstrightitemsz;

	/* account for the new item */
	if (newitemonleft)
		leftfree -= (int) state->newitemsz;
	else
		rightfree -= (int) state->newitemsz;

	/*
	 * If we are not on the leaf level, we will be able to discard the key
	 * data from the first item that winds up on the right page.
	 */
	if (!state->is_leaf)
		rightfree += (int) firstrightitemsz -
			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));

	/*
	 * If feasible split point, remember best delta.
	 * 要点1
	 */
	if (leftfree >= 0 && rightfree >= 0)
	{
		int			delta;

		if (state->is_rightmost)
		{
			/*
			 * If splitting a rightmost page, try to put (100-fillfactor)% of
			 * free space on left page. See comments for _bt_findsplitloc.
			 * 要点3
			 */
			delta = (state->fillfactor * leftfree)
				- ((100 - state->fillfactor) * rightfree);
		}
		else
		{
			/* Otherwise, aim for equal free space on both sides */
			delta = leftfree - rightfree;
		}

		if (delta < 0)
			delta = -delta;
        /*要点2*/
		if (!state->have_split || delta < state->best_delta)
		{
			state->have_split = true;
			state->newitemonleft = newitemonleft;
			state->firstright = firstoldonright;
			state->best_delta = delta;
		}
	}
}

上述代码有三个要点：

要点1：line49

该函数会计算分裂后左右节点的空闲空间大小，分别记为leftfree和rightfree。这个空闲空间是除去了new item之后的空闲空间。所以leftfree或rightfree如果有一个<0，就表示这种分裂方案会导致分裂后的节点无法容纳new item。那么这就不是一个可行的方案，只有当leftfree、rightfree都>0时，才是一个可行方案，才会计算leftfree和rightfree的差值，从而得到最优分裂点。
要点2：line74

state->best_delta记录了最优分裂点左右节点的空间差，如果当前分裂点的空间差比best_delta还要小，那么就将当前分裂点作为最优分裂点。
要点3：line62

对于right_most节点在计算delta是要考虑填充因子fillfactor。

分裂—_bt_split

接下来我们来看看分裂阶段的代码实现，在分裂阶段需要注意的要点有以下几个：

分裂的主体流程

在分裂的过程中，为了尽可能的减小分裂期间发生系统故障时的恢复成本，分裂采用这样的方式。
- 分配一个新的页面rightpage作为分裂后的右节点。
- 分配一个临时页面leftpage用于临时存放左节点中的数据。
- 遍历原始节点origpage，依据分裂点，将origpage中的item分别迁移到leftpage和rightpage中。
我们注意到，在执行上述三个步骤时，原始的B+树没有发生任何变化，如果这个时候系统发生了故障，重启后根本不需要恢复。对于索引的影响也就是插入失败。而上述三个步骤是节点分裂流程中最耗时的三个步骤。

在上述步骤完成之后，将leftpage中的内容拷贝到origpage中，然后再将rightpage加入链表。这两个操作会改变B+树的内容和结构，所以需要WAL来保证原子性。
迁移与拷贝

前面用到了迁移和拷贝两个词，迁移是指从原始节点获取item然后到新节点重新执行一次插入操作。而拷贝是指直接memcpy原始节点的内容到新节点。
关于XLOG

B+树的分裂如何保证原子性？这是B+树索引非常重要的一个问题，后面由专门的文档来说明。

/*
 *	_bt_split() -- split a page in the btree.
 *  用于分裂btree中的一个页面
 *
 *		On entry, buf is the page to split, and is pinned and write-locked.
 *		firstright is the item index of the first item to be moved to the
 *		new right page.  newitemoff etc. tell us about the new item that
 *		must be inserted along with the data from the old page.
 *		buffer是待分裂的页面，该页面已经加pin和wirte-lock
 *		firstright是第一个需要被迁移到new right page（分裂产生的新的页面）的item的下标。      
 *		newitemoff等参数用于描述查询的new item，这个item需要与old page中的数据一同插入
 *
 *
 *		When splitting a non-leaf page, 'cbuf' is the left-sibling of the
 *		page we're inserting the downlink for.  This function will clear the
 *		INCOMPLETE_SPLIT flag on it, and release the buffer.
 *      还没搞懂
 *
 *
 *		Returns the new right sibling of buf, pinned and write-locked.
 *		The pin and lock on buf are maintained.
 * 		返回buf的右兄弟，也就是分裂产生的新页面，对该页面加pin和wirte-lock
 * 		buf上的pin和lock依然保留
 * 		
 */
static Buffer
_bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
		  bool newitemonleft)
{
	Buffer		rbuf;
	Page		origpage;
	Page		leftpage,
				rightpage;
	BlockNumber origpagenumber,
				rightpagenumber;
	BTPageOpaque ropaque,
				lopaque,
				oopaque;
	Buffer		sbuf = InvalidBuffer;
	Page		spage = NULL;
	BTPageOpaque sopaque = NULL;
	Size		itemsz;
	ItemId		itemid;
	IndexTuple	item;
	OffsetNumber leftoff,
				rightoff;
	OffsetNumber maxoff;
	OffsetNumber i;
	bool		isroot;
	bool		isleaf;

	/* 
	 * Acquire a new page to split into 
	 * 分配一个新的页面，用于存放分裂的数据，该函数会对新的页面加pin和lock
	 */
	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);

	/*
	 * origpage is the original page to be split.  leftpage is a temporary
	 * buffer that receives the left-sibling data, which will be copied back
	 * into origpage on success.  rightpage is the new page that receives the
	 * right-sibling data.  If we fail before reaching the critical section,
	 * origpage hasn't been modified and leftpage is only workspace. In
	 * principle we shouldn't need to worry about rightpage either, because it
	 * hasn't been linked into the btree page structure; but to avoid leaving
	 * possibly-confusing junk behind, we are careful to rewrite rightpage as
	 * zeroes before throwing any error.
	 * 
	 * origpage是待分裂的原始页面，leftpage是一个临时buffer用于存放origpage页面分裂后的数据，
	 * 分裂成功后会被copy回origpage页面。rightpage是新分配的页面，用于存放分裂后右兄弟的数据。
	 * 如果我们在critical出错了，那么origpage中的数据并没有修改，leftpage也只是临时内存中，
	 * 在这种规则下，我们也不需要担心rightpage，因为它也还没有并被链接到btree中，
	 * 但是为了避免留下可能造成混淆的垃圾数据，我们会在抛出任何错误之前将rightpage重写为零。
	 * 
	 * 其实这段话的意思就是在critical之前，分裂并没有真正发生，我们也无需做什么特殊处理。
	 */
	origpage 	= BufferGetPage(buf);
	leftpage 	= PageGetTempPage(origpage);
	rightpage 	= BufferGetPage(rbuf);

	origpagenumber 	= BufferGetBlockNumber(buf);
	rightpagenumber = BufferGetBlockNumber(rbuf);

	_bt_pageinit(leftpage, BufferGetPageSize(buf));
	/* 
	 * rightpage was already initialized by _bt_getbuf 
	 * leftpage由于是一个临时页，所以需要在这里进行初始化，
	 * rightpage已经在_bt_pageinit中进行了初始化
	 */

	/*
	 * Copy the original page's LSN into leftpage, which will become the
	 * updated version of the page.  We need this because XLogInsert will
	 * examine the LSN and possibly dump it in a page image.
	 * 初始化后，需要将origpage的lsn写入leftpage，
	 * 因为后面XLogInsert时需要依据pd_lsn判断是否将page image写入xlog（应对partial write）
	 */
	PageSetLSN(leftpage, PageGetLSN(origpage));

	/* 
	 * init btree private data 
	 * BTPageOpaque中主要记录的一个节点的左右孩子，用于在B+树种形成链
	 */
	oopaque = (BTPageOpaque) PageGetSpecialPointer(origpage);
	lopaque = (BTPageOpaque) PageGetSpecialPointer(leftpage);
	ropaque = (BTPageOpaque) PageGetSpecialPointer(rightpage);

	isroot = P_ISROOT(oopaque);
	isleaf = P_ISLEAF(oopaque);

	/* 
	 * if we're splitting this page, it won't be the root when we're done 
	 * 如果我们要分裂一个页面，那么分裂之后，他就不是一个根节点了（分裂之前可能是根节点）
	 * also, clear the SPLIT_END and HAS_GARBAGE flags in both pages
	 * 同理，需要清理待分裂页面的BTP_SPLIT_END（最右页）标志和BTP_HAS_GARBAGE标志（LP_DEAD列）
	 */
	lopaque->btpo_flags = oopaque->btpo_flags;
	lopaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE);
	ropaque->btpo_flags = lopaque->btpo_flags;
	/* set flag in left page indicating that the right page has no downlink */
	lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;
	lopaque->btpo_prev 	= oopaque->btpo_prev;
	lopaque->btpo_next 	= rightpagenumber;
	ropaque->btpo_prev 	= origpagenumber;/*注意：leftpage是临时的，所以要用origpagenumber*/
	ropaque->btpo_next 	= oopaque->btpo_next;
	lopaque->btpo.level = ropaque->btpo.level = oopaque->btpo.level;
	/* Since we already have write-lock on both pages, ok to read cycleid */
	lopaque->btpo_cycleid = _bt_vacuum_cycleid(rel);
	ropaque->btpo_cycleid = lopaque->btpo_cycleid;

	/*
	 * 功能：获取右节点的最大值
	 * 
	 * If the page we're splitting is not the rightmost page at its level in
	 * the tree, then the first entry on the page is the high key for the
	 * page.  We need to copy that to the right half.  Otherwise (meaning the
	 * rightmost page case), all the items on the right half will be user
	 * data.
	 * 如果origpage不是当前层的最右节点， 那么分裂产生的rightpage也不可能是当前层的最右节点。
	 * 
	 * 如果rightpage不是当前层的最右节点，那么P_HIKEY就应该存放rightpage的最大值，
	 * 而rightpage中第一个真正的key就应该存放在P_HIKEY+1的位置上。
	 * 在这种情况下，很显然rightpage应该继承origpage的最大值，
	 * 所以需要将origpage的最大值拷贝到rightpage。
	 * 
	 * 如果rightpage是当前层的最右节点，那么P_HIKEY就应该存放一个真正的key，
	 * 而不用考虑最大值的情况。
	 */
	rightoff = P_HIKEY;

	if (!P_RIGHTMOST(oopaque))
	{
        //不是最右节点
		itemid 	= PageGetItemId(origpage, P_HIKEY);
		itemsz 	= ItemIdGetLength(itemid);
		item 	= (IndexTuple) PageGetItem(origpage, itemid);
		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
						false, false) == InvalidOffsetNumber)
		{
			memset(rightpage, 0, BufferGetPageSize(rbuf));
			elog(ERROR, "failed to add hikey to the right sibling"
				 " while splitting block %u of index \"%s\"",
				 origpagenumber, RelationGetRelationName(rel));
		}
		rightoff = OffsetNumberNext(rightoff);
	}

	/*
	 * 功能：获取左节点的最大值
	 *
	 * The "high key" for the new left page will be the first key that's going
	 * to go into the new right page.  This might be either the existing data
	 * item at position firstright, or the incoming tuple.
	 * 对于分裂后的origpage也就是当前的leftpage，其最大值应该等于rightpage的最小值
	 * 显然rightpage的最小值就是：first key that's going to go into the new right page
	 * 这个值可能是新插入的值，也可能是分裂处（firstright）的值。
	 */
	leftoff = P_HIKEY;
	if (!newitemonleft && newitemoff == firstright)
	{
		/* incoming tuple will become first on right page */
		itemsz 	= newitemsz;
		item 	= newitem;
	}
	else
	{
		/* existing item at firstright will become first on right page */
		itemid = PageGetItemId(origpage, firstright);
		itemsz = ItemIdGetLength(itemid);
		item   = (IndexTuple) PageGetItem(origpage, itemid);
	}
	if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
					false, false) == InvalidOffsetNumber)
	{
		memset(rightpage, 0, BufferGetPageSize(rbuf));
		elog(ERROR, "failed to add hikey to the left sibling"
			 " while splitting block %u of index \"%s\"",
			 origpagenumber, RelationGetRelationName(rel));
	}
	leftoff = OffsetNumberNext(leftoff);

	/*
	 * Now transfer all the data items to the appropriate page.
	 *
	 * Note: we *must* insert at least the right page's items in item-number
	 * order, for the benefit of _bt_restore_page().
	 * 下面是真正的分裂流程，遍历oopaque中的所有item，根据情况将他们插入leftpage或rightpage
	 *
	 *
	 */
	maxoff = PageGetMaxOffsetNumber(origpage);

	for (i = P_FIRSTDATAKEY(oopaque); i <= maxoff; i = OffsetNumberNext(i))
	{
		itemid 	= PageGetItemId(origpage, i);
		itemsz 	= ItemIdGetLength(itemid);
		item 	= (IndexTuple) PageGetItem(origpage, itemid);

		/* does new item belong before this one? */
		if (i == newitemoff)
		{
			if (newitemonleft)
			{
				if (!_bt_pgaddtup(leftpage, newitemsz, newitem, leftoff))
				{
					memset(rightpage, 0, BufferGetPageSize(rbuf));
					elog(ERROR, "failed to add new item to the left sibling"
						 " while splitting block %u of index \"%s\"",
						 origpagenumber, RelationGetRelationName(rel));
				}
				leftoff = OffsetNumberNext(leftoff);
			}
			else
			{
				if (!_bt_pgaddtup(rightpage, newitemsz, newitem, rightoff))
				{
					memset(rightpage, 0, BufferGetPageSize(rbuf));
					elog(ERROR, "failed to add new item to the right sibling"
						 " while splitting block %u of index \"%s\"",
						 origpagenumber, RelationGetRelationName(rel));
				}
				rightoff = OffsetNumberNext(rightoff);
			}
		}
        /*
         * 注意，上面代码有一个非常精妙的地方，
         * 首先在分裂之前并没有对newitem做插入操作，所以newitem并不属于任何一个节点
         * 但是这里通过newitemoff表示了，如果有足够的空间newitem应该插入到origpage中的什么位置
         * 所以当i == newitemoff时，我们就知道我们应该“迁移”newitem了。
         *
         * 注意，newitem“迁移”完成后，并没有continue，也不能有continue，
         * 因为我们还要迁移原本就位于origpage中i处的item。
         *
         */

		/* decide which page to put it on */
		if (i < firstright)
		{
			if (!_bt_pgaddtup(leftpage, itemsz, item, leftoff))
			{
				memset(rightpage, 0, BufferGetPageSize(rbuf));
				elog(ERROR, "failed to add old item to the left sibling"
					 " while splitting block %u of index \"%s\"",
					 origpagenumber, RelationGetRelationName(rel));
			}
			leftoff = OffsetNumberNext(leftoff);
		}
		else
		{
			if (!_bt_pgaddtup(rightpage, itemsz, item, rightoff))
			{
				memset(rightpage, 0, BufferGetPageSize(rbuf));
				elog(ERROR, "failed to add old item to the right sibling"
					 " while splitting block %u of index \"%s\"",
					 origpagenumber, RelationGetRelationName(rel));
			}
			rightoff = OffsetNumberNext(rightoff);
		}
	}

	/* cope with possibility that newitem goes at the end */
	if (i <= newitemoff)
	{
		/*
		 * Can't have newitemonleft here; that would imply we were told to put
		 * *everything* on the left page, which cannot fit (if it could, we'd
		 * not be splitting the page).
		 *
		 * 如果newitem比origpage中的所有item都要大，那么就会出现这种情况。
		 * 显然，在这种情况下，newitem应该插入到rightpage的最右边。
		 *
		 * 上面那个英文注释的意思是，此时不应该有newitemonleft的存在，
		 * 因为newitem不可能插入leftpage，如果有这种可能，那就不需要分裂了。
		 */
		Assert(!newitemonleft);
		if (!_bt_pgaddtup(rightpage, newitemsz, newitem, rightoff))
		{
			memset(rightpage, 0, BufferGetPageSize(rbuf));
			elog(ERROR, "failed to add new item to the right sibling"
				 " while splitting block %u of index \"%s\"",
				 origpagenumber, RelationGetRelationName(rel));
		}
		rightoff = OffsetNumberNext(rightoff);
	}

	/*
	 * We have to grab the right sibling (if any) and fix the prev pointer
	 * there. We are guaranteed that this is deadlock-free since no other
	 * writer will be holding a lock on that page and trying to move left, and
	 * all readers release locks on a page before trying to fetch its
	 * neighbors.
	 *
	 * 现在需要对oopaque的右节点上锁，
	 * 注意我们现在已经持有oopaque的锁，然后要锁定oopaque的右节点。那么这里就可能会出现死锁。
	 * 死锁的原因如下： 为了提高并发性，postgresql在分裂的时候不会锁定整个索引，
	 * 所以其他进程可以对索引进行查询。那么如果在分裂的时候有如下SQL语句：
	 * select * from table where id < 10000 order by id desc;
	 * 由于这是一个要求id降序排列的SQL，所以显然索引会定位到值为10000的item，然后前向遍历。
	 * 对于节点内的item进行遍历肯定要锁定该页面，一个页面遍历完后，
	 * 就会“切换”到当前页面的leftpage，然后锁定leftpage继续遍历。
	 * 注意，如果我们的“切换”流程是先锁定leftpage，再解锁当前节点，那么就可能和分裂流程发生死锁。
	 * 所以切换流程必须是：all readers release locks on a page before trying to fetch its
	 * neighbors，即先解锁当前节点再锁定邻居。
	 *
	 * 如果不这么做，即便没有分裂也可能死锁。比如：两个进行分别执行如下两条语句：
	 * select * from table where id < 10000 order by id desc;
	 * select * from table where id < 10000 order by id asc;
	 */

	if (!P_RIGHTMOST(oopaque))
	{
		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
		spage = BufferGetPage(sbuf);
		sopaque = (BTPageOpaque) PageGetSpecialPointer(spage);
		if (sopaque->btpo_prev != origpagenumber)
		{
			memset(rightpage, 0, BufferGetPageSize(rbuf));
			elog(ERROR, "right sibling's left-link doesn't match: "
			   "block %u links to %u instead of expected %u in index \"%s\"",
				 oopaque->btpo_next, sopaque->btpo_prev, origpagenumber,
				 RelationGetRelationName(rel));
		}

		/*
		 * Check to see if we can set the SPLIT_END flag in the right-hand
		 * split page; this can save some I/O for vacuum since it need not
		 * proceed to the right sibling.  We can set the flag if the right
		 * sibling has a different cycleid: that means it could not be part of
		 * a group of pages that were all split off from the same ancestor
		 * page.  If you're confused, imagine that page A splits to A B and
		 * then again, yielding A C B, while vacuum is in progress.  Tuples
		 * originally in A could now be in either B or C, hence vacuum must
		 * examine both pages.  But if D, our right sibling, has a different
		 * cycleid then it could not contain any tuples that were in A when
		 * the vacuum started.
		 *
		 * 这个好像是优化vacuum的，所以没太看懂，先放一放
		 */
		if (sopaque->btpo_cycleid != ropaque->btpo_cycleid)
			ropaque->btpo_flags |= BTP_SPLIT_END;
	}

	/*
	 * Right sibling is locked, new siblings are prepared, but original page
	 * is not updated yet.
	 *
	 * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
	 * not starting the critical section till here because we haven't been
	 * scribbling on the original page yet; see comments above.
	 */
	START_CRIT_SECTION();

	/*
	 * 功能：将leftpage拷贝回origpage，拷贝完成后会释放leftpage
	 * 
	 * By here, the original data page has been split into two new halves, and
	 * these are correct.  The algorithm requires that the left page never
	 * move during a split, so we copy the new left page back on top of the
	 * original.  Note that this is not a waste of time, since we also require
	 * (in the page management code) that the center of a page always be
	 * clean, and the most efficient way to guarantee this is just to compact
	 * the data by reinserting it into a new left page.  (XXX the latter
	 * comment is probably obsolete; but in any case it's good to not scribble
	 * on the original page until we enter the critical section.)
	 *
	 * 至此，原始的数据页面被分为了两个新的部分。该算法要求在分裂的过程中left page不存在move操作。
	 * 所以我们将新的left page拷回origpage的最开始位置。注意这不是在浪费时间，由于我们同样要求
	 * （在页面管理代码中）页面的中心需要维持clean，保证这一点最有效的办法就是将其重新插入一个新的
	 * 节点，以此来压缩数据。
	 *
	 * We need to do this before writing the WAL record, so that XLogInsert
	 * can WAL log an image of the page if necessary.
	 * 我们需要在WAL日志写入之前完成这些工作，这样，如果有需要的话，
	 * XLogInsert可以在日志中记录image page
	 *
	 */
	PageRestoreTempPage(leftpage, origpage);
	/* 
	 * leftpage, lopaque must not be used below here 
	 * 函数返回后leftpage所在的临时空间会被释放，所以leftpage和lopaque就不能用了
	 */

	MarkBufferDirty(buf);
	MarkBufferDirty(rbuf);

    /* 将rightpage加入链表 */
	if (!P_RIGHTMOST(ropaque))
	{
		sopaque->btpo_prev = rightpagenumber;
		MarkBufferDirty(sbuf);
	}

	/*
	 * Clear INCOMPLETE_SPLIT flag on child if inserting the new item finishes
	 * a split.
	 */
	if (!isleaf)
	{
		Page		cpage = BufferGetPage(cbuf);
		BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);

		cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
		MarkBufferDirty(cbuf);
	}

	/* XLOG stuff */
	if (RelationNeedsWAL(rel))
	{
		xl_btree_split 	xlrec;
		uint8			xlinfo;
		XLogRecPtr		recptr;

		xlrec.level 		= ropaque->btpo.level;	//分裂节点在树中的level
		xlrec.firstright 	= firstright;			//分裂位置
		xlrec.newitemoff 	= newitemoff;			//newitem的插入位置

		XLogBeginInsert();
        
        /* 注册分裂信息：xlrec */
		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);

        /* 注册buf、rbuf、sbuf、cbuf */
		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);	
		XLogRegisterBuffer(1, rbuf, REGBUF_WILL_INIT);
		/* 
		 * Log the right sibling, because we've changed its prev-pointer. 
		 * 由于非最右节点改变了sbuf的前驱，所以将sbuf也写入日志
		 */
		if (!P_RIGHTMOST(ropaque))
			XLogRegisterBuffer(2, sbuf, REGBUF_STANDARD);
		if (BufferIsValid(cbuf))
			XLogRegisterBuffer(3, cbuf, REGBUF_STANDARD);

		/*
		 * Log the new item, if it was inserted on the left page. (If it was
		 * put on the right page, we don't need to explicitly WAL log it
		 * because it's included with all the other items on the right page.)
		 * Show the new item as belonging to the left page buffer, so that it
		 * is not stored if XLogInsert decides it needs a full-page image of
		 * the left page.  We store the offset anyway, though, to support
		 * archive compression of these records.
		 * 记录新插入的item，如果这个item被插入到left page（原始节点）
		 * 如果被插入right page（新节点）就不用记录，因为新节点整个都会被记录在日志中。
		 */
		if (newitemonleft)
			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));

		/* Log left page */
		if (!isleaf)
		{
			/*
			 * We must also log the left page's high key, because the right
			 * page's leftmost key is suppressed on non-leaf levels.  Show it
			 * as belonging to the left page buffer, so that it is not stored
			 * if XLogInsert decides it needs a full-page image of the left
			 * page.
			 */
			itemid = PageGetItemId(origpage, P_HIKEY);
			item = (IndexTuple) PageGetItem(origpage, itemid);
			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
		}

		/*
		 * Log the contents of the right page in the format understood by
		 * _bt_restore_page(). We set lastrdata->buffer to InvalidBuffer,
		 * because we're going to recreate the whole page anyway, so it should
		 * never be stored by XLogInsert.
		 *
		 * Direct access to page is not good but faster - we should implement
		 * some new func in page API.  Note we only store the tuples
		 * themselves, knowing that they were inserted in item-number order
		 * and so the item pointers can be reconstructed.  See comments for
		 * _bt_restore_page().
		 */
		XLogRegisterBufData
        (
        	1,
			(char *) rightpage + ((PageHeader) rightpage)->pd_upper,
			((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper
        );

		if (isroot)
			xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L_ROOT : XLOG_BTREE_SPLIT_R_ROOT;
		else
			xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;

		recptr = XLogInsert(RM_BTREE_ID, xlinfo);

		PageSetLSN(origpage, recptr);
		PageSetLSN(rightpage, recptr);
		if (!P_RIGHTMOST(ropaque))
		{
			PageSetLSN(spage, recptr);
		}
		if (!isleaf)
		{
			PageSetLSN(BufferGetPage(cbuf), recptr);
		}
	}

	END_CRIT_SECTION();

	/* 
	 * release the old right sibling 
	 * 释放原始右节点的锁
	 */
	if (!P_RIGHTMOST(ropaque))
		_bt_relbuf(rel, sbuf);

	/* release the child */
	if (!isleaf)
		_bt_relbuf(rel, cbuf);

	/* split's done */
	return rbuf;
}

写父节点—_bt_insert_parent

B+树分裂的最后一个步骤是将新节点的min key和节点编号组成index tuple，然后插入到父亲节点中。这个操作本身非常简单。难的是如何确定当前节点的父亲节点是谁？如果分裂的节点是根节点，根节点自然没有父节点，所以需要分配一个新的节点作为父节点。如果分裂的节点不是根节点呢？其实在我们定位插入位置时，本身就是一个从上至下的查找过程，那么只要我们在查询时记录好B+树的遍历路径，通过这个路径就能直接找到当前节点的父节点。而记录遍历路径的结构体就是BTStack，定义如下：

typedef struct BTStackData
{
	BlockNumber bts_blkno;
	OffsetNumber bts_offset;
	IndexTupleData bts_btentry;
	struct BTStackData *bts_parent;
} BTStackData;

typedef BTStackData *BTStack;

bts_blkno

当前节点的编号
bts_offset

当前index tuple的偏移
bts_btentry

当前index tuple
bts_parent

父节点指针

现在，我们要向图7所示的B+树中插入元素2，那么定位到插入位置之后BTStack的值应该如图9所示：

在这里插入图片描述

图9

所以，如果block1发生分裂，通过BTStack我们就可以很容易的知道，新节点对应的index tuple即<min key, new blockno>应该插入3号节点，offset为1的位置上。

然而事情真的如此美好么？BTStack真的靠谱么？答案是不靠谱！有两种情况会使BTStack不靠谱。

情况1

BTStack是在_bt_search时构建的。回顾下《B+树索引—插入》中我们讲过的插入的基本流程，通过_bt_search我们找到了index tuple插入的叶子节点。然后，我们通过下面两行代码，将叶子节点上的共享锁变为互斥锁。

/* trade in our read lock for a write lock */
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
LockBuffer(buf, BT_WRITE);

我们说过，在多进程环境下，一旦对节点执行了unlock，那么这个节点就可能被其他进程加上互斥锁，然后执行插入，插入可能会导致该节点分裂，分裂会写父节点。所以，在给叶子节点加上互斥锁之后，会调用_bt_moveright来获取index tuple真正插入的节点。而_bt_moveright只会右移叶子节点，父节点不会联动，所以_bt_moveright之后BTStack就不靠谱了。具体情况如图10所示：

在这里插入图片描述

图10

情况2

在《B+树索引—插入》中，我们说过如果待插入的index tuple与当前节点的high key相等，那么既可以插入当前节点，也可以插入当前节点的右兄弟，如果插入当前节点会引起节点分裂，那么就插入当前节点的右兄弟。所以插入的节点都发生了变化BTStack自然就不靠谱了。

_bt_getstackbuf

既然BTStack不靠谱，我们需要一种机制，来校验BTStack是否靠谱，如果不靠谱就让他靠谱。这个机制通过_bt_getstackbuf来实现。_bt_getstackbuf的实现思路非常简单，就是从bts_parent->bts_btentry中获取孩子节点的blockno（记为blockno child），然后将blockno child与当前的bts_blkno比较，如果不相等则说明父亲节点发生了变化。于是在父亲节点中从bts_offset开始，先向后遍历寻找blockno child与bts_blkno相等的item，如果找不到则从bts_offset开始向前遍历。当前节点找不到就向右遍历下一个节点，直到找到bts_blkno对应的那个父亲节点以及对应的item。

注意

这里为什么要从bts_offset开始先向后遍历再向前遍历？因为只有分裂和合并才会导致BTStack不靠谱，这种概率比较低，通常bts_offset对应的blockno child与bts_blkno是相等的，即便不相等，都应该在他附近。并且，由于分裂的概率又大于合并，所以先向后，再向前是很合理的。

_bt_getstackbuf的实现代码如下：

Buffer
_bt_getstackbuf(Relation rel, BTStack stack, int access)
{
	BlockNumber blkno;
	OffsetNumber start;

	blkno = stack->bts_blkno;
	start = stack->bts_offset;

	for (;;)
	{
		Buffer		buf;
		Page		page;
		BTPageOpaque opaque;

		buf = _bt_getbuf(rel, blkno, access);
		page = BufferGetPage(buf);
		opaque = (BTPageOpaque) PageGetSpecialPointer(page);

		if (access == BT_WRITE && P_INCOMPLETE_SPLIT(opaque))
		{
			_bt_finish_split(rel, buf, stack->bts_parent);
			continue;
		}

		if (!P_IGNORE(opaque))
		{
			OffsetNumber offnum,
						minoff,
						maxoff;
			ItemId		itemid;
			IndexTuple	item;

			minoff = P_FIRSTDATAKEY(opaque);
			maxoff = PageGetMaxOffsetNumber(page);

			/*
			 * start = InvalidOffsetNumber means "search the whole page". We
			 * need this test anyway due to possibility that page has a high
			 * key now when it didn't before.
			 */
			if (start < minoff)
				start = minoff;

			/*
			 * Need this check too, to guard against possibility that page
			 * split since we visited it originally.
			 */
			if (start > maxoff)
				start = OffsetNumberNext(maxoff);

			/*
			 * These loops will check every item on the page --- but in an
			 * order that's attuned to the probability of where it actually
			 * is.  Scan to the right first, then to the left.
			 *
			 * 先向后遍历
			 */
			for (offnum = start;
				 offnum <= maxoff;
				 offnum = OffsetNumberNext(offnum))
			{
				itemid = PageGetItemId(page, offnum);
				item = (IndexTuple) PageGetItem(page, itemid);
				if (BTEntrySame(item, &stack->bts_btentry))
				{
					/* Return accurate pointer to where link is now */
					stack->bts_blkno = blkno;
					stack->bts_offset = offnum;
					return buf;
				}
			}

            /*再向前遍历*/
			for (offnum = OffsetNumberPrev(start);
				 offnum >= minoff;
				 offnum = OffsetNumberPrev(offnum))
			{
				itemid = PageGetItemId(page, offnum);
				item = (IndexTuple) PageGetItem(page, itemid);
				if (BTEntrySame(item, &stack->bts_btentry))
				{
					/* Return accurate pointer to where link is now */
					stack->bts_blkno = blkno;
					stack->bts_offset = offnum;
					return buf;
				}
			}
		}

		/*
		 * The item we're looking for moved right at least one page.
		 */
		if (P_RIGHTMOST(opaque))
		{
			_bt_relbuf(rel, buf);
			return InvalidBuffer;
		}
		blkno = opaque->btpo_next;
		start = InvalidOffsetNumber;
		_bt_relbuf(rel, buf);
	}
}

而_bt_insert_parent的实现代码如下：

static void
_bt_insert_parent(Relation rel,
				  Buffer buf,
				  Buffer rbuf,
				  BTStack stack,
				  bool is_root,
				  bool is_only)
{
	/*
	 * Here we have to do something Lehman and Yao don't talk about: deal with
	 * a root split and construction of a new root.  If our stack is empty
	 * then we have just split a node on what had been the root level when we
	 * descended the tree.  If it was still the root then we perform a
	 * new-root construction.  If it *wasn't* the root anymore, search to find
	 * the next higher level that someone constructed meanwhile, and find the
	 * right place to insert as for the normal case.
	 *
	 * If we have to search for the parent level, we do so by re-descending
	 * from the root.  This is not super-efficient, but it's rare enough not
	 * to matter.
	 *
	 * 分裂节点为根节点
	 */
	if (is_root)
	{
		Buffer		rootbuf;

		Assert(stack == NULL);
		Assert(is_only);
		/* create a new root node and update the metapage */
		rootbuf = _bt_newroot(rel, buf, rbuf);
		/* release the split buffers */
		_bt_relbuf(rel, rootbuf);
		_bt_relbuf(rel, rbuf);
		_bt_relbuf(rel, buf);
	}
	else
	{
        /* 分裂节点不为根节点 */
		BlockNumber bknum = BufferGetBlockNumber(buf);
		BlockNumber rbknum = BufferGetBlockNumber(rbuf);
		Page		page = BufferGetPage(buf);
		IndexTuple	new_item;
		BTStackData fakestack;
		IndexTuple	ritem;
		Buffer		pbuf;

		if (stack == NULL)
		{
			BTPageOpaque lpageop;

			elog(DEBUG2, "concurrent ROOT page split");
			lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
			/* Find the leftmost page at the next level up */
			pbuf = _bt_get_endpoint(rel, lpageop->btpo.level + 1, false,
									NULL);
			/* Set up a phony stack entry pointing there */
			stack = &fakestack;
			stack->bts_blkno = BufferGetBlockNumber(pbuf);
			stack->bts_offset = InvalidOffsetNumber;
			/* bts_btentry will be initialized below */
			stack->bts_parent = NULL;
			_bt_relbuf(rel, pbuf);
		}

		/* get high key from left page == lowest key on new right page */
		ritem = (IndexTuple) PageGetItem(page,
										 PageGetItemId(page, P_HIKEY));

		/* form an index tuple that points at the new right page */
		new_item = CopyIndexTuple(ritem);
		ItemPointerSet(&(new_item->t_tid), rbknum, P_HIKEY);

		/*
		 * Find the parent buffer and get the parent page.
		 *
		 * Oops - if we were moved right then we need to change stack item! We
		 * want to find parent pointing to where we are, right ?	- vadim
		 * 05/27/97
		 */
		ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
		pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);

		/*
		 * Now we can unlock the right child. The left child will be unlocked
		 * by _bt_insertonpg().
		 */
		_bt_relbuf(rel, rbuf);

		/* Check for error only after writing children */
		if (pbuf == InvalidBuffer)
			elog(ERROR, "failed to re-find parent key in index \"%s\" for split pages %u/%u",
				 RelationGetRelationName(rel), bknum, rbknum);

		/* Recursively update the parent */
		_bt_insertonpg(rel, pbuf, buf, stack->bts_parent,
					   new_item, stack->bts_offset + 1,
					   is_only);

		/* be tidy */
		pfree(new_item);
	}
}

这里需要注意的是line48，为什么会出现stack == NULL的情况。是这样的，现在进程1希望在图1中插入2，于是首先通过_bt_search遍历定位插入节点，由于图1中叶子节点即是根节点，所以stack为null。然后进程1希望block1加锁。然而此时有其他进程正在对block1做insert操作，所以进程1只有等待。等到其他进程unlock block1后，进程1锁住block1，然而此时的B+树变成了图11所示的样子：
在这里插入图片描述

图11

此时，2依然应该插入到block1中，插入2会导致block1分裂，分裂后在_bt_insert_parent时就会发现stack为空。stack为空怎么办？按照前面所讲的_bt_getstackbuf的思路，我们应该遍历Level2的所有item，找到blockno child为block1的那个item。所以在这里，PostgreSQL构建了一个fakestack，将他指向Level2的最左节点的第一个元素。然后调用_bt_getstackbuf来寻找真正的父节点。

补充

注意看line60的代码stack->bts_offset = InvalidOffsetNumber;这里用InvalidOffsetNumber来表示节点内的第一个元素，原因是这里不知道这个节点是否为right-most节点。right-most节点没有high key所以first key为第一个元素，非right-most节点有high key所以first key为第二个元素。在前面_bt_getstackbuf的line42就在处理bts_offset为InvalidOffsetNumber的情况。

_bt_insert_parent还有一个要点是并发控制，这个由后面专门的文章讲解。