postgres 源码解析 45 btree分裂流程_bt_split

B+树简介

B+树一种多路平衡树,有如下特点:

  1. m阶B+树表示每个节点最多含有m-1个元素,除了根节点之外,每个节点至少含有ceil(m/2)-1个元素。如5阶B+树,每个节点最多4个元素,除根节点之外最少含有2个元素;
  2. 内部节点不保存数据只保存索引,所有的数据保存在叶子节点中,其目的是最大化中间节点索引键数以减少树高度;
  3. 自带排序,叶子结点之间是有序的,查找的路径稳定;
  4. 插入与修改都拥有较为稳定的对数时间复杂度。叶子结点保存所有父节点的关键字记录,每次查找需要定位到叶子结点,B+树元素地插入均自底而上;
  5. 通过右指针将相邻的叶子节点连接起来,利于范围查找;
    B+树为保持平衡,需要结合自身的结构规则被打破时会进行页分裂操作,本文将结合postgres的源码来学习下PG中的btree 分裂原理。其中pg中btree在常见的btree上有所变化,相关知识见回顾:
    postgres源码解析41 btree索引文件的创建–1
    Postgresql源码(30)Postgresql索引基础B-linked-tree (引用)
关键数据结构

1 FindSplitData
该结构体记录了分裂过程中的状态信息:左页空闲空间/右页空闲空间,候选分裂点总数和当前分裂点

typedef struct
{
	/* context data for _bt_recsplitloc */
	Relation	rel;			/* index relation */
	Page		origpage;		/* page undergoing split */
	IndexTuple	newitem;		/* new item (cause of page split) */
	Size		newitemsz;		/* size of newitem (includes line pointer) */
	bool		is_leaf;		/* T if splitting a leaf page */
	bool		is_rightmost;	/* T if splitting rightmost page on level */
	OffsetNumber newitemoff;	/* where the new item is to be inserted */
	int			leftspace;		/* space available for items on left page */
	int			rightspace;		/* space available for items on right page */
	int			olddataitemstotal;	/* space taken by old items */
	Size		minfirstrightsz;	/* smallest firstright size */

	/* candidate split point data */
	int			maxsplits;		/* maximum number of splits */
	int			nsplits;		/* current number of splits */
	SplitPoint *splits;			/* all candidate split points for page */
	int			interval;		/* current range of acceptable split points */
} FindSplitData;

2 SplitPoint
该结构体记录了分裂点的一些细节信息,包括假设以此位点分裂后左页与右页的空闲空间<该信息是选择最佳分裂点的依据>,以及新插入的元组是否位于左页。

typedef struct
{
	/* details of free space left by split */
	int16		curdelta;		/* current leftfree/rightfree delta */
	int16		leftfree;		/* space left on left page post-split */
	int16		rightfree;		/* space left on right page post-split */

	/* split point identifying fields (returned by _bt_findsplitloc) */
	OffsetNumber firstrightoff; /* first origpage item on rightpage */
	bool		newitemonleft;	/* new item goes on left, or right? */
} SplitPoint;

分类策略
typedef enum
{
	/* strategy for searching through materialized list of split points */
	SPLIT_DEFAULT,				/* give some weight to truncation */
	SPLIT_MANY_DUPLICATES,		/* find minimally distinguishing point */
	SPLIT_SINGLE_VALUE			/* leave left page almost full */
} FindSplitStrat;
_bt_split

以下图为例,介绍具体执行流程,红色项为high key,待分裂页面在上层插入函数_bt_doinsert中已被排他锁锁定。
在这里插入图片描述
1 调用 _bt_findsplitloc函数确定页面分裂点;
在这里插入图片描述
2 准备临时左页
1)在本地上下文申请索引页空间 leftpage,并初始化PageHeader相关字段信息;
2)将待分裂页opage原有标识复制给leftpage,清除BTP_ROOT/BTP_SPLIT_END/BTP_HAS_GARBAGE标识位信息,新增BTP_INCOMPLETE_SPLIT标识信息(表明分裂未完成);同时将opage的页层btpo_level与前趋btpo_prev复制到leftpage对应字段;
3)将opage的 LSN 复制到leftpage,XLogInsert可能会用;
4)为 leftpage 确定high-key(右页第一项即左页的high-key),右页第一项有以下两种情况:
 (1)分裂点和插入点相等,插入项为右页第一项,即左页high-key
 (2)其他情况,分裂点处原有的项为右页第一项,即左页的high-key
5)为左页high-key做后缀截断(若需要)
 (1)首先确定左页最后一项,以便决定右页第一项中的多少个属性必须保留在 左页的新high-key中
 (2)调用_bt_truncate执行后缀截断
6)将high-key插入左页
在这里插入图片描述
3 申请右页buffer(新页调用ReadBufferExtended函数从从索引文件extend而来)

  1. 申请右页的Buffer并持有写锁,获得右页rightpage;
  2. 将临时左页 btpo_next 指向右页 rightpagenumber;
  3. 将右页 btpo_prev指向原始分裂页origpagenumber(左页最终会回写回原页)
  4. 将右页next指针指向原始页面的next指针所指的内容
  5. 获取vacuum id同时赋值给左页和右页
    6 )若原始页面不是当前层最右页面,为 rightpage 设置 high-key,即原始页面high-key
    在这里插入图片描述

4 数据分配与填充
1) 遍历旧页中的所有索引元组,根据偏序关系判断索引元组临时左页还是右页rightpage;
2) 调用_bt_pgaddtup函数将其填充至页中对应的偏移量处;
在这里插入图片描述
5 对opage的右页spage(如果有)持有写锁,更新前驱link,并为rightpage添加 BTP_SPLIT_END标识;

6 进入临界区进行写操作
1)首先将临时左页的内容复制到原旧页opage,释放临时左页占用的内存资源;
2)分别将旧页opage、右页rightpage和spage所在的缓冲区标记为脏;
3)如果分裂页即旧页不是叶子结点,则清除cbuf对应页cpage的 BTP_INCOMPLETE_SPLIT标识信息,设置该buf为脏;
4) 为上述分裂操作构建XLOG日志,重点信息包含rightpage所处层级、分裂点以及上述opage/right/spage/cpage信息;

7 清理工作
1)如果当前页不是最右页,则释放 sbuf的写锁和pin;
2) 如果当前页不是叶子结点,则释放 cbuf的写锁和pin;
3)如果是叶子结点则释放此过程申请的lefthighkey内存;
4) 最后返回右页buf;
当前页opage和右页rightpage的写锁和pin还未释放
在这里插入图片描述

zongtiliucheng图

_bt_insert_parent

通过上述流程可以发现rightpage与父页的link关系没有确定,且持有锁资源均未释放,这些操作由 _bt_insert_parent函数完成,其流程如下:
在这里插入图片描述

/*
 * _bt_insert_parent() -- Insert downlink into parent, completing split.
 *
 * On entry, buf and rbuf are the left and right split pages, which we
 * still hold write locks on.  Both locks will be released here.  We
 * release the rbuf lock once we have a write lock on the page that we
 * intend to insert a downlink to rbuf on (i.e. buf's current parent page).
 * The lock on buf is released at the same point as the lock on the parent
 * page, since buf's INCOMPLETE_SPLIT flag must be cleared by the same
 * atomic operation that completes the split by inserting a new downlink.
 *
 * stack - stack showing how we got here.  Will be NULL when splitting true
 *			root, or during concurrent root split, where we can be inefficient
 * isroot - we split the true root
 * isonly - we split a page alone on its level (might have been fast root)
 */
static void
_bt_insert_parent(Relation rel,
				  Buffer buf,
				  Buffer rbuf,
				  BTStack stack,
				  bool isroot,
				  bool isonly)
{
	/*
	 * Here we have to do something Lehman and Yao don't talk about: deal with
	 * a root split and construction of a new root.  If our stack is empty
	 * then we have just split a node on what had been the root level when we
	 * descended the tree.  If it was still the root then we perform a
	 * new-root construction.  If it *wasn't* the root anymore, search to find
	 * the next higher level that someone constructed meanwhile, and find the
	 * right place to insert as for the normal case.
	 *
	 * If we have to search for the parent level, we do so by re-descending
	 * from the root.  This is not super-efficient, but it's rare enough not
	 * to matter.
	 */
	if (isroot)
	{
		Buffer		rootbuf;

		Assert(stack == NULL);
		Assert(isonly);
		/* create a new root node and update the metapage */
		rootbuf = _bt_newroot(rel, buf, rbuf);
		/* release the split buffers */
		_bt_relbuf(rel, rootbuf);
		_bt_relbuf(rel, rbuf);
		_bt_relbuf(rel, buf);
	}
	else
	{
		BlockNumber bknum = BufferGetBlockNumber(buf);
		BlockNumber rbknum = BufferGetBlockNumber(rbuf);
		Page		page = BufferGetPage(buf);
		IndexTuple	new_item;
		BTStackData fakestack;
		IndexTuple	ritem;
		Buffer		pbuf;

		if (stack == NULL)
		{
			BTPageOpaque opaque;

			elog(DEBUG2, "concurrent ROOT page split");
			opaque = BTPageGetOpaque(page);

			/*
			 * We should never reach here when a leaf page split takes place
			 * despite the insert of newitem being able to apply the fastpath
			 * optimization.  Make sure of that with an assertion.
			 *
			 * This is more of a performance issue than a correctness issue.
			 * The fastpath won't have a descent stack.  Using a phony stack
			 * here works, but never rely on that.  The fastpath should be
			 * rejected within _bt_search_insert() when the rightmost leaf
			 * page will split, since it's faster to go through _bt_search()
			 * and get a stack in the usual way.
			 */
			Assert(!(P_ISLEAF(opaque) &&
					 BlockNumberIsValid(RelationGetTargetBlock(rel))));

			/* Find the leftmost page at the next level up */
			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
			/* Set up a phony stack entry pointing there */
			stack = &fakestack;
			stack->bts_blkno = BufferGetBlockNumber(pbuf);
			stack->bts_offset = InvalidOffsetNumber;
			stack->bts_parent = NULL;
			_bt_relbuf(rel, pbuf);
		}

		/* get high key from left, a strict lower bound for new right page */
		ritem = (IndexTuple) PageGetItem(page,
										 PageGetItemId(page, P_HIKEY));

		/* form an index tuple that points at the new right page */
		new_item = CopyIndexTuple(ritem);
		BTreeTupleSetDownLink(new_item, rbknum);

		/*
		 * Re-find and write lock the parent of buf.
		 *
		 * It's possible that the location of buf's downlink has changed since
		 * our initial _bt_search() descent.  _bt_getstackbuf() will detect
		 * and recover from this, updating the stack, which ensures that the
		 * new downlink will be inserted at the correct offset. Even buf's
		 * parent may have changed.
		 */
		pbuf = _bt_getstackbuf(rel, stack, bknum);

		/*
		 * Unlock the right child.  The left child will be unlocked in
		 * _bt_insertonpg().
		 *
		 * Unlocking the right child must be delayed until here to ensure that
		 * no concurrent VACUUM operation can become confused.  Page deletion
		 * cannot be allowed to fail to re-find a downlink for the rbuf page.
		 * (Actually, this is just a vestige of how things used to work.  The
		 * page deletion code is expected to check for the INCOMPLETE_SPLIT
		 * flag on the left child.  It won't attempt deletion of the right
		 * child until the split is complete.  Despite all this, we opt to
		 * conservatively delay unlocking the right child until here.)
		 */
		_bt_relbuf(rel, rbuf);

		if (pbuf == InvalidBuffer)
			ereport(ERROR,
					(errcode(ERRCODE_INDEX_CORRUPTED),
					 errmsg_internal("failed to re-find parent key in index \"%s\" for split pages %u/%u",
									 RelationGetRelationName(rel), bknum, rbknum)));

		/* Recursively insert into the parent */
		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
					   new_item, MAXALIGN(IndexTupleSize(new_item)),
					   stack->bts_offset + 1, 0, isonly);

		/* be tidy */
		pfree(new_item);
	}
}

分裂流程_bt_split本身是在插入流程_bt_insertonpg中的,而分裂后与父节点建链的插入操作依旧是用的_bt_insertonpg,这样的递归调用,保证了每次分裂的完整性。

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值