PostgreSQL Blink-tree ReadMe---翻译

obvious__

已于 2022-08-15 22:07:51 修改

阅读量1.1k

点赞数 7

分类专栏： postgresql 文章标签： postgresql 数据库 database

于 2022-07-22 01:01:45 首次发布

postgresql 专栏收录该内容

25 篇文章 28 订阅

订阅专栏

来源

src/backend/access/nbtree/README

Btree Indexing

This directory contains a correct implementation of Lehman and Yao’s high-concurrency B-tree management algorithm (P. Lehman and S. Yao, Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions on Database Systems, Vol 6, No. 4, December 1981, pp 650-670). We also use a simplified version of the deletion logic described in Lanin and Shasha (V. Lanin and D. Shasha, A Symmetric Concurrent B-Tree Algorithm,Proceedings of 1986 Fall Joint Computer Conference, pp 380-389).

本文阐述了基于Lehman&Yao的高并发B树管理算法（P. Lehman and S. Yao, 《Efficient Locking for Concurrent Operations on B-Trees》，ACM Transactions on Database Systems, Vol 6, No. 4, December 1981, pp 650-670）。我们也借鉴了Lanin&Shasha的索引删除算法（V. Lanin and D. Shasha, 《A Symmetric Concurrent B-Tree Algorithm》,Proceedings of 1986 Fall Joint Computer Conference, pp 380-389），并实现了该算法的简化版（没有页面合并逻辑）。

The basic Lehman & Yao Algorithm

Compared to a classic B-tree, L&Y adds a right-link pointer to each page, to the page’s right sibling. It also adds a “high key” to each page, which is an upper bound on the keys that are allowed on that page. These two additions make it possible detect a concurrent page split, which allows the tree to be searched without holding any read locks (except to keep a single page from being modified while reading it).

相较于典型的B-tree，L&Y在每个页面中都加入了right-link指针，指向页面的右兄弟。同时在每个页面中添加了”high key“，high key表示该页面中所能存放key的最大值（upper bound）。这两项技术的引入使得索引可以检测页面的并发分裂，如此在查询时页面无需加读锁（除非是为了防止页面在读取时被修改）。

When a search follows a downlink to a child page, it compares the page’s high key with the search key. If the search key is greater than the high key, the page must’ve been split concurrently, and you must follow the right-link to find the new page containing the key range you’re looking for.This might need to be repeated, if the page has been split more than once.

当一个查询操作沿着downlink到达子页面时，需要将页面的high key和search key进行比较。如果search key > high key，说明这个页面正在进行并发分裂，所以必须沿着right-link向右找到包含查询范围的那个分裂出来的新页面。如果页面被多次分裂，那么这个动作也会重复多次。

Differences to the Lehman & Yao algorithm

We have made the following changes in order to incorporate the L&Y algorithm into Postgres:

为了将 L&Y 的算法引入Postgres，我们进行了如下改变：

The requirement that all btree keys be unique is too onerous, but the algorithm won’t work correctly without it. Fortunately, it is only necessary that keys be unique on a single tree level, because L&Y only use the assumption of key uniqueness when re-finding a key in a parent page (to determine where to insert the key for a split page). Therefore, we can use the link field to disambiguate multiple occurrences of the same user key: only one entry in the parent level will be pointing at the page we had split. (Indeed we need not look at the real “key” at all, just at the link field.) We can distinguish items at the leaf level in the same way, by examining their links to heap tuples; we’d never have two items for the same heap tuple.

要求btree的所有key都唯一有点不切实际，但不这么要求算法又没法正常工作。幸运的是，key只需要在B树中的同一级level保持唯一，因为只有在父页面中再次查找key时（用来确定分裂页面的min key应该插入到父页面的什么位置）才需要用到key唯一的假设。因此，我们可以使用link field来消除多个相同的user key所带来的歧义：在parent level只有一个entry会指向分裂的页面。（甚至都不需要查看真正的“key”，只需要查看link field即可）。我们也可以使用同样的方式来区别leaf level的item，通过检测他们指向heap tuple的link，在索引中不会存在两个item指向同一个heap tuple。

解释

这段话信息量有点大，需要解释一下。

关于唯一key的假设

在L&Y论文中5.2讲到了B*树的插入流程，并且用伪代码描述了B*树的插入流程（Fig.8下面）。在这个流程中有一个步骤if v is in A then stop "v already exists in tree"，正是这个步骤规定了B*树中所有的key都是唯一。那么为什么要做这样的规定呢？在《PostgreSQL B+树索引—分裂》中我们阐述过B*树的分裂流程。从流程中，我们不难看出，B*树的分裂可以简单总结为两个大的步骤：

将一个页面分裂为两个
将新页面的min key插入父页面（也就是前面说的：re-finding a key in a parent page）

为了快速实现步骤2，在遍历B*树确定插入位置的时候，我们会用一个栈来记录遍历路径，根据这个栈就可以快速找到父页面，以及min key应该插入的位置。但是在《PostgreSQL B+数索引—分裂》中我们分析过，这种方式不一定靠谱，因为在插入父亲节点前，父亲节点可能并发发生分裂。当这种不靠谱发生之后我们就需要向右遍历，重新定位min key的插入位置。那么问题就来了，如果父亲节点有多个相同的key，那么在遍历过程中，如何知道min key应该插入到哪里？一旦插入的位置不对就会打破整个B*树的有序性。所以这就是为什么L&Y会规定B*树中的key必须唯一，如果唯一就可以根据key的大小关系找到min key的插入位置。由于遍历只会遍历parent level，所以其实只需要同一level的key唯一，而不需要整个B*树的key唯一。

link field

在实际实现中，在leaf page和non-leaf page的item中，存在一个天然唯一的东西就是link field。non-leaf page的link field用于指向下级页面（存放了下级页面的页面编号），所以non-leaf page的link field也被称为下级页面的entry。很显然non-leaf page中的每个item都只会关联一个唯一的下级页面，所以link field便是唯一的。有了link field后，当我们re-finding a key in a parent page时我们只需要检查link field就能定位到min key的插入位置。

而对于leaf page的item，link field就是这个item对应元组的tid，显然tid也是唯一的。

Lehman and Yao assume that the key range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent page. This does not work for nonunique keys (for example, if we have enough equal keys to spread across several leaf pages, there must be some equal bounding keys in the first level up). Therefore we assume Ki <= v <= Ki+1 instead. A search that finds exact equality to a bounding key in an upper tree level must descend to the left of that key to ensure it finds any equal keys in the preceding page. An insertion that sees the high key of its target page is equal to the key to be inserted has a choice whether or not to move right, since the new key could go on either page. (Currently, we try to find a page where there is room for the new key without a split.)

L&Y假定子树S的key range为Ki < v <= Ki+1，其中Ki和Ki+1是parent page中相邻的两个key。这一点不适用于非唯一key（例如，如果有足够多相同的key分布在多个leaf page，那么上级页面中就一定会有一些相同的key）。因此我们假设Ki <= v <= Ki+1。如果一个查询在上级节点中找到了与bounding key相同的key，那么他必须descend到那个key左边的key所对应的下级页面，以确保他在前一个页面中找到相同的key。在插入时，如果目标页面的high key与插入的key相同，那么可以选择将这个key插入到当前页，或者插入到右兄弟。（当前，我们的选择依据是哪个页面有足够空间可以不用分裂，就选择哪个页面）

注意：

L&Y的论文中，parent page中的key是下级页面的high key，而PostgreSQL在实现的时候，是将下级页面中的min key（最小的key）存放到parent page中，如图1所示：

图1
不难看出，page1和page2都有20，所以当查找key=20时，需要在page3中定位20左边的key，也就是10，然后定位到page1，这样才能找到所有的20。这一点要尤为注意，因为这会影响二分法的实现。

Lehman and Yao don’t require read locks, but assume that in-memory copies of tree pages are unshared. Postgres shares in-memory buffers among backends. As a result, we do page-level read locking on btree pages in order to guarantee that no record is modified while we are examining it. This reduces concurrency but guarantees correct behavior. An advantage is that when trading in a read lock for a write lock, we need not re-read the page after getting the write lock. Since we’re also holding a pin on the shared buffer containing the page, we know that buffer still contains the page and is up-to-date.

L&Y不要求读锁，但那是建立在B树的内存页面不共享的前提下（每个进程维护自己的私有内存，读取物理页面时自己拷贝自己的）。然而Postgres会在后台进程间共享内存buffer。所以，在读取页面时我们需要加读锁，以防止读取过程中页面被修改。这样会降低并发性但是可以保证正确性。这样的一个好处在于，当需要将读锁变为写锁时，我们不需要在加写锁后重新读取页面。由于共享buffer上持有pin，所以buffer中page的数据依然是最新的。（pin、buffer、page的相关内容参见《PostgreSQL Buffer ReadMe–翻译》）

We support the notion of an ordered “scan” of an index as well as insertions, deletions, and simple lookups. A scan in the forward direction is no problem, we just use the right-sibling pointers that L&Y require anyway. (Thus, once we have descended the tree to the correct start point for the scan, the scan looks only at leaf pages and never at higher tree levels.) To support scans in the backward direction, we also store a “left sibling” link much like the “right sibling”. (This adds an extra step to the L&Y split algorithm: while holding the write lock on the page being split, we also lock its former right sibling to update that page’s left-link. This is safe since no writer of that page can be interested in acquiring a write lock on our page.) A backwards scan has one additional bit of complexity: after following the left-link we must account for the possibility that the left sibling page got split before we could read it. So, we have to move right until we find a page whose right-link matches the page we came from (Actually, it’s even harder than that; see deletion discussion below.)

在插入、删除、查询时我们支持ordered scan的概念（前向遍历或后向遍历）。前向遍历（向右遍历）没有任何问题，我们只需要使用right-sibling pointers（指向右兄弟的指针）就像L&Y所说的那样。（如此，一旦我们找到正确的遍历起始点，那么我们只需要遍历leaf page而不用访问上层页面。）为了支持后向遍历（向左遍历），我们也会在页面中存储一个“left sibling”，这个和right-sibling是一个意思。（这样在L&Y分裂算法的基础上会增加一个额外的步骤：待分裂页面持有写锁的同时，我们还需要锁定它分裂前的右兄弟来更新右兄弟的left-link（从左向右加锁）。由于此时任何在右兄弟上做写操作的进程都无法对我们的当前页面加锁（从右向左加锁），所以这个加锁操作是安全的。）后向遍历会增加一点复杂度：在沿着left-link遍历的过程中，我们必须考虑在我们读取左兄弟之前，左兄弟发生分裂的情况。所以，我们必须move right直到我们找到一个页面，这个页面的right-link与我们came from的页面匹配。（实际情况比这个复杂，详见下面的deletion discussion，我们在《PostgreSQL B+树索引—并发控制》中也对这个部分进行了详细阐述。）

Page read locks are held only for as long as a scan is examining a page. To minimize lock/unlock traffic, an index scan always searches a leaf page to identify all the matching items at once, copying their heap tuple IDs into backend-local storage. The heap tuple IDs are then processed while not holding any page lock within the index. We do continue to hold a pin on the leaf page in some circumstances, to protect against concurrent deletions (see below). In this state the scan is effectively stopped “between” pages, either before or after the page it has pinned. This is safe in the presence of concurrent insertions and even page splits, because items are never moved across pre-existing page boundaries — so the scan cannot miss any items it should have seen, nor accidentally return the same item twice. The scan must remember the page’s right-link at the time it was scanned, since that is the page to move right to; if we move right to the current right-link then we’d re-scan any items moved by a page split. We don’t similarly remember the left-link, since it’s best to use the most up-to-date left-link when trying to move left (see detailed move-left algorithm below).

页面的读锁只会在扫描页面时持有，为了最小化lock/unlock的拥塞，索引遍历一次会扫描仪一整个页面，检测页面中的所有item，将满足条件的item的tuple ID拷贝到进程的本地缓存中。后续处理这些tuple ID时不会持有索引页面的锁。但是在一些情况下我们需要继续持有索引页面的pin，用来防止并发删除（见下文）。在这种情况下遍历有效的在页面之间停顿，不论是pin之前还是pin之后。这在并发插入甚至分裂时依然是安全的，因为item不会跨越之前存在的页面边界—所以遍历不会丢失它需要的item，也不会返回同一个item两次。向右遍历时，在扫描页面前必须记录页面的right-link，因为记住的这个right-link才是向右遍历时的下一个页面（如果扫描前进行记录，扫描完释放了锁之后再去获取，那么这个页面可能会分裂，right-link会发生变化）。但是在向左遍历时，我们并不会类似的记录left-link，因为在向左遍历时使用最新的left-link才是最好的（详见下文向左遍历的算法）。

In most cases we release our lock and pin on a page before attempting to acquire pin and lock on the page we are moving to. In a few places it is necessary to lock the next page before releasing the current one. This is safe when moving right or up, but not when moving left or down (else we’d create the possibility of deadlocks).

在多数情况下，在获取我们将要移动到的页面的pin和锁之前，我们需要先将当前页面的pin和锁释放。在很少的一些地方我们必须在释放当前页面锁之前锁住下一个页面。这在右移和上移时是安全的，但在左移和下移时不安全（可能发生死锁，详见：《PostgreSQL B+树索引—并发控制》）。

Lehman and Yao fail to discuss what must happen when the root page becomes full and must be split. Our implementation is to split the root in the same way that any other page would be split, then construct a new root page holding pointers to both of the resulting pages (which now become siblings on the next level of the tree). The new root page is then installed by altering the root pointer in the meta-data page (see below). This works because the root is not treated specially in any other way — in particular, searches will move right using its link pointer if the link is set. Therefore, searches will find the data that’s been moved into the right sibling even if they read the meta-data page before it got updated. This is the same reasoning that makes a split of a non root page safe. The locking considerations are similar too.

L&Y没有讨论当root页面分裂时需要做些什么。在我们的实现中，root页面的分裂与其他页面一致，然后会构建一个新的root页面来存放分裂出来的两个页面（这两个页面会成为下级页面的兄弟）。通过改变 meta-data page（详见下文）的root指针将树的根修改为新的root page。由于在其他地方root也不会被特殊处理，所以这样是有效的—特别的，如果已经串好了链，查询会沿着link指针向右移动，因此查询操作会找到那些被移动到右兄弟的数据，即使在查询开始时获取到了分裂前的根节点。这与非叶子节点安全分裂的原因相同。锁的考量因素也类似。

When an inserter recurses up the tree, splitting internal pages to insert links to pages inserted on the level below, it is possible that it will need to access a page above the level that was the root when it began its descent (or more accurately, the level that was the root when it read the meta-data page). In this case the stack it made while descending does not help for finding the correct page. When this happens, we find the correct place by re-descending the tree until we reach the level one above the level we need to insert a link to, and then moving right as necessary. (Typically this will take only two fetches, the meta-data page and the new root, but in principle there could have been more than one root split since we saw the root. We can identify the correct tree level by means of the level numbers stored in each page. The situation is rare enough that we do not need a more efficient solution.)

当插入操作向上递归，分裂内部节点（non-leaf page）以插入下级页面的link时，可能会出现要访问的页面比遍历开始时的root还要上层的情况（或者更准确的说，在访问开始时从meta-data page中获取到的老的root页面。这种情况详见：《PostgreSQL B+树索引—分裂》）在这种情况下，遍历阶段构建的stack就无法帮助我们获取正确的页面。如果这种情况发生，我们就需要重新遍历树来找到正确的位置，直到我们到达比插入link那一层更高一层，然后视情况进行move right操作。（通常这只会fetch两次，meta-data page 以及new root页面，但从规则上来讲我们获取root后root可能分裂多次。我们可以在每个page上存储level编号来识别正确的tree level。这种情况非常罕见，所以我们也无需实现更加高效的方案。）

Lehman and Yao assume fixed-size keys, but we must deal with variable-size keys. Therefore there is not a fixed maximum number of keys per page; we just stuff in as many as will fit. When we split a page, we try to equalize the number of bytes, not items, assigned to each of the resulting pages. Note we must include the incoming item in this calculation, otherwise it is possible to find that the incoming item doesn’t fit on the split page where it needs to go!

L&Y假设key是定长的，但我们必须处理变长key。因此每个页面所能容纳的最大key的数量是不固定的；我们只能尽可能多的向页面中写东西。当我们分裂一个页面，我们尝试让两个页面的字节数相同，而不是item的数量相同。注意，我们在计算时必须包含即将插入进来的item，否则将有可能出现分裂后的页面不足以容纳即将插入的item。

The Deletion Algorithm

Before deleting a leaf item, we get a super-exclusive lock on the target page, so that no other backend has a pin on the page when the deletion starts. This is not necessary for correctness in terms of the btree index operations themselves; as explained above, index scans logically stop “between” pages and so can’t lose their place. The reason we do it is to provide an interlock between non-full VACUUM and indexscans. Since VACUUM deletes index entries before reclaiming heap tuple line pointers, the super-exclusive lock guarantees that VACUUM can’t reclaim for re-use a line pointer that an indexscanning process might be about to visit. This guarantee works only for simple indexscans that visit the heap in sync with the index scan, not for bitmap scans. We only need the guarantee when using non-MVCC snapshot rules; when using an MVCC snapshot, it doesn’t matter if the heap tuple is replaced with an unrelated tuple at the same TID, because the new tuple won’t be visible to our scan anyway.Therefore, a scan using an MVCC snapshot which has no other confounding factors will not hold the pin after the page contents are read. The current reasons for exceptions, where a pin is still needed, are if the index is not WAL-logged or if the scan is an index-only scan. If later work allows the pin to be dropped for all cases we will be able to simplify the vacuum code, since the concept of a super-exclusive lock for btree indexes will no longer be needed.

在删除leaf item（索引元组）之前（这里的的删除指的是真正的物理删除，而不是添加删除标记），会在目标页面上加super-exclusive lock，如此当删除开始后其余后台进程就无法pin住该页面。就btree index自身的操作而言，这并不是确保正确性所必须的。正如前文所述，索引遍历会逻辑的在页面之间停顿，因此不会失去他们的位置。加super-exclusive lock的原因是我们需要在non-full VACUUM和index scan之间提供互锁机制。由于VACUUM会在回收heap tuple的line pointer（ItemIdData结构体）之前先删除索引项，super-exclusive lock 保证了VACUUM不会回收并重用索引遍历过程可能访问的line pointer。这个保证只针对普通的indexscans（索引扫描和回表是同步进行的），而非bitmap scans。只有在非MVCC规则下需要这项保证；当使用MVCC快照，即使相同tid上被替换了一条完全不相关的元组也无所谓，因为这个元组对我们不可见。因此，当一个遍历操作使用了没有其他干扰因素的MVCC快照时，在页面的内容被读取后，就不会在保持页面上的pin了。当前依然需要pin的例外原因是，当索引可能没有记录WAL日志或者遍历是一个index-only scan时依然需要pin。如果后续改进可以让我们在所有情况下都不用持有pin，那么我们就可以简化vacuum的代码，因为此时将不再需要B树索引的super-exclusive的概念。

解释

这段话信息量很大，我们需要做一些解释：

super-exclusive lock

在《PostgreSQL Buffer ReadMe—翻译》中我们阐述了super-exclusive lock的实现方式，简而言之就是如果一个页面持有其他进程的锁或者pin，那么就不能加super-exclusive lock，这个锁不仅和其他锁互斥，还和pin互斥。那么这个锁用来干什么呢？

回顾下本文前面讲index scan时的步骤：

给索引页面加读锁
获取索引页面中满足条件的item对应的tid，将这些tid拷贝到进程私有空间
解锁索引页面，保留页面的pin
遍历进程私有空间的tid，根据tid获取数据元组

如果我们不对VACUUM进行任何限制，那么在我们执行上述流程的最后一步时，tid对应的数据页面可能会并发执行如下流程：

VACUUM操作物理删除页面中的元组并回收相应的tid。
VACUUM后，某事务向该页面插入了一条新元组，并复用了回收的tid

如此，index scan保存在进程私有地址空间中的tid就很有可能是一个被回收并且重用的tid，这个tid关联的新数据和原始数据没有半毛钱关系！！如果index scan是基于MVCC的，那么这不会有任何影响，因为这条新数据对当前的index scan一定是不可见的。如果不基于MVCC，那么条数据就会被读出来，如此就会破坏数据库的一致性（ACID的C）。

为了避免这种情况的发生，我们需要在index scan和VACUUM之间形成一种互锁的机制，以保证当有进程在进行index scan时，索引项关联的数据元组不能被回收，即这些数据元组对应的页面不能进行VACUUM。那么这该如何实现呢？

VACUUM可以分为两个大的步骤：

删除数据元组对应的索引

回收数据元组

删除索引是显然需要对索引页面加排它锁，假设这个页面正在进行index scan，那么会有如下几种情况：

index scan进行到步骤2：由于页面持有共享锁，无法再加上互斥锁，VACUUM无法进行，不会发生异常。
index scan进行到步骤4：由于页面无锁，可以加排它锁，VACUUM正常进行，可能发生异常。

所以，我们不难看出在VACUUM的步骤1中，仅给页面加排它锁是无法完成互锁的，还需要一种更“强力”机制，这种机制需要保证两点：

index scan的步骤3，虽然释放了页面的共享锁，但是必须保持pin，直到步骤4结束才能执行unpin操作。
VACUUM的步骤1，对于索引页面加super-exclusive lock，这个super-exclusive lock不仅和其他锁互斥，还和pin互斥。

如此，只要索引页面的pin不释放，就无法加super-exclusive lock，VACUUM也就无法进行。

stop between pages

文中多次提及stop between pages，所谓stop between pages意思是扫描完一整个页后，才会stop然后去做可见性检测或者回表。这个stop不会stop在页面的中间，所以即使在index scan时页面发生并发的插入和删除，也不会影响scan本身。

Because a pin is not always held, and a page can be split even while someone does hold a pin on it, it is possible that an indexscan will return items that are no longer stored on the page it has a pin on, but rather somewhere to the right of that page. To ensure that VACUUM can’t prematurely remove such heap tuples, we require btbulkdelete to obtain a super-exclusive lock on every leaf page in the index, even pages that don’t contain any deletable tuples. Any scan which could yield incorrect results if the tuple at a TID matching the scan’s range and filter conditions were replaced by a different tuple while the scan is in progress must hold the pin on each index page until all index entries read from the page have been processed. This guarantees that the btbulkdelete call cannot return while any indexscan is still holding a copy of a deleted index tuple if the scan could be confused by that. Note that this requirement does not say that btbulkdelete must visit the pages in any particular order. (See also on-the-fly deletion, below.)

由于页面不会一直保持pin，同时即使页面存在pin，也无法保证页面就不发生分裂，所以indexscan可能返回一些已经不在当前页面，而在其右兄弟的item。为了确保VACUUM不会过早的删除这些heap tuples（由于分裂被移动到右兄弟的item对应的heap tuples），我们要求btbulkdelete函数给每一个leaf page加上super-exclusive lock，即使那些不包含可删除元组的页面。对于index scan，如果在scan范围内且满足过滤条件的tid对应的元组在scan的过程中被替换了，则有可能产生错误结果（例如：如果index scan不基于MVCC就可能发生这种情况）。对于这类index scan在扫描的过程中必须为每个索引页面持有pin，直到所有从页面中获取的索引向都被处理。如此，如果有indexscan还持有某个被删除的index tuple的副本，那么 btbulkdelete 函数就不会返回（这样就无法进行后续的数据元组的回收）。注意，这并不意味着btbulkdelete必须以某种特定的顺序访问页面。

There is no such interlocking for deletion of items in internal pages, since backends keep no lock nor pin on a page they have descended past. Hence, when a backend is ascending the tree using its stack, it must be prepared for the possibility that the item it wants is to the left of the recorded position (but it can’t have moved left out of the recorded page). Since we hold a lock on the lower page (per L&Y) until we have re-found the parent item that links to it, we can be assured that the parent item does still exist and can’t have been deleted. Also, because we are matching downlink page numbers and not data keys, we don’t have any problem with possibly misidentifying the parent item.

由于遍历B树时，后台进程不会对已经访问过的内部页面持有锁或者pin，所以对于内部页面（nonleaf page）元组的删除没有类似的互锁机制。因此，当后台进程通过stack向上遍历树时，可能会发现它所期望的item位于stack中记录的位置的左边（但不会超过记录的页面）。由于当我们在上级页面中寻找指向下级页面的parent item时，我们一直持有下级页面的锁，我们可以确定parent item一定存在且不会被删除（因为下级页面不会被删除）。此外，由于我们匹配的是downlink的页面号而不是data key，所以我们不可能会错误识别parent item.（页面号是唯一的，但是data key不保证唯一）

Page Deletion

We consider deleting an entire page from the btree only when it’s become completely empty of items. (Merging partly-full pages would allow better space reuse, but it seems impractical to move existing data items left or right to make this happen — a scan moving in the opposite direction might miss the items if so.) Also, we never delete the rightmost page on a tree level (this restriction simplifies the traversal algorithms, as explained below). Page deletion always begins from an empty leaf page. An internal page can only be deleted as part of a branch leading to a leaf page, where each internal page has only one child and that child is also to be deleted.

只有在一个页面完全为空时，我们才会考虑将这个页面整个删除。（合并partly-full的页面有更高的空间利用率，但是通过将现有数据向左或向右移动来实现Merge或许不切实际—因为如果此时有一个相反方向的index scan，这个index scan可能会丢失item）此外，我们也不会删除rightmost 页面（这个限制可以简化遍历算法，下面会解释）。页面的删除都是从空的leaf page开始。一个内部页面只能作为一个指向leaf page的一个分支被删除，彼时每个内部页面仅有一个孩子，并且这个孩子也即将被删除。

解释

这段话解释了PostgreSQL为什么不支持页面合并的一个重要原因。页面合并其实就是把当前页面与其左兄弟（向左合并）或者右兄弟（向右合并）合并成一个页面。PostgreSQL的索引支持向左和向右遍历。假设有一个进程T1在执行向右遍历，我们再来回顾下向右遍历的步骤：

给当前节点加共享锁（设当前节点为PAGE_A）。
扫描当前节点获取所有满足条件的索引元组关联的tid，存放到进程私有空间。
记录当前节点的右兄弟（设右兄弟为PAGE_B）。
解锁当前节点。
遍历私有空间的tid，进行可见性判断或回表等操作。

当T1执行到步骤5时，有一个并发的进程T2，执行了一个向左合并，将PAGE_B中的内容合并到了PAGE_A。那么当T1执行完步骤5，继续向右遍历时，就会发现PAGE_B已经不再了，此时也不能再遍历一次PAGE_A，因为无法区别PAGE_A中哪些数据是已经遍历过的，哪些是没有遍历过的。

Deleting a leaf page is a two-stage process. In the first stage, the page is unlinked from its parent, and marked as half-dead. The parent page must be found using the same type of search as used to find the parent during an insertion split. We lock the target and the parent pages, change the target’s downlink to point to the right sibling, and remove its old downlink. This causes the target page’s key space to effectively belong to its right sibling. (Neither the left nor right sibling pages need to change their “high key” if any; so there is no problem with possibly not having enough space to replace a high key.) At the same time, we mark the target page as half-dead, which causes any subsequent searches to ignore it and move right (or left, in a backwards scan). This leaves the tree in a similar state as during a page split: the page has no downlink pointing to it, but it’s still linked to its siblings.

删除leaf page的操作（对应函数为_bt_pagedel）可以被分为两个阶段。阶段1，断开目标页面（待删除页面）与其父页面的链接，并将该页面标记为half-dead。获取父页面的方式必须与插入分裂时获取父页面的方式相同。我们将目标页面与父页面加锁（从下向上加锁），将目标的downlink指向其右兄弟，然后删除原始downlink。这可以让目标页面的key space有效的属于其右兄弟。（左右兄弟都无需修改他们的high key（如果有左右兄弟），所以不存在没有足够空间用于修改high key的情况。）与此同时，我们将目标页面标记为half-dead，这会使得后续的查询忽略这目标页面然后右移（或者左移，在后向遍历时）。这使得树处于和分裂时相似的状态：目标页面没有指向他的downlink，但是目标页面会指向他的兄弟。

解释

这段话中涉及的操作相关的源代码在9.6.10版本nbtpage.c的1435行，对应函数为_bt_mark_page_halfdead。在《PostgreSQL B+树索引—基本结构》中，我们阐述了B*树的基本结构，如图2所示。现在page2中的索引元组已经被清空，page2需要被删除。

图2
由于在page2父页面page6中，存在一个指向page2的item（key值为21），所以我们很自然的就会想到需要将page6中的21删除。那么这个21可以直接删除么？假设我们将page2以及page6中的21直接删除，那么B*树的结构就变为了图3。

图3
在图3中，page2已经被删除，在page6中我们将51左移覆盖了21。那么问题来了，在图3的结构中，如果我们希望插入30，那么30应该插入到哪里？

插入到page1?

由于30大于page1的high key，如果将30插入page1，则需要将page1的high key改为30。

插入到page3？

由于30小于page3的min key，如果将30插入page3，那么就需要将page6中的51改为30。

所以，看起来30无论插入到哪里都不合适。造成这一现象的原因，在于当我们删除page2之后，page1的high key就不再等于其右兄弟page3的min key了！（这一原则在《PostgreSQL B+树索引—基本结构》中提到过）。所以PostgreSQL采用了更加巧妙的方式：

将page6中的21指向page2的右兄弟page3。
删除page6中21右边的索引元组（即原来指向page3的索引元组）51。

完成上述步骤后，B*树的结构如图4：

图4
通过这样的方式，间接将page3的min key改为了21，从而维持了page1的high key等于右兄弟page3的min key的原则。此时，key为30的元组就可以顺利的插入到page3中了。

(Note: Lanin and Shasha prefer to make the key space move left, but their argument for doing so hinges on not having left-links, which we have anyway. So we simplify the algorithm by moving key space right.)

（注意：Lanin和Shasha更希望让key space移动到左边，但他们这么做的理由是没有left-link，而我们有。所以我们可以将key space移动到右边以简化算法。）

To preserve consistency on the parent level, we cannot merge the key space of a page into its right sibling unless the right sibling is a child of the same parent — otherwise, the parent’s key space assignment changes too, meaning we’d have to make bounding-key updates in its parent, and perhaps all the way up the tree. Since we can’t possibly do that atomically, we forbid this case. That means that the rightmost child of a parent node can’t be deleted unless it’s the only remaining child, in which case we will delete the parent too (see below).

为了维持parent level的一致性，我们不能将一个页面的key space合并到他的右兄弟中，除非他们俩是同一个父亲—否则，他们父亲的key space的分布就需要发生变化，我们必须修改他们父亲的bounding-key，并且这项修改可能需要递归到树的根。由于我们不支持自动完成这项工作（递归修改bounding-key），所以我们禁止这种情况的发生。这意味着，每个父亲节点最右边的孩子都不能被删除，除非这是他唯一的孩子，因为这种情况下父亲也会被删除。

解释

现在，我们现在需要删除图1中的page3，如图5所示：

图5
按照前面的流程，我们需要将page6中的51指向page3的右兄弟page4，然后删除page7中的63，最终如图6所示：

图6
不难看出，我们删除了page7中的63，但page8中指向page7的key依然是63，这就造成了上下级的不一致。此时如果我们向B+树中插入70，page8会把70路由到page7，然后我们就会发现70大于page7的min key，于是又出现的混乱。

出现这一问题的原因在于，我们删除了63之后，使得page6的high key与page7的min key不相等。在图2中，我们之所以可以合并page2和page3的key space，是因为page2和page3拥有相同的父节点（page6）。而现在，我们希望将page3与page4的key space合并，由于page3和page4的父节点不一样，所以如果合并，会导致page6与page7的key space发生改变！，如果一定要合并page3和page4，就必须修改page6和page7的bounding-key（可见，亲兄弟与表兄弟始终是有区别的！）。而修改父亲节点bounding-key这件事，开销会比较大，并且可能会引发递归修改（比如page8的63就需要改成85），所以PostgreSQL不予支持。

正是这个原因，PostgreSQL也就不支持父亲节点最右孩子的删除，除非这是这个父亲唯一的孩子，因为此时父亲节点也即将被删除。这里还有一种特殊情况，就是page5，page5是整棵树最右的叶子节点，要删除pgae5，那么page5必须是page7的唯一孩子，而要删除page7，page7必须是page8的唯一孩子，所以只有当整棵树为空时page5才能被删除！

In the second-stage, the half-dead leaf page is unlinked from its siblings. We first lock the left sibling (if any) of the target, the target page itself, and its right sibling (there must be one) in that order. Then we update the side-links in the siblings, and mark the target page deleted.

在第二阶段，处于half-dead状态的leaf pgae会从他的兄弟链中删除。我们首先给左兄弟（如果存在）加锁，然后锁住当前页面，最后锁住右兄弟（一定存在右兄弟）。接着我们更新兄弟节点中的side-links，然后将页面标记为已删除（deleted）。

When we’re about to delete the last remaining child of a parent page, things are slightly more complicated. In the first stage, we leave the immediate parent of the leaf page alone, and remove the downlink to the parent page instead, from the grandparent. If it’s the last child of the grandparent too, we recurse up until we find a parent with more than one child, and remove the downlink of that page. The leaf page is marked as half-dead, and the block number of the page whose downlink was removed is stashed in the half-dead leaf page. This leaves us with a chain of internal pages, with one downlink each, leading to the half-dead leaf page, and no downlink pointing to the topmost page in the chain.

当我们准备删除parent page中仅存的孩子时，情况会稍微复杂一点。在第一阶段，我们先不管leaf page的直接父亲，而是先从他的爷爷中删除指向他父亲的downlink。如果父亲也是爷爷的最后一个孩子，那么我们就递归向上，直到找到我找到一个拥有多个孩子的父亲，然后删除相应的downlink。接着我们将叶子节点标记为half-dead，然后将相应页面的downlink删除。现在leaf page已经被标记为half-dead，那个被删除了downlink的页面的block number被存放在了half-dead leaf page中。所以，我们现在有一条由内部页面组成的链，链中每个内部页面只有一个downlink，这条链最终指向leaf page，并且此时没有downlink指向链头页面（the topmost page in the chain）。

While we recurse up to find the topmost parent in the chain, we keep the leaf page locked, but don’t need to hold locks on the intermediate pages between the leaf and the topmost parent —insertions into upper tree levels happen only as a result of splits of child pages, and that can’t happen as long as we’re keeping the leaf locked. The internal pages in the chain cannot acquire new children afterwards either, because the leaf page is marked as half-dead and won’t be split.

当我们向上递归去获取链中的topmost parent时，我们保持leaf page上的锁，但是我们无需对topmost parent与leaf之间的页面加锁------只有孩子发生分裂时，上级节点才会存在插入操作，但只要我们只有leaf上的锁，这一切就不会发生。leaf上锁后，链上的内部节点也不会获取到新的孩子，因为leaf page已经被标记为half-dead且不会发生分裂。

解释

以上两段描述，是由_bt_mark_page_halfdead函数来实现，该函数实现了如图7所示的场景：

图7
现在我们希望删除page3，page3是page6的最右节点，但由于page3是page6唯一的孩子，所以page3可以删除，在删除page3之后会将page6也删除掉。

通过前面的描述我们知道，删除页面分为两个阶段：

阶段一：将待删除页面从父页面中unlink，然后将其标记为half-dead。

这个阶段由函数_bt_mark_page_halfdead完成。

阶段二：将待删除页面从左右兄弟中删除。

这个阶段由函数_bt_unlink_halfdead_page完成。

阶段一完成四项工作：

向上遍历直到找到一个孩子不唯一的父亲

这项工作，也就是前文中所说的" we leave the immediate parent of the leaf page alone, and remove the downlink to the parent page instead, from the grandparent. If it’s the last child of the grandparent too, we recurse up until we find a parent with more than one child"。在PostgreSQL中由_bt_lock_branch_parent来具体实现这项功能。该函数存在4个传出参数：topparent、topoff、target、rightsib，假设当前树的结构为图7，那么这4个参数的含义和值如图8所示：

图8

topparent：_bt_lock_branch_parent最终获取到的孩子不唯一的父亲页面的页号。

topoff：topparent中指向下级页面的downlink所在的index tuple的偏移。

target：topoffset指向的下级页面，这个页面也是“内部页面链”（同下文to-be-deleted chain）的链头。

rightsib：target的右兄弟。

删除downlink

删除downlink的方式前面已经讲过（详见：图4），这部分功能的代码实现从nbtpage.c的1435行开始。

将page标记为BTP_HALF_DEAD

相关代码在nbtpage.c的1452行

让leaf page指向顶层页面

让leaf page指向“内部页面链”的最顶层，也就是让page3指向page6（对应原文中的：“ The leaf page is marked as half-dead, and the block number of the page whose downlink was removed is stashed in the half-dead leaf page.”）。

而所谓的指向的具体实现是将page3中high key的tid改为page6的块号。（注意：high key的tid原本就没有意义，所以可以随便改），具体的代码实现从nbtpage.c的1454行开始。

阶段一完成后，树的状态如图9所示：

图9
此时，page6与page8之间的父子关系已经断绝，同时我们得到了一条内部页面链，这条链上只有page6和page3两个页面，page3被标记为half dead，其high key指向了page6。

Removing the downlink to the top of the to-be-deleted chain effectively transfers the key space to the right sibling for all the intermediate levels too, in one atomic operation. A concurrent search might still visit the intermediate pages, but it will move right when it reaches the half-dead page at the leaf level.

将to-be-deleted链链头的downlink删除可以有效的利用一个原子操作将所有中间层次的key space移动到右兄弟。此时可能有一个并发的查询操作正在访问中间页面，但当他到达leaf level找到half-dead页面时，会执行move right操作。

In the second stage, the topmost page in the chain is unlinked from its siblings, and the half-dead leaf page is updated to point to the next page down in the chain. This is repeated until there are no internal pages left in the chain. Finally, the half-dead leaf page itself is unlinked from its siblings.

在第二阶段，to-be-deleted chain中的topmost page（对应图9中的page6）会从它的左右兄弟中移除，处于half-dead的leaf page会指向链中的下级页面（对应实现代码从nbtpage.c的1787行开始）。这个操作会一直重复直至链中不存在内部页面为止，最终，处于half-dead的leaf page会从它的兄弟中移除。

解释

其实第二阶段干的事，就是循环将阶段一生成的to-be-deleted chain中所有的页面，从他们的左右兄弟中移除。

A deleted page cannot be reclaimed immediately, since there may be other processes waiting to reference it (ie, search processes that just left the parent, or scans moving right or left from one of the siblings). These processes must observe that the page is marked dead and recover accordingly. Searches and forward scans simply follow the right-link until they find a non-dead page — this will be where the deleted page’s key-space moved to.

被删除的页面，不会被立即回收，因为有可能还有其他进程正在访问他们（比如，查询进程可能刚刚离开parent页面，或者向左或向右扫描兄弟节点）。这些进程必须观察到相应页面被标记为dead或者recover。查询或者前向遍历只需要沿着right-link直到找到一个non-dead page—这里就是被删除页面的key-space所在的地方。

Moving left in a backward scan is complicated because we must consider the possibility that the left sibling was just split (meaning we must find the rightmost page derived from the left sibling), plus the possibility that the page we were just on has now been deleted and hence isn’t in the sibling chain at all anymore. So the move-left algorithm becomes:

后向遍历的过程中，左移操作比较复杂，因为我们必须考虑左兄弟正好发生分裂的情况（这意味着我们必须找到从左兄弟派生的最右边的页面），此外页面有可能正好被删除，所以就不存在于兄弟链中。move-left的算法如下：

\0. Remember the page we are on as the “original page”.

0.记录我们当前的页面为"original page"。

\1. Follow the original page’s left-link (we’re done if this is zero).

1.沿着original page的left-link开始向左遍历（如果left-link为0，则完成遍历）。

\2. If the current page is live and its right-link matches the “original page”, we are done.

2.如果当前页面存在并且其右兄弟与"original page"匹配，则完成遍历。

\3. Otherwise, move right one or more times looking for a live page whose right-link matches the “original page”. If found, we are done. (In principle we could scan all the way to the right end of the index, but in practice it seems better to give up after a small number of tries. It’s unlikely the original page’s sibling split more than a few times while we were in flight to it; if we do not find a matching link in a few tries, then most likely the original page is deleted.)

3.否则，向右移动一次或多次，寻找一个右兄弟与"original page"匹配的live page。如果找到，则完成遍历。（理论上，我们可能需要向右一直遍历直到找到满足条件的live page或者索引结束，但实际上尝试几次后如果还没找到最好放弃。在我们遍历的过程中，original page的兄弟不会分裂太多次，如果我们尝试几次后还没有找到matching link，大概率是因为original page被删除了。）

\4. Return to the “original page”. If it is still live, return to step 1 (we guessed wrong about it being deleted, and should restart with its current left-link). If it is dead, move right until a non-dead page is found (there must be one, since rightmost pages are never deleted), mark that as the new “original page”, and return to step 1.This algorithm is correct because the live page found by step 4 will have the same left keyspace boundary as the page we started from. Therefore, when we ultimately exit, it must be on a page whose right keyspace boundary matches the left boundary of where we started — which is what we need to be sure we don’t miss or re-scan any items.

4.返回"original page"。如果他的状态依然是live（不是half dead或者dead），则返回步骤1（我们误以为他被删除了，所以需要从他当前的left-link开始重新遍历）。如果它的状态为dead，则向右遍历直到找到一个non-dead page（一定可以找到一个，因为最右的页面不会被删除），将这个页面作为新的"original page"，然后返回步骤1。这样做之所以正确，是因为步骤4找到的live page（就是前面说的non-dead page）与我们刚开始的"original page"有相同的keyspace边界。因此，当我们最终退出时，它一定位于right keyspace boundary与我们开始的left boundary匹配的页面上—这就是我们可以确保我们不会丢失或者re-scan任何item的原因。

A deleted page can only be reclaimed once there is no scan or search that has a reference to it; until then, it must stay in place with its right-link undisturbed. We implement this by waiting until all active snapshots and registered snapshots as of the deletion are gone; which is overly strong, but is simple to implement within Postgres. When marked dead, a deleted page is labeled with the next-transaction counter value. VACUUM can reclaim the page for re-use when this transaction number is older than RecentGlobalXmin. As collateral damage, this implementation also waits for running XIDs with no snapshots and for snapshots taken until the next transaction to allocate an XID commits.

删除的页面只有在没有遍历或查询引用它时才能被回收；在此之前他必须与他的right-link保持在原来的位置上，从而不受干扰。我们通过等待直到删除时所有active snapshots和registered snapshots都消失来实现这一点；这或许有点严格，但在Postgres中比较便于实现。当页面被标记为dead时，删除的页面将标记为下一个transaction number。当这个transaction number比RecentGlobalXmin老时，VACUUM就可以回收并重用这个页面了。作为副作用，该实现需要等待所有的running XIDs没有快照，并且等到下一个事务分配的XID提交时才能再拍快照。（obvious：没看懂）

Reclaiming a page doesn’t actually change its state on disk — we simply record it in the shared-memory free space map, from which it will be handed out the next time a new page is needed for a page split. The deleted page’s contents will be overwritten by the split operation. (Note: if we find a deleted page with an extremely old transaction number, it’d be worthwhile to re-mark it with FrozenTransactionId so that a later xid wraparound can’t cause us to think the page is unreclaimable. But in more normal situations this would be a waste of a disk write.)

回收一个页面不会真正的修改它磁盘上的状态—我们只是把他记录到shared-memory free space map中，这样当下一次页面发生分裂需要一个新页面时就会把回收的页面分配出去。删除页面的内容会被分裂操作覆盖。（注意：如果我们发现一个已删除的页面具有非常旧的transaction number，则值得用 FrozenTransactionId 重新标记它，如此后续的xid回绕不会导致我们认为该页面不可回收。但通常情况下，这将浪费磁盘写入。）

Because we never delete the rightmost page of any level (and in particular never delete the root), it’s impossible for the height of the tree to decrease. After massive deletions we might have a scenario in which the tree is “skinny”, with several single-page levels below the root. Operations will still be correct in this case, but we’d waste cycles descending through the single-page levels. To handle this we use an idea from Lanin and Shasha: we keep track of the “fast root” level, which is the lowest single-page level. The meta-data page keeps a pointer to this level as well as the true root. All ordinary operations initiate their searches at the fast root not the true root. When we split a page that is alone on its level or delete the next-to-last page on a level (both cases are easily detected), we have to make sure that the fast root pointer is adjusted appropriately. In the split case, we do this work as part of the atomic update for the insertion into the parent level; in the delete case as part of the atomic update for the delete (either way, the metapage has to be the last page locked in the update to avoid deadlock risks). This avoids race conditions if two such operations are executing concurrently.

由于我们不会删除任意层次的rightmost page（特别是root永远不会删除）所以树的高度不会降低，经过大量的删除操作后，可能会出现一种树变得非常“瘦”的场景，root下面可能会出现很多只含有一个页面的level。操作这样的树依然是正确的，但是遍历single-page levels会浪费时间（因为其实没必要遍历）。为了应对这种情况，我们利用了Lanin和Shasha的思想：我们保持了一个叫"fast root"的level，这是 single-page level的最低层次。meta-data page会像维护root一样维护这个fast root的指针。所有的常规操作，都从fast root开始进行查询，而不是真正的root。当一个level中的唯一一个页面发生分裂（1变2），或者删除了某层的倒数第二个页面（2变1，这两种情况都比较容易检测到），我们必须确保fast root指针被适当调整。在分裂时，我们将这项工作作为insertion into the parent level中原子更新的一部分；在删除时，作为删除时原子更新的一部分（无论哪种方式，metapage都必须是更新中锁定的最后一页，以避免死锁风险）。这样可以避免两个相同的操作并发执行造成的竞争。

VACUUM needs to do a linear scan of an index to search for deleted pages that can be reclaimed because they are older than all open transactions. For efficiency’s sake, we’d like to use the same linear scan to search for deletable tuples. Before Postgres 8.2, btbulkdelete scanned the leaf pages in index order, but it is possible to visit them in physical order instead. The tricky part of this is to avoid missing any deletable tuples in the presence of concurrent page splits: a page split could easily move some tuples from a page not yet passed over by the sequential scan to a lower-numbered page already passed over. (This wasn’t a concern for the index-order scan, because splits always split right.) To implement this, we provide a “vacuum cycle ID” mechanism that makes it possible to determine whether a page has been split since the current btbulkdelete cycle started. If btbulkdelete finds a page that has been split since it started, and has a right-link pointing to a lower page number, then it temporarily suspends its sequential scan and visits that page instead.It must continue to follow right-links and vacuum dead tuples until reaching a page that either hasn’t been split since btbulkdelete started, or is above the location of the outer sequential scan. Then it can resume the sequential scan. This ensures that all tuples are visited. It may be that some tuples are visited twice, but that has no worse effect than an inaccurate index tuple count (and we can’t guarantee an accurate count anyway in the face of concurrent activity). Note that this still works if the has-been-recently-split test has a small probability of false positives, so long as it never gives a false negative. This makes it possible to implement the test with a small counter value stored on each index page.

VACUUM需要顺序遍历索引，获取标记为deleted且比所有活跃事务更老从而可以被回收的页面。出于性能考虑，我们希望在顺序遍历时还能够获取 deletable tuples。在Postgres 8.2之前，btbulkdelete函数按照索引的顺序扫描leaf page，但通常是按照物理顺序来访问他们。比较棘手的问题是在并发的页面分裂时，如何避免丢失deletable tuples：页面分裂很容易将遍历操作尚未访问到的页面中的数据，移动到物理页号更小的已经访问过的页面中。（这一点在index-order scan时不需要考虑，因为分裂总是向右进行）为了应对这个问题，我们提供了一种称为"vacuum cycle ID" 的机制，这种机制可以帮我们确定当前的btbulkdelete开始后页面是否发生过分裂。如果btbulkdelete发现一个页面发生了分裂，并且它的右兄弟的page number比自己小，那么将暂时挂起当前的sequential scan，然后去访问右兄弟。我们必须沿着right-links一直遍历，同时vacuum dead tuples，直到我们到达一个btbulkdelete开始后没有分裂的页面或者页号比外层sequential scan的页号更大的页面（也就是sequential scan还未访问到的页面）。这时我们就可以恢复sequential scan了。这样就确保了所有的元组都被访问到了。当然，有些元组可能被访问了两次，但这最多就是造成不准确的索引元组计数（面对并发活动，我们反正也无法保证准确的技术）。注意，即便has-been-recently-split（最近是否发生分裂）的检测出现小概率的false positive，这也能正常工作，只要不出现false negative。这使得我们在实现has-been-recently-split test时可以将一个小的计数值存放在每个索引页面中。

解释

false positive：假阳性

在上文中，假阳性是指页面实际没有发生分裂，但检测出来发生了分裂。

false negative：假阴性

在上文中，假阴性是指页面实际发生了分裂，但检测出来没有发生分裂。

On-the-Fly Deletion Of Index Tuples

If a process visits a heap tuple and finds that it’s dead and removable(ie, dead to all open transactions, not only that process), then we can return to the index and mark the corresponding index entry “known dead”, allowing subsequent index scans to skip visiting the heap tuple. The “known dead” marking works by setting the index item’s lp_flags state to LP_DEAD. This is currently only done in plain indexscans, not bitmap scans, because only plain scans visit the heap and index “in sync” and so there’s not a convenient way to do it for bitmap scans.

当进程访问一个heap tuple然后发现这个元组的状态时dead并且可以被物理删除时（即对于所有活跃事务都是dead，并非只对于当前进程），我们会返回所用并将对应的索引项修改为“known dead”，从而允许后续的index scan忽略这个索引项。通过将索引item的lp_flags设置为LP_DEAD来标记索引项为"known dead" 。这个操作目前只会出现在常规的indexscans，bitmap scan不会有这个操作，因为只有常规的indexscans会同步访问索引和堆，而这对于bitmap scan并不是一种便捷的方式。

关于同步访问

所谓同步访问是指，常规的indexscans会先扫描一个索引页面，获取一个索引页面中满足条件的item之后，立即访问heap页面判断元组可见性、进行回表等操作，然后再扫描下一个索引页面。而bitmap scans为了解决索引回表时磁盘的随机读取问题，会将所有满足查询条件的索引item（这些item分布在多个索引页面中）的tid进行排序，排序完成后再进行回表。

Once an index tuple has been marked LP_DEAD it can actually be removed from the index immediately; since index scans only stop “between” pages, no scan can lose its place from such a deletion. We separate the steps because we allow LP_DEAD to be set with only a share lock (it’s exactly like a hint bit for a heap tuple), but physically removing tuples requires exclusive lock. In the current code we try to remove LP_DEAD tuples when we are otherwise faced with having to split a page to do an insertion (and hence have exclusive lock on it already).

一旦一个索引元组被标记为LP_DEAD ，它就可以立即从索引中删除；由于index scan只会在页面之间stop，所以扫描不会因为这样的删除操作而lost place。我们将标记LP_DEAD与索引元组的物理删除相分开，因为加标记只需要共享锁（这与修改heap tuple的hint bit是一样的），而物理删除需要排它锁。在当前的代码逻辑中，当插入操作面临页面分裂时，我们会尝试物理删除标记为LP_DEAD的元组以释放空间（由于此时我们已经持有排他锁）。

This leaves the index in a state where it has no entry for a dead tuple that still exists in the heap. This is not a problem for the current implementation of VACUUM, but it could be a problem for anything that explicitly tries to find index entries for dead tuples. (However, the same situation is created by REINDEX, since it doesn’t enter dead tuples into the index.)

这（索引页面面临分裂时物理删除索引项）使得索引可能处于这样一种状态，即索引中不存在指向dead tuple（被标记为dead的数据元组）的索引项（因为索引元组先于数据元组被物理删除）。对于当前VACUUM的实现来说，这并不是什么问题，这对于需要显示查询指向dead tuple的索引项的操作来说，这是一个问题。（然而，REINDEX也会出现相同的情况，因为REINDEX不会把dead tuple插入索引）

It’s sufficient to have an exclusive lock on the index page, not a super-exclusive lock, to do deletion of LP_DEAD items. It might seem that this breaks the interlock between VACUUM and indexscans, but that is not so: as long as an indexscanning process has a pin on the page where the index item used to be, VACUUM cannot complete its btbulkdelete scan and so cannot remove the heap tuple. This is another reason why btbulkdelete has to get a super-exclusive lock on every leaf page, not only the ones where it actually sees items to delete. So that we can handle the cases where we attempt LP_DEAD flagging for a page after we have released its pin, we remember the LSN of the index page when we read the index tuples from it; we do not attempt to flag index tuples as dead if the we didn’t hold the pin the entire time and the LSN has changed.

在物理删除索引元组时（特只在页面面临分裂时删除索引元组），使用排它锁就足够了，无需使用super-exclusive lock。这似乎像是打破了VACUUM和indexscans之间的互锁机制，但实际并非如此：只要进行indexscanning对于index item曾经存在的页面持有pin，那么VACUUM 就无法完成btbulkdelete scan，从而无法删除heap tuple。这就是为什么btbulkdelete必须给每一个leaf page（而不仅仅是存在需要删除的item的leaf page），加super-exclusive lock的另一个原因。所以我们可以处理在释放其pin后尝试对页面进行LP_DEAD标记的情况，我们在读取index tuples的时候记录其index page的LSN;我们不会尝试在没有pin并且LSN发生变化时将index tuple的flag修改为dead。

WAL Considerations

The insertion and deletion algorithms in themselves don’t guarantee btree consistency after a crash. To provide robustness, we depend on WAL replay. A single WAL entry is effectively an atomic action — we can redo it from the log if it fails to complete.

插入和删除算法自身无法保证系统奔溃后btree的一致性。为了提供robustness（鲁棒性），我们需要依靠WAL的回放机制。单条WAL是一个原子操作—如果这个操作没有完成，我们可以根据日志来重做它。

说明

这里有一个非常重要的概念，一条WAL代表一个原子操作，这个原子操作中可能只包含了一个步骤，也可能包含了多个步骤。像分裂操作的WAL就包含了多个步骤。这个概念在MySQL中有一个专有术语叫MTR（mini-transaction）。

Ordinary item insertions (that don’t force a page split) are of course single WAL entries, since they only affect one page. The same for leaf-item deletions (if the deletion brings the leaf page to zero items, it is now a candidate to be deleted, but that is a separate action).

常规的item插入操作（不引起页面分裂）显然只对应一条WAL条目，因为他只影响一个页面。删除leaf-item的操作也是如此（如果删除操作导致页面没有item，这个页面就成了一个可以被删除的页面，但这又是一个另外的操作了）。

An insertion that causes a page split is logged as a single WAL entry for the changes occurring on the insertion’s level — including update of the right sibling’s left-link — followed by a second WAL entry for the insertion on the parent level (which might itself be a page split, requiring an additional insertion above that, etc).

造成页面分裂的插入操作会被记录为一条WAL日志，其中包含了insertion’s level发生的改变（insertion’s level为进行插入操作的那一层）—包括更新右兄弟的left-link—紧跟着第二条WAL日志，记录了对于parent level的插入操作（这本身可能是一个页面分裂，需要一个额外的向上插入的日志，以此类推）。

For a root split, the followon WAL entry is a “new root” entry rather than an “insertion” entry, but details are otherwise much the same.

对于root分裂，followon WAL是一个“new root”条目而不是“insertion”条目，但其余细节除此之外基本相同。

Because splitting involves multiple atomic actions, it’s possible that the system crashes between splitting a page and inserting the downlink for the new half to the parent. After recovery, the downlink for the new page will be missing. The search algorithm works correctly, as the page will be found by following the right-link from its left sibling, although if a lot of downlinks in the tree are missing, performance will suffer. A more serious consequence is that if the page without a downlink gets split again, the insertion algorithm will fail to find the location in the parent level to insert the downlink.

由于分裂包含了多个原子操作，系统有可能在分裂页面和向上插入（插入新页面的downlink到父亲节点）之间发生崩溃。如此，在恢复之后，新页面的downlink会丢失。但查询算法可以在这种情况下正常工作，因为页面可以通过左兄弟的right-link找到，虽然如果树中的downlink丢失过多会出现性能问题。更严重的后果是，如果一个没有downlink的页面发生了分裂，插入算法无法在父亲页面中找到插入downlink的地方。

Our approach is to create any missing downlinks on-the-fly, when searching the tree for a new insertion. It could be done during searches, too, but it seems best not to put any extra updates in what would otherwise be a read-only operation (updating is not possible in hot standby mode anyway). It would seem natural to add the missing downlinks in VACUUM, but since inserting a downlink might require splitting a page, it might fail if you run out of disk space. That would be bad during VACUUM - the reason for running VACUUM in the first place might be that you run out of disk space, and now VACUUM won’t finish because you’re out of disk space. In contrast, an insertion can require enlarging the physical file anyway.

我们的策略是，在为new insertion执行查询的过程中，创建丢失的downlink。这也可以在查询时实现，但是最好别在查询时引入额外的update操作，这样查询就是一个read-only的操作（此外，更新操作也不可能在hot standby模式下）。在VACUUM时来添加丢失的downlink看来更加自然，但是由于插入downlink可能会引起页面分裂，在磁盘空间用完时分裂会失败。这中情况如果发生在VACUUM时会很糟糕—起初运行VACUUM就是因为磁盘空间用完了，现在VACUUM无法完成也是因为磁盘空间用完了。相反，插入操作可以要求扩大物理空间。

To identify missing downlinks, when a page is split, the left page is flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT).When the downlink is inserted to the parent, the flag is cleared atomically with the insertion. The child page is kept locked until the insertion in the parent is finished and the flag in the child cleared, but can be released immediately after that, before recursing up the tree if the parent also needs to be split. This ensures that incompletely split pages should not be seen under normal circumstances; only if insertion to the parent has failed for some reason.

为了识别丢失的downlink，当页面发生了分裂，左边的页面会打上标记来指明分裂没有完成（标记为：INCOMPLETE_SPLIT）。当downlink插入到父亲节点后，标记会由插入操作自动清除。孩子页面会一直持有锁，直到插入父亲节点的操作完成，并且孩子节点上的标记被删除，但这两项操作完成之后，锁就可以被立即释放，不用考虑parent发生分裂然后向上递归的情况。这样确保了在通常情况下不会看到未完成分裂的页面；只有当向上插入父亲节点的操作由于某种原因发生失败时才可能看得到。

We flag the left page, even though it’s the right page that’s missing the downlink, because it’s more convenient to know already when following the right-link from the left page to the right page that it will need to have its downlink inserted to the parent.

我们给left page加上标记，虽然丢失的downlink是right page的，因为当沿着right-link从left page到right page时，知道right page要将其插入到父页面的downlink会更方便。

When splitting a non-root page that is alone on its level, the required metapage update (of the “fast root” link) is performed and logged as part of the insertion into the parent level. When splitting the root page, the metapage update is handled as part of the “new root” action.

当分裂某层中唯一的页面（且该页面为non-root page）时，metapage的更新（“fast root” link的更新）将被执行，并且作为向上插入父节点的操作的一部分被记录到日志中。当分裂一个root page，更新metapage将作为”new root“操作的一部分被处理。

Each step in page deletion is logged as a separate WAL entry: marking the leaf as half-dead and removing the downlink is one record, and unlinking a page is a second record. If vacuum is interrupted for some reason, or the system crashes, the tree is consistent for searches and insertions. The next VACUUM will find the half-dead leaf page and continue the deletion.

删除操作中的每个步骤都会被记录为一条独立的WAL条目：将叶子页面标记为half-dead然后删除downlink是第一条记录，unlink page是第二条记录。如果由于某些原因vacuum被打断了，或者系统发生了崩溃，B树对于查询和插入操作是一致的。后续的VACUUM可以找到标记为half-dead的leaf page然后继续删除。

Before 9.4, we used to keep track of incomplete splits and page deletions during recovery and finish them immediately at end of recovery, instead of doing it lazily at the next insertion or vacuum. However, that made the recovery much more complicated, and only fixed the problem when crash recovery was performed. An incomplete split can also occur if an otherwise recoverable error, like out-of-memory or out-of-disk-space, happens while inserting the downlink to the parent.

在9.4之前，我们在重启恢复的过程中跟踪记录未完成的分裂以及页面删除，然后在恢复阶段的最后来完成他们，而不是在后续的插入或者vacuum时来做。然而，这会使得重启恢复变得非常复杂，并且仅仅是在重启恢复时修复了问题。对于一些可恢复错误也可能出现incomplete split，比如：在向父亲插入downlink时发现没有足够的内存、没有足够的磁盘空间。

Scans during Recovery

The btree index type can be safely used during recovery. During recovery we have at most one writer and potentially many readers. In that situation the locking requirements can be relaxed and we do not need double locking during block splits. Each WAL record makes changes to a single level of the btree using the correct locking sequence and so is safe for concurrent readers. Some readers may observe a block split in progress as they descend the tree, but they will simply move right onto the correct page.

在重启恢复时也可以安全的使用B树索引。在恢复的过程中，可以有最多一个写操作以及多个可能的读操作。在这种情况下，我们可以放宽锁的要求，并且在块分裂时不需要两把锁。每条WAL记录都使用正确的锁定顺序对 btree 单个级别进行更改，因此对于并发读操作是安全的。一些读取操作在遍历树的过程中可能会发现块发生了分裂，但他只需要向右移动到正确的页面。

During recovery all index scans start with ignore_killed_tuples = false and we never set kill_prior_tuple. We do this because the oldest xmin on the standby server can be older than the oldest xmin on the master server, which means tuples can be marked as killed even when they are still visible on the standby. We don’t WAL log tuple killed bits, but they can still appear in the standby because of full page writes. So we must always ignore them in standby, and that means it’s not worth setting them either.

在恢复的过程中，所有索引遍历都会在ignore_killed_tuples为false时开始，并且不会设置kill_prior_tuple。我们这样的做的原因，是在standby上oldest xmin可能比master上的oldest xmin还要老，这意味着元组可能会被标记为killed，即使他们在standby上依然可见。我们不会使用WAL记录元组的killed位，但是由于full page write，他们依然可能在standby上出现。所以在standby上我们必须要忽略他，这同样意味着我们没有必要设置他们。

Note that we talk about scans that are started during recovery. We go to a little trouble to allow a scan to start during recovery and end during normal running after recovery has completed. This is a key capability because it allows running applications to continue while the standby changes state into a normally running server.

请注意，我们讨论的是在重启恢复期间开始的遍历。当我们允许遍历在恢复期间开始，并且在恢复结束系统正常运行后结束时，我们遇上了一些麻烦。这是一项关键功能，因为它允许备服务器将状态更改为正常运行时，正在运行的应用程序可以继续运行。

The interlocking required to avoid returning incorrect results from non-MVCC scans is not required on standby nodes. That is because HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(), HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only ever used during write transactions, which cannot exist on the standby. MVCC scans are already protected by definition, so HeapTupleSatisfiesMVCC() is not a problem. That leaves concern only for HeapTupleSatisfiesToast(). HeapTupleSatisfiesToast() doesn’t use MVCC semantics, though that’s because it doesn’t need to - if the main heap row is visible then the toast rows will also be visible. So as long as we follow a toast pointer from a visible (live) tuple the corresponding toast rows will also be visible, so we do not need to recheck MVCC on them. There is one minor exception, which is that the optimizer sometimes looks at the boundaries of value ranges using SnapshotDirty, which could result in returning a newer value for query statistics; this would affect the query plan in rare cases, but not the correctness. The risk window is small since the stats look at the min and max values in the index, so the scan retrieves a tid then immediately uses it to look in the heap. It is unlikely that the tid could have been deleted, vacuumed and re-inserted in the time taken to look in the heap via direct tid access. So we ignore that scan type as a problem.

为了避免non-MVCC scans返回错误结果而采用的互锁机制不需要在备节点上使用。这是因为HeapTupleSatisfiesUpdate()、HeapTupleSatisfiesSelf(),、HeapTupleSatisfiesDirty()以及HeapTupleSatisfiesVacuum() 只会在写入事务期间使用，而standby不存在写入事务。定义已经保护了MVCC scan，所以 HeapTupleSatisfiesMVCC() 不再是问题。如此，我们只需要考虑HeapTupleSatisfiesToast()。HeapTupleSatisfiesToast()不使用MVCC语义，虽然这是因为他不需要-如果main heap可见，那么toast row也可见。所有只要我们通过一个可见元组的toast pointer找到toast rows的也是可见的，所以我们不需要再去check他们的MVCC。这里有一个小的例外，优化器有时候会使用SnapshotDirty来查看value range的边界，这可能导致查询统计信息返回更加新的值；这在极少数情况下会影响查询计划，但不会影响正确性。风险窗口很小，因为统计信息查看索引中的最小值和最大值，因此扫描检索一个 tid 然后立即使用它在堆中查找。在直接通过tid访问堆所需的时间内，tid不太可能被删除、vacuum或者重新插入。所以我们不会将扫描类型作为一个问题。

Other Things That Are Handy to Know

Page zero of every btree is a meta-data page. This page stores the location of the root page — both the true root and the current effective root (“fast” root). To avoid fetching the metapage for every single index search, we cache a copy of the meta-data information in the index’s relcache entry (rd_amcache). This is a bit ticklish since using the cache implies following a root page pointer that could be stale. However, a backend following a cached pointer can sufficiently verify whether it reached the intended page; either by checking the is-root flag when it is going to the true root, or by checking that the page has no siblings when going to the fast root. At worst, this could result in descending some extra tree levels if we have a cached pointer to a fast root that is now above the real fast root. Such cases shouldn’t arise often enough to be worth optimizing; and in any case we can expect a relcache flush will discard the cached metapage before long, since a VACUUM that’s moved the fast root pointer can be expected to issue a statistics update for the index.

每颗btree的page 0都是meta-data page。这个页面存储了root page的位置—包括ture root以及当前的 effective root (“fast” root)。为了避免每次查询索引时都去fetch metapage，我们将meta-data中的信息拷贝了一份存放在索引的relcache条目中（rd_amcache）。这有点棘手，因为使用缓存意味着我们使用的root page pointer可能是陈旧的。然而，使用缓存中指针的后台进程有足够的手段来确认他是否达到了预期的页面；通过校验is-root flag可以判断是否到达了true root，通过判断页面是否有兄弟可以判断是否到达了fast root。最坏的情况，也就是如果我们缓存的fast root的指针比真正的fast root更上层，那么就需要多遍历几层。这种情况发生的频率不足以对它进行优化;并且在任何情况下，我们都可以认为relcache刷新会很快覆盖缓存的metapage ，因为如果VACUUM移动了fast root指针，则会发起索引统计信息的更新。

The algorithm assumes we can fit at least three items per page (a “high key” and two real data items). Therefore it’s unsafe to accept items larger than 1/3rd page size. Larger items would work sometimes, but could cause failures later on depending on what else gets put on their page.

这个算法假设每个页面至少能存放3个litem（一个high key以及两个真正的data item），因此如果item的大小超过页面大小的1/3则是不安全的。大的item有时会执行成功，但可能引发后续的失败，这取决于页面中存放的其他内容。

“ScanKey” data structures are used in two fundamentally different ways in this code, which we describe as “search” scankeys and “insertion” scankeys. A search scankey is the kind passed to btbeginscan() or btrescan() from outside the btree code. The sk_func pointers in a search scankey point to comparison functions that return boolean, such as int4lt. There might be more than one scankey entry for a given index column, or none at all. (We require the keys to appear in index column order, but the order of multiple keys for a given column is unspecified.) An insertion scankey uses the same array-of-ScanKey data structure, but the sk_func pointers point to btree comparison support functions (ie, 3-way comparators that return int4 values interpreted as <0, =0, >0). In an insertion scankey there is exactly one entry per index column. Insertion scankeys are built within the btree code (eg, by _bt_mkscankey()) and are used to locate the starting point of a scan, as well as for locating the place to insert a new index tuple. (Note: in the case of an insertion scankey built from a search scankey, there might be fewer keys than index columns, indicating that we have no constraints for the remaining index columns.) After we have located the starting point of a scan, the original search scankey is consulted as each index entry is sequentially scanned to decide whether to return the entry and whether the scan can stop (see _bt_checkkeys()).

“ScanKey”数据结构在代码中以两种根本不同的方式使用，我们将其描述为"search" scankeys和"insertion" scankeys。search scankey是从 btree 代码外部传递给 btbeginscan() 或 btrescan() 的类型。sk_func指针指向一个返回值为boolean类型的比较函数，例如： int4lt。对于一个给定的index column，可能存在多个scankey也可能一个都没有。（我们要求key以索引列的顺序出现，但是给定列的多个索引key的顺序没有指定。）insertion scankey使用相同的array-of-ScanKey数据结构，但是sk_func指针指向btree的comparison support function（比如，3-way比较器返回int4类型的数据，分别表示小于0、等于0和大于0）。对于insertion scankey每个索引列只有会一个条目。Insertion scankeys在btree的代码中构建（比如：_bt_mkscankey()）,并用于定位遍历的起始位置，以及定位新索引元组的插入位置。（注意：在由search scankey构建insertion scankey的场景下，key的数量可能比index column的数量少，这表明我们对于剩余索引列没有约束。）当我们定位到遍历的起始位置后，在顺序遍历每个索引条目时，都会比较原始的search scankey，从而决定索引条目是否返回以及是否终止遍历（详见：_bt_checkkeys()）。

Notes About Data Representation

The right-sibling link required by L&Y is kept in the page “opaque data” area, as is the left-sibling link, the page level, and some flags. The page level counts upwards from zero at the leaf level, to the tree depth minus 1 at the root. (Counting up from the leaves ensures that we don’t need to renumber any existing pages when splitting the root.)

L&Y所要求的right-sibling link、left-sibling link、page level以及一些标志，都被存储在页面的"opaque data"区域。叶子页面的page level为0，然后依次向上递增，root的page level为树的深度减1。（从叶子向上递增的原因是当root发生分裂时，我们不需要对已经存在的页面重新编号。）

The Postgres disk block data format (an array of items) doesn’t fit Lehman and Yao’s alternating-keys-and-pointers notion of a disk page,so we have to play some games.

Postgres的磁盘块格式（item数组方式）不符合L&Y的alternating-keys-and-pointers的概念，所有我们需要做些改变。

On a page that is not rightmost in its tree level, the “high key” is kept in the page’s first item, and real data items start at item 2. The link portion of the “high key” item goes unused. A page that is rightmost has no “high key”, so data items start with the first item. Putting the high key at the left, rather than the right, may seem odd, but it avoids moving the high key as we add data items.

在非rightmost页面中，“high key”位于页面的first item，真实数据从第二个item开始。没有使用“high key”的link 部分。而rightmost的的页面中没有“high key”，所以第一个item开始就是真实数据。将high key放在最左边而不是最右边，看起来可能有点奇怪，但这是为了避免添加新item时移动high key。

On a leaf page, the data items are simply links to (TIDs of) tuples in the relation being indexed, with the associated key values.

在leaf page，通过关联的key value，data item直接链接到（TID对应的）关系中被索引的数据元组。

On a non-leaf page, the data items are down-links to child pages with bounding keys. The key in each data item is the lower bound for keys on that child page, so logically the key is to the left of that downlink. The high key (if present) is the upper bound for the last downlink. The first data item on each such page has no lower bound— or lower bound of minus infinity, if you prefer. The comparison routines must treat it accordingly. The actual key stored in the item is irrelevant, and need not be stored at all. This arrangement corresponds to the fact that an L&Y non-leaf page has one more pointer than key.

在non-leaf page，data item包含了child page的down-link以及bounding key。data item中的key是child page中的lower bound，所以逻辑上这个key是左边的downlink。high key（如果存在）是最后一个downlink的upper bound。每一个最左页面的第一个item都没有lower bound—或者你更愿意认为lower bound是负无穷。存储在item中的key无关紧要，完全无需存储。这种设计是为了应对L&Y的non-leaf page中指针比key多一个的事实。

Notes to Operator Class Implementors

With this implementation, we require each supported combination of datatypes to supply us with a comparison procedure via pg_amproc. This procedure must take two nonnull values A and B and return an int32 < 0, 0, or > 0 if A < B, A = B, or A > B, respectively. The procedure must not return INT_MIN for “A < B”, since the value may be negated before being tested for sign. A null result is disallowed, too. See nbtcompare.c for examples.

当前实现，我们需要支持复合数据类型用于通过pg_amproc来实现比较函数。这个比较函数必须支持两个非空值A和B的比较，然后当A < B、A = B和A > B时分别返回一个< 0、0和 > 0的32位整数。在A < B时该函数不能返回INT_MIN，因为这个值在用于有符号比较时会变成负数。返回null同样不允许。详见：nbtcompare.c.

There are some basic assumptions that a btree operator family must satisfy:

这里有一些btree operator必须满足的基本假设。

An = operator must be an equivalence relation; that is, for all non-null values A,B,C of the datatype:

等于运算符必须是等价关系;这意味着，对于任何类型的非空值，A、B、C有如下关系：

A = A is true reflexive law

if A = B, then B = A symmetric law

if A = B and B = C, then A = C transitive law

A = A 为真

如果A = B，则B = A

如果 A =B 且 B = C，则 A = C

A < operator must be a strong ordering relation; that is, for all non-null values A,B,C:

小于操作必须是强有序关系；这意味着，对于任何类型的非空值，A、B、C有如下关系：

A < A is false irreflexive law

if A < B and B < C, then A < C transitive law

A < A为假

如果 A < B 且 B < C，则A < C

Furthermore, the ordering is total; that is, for all non-null values A,B:

此外，有序性是全局的，这意味着，对于所有非空值A，B：

exactly one of A < B, A = B, and B < A is true trichotomy law

A < B、A = B、B < A必有一个为真

(The trichotomy law justifies the definition of the comparison support procedure, of course.)

（trichotomy规则证明了比较函数的定义是正确的）

The other three operators are defined in terms of these two in the obvious way, and must act consistently with them.

其他三目运算符显然是根据这两个定义的，并且必须与它们保持一致。

For an operator family supporting multiple datatypes, the above laws must hold when A,B,C are taken from any datatypes in the family. The transitive laws are the trickiest to ensure, as in cross-type situations they represent statements that the behaviors of two or three different operators are consistent. As an example, it would not work to put float8 and numeric into an opfamily, at least not with the current semantics that numerics are converted to float8 for comparison to a float8. Because of the limited accuracy of float8, this means there are distinct numeric values that will compare equal to the same float8 value, and thus the transitive law fails.

对于支持多种数据类型的运算符族，当 A、B、C 取自该族中的任何数据类型时，上述定律必须成立。transitive规则是最难保证的，由于在cross-type情况下，他们要表示两到三种不同运算符的行为是一致的。例如，将float8和numeric放入opfamily是行不通的，至少在当前的语义下，将numeric转换为float8来和float8比较是不行的。由于float8的精度有限，这意味着存在不同的numeric 与float8相同，因此传递定律失效。

It should be fairly clear why a btree index requires these laws to hold within a single datatype: without them there is no ordering to arrange the keys with. Also, index searches using a key of a different datatype require comparisons to behave sanely across two datatypes. The extensions to three or more datatypes within a family are not strictly required by the btree index mechanism itself, but the planner relies on them for optimization purposes.

btree 索引要求这些定律保持在单个数据类型中的原因应该非常清楚：没有他们就无法按顺序排列key。同样使用不同数据类型的key在索引中搜索时，要求比较操作能对两种不同类型数据进行比较。btree索引机制本身并不严格要求对运算符族中的三个或更多数据类型的扩展，但是执行计划依赖他们对于优化器的支持。

obvious__

关注

7
点赞
踩
3

收藏

觉得还不错? 一键收藏
4
评论
PostgreSQL Blink-tree ReadMe---翻译

src/backend/access/nbtree/READMEThis directory contains a correct implementation of Lehman and Yao’s high-concurrency B-tree management algorithm (P. Lehman and S. Yao, Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions on Database Sy
复制链接

扫一扫