论文阅读——R-TREES. A DYNAMIC INDEX STRUCTURE 中文翻译

原文地址：http://blog.chinaunix.net/uid-7426920-id-2627754.html

Abstract

In order to handle spatial data efficiently, as required in commuter aided design and geo-data applications, a database system needs an index mechanism that will help it retrieve data items quickly according to their spatial locations. However, traditional indexing methods are not well suited to data objects of non-zero size located in multi-dimensional spaces, in this paper we describe a dynamic index structure called an R-tree which meets the need, and give algorithms for searching and updating it, We present the results of a series of tests which indicate that the structure performs well, and conclude that it is useful for current database systems in spatial applications.

在计算机辅助设计和地理数据有关程序中，为了更有效率的来使用空间数据，数据常使用索引机制有效的返回相对应的数据项。然而，在所谓空间中，传统的索引方式并不能很好的使用非零（不能直接判定大小的）数据对象，在此篇论文中，将描述一种动态索引结构（称之为R树），它将很好的来处理这种数据对象，它提供了查找和更新的有效算法，我们将用一系列数据来展示这种结构的优良，从而得出R树对于当前数据库系统的空间对象的操作是有益的。

1 Introduction

Spatial data objects often cover areas in multi-dimensional spaces and are not well represented by point locations. For example, map objects like counties, census tracts etc occupy regions of non-zero size in two dimensions. A common operation on spatial data is a search for all objects in an area, for example to find all countries that have land within 20 miles of a particular point. This kind of spatial search occurs frequently in computer aided design (CAD) and geo-data applications, and therefore it is important to be able to retrieve objects efficiently according to their spatial location.

空间数据对象通常出现在多维空间中，仅仅用点是不能够很好的描述的。，对于空间对象——县城、人口等等在二维空间中都不能够用一个孤立的点而很好的描述。在数据对象上的使用上，查询是其中一个很常见的操作，举例来说，如果有一种想法是找出地图中所有土地面积小于20英里的县城，这种空间查询在计算机辅助设计和地理空间相关应用中是就是很常见的，所以根据他们的空间位置快速有效的返回相对应的数据对象是很重要的。

An index based on object’s spatial locations is desirable, but classical one-dimensional database indexing structures are not appropriate to multi-dimensional spatial searching. Structures based on exact matching of values, such hash tables, are not useful because a range search is required. Structures using one-dimensional ordering of key values, such as B-trees and ISAM indexes, do not work because the search space is multi-dimensional.

为空间数据对象建立一种索引结构是有益的，但是传统的一维数据索引结构对于多维空间查询是不适用的。基于确定的比较值，比如哈希表，对范围查询也起不到很大作用。像B树、顺序索引这种采用一维关键值结构的对于多维空间也是不起作用的。

A number of structures have been proposed for handling multi-dimensional point data, and a survey of methods can be found in [5]. Cell methods [4, 8, 16] are not good for dynamic structures because the cell boundaries must be decided in advance. Quad trees [7] and k-d trees do not take paging of secondary memory into account. K_D_B trees [13] are designed for paged memory but are useful only for point data. The use of index inter-values has been suggested in [15], but this method cannot be used in multiple dimensions. Corner stitching [12] is an example of a structure for two-dimensional spatial searching suitable for data objects of non-zero size, but is assumes homogeneous primary memory and is not efficient for random searched in very large collections of data. Grid files [10] handle non-point data by mapping each object to a point in a higher-dimensional space. In this paper we describe an alternative structure called an R-tree which represents data objects by intervals in several dimensional.

Section 2 outlines the structure of an R-tree and Section 3 gives algorithms for searching, inserting, deleting, and updating. Results of R-tree index performance tests are presented in Sections 4. Section 5 contains a summary of our conclusions.

一些结构用来处理多维点数据以及相关的方法在论文【5】中可以看到，其核心算法对于动态结构也不是很好用，应为核心边界必须实现计算出来（没考证）。Quad-trees和k-D trees没有考虑块内存。K-D-B trees 解决了块内存的问题，但也局限于点数据。内值索引的适用在论文【15】中被提到，但是这种方法也不能再多维空间中适用。Corner stitching结构可以用来处理多维空间中非零数据的空间查询，但其假定是在同类内存块中并且在这种结构在大量数据集合中的随机查询的效率也不是很高。Grid files的做法是将非点数据与一个点进行映射。在本篇论文中我们将描述一个R树动态结构用多维间隔的方式来存储数据。

在第二章节中我们将概括出R树的结构，第三种汇总将给出算法，包括查询，插入，删除，更新。R树的性能测试结构将在第四章中描述，第五章将总结结论。

2 R-tree Index Structure

An R-tree is a height-balanced tree similar to a B-tree [2, 6] with index records in its leaf nodes containing points to data objects. Nodes correspond to disk pages if the index is disk-resident, and the structure is designed so that a spatial search requires visiting only a small number of nodes. The index is completely dynamic; inserts and deletes can be inter-mixed with searches and no periodic reorganization is required.

R树是一个高度平衡树，这点与B树类似（如果索引指向磁盘则结点便对应磁盘页），它将索引记录记录也在叶子结点中，这些记录保存指向它所对应的对象数据的指针，结构之所以这样设计是当要进行空间查询时只需访问少部分的结点。这个索引是完全动态的，对于插入和删除实际上已经在查询结点时已经将结构改变了。

A spatial database consists of a collection of tuples representing spatial objects, and each tuple has a unique identifier which can be used to retrieve it. Leaf nodes in an R-tree contain index record entries of the form (I, tuple-identifier) where tuple-identifier refers to a tuple in the database and I is an n-dimensional rectangle which is the bounding box of the spatial object indexed I = (I₀, I₁, I_n-1). Here n is the number of dimensions and I_i is a closed bounded interval [a, b] describing the entries of the object along dimensional i. Alternatively Ii may have one or both endpoints equal to infinity, indicating that the object extends outward indefinitely. Non-leaf nodes contain entries if the form (I, child-pointer) where child-pointer is the address of a lower node in the R-tree and I covers all rectangles in the lower node’s entries.

空间数据库是空间对象元组的集合，每一个元组都有一个确定的标识符，通过这个标识符就可以很容易得到这个元组。在R树中，页结点包含着一个索引记录，这个索引记录的形式为(I, tuple-identifier)，tuple-identifier是数据集中的一个元组的索引，I是一个N维的、空间对象索引I = (I₀, I₁, I_n-1)的范围矩形，而l_i代表第i维的范围[a, b]。如果对于(I, child-pointer)中，child-pointer是低一层结点的话，非叶结点则存储着条目，并且I 存储着低一层的所有节点的外接矩形。

Let M be the maximum number of entries that will fit in one node let m≤M/2 be a parameter specifying the minimum number of entries in a node. An R-tree satisfies the following properties.

假定M是条目数量的上界，规定每个结点都要大于下界m≤M/2。一个R树要满足一下条件：

(1) Every leaf node contains between m and M index records unless it is the root

(2) For each index record (I, tuple-identifier) in a leaf node, I is the smallest rectangle that spatially contains the n-dimensional data object represented by the indicated tuple.

(3) Every non-leaf node has between m and M children unless it is the root

(4) For each entry (I, child-pointer) in a non-leaf node, I is the smallest rectangle that spatially contains the rectangles in the child node.

(5) The root node has at least two children unless it is a leaf.

(6) All leaves appear on the same level.

(1) 每个叶结点必须包含m至M个索引记录除非它是根

(2) 对于叶结点的每个索引记录(I, tuple-identifier)，I 包含了一个最小矩形，这个矩形是包含了元组所对应的空间对象

(3) 每个非叶结点要包含m至M个孩子除非它是根

(4) 对于非叶结点中每一个条目(I, child-pointer)， I是能够包含孩子结点矩形的矩形

(5) 根结点最少要有两个孩子除非这个根结点就是一个叶结点

(6) 所有叶结点必须在同一层中

图2 1a和2 1b中展示了R树的结构并且说明了其中矩形的内容以及覆盖关系。包含N个索引记录的R树的高度最多为|logmN|-1，这是因为每个节点分支的最小界限是m.故而节点数目的最大值是|N/m|+|N/m*m|+...+1.

3. Searching and Updating

3.1 Searching

The search algorithm descends the tree from the root in a manner similar to a B-tree. However, more than one subtree under a node visited may need to be searched; hence it is not possible to guarantee good worst-case performance. Nevertheless will most kinds of data the update algorithm will maintain the tree in a form that allows the search algorithm to eliminate irrelevant regions of the indexed space, and examine only data near the Algorithms Search. Given an R-tree whose root node is T, find all index records whose rectangles overlap a search rectangles S

R树的查询算法也是从根开始，与B树很相似，但可能要访问多个子树。所以不能够保证很好的最坏情况下性能。然后大多种类数据更新算法将让在查询过程中进行剪枝，并只检查查询算法给出的区域。

考虑一棵根为T的树，查找出能够覆盖要查询矩形S的矩形所对应的索引记录。

Search area

In the following we denote the rectangle part of an index entry E by EI, and the tuple-identifier or child-pointer part by Ep.

(1) [Search subtrees]

If T is not a leaf, check each entry E to determine whether E I overlaps S. For all overlapping entries, invoke Search on the tree whose root node is pointed to by E p

(2) [Search leaf node]

If T is a leaf, check all entries E to determine whether E I overlaps S. If so, E is a qualifying record.

查询范围

接下来，我们将记录E的矩形部分记为EI，将tuple-identifier或child-pointer记为E p

(1) 【在子树中查询】

如果T不是叶子，查询每条记录来判断E I是否覆盖S，对于所有覆盖记录，调用查询算法来查询谁的根结点由E p指向。

(2) 【在叶结点中查询】

如果T是个叶子，检查所有记录看EI是否覆盖了S。如果是，E则就是要查的结果

3.2 Insertion

Insertion index records for new data tuples is similar to insertion in a B-tree in that new index records are added to the leaves, nodes that overflow are split, and splits propagate up the tree.

为新元组插入索引记录的方式与向B树的插入方式大致相同，新的索引记录加入到叶子中，如果节点满了则需要分裂，而分裂则会使树增殖。

Algorithm Insert: Insert a new index entry E into an R-tree

算法Insert——将一个新的索引记录E插入到R树中

(1) [Insert position for new record]

Invoke ChooseLeaf to select a leaf node L in which to place E

(2) [Add record to leaf node]

If L has room for another entry, install R. Otherwise invoke SplitNode to obtain L and LLcontaining E and all the old entries of L.

(3) [Propagate changes upward]

Invoke AdjustTree on L, also passing LL if a split was performed.

(4) [Grow tree taller]

If node split propagation caused the root to split, create a new root whose children are the two resulting nodes.

(1) 【为新记录找到位置】

调用ChooseLeaf找到一个叶结点L，而E就放在这个叶结点中

(2) 【将记录插入叶结点中】

如果叶结点还有空位，则插入。否则，调用SplitNode得到两个叶结点L和LL，LL存储着E以及L的旧记录

(3) 【增殖向上传递】

对L进行AdjustTree，如果有LL这也同时需要进行调整

(4) 【增加深度】

如过叶结点的分裂增殖导致根结点也需要分裂，则创建一个新根，而其两个孩子则是已经存在的分裂出的两个结点。

Algorithm ChooseLeaf: Select a leaf node in which to place a new index entry E.

算法ChooseLeaf:找出一个叶结点在插入索引记录E

(1) [Initialize] Set N to be the root node

(2) [Leaf check] if N is a leaf, return N

(3) [Choose subtree] If N is not a leaf, let F be the entry in N whose rectangle F I needs least enlargement to include E I. Resolve ties by choosing the entry with the rectangle of smallest area.

(4) [Descend until a leaf is reached.] Set N to be the child node pointed to by F p and repeat from CL2

(1) 【初始化】将N设置为根结点

(2) 【叶子检查】如果N是叶子，则返回

(3) 【选择子树】如果N不是叶子，设定一个F是N的一个目录，它的矩形F I 是将E I包含的最小包围矩形。而正是通过选取矩形的最小区域来解决这个关系。

(4) 【从根下降直到到达叶子】将N设置为F p指向的孩子结点，然后回到步骤二

Algorithm AdjustTree: Ascend form a leaf node L to the root, adjusting covering rectangles and propagating node splits as necessary.

算法AdjustTree: 在从叶结点L上升到根节点的过程中，不断调整覆盖矩形，如果需要的话分裂节点进行增殖。

(1) [Initialize] Set N = L If L was split previously, set NN to be the resulting second node

(2) [Check if done] If N is the root, stop

(3) [Adjust covering rectangle in parent entry] Let P be the parent node of N, and let E_n be N’s entry in P. Adjust E_nI so that it tightly encloses all entry rectangles in N.

(4) [Propagate node ] If N has a partner NN resulting from an earlier split, create a new entry E_NNwith E_NN P pointing to NN and E_NNI enclosing all rectangles in NN. Add E_NNto P if there is room. Otherwise, invoke SplitNode E_NNand allP’s old entries.

(5) [Move up to next level] Set N=P and set NN=PP if a split occurred. Repeat from (2).

(1) 【初始化】

让节点N是L，如果L已经是分裂出来的，则假定NN是另一个分裂出来的节点

(2) 【检查是否完成】

如果节点N是根，则结束

(3) 【在父亲记录中调整覆盖矩形】

让P作为N的父节点，将N中的各个记录E_n放置到P中，不断调整E_nI使得N中的矩形能够被很好的包含

(4) 【节点增殖】

如果因为分裂的原因，N有一个兄弟节点NN，创建一个新的记录E_NN（用E_NN P指向NN）并最小包围NN的矩形。如果P有空位的话则将E_NN插入进去，如P节点满的话，则将E_NN和所有P的旧记录进行节点分裂

(5) 【向下一层移动】

将P赋值给N，如果有兄弟节点NN的话则将PP赋值给NN，回到步骤2，重复。

Node Splitting

In order to add a new entry to a full node containing M entries, it is necessary to divide the collection of M+1 entries between two nodes. The division should be done in a way that makes it as unlikely as possible that both new nodes will need to be examined on subsequent searches. Since the decision whether to visit a node depends on whether its covering rectangle overlaps the search area, the total area of the two covering rectangles after a split should be minimized. Figure 3.1 illustrates this point. The area of the covering rectangles in the “bad split” case is much larger than in the “good split” case.

The same criterion was used in procedure ChooseLeaf to decide where to insert a new index entry at each level in the tree, the subtree chosen was the one whose covering rectangle would have to be enlarged least.

We now turn to algorithms for partitioning the set of M + 1 entries into two groups, one for each new node.

节点分裂：

为了向满节点插入新节点，所以需要将M+1个节点分开到两个节点对于节点分裂要尽量采取不会在接下来的查询时再将两个节点都查询一次。对于决定是否去访问一个节点取决与它的范围矩形是否覆盖到了要查询的区域，所以两个节点的覆盖矩形的覆盖区域要保证最小化，根据图3.1 说明，差的分裂方法比好的分裂方法形成的范围矩形浪费的空间多很多。

同样的规则也被用在ChooseLeaf过程中，在这个过程中要决定树的每一层中哪个地方可以插入新的索引记录，而对于子树的选择方法就是选择谁的范围矩形可以被最少的扩充。

我们将讨论几个算法来将M+1个记录分裂到两个组中。

3.5.2 A Quadratic-Cost Algorithm

This algorithm attempts to find a small-area split, but is not guaranteed to find one with the smallest area possible. The cost is quadratic in M and liner in the number of dimensions. The algorithm picks two of the M+1 entries to be the first elements of the two new groups by choosing the pair that would waste the most area if both were put in the same group, i.e. the area of a rectangle covering both entries, minus the areas of the entries themselves, would be greatest. The remaining entries are then assigned to groups one at a time. At each step the area expansion required to add each remaining entry to each group is calculated, and the entry assigned is the one showing the greatest difference between the two groups.

这个算法的目的是找一个比较小的，但不追求最小覆盖区域。而这个算法的花费就是M的平方大小以及维数的线性大小。从M+1个记录中选取两个记录放置到两个新组中，成为其首个元素，而对于这两个元素的选择方法是如果将这两个元素放到一组中将会浪费最大的空间。

Algorithm Quadratic Split. Divide a set of M+1 index entries into two groups.

(1) [Pick first entry for each group] Apply Algorithm PickSeeds to choose two entries to be the first elements for the groups. Assign each to a group.

(2) [Check if done] If all entries have been assigned, stop. If one group has so few entries that all the rest must be assigned to it in order for it to have the minimum number m, assign them and stop.

(3) [Select entry to assign] Invoke Algorithm PickNext to choose the next entry to assign. Add it to the group whose covering rectangle will have to be enlarged least to accommodate it. Resolve ties by adding the entry to the group with smaller area, then to the one with fewer entries, then to either. Repeat form (2)

平方分裂法：将M+1个索引记录分裂到两组中

(1) 【为每一组选择第一个记录】调用PickSeeds算法为两组选择两个记录

(2) 【检查是否完成】如果所有的记录都被处理完，则结束。如果一个组有很少的记录从而导致剩下的都需要被分配到这个组中，因为要符合m的最小值，分配后停止

(3) 【选择下次要分配的记录】调用PickNext算法选择下一个要分配的记录。将其加到一个组中（此组加进记录后范围矩形扩张的最小），通过不断选择区域可能增加比较少的组来满足R树所需要的关系，接着再选择较少记录的组来添加，不断重复,到(2)

Algorithm PickSeeds: Select two entries to be the first elements of the (two) groups

PickSeeds算法：为两组的首元素挑选两个索引记录

(1) [Calculate inefficiency of grouping entries together] For each pair of entries E₁ and E₂compares a rectangle J including E₁I and E₂I. Calculate d = area(J) – area(E₁I) – area(E₂I)

(2) [Choose the most wasteful pair] Choose the pair with the largest d.

(1) 【计算无效性】对于一个包含E₁I 和 E₂I的矩形J，计算剩余面积d = area(J) – area(E₁I) – area(E₂I)

(2) 【选择最耗费的一对】选择d最大的一对

Algorithm PickNext: Select one remaining entry for classification in a group

算法PickNext：为一个组选择其他余下的记录

(1) [Determine cost of putting each entry in each group] For each entry E not yet in a group, calculate d₁ = the area increase required in the covering rectangle of Group 1 to include E I. Calculate d₂ similarly for Group 2.

(2) [Find entry with greatest preference for one group] Choose any entry with the maximum difference between d1 and d2。

(1) 【测算每一个记录放置到组中的面积】对于每个未放置到组中的记录，计算当每个组加入当前记录后的数值为d₁,d₂

(2) 【为每个组选择最适宜的记录】选择在d₁和d₂之间的最大的记录

3.5.2 A Liner-Cost Algorithm

This algorithm is liner in M and in the number of dimensions Linear Split is identical to Quadratic Split, but uses a different version of PickSeeds. PickNext simply chooses any of the remaining entries.

线性耗费算法：

这个算法的耗费是在M的线性时间上，对于维度上的分裂时间与平方分裂相同，但是采用了一个不同版本的分裂，而PickNext只是简单的从余下的及记录选择

Algorithm LinerPickSeeds: Select two entries to be the first elements of the groups

LinerPickSeeds算法：为两组的首元素挑选两个索引记录

(1) [Find extreme rectangles along all dimensions] Along each dimension, find the entry whose rectangle has the highest low side, and the one with the lowest high side. Record the separation.

(2) [Adjust for shape of the rectangle cluster] Normalize the separations by dividing by the width of the entire set along the corresponding dimension

(3) [Select the most extreme pair] Choose the pair with the greatest normalized separation along any dimension.

(1) 【从所有维度上找寻不寻常的矩形】在每一维度上，找出在当前维度上每个记录的范围矩形中，谁的宽最长，谁的长最短。并记录这个差异

(2) 【为矩形簇的图形进行调整】对于整个集合的不同维度上，通过宽度来进行规范化

(3) 【选择最不同的一对记录】找出在同一纬度上区别最大的一对