spatialhadoop2.3源码阅读(十三) RTreeGridOutputFormat & RTreeGridRecordWriter & RTree[RTree Index MapReuce]

最新推荐文章于 2017-04-04 12:24:00 发布

flyhaifeng

最新推荐文章于 2017-04-04 12:24:00 发布

阅读量517

点赞数

分类专栏： spatialhadoop

本文链接：https://blog.csdn.net/flyhaifeng/article/details/50387592

版权

spatialhadoop 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

这几个类的调用关系为RTreeGridOutputFormat生成edu.umn.cs.spatialHadoop.mapred.RTreeGridRecordWriter类，edu.umn.cs.spatialHadoop.mapred.RTreeGridRecordWriter类继承自edu.umn.cs.spatialHadoop.core.RTreeGridRecordWriter，edu.umn.cs.spatialHadoop.mapred.RTreeGridRecordWriter的具体实现是对父类对应方法的调用，所以本文重点介绍，edu.umn.cs.spatialHadoop.core.RTreeGridRecordWriter。其余两个类可参考前文相应部分的介绍。

1. RTreeGridRecordWriter

1.1 构造函数

RTreeGridRecordWriter继承自上一篇文章所介绍的GridRecordWriter。在RTreeGridRecordWriter的构造函数中，首先调用父类的构造函数，然后初始化自己定义的一些成员变量，包括是否采用快速模式生成RTree和一次最多存储的数据。

1.2. writeInternal函数

RTreeGridRecordWriter重载了父类的writeInternal方法。writeInternal方法被父类的write方法调用。

 protected synchronized void writeInternal(int cellIndex, S shape)
      throws IOException {
    if (cellIndex < 0) {
      // This indicates a close cell command
      super.writeInternal(cellIndex, shape);
      return;
    }
    // Convert to text representation to test new file size
    text.clear();
    shape.toText(text);
    // Check if inserting this object will increase the degree of the R-tree
    // above the threshold
    int new_data_size =
        intermediateCellSize[cellIndex] + text.getLength() + NEW_LINE.length;
    int bytes_available = (int) (blockSize - 8 - new_data_size);
    if (bytes_available < maximumStorageOverhead) {
      // Check if writing this new record will take storage overhead beyond the
      // available bytes in the block
      int degree = 4096 / RTree.NodeSize;
      int rtreeStorageOverhead =
          RTree.calculateStorageOverhead(intermediateCellRecordCount[cellIndex], degree);
      if (rtreeStorageOverhead > bytes_available) {
        LOG.info("Early flushing an RTree with data "+
            intermediateCellSize[cellIndex]);
        // Writing this element will get the degree above the threshold
        // Flush current file and start a new file
        super.writeInternal(-cellIndex, null);
      }
    }
    
    super.writeInternal(cellIndex, shape);
  }

3-6：在RTree Index MapReuce的reduce实现中，对于每一个cell，当处理完该cell的所有数据后，总会输出一个key为-cellID的key-value对,用来标识该cell已经处理完毕，该处代码就是对某一cell的最终处理。在这里，直接调用了父类的该函数。

13-29：判断是否需要关闭当前输出文件，创建一个新的文件写入。一个文件的大小不大于块大小。所以一个cellID会产生多个输出文件。

31：调用父类的方法写入。

1.3. flushAllEntries函数

RTreeGridRecordWriter重载了父类的getIntermediateCellStream，getFinalCellPath和flushAllEntries方法。

父类的实现中，getIntermediateCellStream，getFinalCellPath这两个方法产生的输出路径是相同的。

而在重载实现中，getIntermediateCellStream方法会先生成一个临时文件，这个临时文件中的内容和Grid Index MapReuce最终生成的数据是一样的；getFinalCellPath产生的路径才是真实的最终输出路径。而将临时文件中的数据变为RTree Index，依靠的方法就是子类重载的flushAllEntries方法。接下来具体介绍该方法。

protected Path flushAllEntries(Path intermediateCellPath,
      OutputStream intermediateCellStream, Path finalCellPath) throws IOException {
    // Close stream to current intermediate file.
    intermediateCellStream.close();

    // Read all data of the written file in memory
    byte[] cellData = new byte[(int) new File(intermediateCellPath.toUri()
        .getPath()).length()];
    InputStream cellIn = new FileInputStream(intermediateCellPath.toUri()
        .getPath());
    cellIn.read(cellData);
    cellIn.close();

    // Build an RTree over the elements read from file
    RTree<S> rtree = new RTree<S>();
    rtree.setStockObject((S) stockObject.clone());
    // It should create a new stream
    DataOutputStream cellStream =
      (DataOutputStream) createFinalCellStream(finalCellPath);
    cellStream.writeLong(SpatialSite.RTreeFileMarker);
    int degree = 4096 / RTree.NodeSize;
    rtree.bulkLoadWrite(cellData, 0, cellData.length, degree, cellStream,
        fastRTree);
    cellStream.close();
    cellData = null; // To allow GC to collect it
    
    return finalCellPath;
  }

4：关闭中间文件输出流

7-12:将中间文件全部读入内存

15-25：为该文件中的数据建立rtree索引，写入最终输出文件。重点函数为bulkLoadWrite。

21：该除法是用来计算R树中每一层的节点数，4096可能为内存页大小4K。

2. RTree

2.1 bulkLoadWrite函数

bulkLoadWrite函数实际实现了RTree Index的创建过程。其输入参数分别为element_bytes需要创建索引的数据，offset起始位置，len长度，degree度数即每层的node数，dataOut输出流，fast_sort是否快速生成标志。接下来分段介绍该函数的实现：

第一部分

 // Count number of elements in the given text
      int i_start = offset;
      final Text line = new Text();
      while (i_start < offset + len) {
        int i_end = skipToEOL(element_bytes, i_start);
        // Extract the line without end of line character
        line.set(element_bytes, i_start, i_end - i_start - 1);
        stockObject.fromText(line);
        elementCount++;
        i_start = i_end;
      }
      LOG.info("Bulk loading an RTree with "+elementCount+" elements");
      
      // It turns out the findBestDegree returns the best degree when the whole
      // tree is loaded to memory when processed. However, as current algorithms
      // process the tree while it's on disk, a higher degree should be selected
      // such that a node fits one file block (assumed to be 4K).
      //final int degree = findBestDegree(bytesAvailable, elementCount);
      LOG.info("Writing an RTree with degree "+degree);
      
      int height = Math.max(1, 
          (int) Math.ceil(Math.log(elementCount)/Math.log(degree)));
      int leafNodeCount = (int) Math.pow(degree, height - 1);
      if (elementCount <  2 * leafNodeCount && height > 1) {
        height--;
        leafNodeCount = (int) Math.pow(degree, height - 1);
      }
      int nodeCount = (int) ((Math.pow(degree, height) - 1) / (degree - 1));
      int nonLeafNodeCount = nodeCount - leafNodeCount;

      // Keep track of the offset of each element in the text
      final int[] offsets = new int[elementCount];
      final double[] xs = fast_sort? new double[elementCount] : null;
      final double[] ys = fast_sort? new double[elementCount] : null;
      
      i_start = offset;
      line.clear();
      for (int i = 0; i < elementCount; i++) {
        offsets[i] = i_start;
        int i_end = skipToEOL(element_bytes, i_start);
        if (xs != null) {
          // Extract the line with end of line character
          line.set(element_bytes, i_start, i_end - i_start - 1);
          stockObject.fromText(line);
          // Sample center of the shape
          xs[i] = (stockObject.getMBR().x1 + stockObject.getMBR().x2) / 2;
          ys[i] = (stockObject.getMBR().y1 + stockObject.getMBR().y2) / 2;
        }
        i_start = i_end;
      }

1-11：计算输入数据总共有多少条记录

21-29：分别计算出R树的高度，叶子节点数目，总节点数和非叶子节点数目。其中计算高度用到了换底公式。
32-49：offsets计算存储每条记录在输入数据中的起始位置，xs和ys分别存储每条记录的最小包围矩形的中心点坐标。

第二部分

内部类：SplitStruct

该类用来保存分片信息。

index1和index2分别表示该分片在输入记录中的起始索引和终止索引。

direction表示该分片是x方向还是y方向，涉及到内部类中的partition函数的排序算法

offsetOfFirstElement表示第一个记录在磁盘上的起始位置。

类变量DIRECTION_X和DIRECTION_Y是direction成员变量的两个取值。

partition函数：首先会定义两个排序变量，分别为sortableX和sortableY，表示根据x和y坐标由小到大进行排序，然后根据direction成员变量，将index1和index2之间的记录进行排序，按照x或者y。排序之后，将index1和index2之间的记录分割为degree份，每一份重新生成一个SplitStruct，同时更新其direction变量为分割前的相反值，最后将其加入到partition的参数toBePartitioned中。partition的主要代码如下：

final IndexedSorter sorter = new QuickSort();
          
          final IndexedSortable[] sortables = new IndexedSortable[2];
          sortables[SplitStruct.DIRECTION_X] = sortableX;
          sortables[SplitStruct.DIRECTION_Y] = sortableY;
          
          sorter.sort(sortables[direction], index1, index2);

          // Partition into maxEntries partitions (equally) and
          // create a SplitStruct for each partition
          int i1 = index1;
          for (int iSplit = 0; iSplit < degree; iSplit++) {
            int i2 = index1 + (index2 - index1) * (iSplit + 1) / degree;
            SplitStruct newSplit = new SplitStruct(i1, i2, (byte)(1 - direction));
            toBePartitioned.add(newSplit);
            i1 = i2;
          }

第三部分

接下来的部分介绍分割算法，其主要是对上面介绍的partiton函数的调用。代码如下：

 // All nodes stored in level-order traversal
      Vector<SplitStruct> nodes = new Vector<SplitStruct>();
      final Queue<SplitStruct> toBePartitioned = new LinkedList<SplitStruct>();
      toBePartitioned.add(new SplitStruct(0, elementCount, SplitStruct.DIRECTION_X));
      
      while (!toBePartitioned.isEmpty()) {
        SplitStruct split = toBePartitioned.poll();
        if (nodes.size() < nonLeafNodeCount) {
          // This is a non-leaf
          split.partition(toBePartitioned);
        }
        nodes.add(split);
      }
      
      if (nodes.size() != nodeCount) {
        throw new RuntimeException("Expected node count: "+nodeCount+". Real node count: "+nodes.size());
      }

1-3：初始化SplitStruct变量，用所有的记录初始化SplitStruct变量，并加入到toBePartitioned中

7：弹出队列中的第一个SplitStruct

8：判断当前节点数是否小于需要的非叶子节点数

10：调用当前SplitStruct的partition函数，将当前SplitStruct分区，并加入到toBePartitioned中，然后将当前SplitStruct加入到nodes中，如此循环，直到nodes中的节点数等于非叶子节点数。

总结:整个算法相当于首先将所有数据设为第一层，即第一层有一个节点，然后按照x排序，切割成degree分，即第二层有degree个节点，第一层按照x排序，第二层按照y排序，第三层按照x排序，依此类推，直到非叶子节点数目达到要求，则不再分割，并将剩下的加入到nodes中。

第四部分

这一部分的代码主要实现了对所有的叶子节点，记录他们在输出文件中，第一个记录的起始位置，并且计算该叶子节点包含的所有记录的最小包围矩形。并将真实的数据记录写到输出流。

// Now we have our data sorted in the required order. Start building
      // the tree.
      // Store the offset of each leaf node in the tree
      FSDataOutputStream fakeOut = null;
      try {
        fakeOut = new FSDataOutputStream(new java.io.OutputStream() {
          // Null output stream
          @Override
          public void write(int b) throws IOException {
            // Do nothing
          }
          @Override
          public void write(byte[] b, int off, int len) throws IOException {
            // Do nothing
          }
          @Override
          public void write(byte[] b) throws IOException {
            // Do nothing
          }
        }, null, TreeHeaderSize + nodes.size() * NodeSize);
        for (int i_leaf = nonLeafNodeCount, i=0; i_leaf < nodes.size(); i_leaf++) {
          nodes.elementAt(i_leaf).offsetOfFirstElement = (int)fakeOut.getPos();
          if (i != nodes.elementAt(i_leaf).index1) throw new RuntimeException();
          double x1, y1, x2, y2;
          
          // Initialize MBR to first object
          int eol = skipToEOL(element_bytes, offsets[i]);
          fakeOut.write(element_bytes, offsets[i],
              eol - offsets[i]);
          line.set(element_bytes, offsets[i], eol - offsets[i] - 1);
          stockObject.fromText(line);
          Rectangle mbr = stockObject.getMBR();
          x1 = mbr.x1;
          y1 = mbr.y1;
          x2 = mbr.x2;
          y2 = mbr.y2;
          i++;
          
          while (i < nodes.elementAt(i_leaf).index2) {
            eol = skipToEOL(element_bytes, offsets[i]);
            fakeOut.write(element_bytes, offsets[i],
                eol - offsets[i]);
            line.set(element_bytes, offsets[i], eol - offsets[i] - 1);
            stockObject.fromText(line);
            mbr = stockObject.getMBR();
            if (mbr.x1 < x1) x1 = mbr.x1;
            if (mbr.y1 < y1) y1 = mbr.y1;
            if (mbr.x2 > x2) x2 = mbr.x2;
            if (mbr.y2 > y2) y2 = mbr.y2;
            i++;
          }
          nodes.elementAt(i_leaf).set(x1, y1, x2, y2);
        }
        
      } finally {
        if (fakeOut != null)
          fakeOut.close();
      }

4-20：初始化输出流，并使写入指针跳过R树头部信息和所有的索引信息。

21：遍历所有的叶子节点，i变量用来指示数据记录

22：记录该SplitStruct的第一个记录在输出文件中的位置

27-36：获得该SplitStruct的第一个记录，并获得最小包围矩形，然后将记录写到输出流（注意：该输出流并没有实际写入到输出文件中，实际写入见下文）

37：记录指针加1

39：遍历该SplitStruct的所有记录

40-50：获得每一个记录，并计算最小包围矩形，与之前的矩形进行合并，然后将记录写到输出流（注意：该输出流并没有实际写入到输出文件中，实际写入见下文）

52：最终计算出该SplitStruct所有记录的最小包围矩形，并更新。

第五部分

对于所有非叶子节点，记录它们对应的叶子节点的数据信息的起始位置，并计算最小包围矩形

// Calculate MBR and offsetOfFirstElement for non-leaves
      for (int i_node = nonLeafNodeCount-1; i_node >= 0; i_node--) {
        int i_first_child = i_node * degree + 1;
        nodes.elementAt(i_node).offsetOfFirstElement =
            nodes.elementAt(i_first_child).offsetOfFirstElement;
        int i_child = 0;
        Rectangle mbr;
        mbr = nodes.elementAt(i_first_child + i_child);
        double x1 = mbr.x1;
        double y1 = mbr.y1;
        double x2 = mbr.x2;
        double y2 = mbr.y2;
        i_child++;
        
        while (i_child < degree) {
          mbr = nodes.elementAt(i_first_child + i_child);
          if (mbr.x1 < x1) x1 = mbr.x1;
          if (mbr.y1 < y1) y1 = mbr.y1;
          if (mbr.x2 > x2) x2 = mbr.x2;
          if (mbr.y2 > y2) y2 = mbr.y2;
          i_child++;
        }
        nodes.elementAt(i_node).set(x1, y1, x2, y2);
      }

2：遍历所有的非叶子节点

3：计算该非叶子节点对应的degree个叶子节点的第一个

4：更新该非叶子节点的offsetOfFirstElement为其最左侧孩子节点的offsetOfFirstElement

5-22：遍历该非叶子节点的所有孩子节点，计算它们的最小包围矩形，

23：更新该非叶子节点的最小包围矩形

第六部分

该部分是将所有的信息实际写入到输出流中。

// Start writing the tree
      // write tree header (including size)
      // Total tree size. (== Total bytes written - 8 bytes for the size itself)
      dataOut.writeInt(TreeHeaderSize + NodeSize * nodeCount + len);
      // Tree height
      dataOut.writeInt(height);
      // Degree
      dataOut.writeInt(degree);
      dataOut.writeInt(elementCount);
      
      // write nodes
      for (SplitStruct node : nodes) {
        node.write(dataOut);
      }
      // write elements
      for (int element_i = 0; element_i < elementCount; element_i++) {
        int eol = skipToEOL(element_bytes, offsets[element_i]);
        dataOut.write(element_bytes, offsets[element_i],
            eol - offsets[element_i]);
      }

4：该文件的实际大小，不包括自己

5:-9：写入TreeHeader信息，包括高度，度数，记录数

12-13：将所有的节点信息写入。

16-19：将实际的数据记录写入

flyhaifeng

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spatialhadoop2.3源码阅读(十三) RTreeGridOutputFormat & RTreeGridRecordWriter & RTree[RTree Index MapReuce]

这几个类的调用关系为RTreeGridOutputFormat生成edu.umn.cs.spatialHadoop.mapred.RTreeGridRecordWriter类，edu.umn.cs.spatialHadoop.mapred.RTreeGridRecordWriter类继承自edu.umn.cs.spatialHadoop.core.RTreeGridRecordWriter，edu
复制链接

扫一扫