spatialhadoop2.3源码阅读(十二) GridOutputFormat & GridRecordWriter[Grid Index MapReuce]-CSDN博客

本文链接：https://blog.csdn.net/flyhaifeng/article/details/50384765

1. GridOutputFormat

GridOutputFormat的作用是产生RecordWriter，其生成了GridRecordWriter，代码如下：

public class GridOutputFormat<S extends Shape> extends FileOutputFormat<IntWritable, S> {

  @Override
  public RecordWriter<IntWritable, S> getRecordWriter(FileSystem ignored,
      JobConf job,
      String name,
      Progressable progress)
      throws IOException {
    // Get grid info
    CellInfo[] cellsInfo = SpatialSite.getCells(job);
    GridRecordWriter<S> writer = new GridRecordWriter<S>(job, name, cellsInfo);
    return writer;
  }
  
}

2.edu.umn.cs.spatialHadoop.mapred.GridRecordWriter

该类实现了MapReduce框架的RecordWriter接口，并继承了edu.umn.cs.spatialHadoop.core.GridRecordWriter类。

edu.umn.cs.spatialHadoop.mapred.GridRecordWriter的具体实现就是对edu.umn.cs.spatialHadoop.core.GridRecordWriter的调用，所以实现重点在于edu.umn.cs.spatialHadoop.core.GridRecordWriter类。

edu.umn.cs.spatialHadoop.mapred.GridRecordWriter的代码如下：

public class GridRecordWriter<S extends Shape>
extends edu.umn.cs.spatialHadoop.core.GridRecordWriter<S> implements RecordWriter<IntWritable, S> {

  public GridRecordWriter(JobConf job, String name, CellInfo[] cells) throws IOException {
    super(null, job, name, cells);
  }
  
  @Override
  public void write(IntWritable key, S value) throws IOException {
    super.write(key.get(), value);
  }

  @Override
  public void close(Reporter reporter) throws IOException {
    super.close(reporter);
  }

3. edu.umn.cs.spatialHadoop.core.GridRecordWriter

3.1 构造函数

 public GridRecordWriter(Path outDir, JobConf job, String prefix,
      CellInfo[] cells) throws IOException {
    if (job != null) {
      this.sindex = job.get("sindex", "heap");
      this.pack = PackedIndexes.contains(sindex);
      this.expand = ExpandedIndexes.contains(sindex);
    }
    this.prefix = prefix;
    this.fileSystem = outDir == null ? 
      FileOutputFormat.getOutputPath(job).getFileSystem(job):
      outDir.getFileSystem(job != null? job : new Configuration());
    this.outDir = outDir;
    this.jobConf = job;
    
    if (cells != null) {
      // Make sure cellIndex maps to array index. This is necessary for calls that
      // call directly write(int, Text)
      int highest_index = 0;
      
      for (CellInfo cell : cells) {
        if (cell.cellId > highest_index)
          highest_index = (int) cell.cellId;
      }
      
      // Create a master file that contains meta information about partitions
      masterFile = fileSystem.create(getMasterFilePath());
      
      this.cells = new CellInfo[highest_index + 1];
      for (CellInfo cell : cells)
        this.cells[(int) cell.cellId] = cell;
      
      // Prepare arrays that hold cells information
      intermediateCellStreams = new OutputStream[this.cells.length];
      intermediateCellPath = new Path[this.cells.length];
      cellsMbr = new Rectangle[this.cells.length];
      // Initialize the counters for each cell
      intermediateCellRecordCount = new int[this.cells.length];
      intermediateCellSize = new int[this.cells.length];

    } else {
      intermediateCellStreams = new OutputStream[1];
      intermediateCellPath = new Path[1];
      cellsMbr = new Rectangle[1];
      intermediateCellSize = new int[1];
      intermediateCellRecordCount = new int[1];
    }
    for (int i = 0; i < cellsMbr.length; i++) {
      cellsMbr[i] = new Rectangle(Double.MAX_VALUE, Double.MAX_VALUE,
          -Double.MAX_VALUE, -Double.MAX_VALUE);
    }

    this.blockSize = fileSystem.getDefaultBlockSize(outDir);
    
    closingThreads = new ArrayList<Thread>();
    text = new Text();
  }

3-13：初始化参数；

18-23：判断输入的CellInfo数组中最大的ceilID

26：在输出目录中创建master文件

28-30：根据highest_index，为cells成员变量赋值

33-38：初始化一些与网格信息相关的数组

41-45：不执行

47-50:初始化cellsMbr

3.2. writeInternal函数

调用顺序为edu.umn.cs.spatialHadoop.mapred.GridRecordWriter.write->edu.umn.cs.spatialHadoop.core.GridRecordWriter.write->edu.umn.cs.spatialHadoop.core.GridRecordWriter.writeInternal,所以重点在于writeInternal函数的实现。具体代码如下：

protected synchronized void writeInternal(int cellIndex, S shape) throws IOException {
    if (cellIndex < 0) {
      // A special marker to close a cell
      closeCell(-cellIndex);
      return;
    }
    try {
      cellsMbr[cellIndex].expand(shape.getMBR());
    } catch (NullPointerException e) {
      e.printStackTrace();
    }
    // Convert shape to text
    text.clear();
    shape.toText(text);
    // Write text representation to the file
    OutputStream cellStream = getIntermediateCellStream(cellIndex);
    cellStream.write(text.getBytes(), 0, text.getLength());
    cellStream.write(NEW_LINE);
    intermediateCellSize[cellIndex] += text.getLength() + NEW_LINE.length;
    intermediateCellRecordCount[cellIndex]++;
  }

2-6：在Grid Index MapReuce的reduce实现中，对于每一个cell，当处理完该cell的所有数据后，总会输出一个key为-cellID的key-value对,用来标识该cell已经处理完毕，该处代码就是对某一cell的最终处理。

8：计算当前cellIndex对应的cell已经写入文件的数据的最小包围矩形

13-14：将该shape转换为Text

16：获得该cellIndex对应的cell文件的输出流，在getIntermediateCellStream函数中，先判断输出流是否存在，若不存在则先获得输出路径，再构建输出流，若存在直接返回

17-20：将数据写入输出流，同时更新已写入文件的数据大小和数据记录数。

3.3 closeCell函数

在上面讲述中，2-6行代码中涉及到closeCell函数，接下来主要介绍该函数。

protected void closeCell(int cellIndex) throws IOException {
    CellInfo cell = cells != null? cells[cellIndex] : new CellInfo(cellIndex+1, cellsMbr[cellIndex]);
    if (expand)
      cell.expand(cellsMbr[cellIndex]);
    if (pack)
      cell = new CellInfo(cell.cellId, cell.getIntersection(cellsMbr[cellIndex]));

    closeCellBackground(intermediateCellPath[cellIndex],
        getFinalCellPath(cellIndex), intermediateCellStreams[cellIndex],
        masterFile, cell, intermediateCellRecordCount[cellIndex], intermediateCellSize[cellIndex]);
    cellsMbr[cellIndex] = new Rectangle(Double.MAX_VALUE, Double.MAX_VALUE,
        -Double.MAX_VALUE, -Double.MAX_VALUE);
    intermediateCellPath[cellIndex] = null;
    intermediateCellStreams[cellIndex] = null;
    intermediateCellRecordCount[cellIndex] = 0;
    intermediateCellSize[cellIndex] = 0;
  }

2：获得cellIndex对应的CellInfo信息

3-4：判断当前的索引类型是否需要将网格信息和已写入的数据的最小包围矩形进行合并，expand在构造函数中确定

5-6：判断当前的索引类型是否需要将网格信息和已写入的数据的最小包围矩形取交集，pack在构造函数中确定

8：调用closeCellBackground进行处理。在该函数中，会单独开辟一个线程来进行最终处理，并将该线程加入成员变量closingThreads中。包括刷新并关闭该输出流，更新master文件。同时遍历closingThreads，若有已经结束的线程，则销毁，同时启动当前新建的线程。

11-16：销毁与该cell相关的成员变量。

3.4 close函数

调用顺序为edu.umn.cs.spatialHadoop.mapred.GridRecordWriter.close->edu.umn.cs.spatialHadoop.core.GridRecordWriter.close，所以重点在于close函数的实现。具体代码如下：

  public synchronized void close(Progressable progressable) throws IOException {
    // Close all output files
    for (int cellIndex = 0; cellIndex < intermediateCellStreams.length; cellIndex++) {
      if (intermediateCellStreams[cellIndex] != null) {
        closeCell(cellIndex);
      }
      // Indicate progress. Useful if closing a single cell takes a long time
      if (progressable != null)
        progressable.progress();
    }

    while (!closingThreads.isEmpty()) {
      try {
        Thread t = closingThreads.get(0);
        switch (t.getState()) {
        case NEW: t.start(); break;
        case TERMINATED: closingThreads.remove(0); break;
        default:
          // Use limited time join to indicate progress frequently
          t.join(10000);
        }
        // Indicate progress. Useful if closing a single cell takes a long time
        if (progressable != null)
          progressable.progress();
      } catch (InterruptedException e) {
        e.printStackTrace();
      }
    }
    
    if (masterFile != null)
      masterFile.close();
  }

3-9：判断是否有未关闭的cell输出流，并关闭

12-28：清理cell输出流关闭线程。

30：关闭masterFile