spatialhadoop2.3源码阅读(四) FileMBR类

最新推荐文章于 2016-10-28 19:40:17 发布

flyhaifeng

最新推荐文章于 2016-10-28 19:40:17 发布

阅读量748

点赞数

分类专栏： spatialhadoop 文章标签： spatialhadoop

本文链接：https://blog.csdn.net/flyhaifeng/article/details/50151651

版权

spatialhadoop 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

edu.umn.cs.spatialHadoop.operations.FileMBR 类主要功能为计算输入数据的最小包围矩形。

该类的核心实现为fileMBRMapReduce方法。该方法使用MapReduce Job进行计算。

FileMBR 主要实现了map，combine和reduce方法。接下来分别介绍这三个方法。

1. FileMBRMapper为Map类，该类map方法位

public void map(Rectangle dummy, Text text,
        OutputCollector<Text, Partition> output, Reporter reporter)
            throws IOException {
      if (lastSplit != reporter.getInputSplit()) {
        lastSplit = reporter.getInputSplit();
        value.filename = ((FileSplit)lastSplit).getPath().getName();
        fileName = new Text(value.filename);
      }
      value.size = text.getLength() + 1; // +1 for new line
      shape.fromText(text);
      Rectangle mbr = shape.getMBR();

      if (mbr != null) {
        value.set(mbr);
        output.collect(fileName, value);
      }
    }
  }

map方法中传入的key和value是自定义RecordReader的解析结果。重点在value即map方法中的传入参数Text。 RecordReader会将原始数据中的每一行解析为spatialhadoop自定义的shape类型，并序列化为Text。map方法就是计算每一个shape的mbr，然后输出。输出格式为（文件名，mbr）

2. Combine为Combine类，Reduce为Reduce类，两者作用基本相同。reduce代码如下

public static class Reduce extends MapReduceBase implements
    Reducer<Text, Partition, NullWritable, Partition> {
    @Override
    public void reduce(Text filename, Iterator<Partition> values,
        OutputCollector<NullWritable, Partition> output, Reporter reporter)
            throws IOException {
      if (values.hasNext()) {
        Partition partition = values.next().clone();
        while (values.hasNext()) {
          partition.expand(values.next());
        }
        partition.cellId = Math.abs(filename.hashCode());
        output.collect(NullWritable.get(), partition);
      }
    }
  }

reduce方法循环的将相同文件的两个shape进行合并，计算两个shape的mbr，具体计算方法如下，即计算出包围这两个shape的最小包围矩形。

 public void expand(final Shape s) {
    Rectangle r = s.getMBR();
    if (r.x1 < this.x1)
      this.x1 = r.x1;
    if (r.x2 > this.x2)
      this.x2 = r.x2;
    if (r.y1 < this.y1)
      this.y1 = r.y1;
    if (r.y2 > this.y2)
      this.y2 = r.y2;
  }

同时，pattiton中也保存了文件名，数据大小，记录数这些信息，在mbr合并的同时，也会将数据大小，记录数进行相加，用来最后进行输出。代码如下：

 public void expand(Partition p) {
    super.expand(p);
    // accumulate size
    this.size += p.size;
    this.recordCount += p.recordCount;
  }

总结如下：

map方法输出每一个shape的mbr

combine和reduce循序计算两个shape的mbr。

接下来介绍FileMBR中的outputCommiter，代码如下：

 public static class MBROutputCommitter extends FileOutputCommitter {
    // If input is a directory, save the MBR to a _master file there
    @Override
    public void commitJob(JobContext context) throws IOException {
      try {
        super.commitJob(context);
        // Store the result back in the input file if it is a directory
        JobConf job = context.getJobConf();

        Path[] inPaths = SpatialInputFormat.getInputPaths(job);
        Path inPath = inPaths[0]; // TODO Handle multiple file input
        FileSystem inFs = inPath.getFileSystem(job);
        if (!inFs.getFileStatus(inPath).isDir())
          return;
        Path gindex_path = new Path(inPath, "_master.heap");
        // Answer has been already cached (may be by another job)
        if (inFs.exists(gindex_path))
          return;
        PrintStream gout = new PrintStream(inFs.create(gindex_path, false));

        // Read job result and concatenate everything to the master file
        Path outPath = TextOutputFormat.getOutputPath(job);
        FileSystem outFs = outPath.getFileSystem(job);
        FileStatus[] results = outFs.listStatus(outPath);
        for (FileStatus fileStatus : results) {
          if (fileStatus.getLen() > 0 && fileStatus.getPath().getName().startsWith("part-")) {
            LineReader lineReader = new LineReader(outFs.open(fileStatus.getPath()));
            Text text = new Text();
            while (lineReader.readLine(text) > 0) {
              gout.println(text);
            }
            lineReader.close();
          }
        }
        gout.close();
      } catch (RuntimeException e) {
        // This might happen of the input directory is read only
        LOG.info("Error caching the output of FileMBR");
      }
    }
  }

commitJob会在job成功运行之后执行，commitJob方法第一行是调用父类的该方法，这将会在输出文件夹中生成part-00000文件，该文件和下文所讲的 _master.heap文件完全相同，自定义的commitJob方法实质上就是将part-00000文件中的内容复制到_master.heap文件中。

FileMBR的输出：

FileMBR会在输入文件夹下生成一个全局索引文件，该文件由MBROutputCommitter生成。命名为“_master.heap”，内容实例如下

488954273,-176.2562062,-54.9019677,178.4890059,78.2138853,193076,59151087,cemetery.bz2
1789326676,-179.8728244,-89.9678417,179.7586087,78.6569788,1767137,618575974,sports.bz2

每一行代表一个文件，各个字段分别代表:cellID,x1,y1,x2,y2,recordcount,filesize,filename
cellID为filename的hashcode

(x1,y1)-(x2,y2)为该文件的最小包围矩形

recordcount为该文件数据记录数

filesize为该文件未压缩的大小

filename为该文件名