edu.umn.cs.spatialHadoop.operations.FileMBR 类主要功能为计算输入数据的最小包围矩形。
该类的核心实现为fileMBRMapReduce方法。该方法使用MapReduce Job进行计算。
FileMBR 主要实现了map,combine和reduce方法。接下来分别介绍这三个方法。
1. FileMBRMapper为Map类,该类map方法位
public void map(Rectangle dummy, Text text,
OutputCollector<Text, Partition> output, Reporter reporter)
throws IOException {
if (lastSplit != reporter.getInputSplit()) {
lastSplit = reporter.getInputSplit();
value.filename = ((FileSplit)lastSplit).getPath().getName();
fileName = new Text(value.filename);
}
value.size = text.getLength() + 1; // +1 for new line
shape.fromText(text);
Rectangle mbr = shape.getMBR();
if (mbr != null) {
value.set(mbr);
output.collect(fileName, value);
}
}
}
map方法中传入的key和value是自定义RecordReader的解析结果。重点在value即map方法中的传入参数Text。
RecordReader会将原始数据中的每一行解析为spatialhadoop自定义的shape类型,并序列化为Text。map方法就是计算每一个shape的mbr,然后输出。输出格式为(文件名,mbr)
2. Combine为Combine类,Reduce为Reduce类,两者作用基本相同。reduce代码如下
public static class Reduce extends MapReduceBase implements
Reducer<Text, Partition, NullWritable, Partition> {
@Override
public void reduce(Text filename, Iterator<Partition> values,
OutputCollector<NullWritable, Partition> output, Reporter reporter)
throws IOException {
if (values.hasNext()) {
Partition partition = values.next().clone();
while (values.hasNext()) {
partition.expand(values.next());
}
partition.cellId = Math.abs(filename.hashCode());
output.collect(NullWritable.get(), partition);
}
}
}
reduce方法循环的将相同文件的两个shape进行合并,计算两个shape的mbr,具体计算方法如下,即计算出包围这两个shape的最小包围矩形。
public void expand(final Shape s) {
Rectangle r = s.getMBR();
if (r.x1 < this.x1)
this.x1 = r.x1;
if (r.x2 > this.x2)
this.x2 = r.x2;
if (r.y1 < this.y1)
this.y1 = r.y1;
if (r.y2 > this.y2)
this.y2 = r.y2;
}
同时,pattiton中也保存了文件名,数据大小,记录数这些信息,在mbr合并的同时,也会将数据大小,记录数进行相加,用来最后进行输出。代码如下:
public void expand(Partition p) {
super.expand(p);
// accumulate size
this.size += p.size;
this.recordCount += p.recordCount;
}
总结如下:
map方法输出每一个shape的mbr
combine和reduce循序计算两个shape的mbr。
接下来介绍FileMBR中的outputCommiter,代码如下:
public static class MBROutputCommitter extends FileOutputCommitter {
// If input is a directory, save the MBR to a _master file there
@Override
public void commitJob(JobContext context) throws IOException {
try {
super.commitJob(context);
// Store the result back in the input file if it is a directory
JobConf job = context.getJobConf();
Path[] inPaths = SpatialInputFormat.getInputPaths(job);
Path inPath = inPaths[0]; // TODO Handle multiple file input
FileSystem inFs = inPath.getFileSystem(job);
if (!inFs.getFileStatus(inPath).isDir())
return;
Path gindex_path = new Path(inPath, "_master.heap");
// Answer has been already cached (may be by another job)
if (inFs.exists(gindex_path))
return;
PrintStream gout = new PrintStream(inFs.create(gindex_path, false));
// Read job result and concatenate everything to the master file
Path outPath = TextOutputFormat.getOutputPath(job);
FileSystem outFs = outPath.getFileSystem(job);
FileStatus[] results = outFs.listStatus(outPath);
for (FileStatus fileStatus : results) {
if (fileStatus.getLen() > 0 && fileStatus.getPath().getName().startsWith("part-")) {
LineReader lineReader = new LineReader(outFs.open(fileStatus.getPath()));
Text text = new Text();
while (lineReader.readLine(text) > 0) {
gout.println(text);
}
lineReader.close();
}
}
gout.close();
} catch (RuntimeException e) {
// This might happen of the input directory is read only
LOG.info("Error caching the output of FileMBR");
}
}
}
commitJob会在job成功运行之后执行,commitJob方法第一行是调用父类的该方法,这将会在输出文件夹中生成part-00000文件,该文件和下文所讲的 _master.heap文件完全相同,自定义的commitJob方法实质上就是将part-00000文件中的内容复制到_master.heap文件中。
FileMBR的输出:
FileMBR会在输入文件夹下生成一个全局索引文件,该文件由MBROutputCommitter生成。命名为“_master.heap”,内容实例如下
488954273,-176.2562062,-54.9019677,178.4890059,78.2138853,193076,59151087,cemetery.bz2
1789326676,-179.8728244,-89.9678417,179.7586087,78.6569788,1767137,618575974,sports.bz2
每一行代表一个文件,各个字段分别代表:cellID,x1,y1,x2,y2,recordcount,filesize,filename
cellID为filename的hashcode
(x1,y1)-(x2,y2)为该文件的最小包围矩形
recordcount为该文件数据记录数
filesize为该文件未压缩的大小
filename为该文件名