1. GridOutputFormat
GridOutputFormat的作用是产生RecordWriter,其生成了GridRecordWriter,代码如下:
public class GridOutputFormat<S extends Shape> extends FileOutputFormat<IntWritable, S> {
@Override
public RecordWriter<IntWritable, S> getRecordWriter(FileSystem ignored,
JobConf job,
String name,
Progressable progress)
throws IOException {
// Get grid info
CellInfo[] cellsInfo = SpatialSite.getCells(job);
GridRecordWriter<S> writer = new GridRecordWriter<S>(job, name, cellsInfo);
return writer;
}
}
2.edu.umn.cs.spatialHadoop.mapred.GridRecordWriter
该类实现了MapReduce框架的RecordWriter接口,并继承了edu.umn.cs.spatialHadoop.core.GridRecordWriter类。
edu.umn.cs.spatialHadoop.mapred.GridRecordWriter的具体实现就是对edu.umn.cs.spatialHadoop.core.GridRecordWriter的调用,所以实现重点在于edu.umn.cs.spatialHadoop.core.GridRecordWriter类。
edu.umn.cs.spatialHadoop.mapred.GridRecordWriter的代码如下:
public class GridRecordWriter<S extends Shape>
extends edu.umn.cs.spatialHadoop.core.GridRecordWriter<S> implements RecordWriter<IntWritable, S> {
public GridRecordWriter(JobConf job, String name, CellInfo[] cells) throws IOException {
super(null, job, name, cells);
}
@Override
public void write(IntWritable key, S value) throws IOException {
super.write(key.get(), value);
}
@Override
public void close(Reporter reporter) throws IOException {
super.close(reporter);
}
3. edu.umn.cs.spatialHadoop.core.GridRecordWriter
3.1 构造函数
public GridRecordWriter(Path outDir, JobConf job, String prefix,
CellInfo[] cells) throws IOException {
if (job != null) {
this.sindex = job.get("sindex", "heap");
this.pack = PackedIndexes.contains(sindex);
this.expand = ExpandedIndexes.contains(sindex);
}
this.prefix = prefix;
this.fileSystem = outDir == null ?
FileOutputFormat.getOutputPath(job).getFileSystem(job):
outDir.getFileSystem(job != null? job : new Configuration());
this.outDir = outDir;
this.jobConf = job;
if (cells != null) {
// Make sure cellIndex maps to array index. This is necessary for calls that
// call directly write(int, Text)
int highest_index = 0;
for (CellInfo cell : cells) {
if (cell.cellId > highest_index)
highest_index = (int) cell.cellId;
}
// Create a master file that contains meta information about partitions
masterFile = fileSystem.create(getMasterFilePath());
this.cells = new CellInfo[highest_index + 1];
for (CellInfo cell : cells)
this.cells[(int) cell.cellId] = cell;
// Prepare arrays that hold cells information
intermediateCellStreams = new OutputStream[this.cells.length];
intermediateCellPath = new Path[this.cells.length];
cellsMbr = new Rectangle[this.cells.length];
// Initialize the counters for each cell
intermediateCellRecordCount = new int[this.cells.length];
intermediateCellSize = new int[this.cells.length];
} else {
intermediateCellStreams = new OutputStream[1];
intermediateCellPath = new Path[1];
cellsMbr = new Rectangle[1];
intermediateCellSize = new int[1];
intermediateCellRecordCount = new int[1];
}
for (int i = 0; i < cellsMbr.length; i++) {
cellsMbr[i] = new Rectangle(Double.MAX_VALUE, Double.MAX_VALUE,
-Double.MAX_VALUE, -Double.MAX_VALUE);
}
this.blockSize = fileSystem.getDefaultBlockSize(outDir);
closingThreads = new ArrayList<Thread>();
text = new Text();
}
3-13:初始化参数;
18-23:判断输入的CellInfo数组中最大的ceilID
26:在输出目录中创建master文件
28-30:根据highest_index,为cells成员变量赋值
33-38:初始化一些与网格信息相关的数组
41-45:不执行
47-50:初始化cellsMbr
3.2. writeInternal函数
调用顺序为edu.umn.cs.spatialHadoop.mapred.GridRecordWriter.write->edu.umn.cs.spatialHadoop.core.GridRecordWriter.write->edu.umn.cs.spatialHadoop.core.GridRecordWriter.writeInternal,所以重点在于writeInternal函数的实现。具体代码如下:
protected synchronized void writeInternal(int cellIndex, S shape) throws IOException {
if (cellIndex < 0) {
// A special marker to close a cell
closeCell(-cellIndex);
return;
}
try {
cellsMbr[cellIndex].expand(shape.getMBR());
} catch (NullPointerException e) {
e.printStackTrace();
}
// Convert shape to text
text.clear();
shape.toText(text);
// Write text representation to the file
OutputStream cellStream = getIntermediateCellStream(cellIndex);
cellStream.write(text.getBytes(), 0, text.getLength());
cellStream.write(NEW_LINE);
intermediateCellSize[cellIndex] += text.getLength() + NEW_LINE.length;
intermediateCellRecordCount[cellIndex]++;
}
2-6:在Grid Index MapReuce的reduce实现中,对于每一个cell,当处理完该cell的所有数据后,总会输出一个key为-cellID的key-value对,用来标识该cell已经处理完毕,该处 代码就是对某一cell的最终处理。
8:计算当前cellIndex对应的cell已经写入文件的数据的最小包围矩形
13-14:将该shape转换为Text
16:获得该cellIndex对应的cell文件的输出流,在getIntermediateCellStream函数中,先判断输出流是否存在,若不存在则先获得输出路径,再构建输出流,若存在直接返 回
17-20:将数据写入输出流,同时更新已写入文件的数据大小和数据记录数。
3.3 closeCell函数
在上面讲述中,2-6行代码中涉及到closeCell函数,接下来主要介绍该函数。
protected void closeCell(int cellIndex) throws IOException {
CellInfo cell = cells != null? cells[cellIndex] : new CellInfo(cellIndex+1, cellsMbr[cellIndex]);
if (expand)
cell.expand(cellsMbr[cellIndex]);
if (pack)
cell = new CellInfo(cell.cellId, cell.getIntersection(cellsMbr[cellIndex]));
closeCellBackground(intermediateCellPath[cellIndex],
getFinalCellPath(cellIndex), intermediateCellStreams[cellIndex],
masterFile, cell, intermediateCellRecordCount[cellIndex], intermediateCellSize[cellIndex]);
cellsMbr[cellIndex] = new Rectangle(Double.MAX_VALUE, Double.MAX_VALUE,
-Double.MAX_VALUE, -Double.MAX_VALUE);
intermediateCellPath[cellIndex] = null;
intermediateCellStreams[cellIndex] = null;
intermediateCellRecordCount[cellIndex] = 0;
intermediateCellSize[cellIndex] = 0;
}
2:获得cellIndex对应的CellInfo信息
3-4:判断当前的索引类型是否需要将网格信息和已写入的数据的最小包围矩形进行合并,expand在构造函数中确定
5-6:判断当前的索引类型是否需要将网格信息和已写入的数据的最小包围矩形取交集,pack在构造函数中确定
8:调用closeCellBackground进行处理。在该函数中,会单独开辟一个线程来进行最终处理,并将该线程加入成员变量closingThreads中。包括刷新并关闭该输出流,更新master文件。同时遍历closingThreads,若有已经结束的线程,则销毁,同时启动当前新建的线程。
11-16:销毁与该cell相关的成员变量。
3.4 close函数
调用顺序为edu.umn.cs.spatialHadoop.mapred.GridRecordWriter.close->edu.umn.cs.spatialHadoop.core.GridRecordWriter.close,所以重点在于close函数的实现。具体代码如下:
public synchronized void close(Progressable progressable) throws IOException {
// Close all output files
for (int cellIndex = 0; cellIndex < intermediateCellStreams.length; cellIndex++) {
if (intermediateCellStreams[cellIndex] != null) {
closeCell(cellIndex);
}
// Indicate progress. Useful if closing a single cell takes a long time
if (progressable != null)
progressable.progress();
}
while (!closingThreads.isEmpty()) {
try {
Thread t = closingThreads.get(0);
switch (t.getState()) {
case NEW: t.start(); break;
case TERMINATED: closingThreads.remove(0); break;
default:
// Use limited time join to indicate progress frequently
t.join(10000);
}
// Indicate progress. Useful if closing a single cell takes a long time
if (progressable != null)
progressable.progress();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
if (masterFile != null)
masterFile.close();
}
3-9:判断是否有未关闭的cell输出流,并关闭
12-28:清理cell输出流关闭线程。
30:关闭masterFile