由于要用hadoop streaming处理sdf文件,而sdf文件的文件格式为
1
-OEChem-12181003042D
.....
$$$$
以$$$$结尾的多行。
而hadoop默认的分片为:以分块为基础的分片
for (FileStatus file: files) {
Path path = file.getPath();
FileSystem fs = path.getFileSystem(job.getConfiguration());
long length = file.getLen();
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
if ((length != 0) && isSplitable(job, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts()));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,
blkLocations[blkLocations.length-1].getHosts()));
}
} else if (length != 0) {
splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
} else {
//Create empty hosts array for zero length files
splits.add(new FileSplit(path, 0, length, new String[0]));
}
}
本来想直接修改FileInputFormat.java的getSplits方法。可是本人刚刚接触java和hadoop,必须参考其他例子。于是找到NLineInputFormat的源码,其分片(getSplits方法)是按指定行划分的,有了点头绪。又想到TextInputFormat是按行读取的,想想应该在读取分片时做了些处理,果然是,它在已分片的基础上,跳过起始的最后一行,多读取分片的最后一行。
LineRecordReader.java:
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());
boolean skipFirstLine = false;
if (codec != null) {
in = new LineReader(codec.createInputStream(fileIn), job);
end = Long.MAX_VALUE;
} else {
if (start != 0) {
skipFirstLine = true;
--start;
fileIn.seek(start);
}
in = new LineReader(fileIn, job);
}
if (skipFirstLine) { // skip first line and re-establish "start".
start += in.readLine(new Text(), 0,
(int)Math.min((long)Integer.MAX_VALUE, end - start));
}
this.pos = start;
}
于是想到只要改写TextInputFormat就OK了。
下面是源代码的地址:源代码
运行环境是eclipse3.7 +hadoop-1.0.0+64位gentoo