FileInputFormat切片机制
FileInputFormat的切片方法getSplits
本示例以1G的一个文件来进行分析,块大小为128M,则1G数据有8个块
public List<InputSplit> getSplits(JobContext job) throws IOException {
long minSize = Math.max(this.getFormatMinSplitSize(), getMinSplitSize(job));//若不进行配置默认是1L
long maxSize = getMaxSplitSize(job);//若进行配置默认是9223372036854775807L
List<InputSplit> splits = new ArrayList();//最终返回的切片信息
List<FileStatus> files = this.listStatus(job);//job任务的文件,本次示例只有一个1G的文件
Iterator i$ = files.iterator();
while(true) {
while(true) {
while(i$.hasNext()) {
FileStatus file = (FileStatus)i$.next();//获取这1G文件
Path path = file.getPath();//获取文件路径
long length = file.getLen();//获取文件大小(1G)
if (length != 0L) {
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus)file).getBlockLocations();
} else {
FileSystem fs = path.getFileSystem(job.getConfiguration());
blkLocations = fs.getFileBlockLocations(file, 0L, length);//获取这1G文件的块所在位置,1G文件有8个块
}
/**
* //计算切片大小(128M)Math.max(minSize, Math.min(maxSize, blockSize));
* (Math.max(lL, Math.min(9223372036854775807L, 128M));)
*/
if (this.isSplitable(job, path)) {
long blockSize = file.getBlockSize();//获取块大小(128M)
long splitSize = this.computeSplitSize(blockSize, minSize, maxSize);
long bytesRemaining;//文件切片后剩余大小
int blkIndex;//文件块的索引
/**
* 进行切片,每循环一次切一次,若切完后剩余文件大小不大于1.1倍切片大小(128*1.1)时,停止切片
* for循环一次切一次,1G切一次,剩余7/8G,再切一次剩余6/8G,这样可以切7次
*/
for(bytesRemaining = length; (double)bytesRemaining / (double)splitSize > 1.1D; bytesRemaining -= splitSize) {
blkIndex = this.getBlockIndex(blkLocations, length - bytesRemaining);//获取文件块索引
//把每次切好的切片信息保存到splits中
splits.add(this.makeSplit(path, length - bytesRemaining, splitSize, blkLocations[blkIndex].getHosts()));
}
/**
* 切片后剩下的,例:200M数据切了一次剩下200-128=72M,72<128*1.1,此时bytesRemaining=72
* 把剩下的也保存到splits中去
*/
if (bytesRemaining != 0L) {
blkIndex = this.getBlockIndex(blkLocations, length - bytesRemaining);
splits.add(this.makeSplit(path, length - bytesRemaining, bytesRemaining, blkLocations[blkIndex].getHosts()));
}
} else {
splits.add(this.makeSplit(path, 0L, length, blkLocations[0].getHosts()));
}
} else {
splits.add(this.makeSplit(path, 0L, length, new String[0]));
}
}
job.getConfiguration().setLong("mapreduce.input.fileinputformat.numinputfiles", (long)files.size());
LOG.debug("Total # of splits: " + splits.size());
return splits;
}
}