FileInputFormat 切片机制源码分析:
用到的源码如下:
FileInputFormat类中的getSplits()方法
/**
* Generate the list of files and make them into FileSplits.
*/
public List<InputSplit> getSplits(JobContext job
) throws IOException {
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
long maxSize = getMaxSplitSize(job); // 默认值是Long类型的最大值
// generate splits
List<InputSplit> splits = new ArrayList<InputSplit>();
List<FileStatus>files = listStatus(job);
for (FileStatus file: files) {
Path path = file.getPath();
FileSystem fs = path.getFileSystem(job.getConfiguration());
long length = file.getLen();
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
if ((length != 0) && isSplitable(job, path)) {
long blockSize = file.getBlockSize(); //获取到的是blockSize即block块的大小
long splitSize = computeSplitSize(blockSize, minSize, maxSize); // 计算splitSize
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts()));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,
blkLocations[blkLocations.length-1].getHosts()));
}
} else if (length != 0) {
splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
} else {
//Create empty hosts array for zero length files
splits.add(new FileSplit(path, 0, length, new String[0]));
}
}
// Save the number of input files in the job-conf
job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
LOG.debug("Total # of splits: " + splits.size());
return splits;
}
计算splitSize主要用到的是:
protected long computeSplitSize(long blockSize, long minSize,
long maxSize) {
// 返回的就是block块的大小
return Math.max(minSize, Math.min(maxSize, blockSize));
}
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
minSize的计算是:
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
用到以下两个方法:
protected long getFormatMinSplitSize() {
return 1;
}
public static long getMinSplitSize(JobContext job) {
return job.getConfiguration().getLong("mapred.min.split.size", 1L);
}
maxSize的计算是:
long maxSize = getMaxSplitSize(job); // 默认值是Long类型的最大值
用到的方法是:
public static long getMaxSplitSize(JobContext context) {
return context.getConfiguration().getLong("mapred.max.split.size",
Long.MAX_VALUE);
}
那么minSize得到的就是1,maxSize得到的就是Long类型的最大值,最后得到的splitSize的大小就是默认等于blockSize的大小。