现在让我们看一下job.split文件是怎么生成的,先看writeSplits()函数的源码:
int maps =writeSplits(job, submitJobDir);
private int writeSplits(org.apache.hadoop.mapreduce.JobContext job,
Path jobSubmitDir) throwsIOException,
InterruptedException,ClassNotFoundException {
JobConf jConf =(JobConf)job.getConfiguration();
int maps;
if(jConf.getUseNewMapper()) {
maps = writeNewSplits(job, jobSubmitDir);
} else {
maps = writeOldSplits(jConf, jobSubmitDir);
}
return maps;
}
主要调用的是writerSplits函数,最后调用了 writeNew(Old)Splits(JobConf job, Path jobSubmitDir) 函数,它通过反射获取指定的inputformat,然后再通过调用inputformat的getSplits()函数来进行分块的获取。实际分块在FileIputFormat类中实现了。FileSplit是InputSplit的一个实现。
从以下代码我们可以看出:
private<T extends InputSplit>
int writeNewSplits(JobContext job, Path jobSubmitDir) throwsIOException,
InterruptedException,ClassNotFoundException {
Configuration conf =job.getConfiguration();
InputFormat<?, ?> input =
ReflectionUtils.newInstance(job.getInputFormatClass(),conf);
List<InputSplit> splits = input.getSplits(job);
T[] array = (T[]) splits.toArray(newInputSplit[splits.size()]);
// sort the splits into orderbased on size, so that the biggest