job.class的源码
public void submit()
throws IOException, InterruptedException, ClassNotFoundException {
ensureState(JobState.DEFINE);
setUseNewAPI();
connect();
final JobSubmitter submitter =
getJobSubmitter(cluster.getFileSystem(), cluster.getClient());
status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {
public JobStatus run() throws IOException, InterruptedException,
ClassNotFoundException {
return submitter.submitJobInternal(Job.this, cluster);
}
});
state = JobState.RUNNING;
LOG.info("The url to track the job: " + getTrackingURL());
}
最后return的submitter.submitJobInternal需要做五件事:
The job submission process involves:
- Checking the input and output specifications of the job.
- Computing the
InputSplit
s for the job. - Setup the requisite accounting information for the
DistributedCache
of the job, if necessary. - Copying the job's jar and configuration to the map-reduce system directory on the distributed file-system.
- Submitting the job to the
JobTracker
and optionally monitoring it's status.
检查路径;计算切片;提交资源;复制jar;提交到jobtracker
在有yarn之后。现在不再到jobtracker了
进入对应的jobsubmitter类中,有这样一行,书写切片信息:
int maps = writeSplits(job, submitJobDir);
进入该方法
private int writeSplits(org.apache.hadoop.mapreduce.JobContext job,
Path jobSubmitDir) throws IOException,
InterruptedException, ClassNotFoundException {
JobConf jConf = (JobConf)job.getConfiguration();
int maps;
if (jConf.getUseNewMapper()) {
maps = writeNewSplits(job, jobSubmitDir);
} else {
maps = writeOldSplits(jConf, jobSubmitDir);
}
return maps;
}
进入写入新切片: maps = writeNewSplits(job, jobSubmitDir);
可发现,
private <T extends InputSplit>
int writeNewSplits(JobContext job, Path jobSubmitDir) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = job.getConfiguration();
InputFormat<?, ?> input =
ReflectionUtils.newInstance(job.getInputFormatClass(), conf);
List<InputSplit> splits = input.getSplits(job);
T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]);
// sort the splits into order based on size, so that the biggest
// go first
Arrays.sort(array, new SplitComparator());
JobSplitWriter.createSplitFiles(jobSubmitDir, conf,
jobSubmitDir.getFileSystem(conf), array);
return array.length;
}
可以发现,job里面有配置信息,拿了出来
下面这句,是反射。
反射,Java高级特性,是在运行状态中,对于任意一个类,都能够知道这个类的所有属性和方法;对于任意一个对象,都能够调用它的任意方法和属性;这种动态获取信息以及动态调用对象方法的功能称为java语言的反射机制。
InputFormat<?, ?> input =
ReflectionUtils.newInstance(job.getInputFormatClass(), conf);
这里面的job.getInputFormatClass()方法来源于job的父类jobcontextlmpl,如果自己不设置,就会默认为Text输入格式化类
public Class<? extends InputFormat<?,?>> getInputFormatClass()
throws ClassNotFoundException {
return (Class<? extends InputFormat<?,?>>)
conf.getClass(INPUT_FORMAT_CLASS_ATTR, TextInputFormat.class);
}
进入 getSplits方法,获取切片:
List<InputSplit> splits = input.getSplits(job);
其实Text输入格式化类的父类是FileInfutFormat,和很多格式化类一起都是格式化类的子类
因此,getSplits方法实际上是父类FileInfutFormat的方法,这里进入FileInfutFormat:
public List<InputSplit> getSplits(JobContext job) throws IOException {
Stopwatch sw = new Stopwatch().start();
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job)); // 1
long maxSize = getMaxSplitSize(job); // 2
// generate splits
List<InputSplit> splits = new ArrayList<InputSplit>();
List<FileStatus> files = listStatus(job); //3
for (FileStatus file: files) { // 4
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
FileSystem fs = path.getFileSystem(job.getConfiguration());
blkLocations = fs.getFileBlockLocations(file, 0, length); // 5
}
if (isSplitable(job, path)) { // 6
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining); // 7
splits.add(makeSplit(path, length-bytesRemaining, splitSize, // 8
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
}
} else { // not splitable
splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
blkLocations[0].getCachedHosts()));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
// Save the number of input files for metrics/loadgen
job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
sw.stop();
if (LOG.isDebugEnabled()) {
LOG.debug("Total # of splits generated by getSplits: " + splits.size()
+ ", TimeTaken: " + sw.elapsedMillis());
}
return splits;
}
1和2分别是获取切片的最小值和最大值,如果没有设置,分别是1以及一个非常大的值(long的最大值)
3是hdfs的对象,目的是获取路径下的所有文件组成列表,随后4增强for循环,面向文件,获取每一个文件的切片。
5获取文件所在块。
6如果可以切片,获得块大小以及切片大小。切片大小
computeSplitSize(blockSize, minSize, maxSize);
目的是为了和blocksize比较,默认切片大小为block大小,128M;如果调整minSize大于128,实际上是增大切片,如果设置maxSize小于128,实际上减小切片
7所在的while循环获取块索引。
protected int getBlockIndex(BlockLocation[] blkLocations, long offset) { for (int i = 0 ; i < blkLocations.length; i++) { // is the offset inside this block? if ((blkLocations[i].getOffset() <= offset) && (offset < blkLocations[i].getOffset() + blkLocations[i].getLength())){ return i; } } BlockLocation last = blkLocations[blkLocations.length -1]; long fileLength = last.getOffset() + last.getLength() -1; throw new IllegalArgumentException("Offset " + offset + " is outside of file (0.." + fileLength + ")"); }
(offset < blkLocations[i].getOffset() + blkLocations[i].getLength()))很重要
意思是切片偏移量被一个块包含,那么就取出这个块的索引,而块的位置信息也就可以提供给切片的后续计算了
正常的默认切片等于block大小,获取的块索引就是1234这种连续的
而如果一个切片同时占据两个block或更多, 在计算map的时候,第一个块的位置信息会给map,第一个块作为本地化,其他的块会通过网络进行向第一个块的移动拉取。获取的块索引可能是147这种
如果一个切片比一个block小, 在计算map的时候,第一个块的位置信息会给第一个map,后续同一个块中的位置信息仍然会给map,即返回的切片信息相同(splits),而且后续计算的时候,还能保证尽量不太会发生争抢。获取的块索引可能是111222这种
总之,找到的块索引包含当前片的起始位置。
8创建切片。
protected FileSplit makeSplit(Path file, long start, long length, String[] hosts, String[] inMemoryHosts) { return new FileSplit(file, start, length, hosts, inMemoryHosts); }
这里面前四个关键参数为:文件路径、切片起始点偏移量、切片大小、切片第一个块的位置信息
splits.add( makeSplit( path, length-bytesRemaining, splitSize, blkLocations[blkIndex].getHosts(), blkLocations[blkIndex].getCachedHosts() ) );
在获取blockindex之后,通过makeSplit方法,得到splits
这些切片清单数量也就是map的数量
以400M,blocksize128M为例,
获取到的最终splits可能是
[
block1:[path,0,128,[1,2,3]],
block2:[path,128,128,[2,3,6]],
block3:[path,256,128,[6,7,9]],
block4:[path,384,128,[1,5,7]]
]
客户端总结:
完成配置。
获取切片信息,以便于后续计算