前言
首先要说,MapTask,分为4种,分别是Job-setup Task,Job-cleanup Task,Task-cleanup和Map Task。
Job-setup Task、Job-cleanup Task分别是作业运行时启动的第一个任务和最后一个任务,主要工作分别是进行一些作业初始化和收尾工作,比如创建和删除作业临时输出目录;Task-cleanup Task则是任务失败或者被杀死后,用于清理已写入临时目录中数据的任务;最后一种Map Task则是处理数据并将结果存到本地磁盘上。
本文分析的重点,是最重要的MapTask。
源码分析
MapTask的整个过程分为5个阶段:
Read----->Map------>Collect------->Spill------>Combine
接下来我们直接分析源码
/**
* mapTask主要执行流程
*/
@Override
public void run(final JobConf job, final TaskUmbilicalProtocol umbilical)
throws IOException, ClassNotFoundException, InterruptedException {
this.umbilical = umbilical;
// start thread that will handle communication with parent
//发送task任务报告,与父进程做交流
TaskReporter reporter = new TaskReporter(getProgress(), umbilical,
jvmContext);
reporter.startCommunicationThread();
//判断用的是新的MapReduceAPI还是旧的API
boolean useNewApi = job.getUseNewMapper();
initialize(job, getJobID(), reporter, useNewApi);
// check if it is a cleanupJobTask
//map任务有4种,Job-setup Task, Job-cleanup Task, Task-cleanup Task和MapTask
if (jobCleanup) {
//这里执行的是Job-cleanup Task
runJobCleanupTask(umbilical, reporter);
return;
}
if (jobSetup) {
//这里执行的是Job-setup Task
runJobSetupTask(umbilical, reporter);
return;
}
if (taskCleanup) {
//这里执行的是Task-cleanup Task
runTaskCleanupTask(umbilical, reporter);
return;
}
//如果前面3个任务都不是,执行的就是最主要的MapTask,根据新老API调用不同的方法
if (useNewApi) {
//我们关注一下新的的方法实现splitMetaInfo为Spilt分片的信息
runNewMapper(job, splitMetaInfo, umbilical, reporter);
} else {
runOldMapper(job, splitMetaInfo, umbilical, reporter);
}
done(umbilical, reporter);
}
下面我们来看中的runNewMapper(job, splitMetaInfo, umbilical, reporter)方法方法,这个方法将会构造一系列的对象来辅助执行Mapper。其代码如下:
private <INKEY,INVALUE,OUTKEY,OUTVALUE>
void runNewMapper(final JobConf job,
final TaskSplitIndex splitIndex,
final TaskUmbilicalProtocol umbilical,
TaskReporter reporter
) throws IOException, ClassNotFoundException,
InterruptedException {
/*TaskAttemptContext类继承于JobContext类,相对于JobContext类增加了一些有关task的信息。
* 通过taskContext对象可以获得很多与任务执行相关的类,比如用户定义的Mapper类,InputFormat类等等 */
// make a task context so we can get the classes
org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
new org.apache.hadoop.mapreduce.TaskAttemptContext(job, getTaskID());
// make a mapper//创建用户自定义的Mapper类的实例
org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =
(org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)
ReflectionUtils.newInstance(taskContext.getMapperClass(), job);
// make the input format 创建用户指定的InputFormat类的实例
org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
(org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
// rebuild the input split 重新生成InputSplit
org.apache.hadoop.mapreduce.InputSplit split = null;
split = getSplitDetails(new Path(splitIndex.getSplitLocation()),
splitIndex.getStartOffset());
//根据InputFormat对象创建RecordReader对象,默认是LineRecordReader
org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
new NewTrackingRecordReader<INKEY,INVALUE>
(split, inputFormat, reporter, job, taskContext);
job.setBoolean("mapred.skip.on", isSkipping());
//生成RecordWriter对象
org.apache.hadoop.mapreduce.RecordWriter output = null;
org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context
mapperContext = null;
try {
Constructor<org.apac