前言
Reduce会从Mapper任务中拉取很多小文件,小文件内部有序,但是整体是没序的,Reduce会合并小文件,然后套个归并算法,变成一个整体有序的文件。
Reducer 主要有3个基本的过程:
1.Shuffle阶段
Reducer会通过网络IO将Mapper端的排序输出给复制过来。
2.Sort阶段
- 按key对reducer输入进行排序(因为不同的mapper可能输出相同的key)
- shuffle和sort阶段同时发生,即在拉去mapper输出时,它们被合并。
3.Reduce阶段
在此阶段中,对排序输入中的每个group调用reduce(object,iterable,reducer.context)方法。reduce任务的输出通常通过reducer.context.write(object,object)写入记录编写器。reduce的输出没有重新排序。
源码解析
1.Shuffle阶段源码分析
@Override
@SuppressWarnings("unchecked")
public void run(JobConf job, final TaskUmbilicalProtocol umbilical)
throws IOException, InterruptedException, ClassNotFoundException {
job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
if (isMapOrReduce()) {
copyPhase = getProgress().addPhase("copy");
sortPhase = getProgress().addPhase("sort");
reducePhase = getProgress().addPhase("reduce");
}
//发送task任务报告,与父进程做交流
TaskReporter reporter = startReporter(umbilical);
//判断用的是新的MapReduceAPI还是旧的API
boolean useNewApi = job.getUseNewReducer();
//核心代码,初始化任务
initialize(job, getJobID(), reporter, useNewApi);
//Reduce任务有4种,Job-setup Task, Job-cleanup Task, Task-cleanup Task和ReduceTask
if (jobCleanup) {
runJobCleanupTask(umbilical, reporter);
return;
}
if (jobSetup) {
runJobSetupTask(umbilical, reporter);
return;
}
if (taskCleanup) {
runTaskCleanupTask(umbilical, reporter);
return;
}
// Initialize the codec
codec = initCodec();
RawKeyValueIterator rIter = null;
//使用的shuffle插件
ShuffleConsumerPlugin shuffleConsumerPlugin = null;
Class combinerClass = conf.getCombinerClass();
CombineOutputCollector combineCollector =
(null != combinerClass) ?
new CombineOutputCollector(reduceCombineOutputCounter, reporter, conf) : null;
Class<? extends ShuffleConsumerPlugin> clazz =
job.getClass(MRConfig.SHUFFLE_CONSUMER_PLUGIN, Shuffle.class, ShuffleConsumerPlugin.class);
shuffleConsumerPlugin = ReflectionUtils.newInstance(clazz, job);
LOG.info("Using ShuffleConsumerPlugin: " + shuffleConsumerPlugin);
ShuffleConsumerPlugin.Context shuffleContext =
new ShuffleConsumerPlugin.Context(getTaskID(), job, FileSystem.getLocal(job), umbilical,
super.lDirAlloc, reporter, codec,
combinerClass, combineCollector,
spilledRecordsCounter, reduceCombineInputCounter,
shuffledMapsCounter,
reduceShuffleBytes, failedShuffleCounter,
mergedMapOutputsCounter,
taskStatus, copyPhase, sortPhase, this,
mapOutputFile, localMapFiles);
//初始化shuffle插件,核心代码
shuffleConsumerPlugin.init(shuffleContext);
//跑shuflle核心代码,此步骤,会通过网络IO将Map端的输出给拉过来,并且进行合并操作~~~
rIter = shuffleConsumerPlugin.run();
// free up the data structures
mapOutputFilesOnDisk.clear();
// sort is complete
sortPhase.complete();
setPhase(TaskStatus.Phase.REDUCE);
statusUpdate(umbilical);
Class keyClass = job.getMapOutputKeyClass();
Class valueClass = job.getMapOutputValueClass();
//分组比较
RawComparator comparator = job.getOutputValueGroupingComparator();
//如果前面3个任务都不是,执行的就是最主要的ReduceTask,根据新老API调用不同的方法
if (useNewApi) {
runNewReducer(job, umbilical, reporter, rIter, comparator,
keyClass, valueClass);
} else {
runOldReducer(job, umbilical, reporter, rIter, comparator,