MapReduce源码分析——ReduceTask流程分析

最新推荐文章于 2021-12-03 11:35:21 发布

叫我不矜持

最新推荐文章于 2021-12-03 11:35:21 发布

阅读量556

点赞数 1

本文链接：https://blog.csdn.net/SmallCatBaby/article/details/90187673

版权

本文详细分析了MapReduce中ReduceTask的执行流程，包括Shuffle、Sort和Reduce三个阶段。在Shuffle阶段，ReduceTask通过HTTP从MapTask获取数据，并利用combiner减少数据量。Sort阶段对数据进行归并排序，确保相同key的数据在一起。Reduce阶段则对排序后的数据进行处理，调用用户自定义的reduce方法。整个流程中，内存管理和磁盘操作相互配合，优化了数据处理效率。

摘要由CSDN通过智能技术生成

前言

Reduce会从Mapper任务中拉取很多小文件，小文件内部有序，但是整体是没序的，Reduce会合并小文件，然后套个归并算法，变成一个整体有序的文件。

Reducer 主要有3个基本的过程：

1.Shuffle阶段
Reducer会通过网络IO将Mapper端的排序输出给复制过来。

2.Sort阶段

按key对reducer输入进行排序（因为不同的mapper可能输出相同的key）
shuffle和sort阶段同时发生，即在拉去mapper输出时，它们被合并。

3.Reduce阶段
在此阶段中，对排序输入中的每个group调用reduce（object，iterable，reducer.context）方法。reduce任务的输出通常通过reducer.context.write（object，object）写入记录编写器。reduce的输出没有重新排序。

源码解析

1.Shuffle阶段源码分析

@Override
  @SuppressWarnings("unchecked")
  public void run(JobConf job, final TaskUmbilicalProtocol umbilical)
    throws IOException, InterruptedException, ClassNotFoundException {
    job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());

    if (isMapOrReduce()) {
      copyPhase = getProgress().addPhase("copy");
      sortPhase  = getProgress().addPhase("sort");
      reducePhase = getProgress().addPhase("reduce");
    }
    //发送task任务报告，与父进程做交流
    TaskReporter reporter = startReporter(umbilical);
    //判断用的是新的MapReduceAPI还是旧的API
    boolean useNewApi = job.getUseNewReducer();
    //核心代码，初始化任务
    initialize(job, getJobID(), reporter, useNewApi);

    //Reduce任务有4种，Job-setup Task, Job-cleanup Task, Task-cleanup Task和ReduceTask
    if (jobCleanup) {
      runJobCleanupTask(umbilical, reporter);
      return;
    }
    if (jobSetup) {
      runJobSetupTask(umbilical, reporter);
      return;
    }
    if (taskCleanup) {
      runTaskCleanupTask(umbilical, reporter);
      return;
    }
    
    // Initialize the codec
    codec = initCodec();
    RawKeyValueIterator rIter = null;
    //使用的shuffle插件
    ShuffleConsumerPlugin shuffleConsumerPlugin = null;
    
    Class combinerClass = conf.getCombinerClass();
    CombineOutputCollector combineCollector = 
      (null != combinerClass) ? 
     new CombineOutputCollector(reduceCombineOutputCounter, reporter, conf) : null;
    
    Class<? extends ShuffleConsumerPlugin> clazz =
          job.getClass(MRConfig.SHUFFLE_CONSUMER_PLUGIN, Shuffle.class, ShuffleConsumerPlugin.class);
                    
    shuffleConsumerPlugin = ReflectionUtils.newInstance(clazz, job);
    LOG.info("Using ShuffleConsumerPlugin: " + shuffleConsumerPlugin);

    ShuffleConsumerPlugin.Context shuffleContext = 
      new ShuffleConsumerPlugin.Context(getTaskID(), job, FileSystem.getLocal(job), umbilical, 
                  super.lDirAlloc, reporter, codec, 
                  combinerClass, combineCollector, 
                  spilledRecordsCounter, reduceCombineInputCounter,
                  shuffledMapsCounter,
                  reduceShuffleBytes, failedShuffleCounter,
                  mergedMapOutputsCounter,
                  taskStatus, copyPhase, sortPhase, this,
                  mapOutputFile, localMapFiles);
    //初始化shuffle插件，核心代码
    shuffleConsumerPlugin.init(shuffleContext);
    //跑shuflle核心代码,此步骤，会通过网络IO将Map端的输出给拉过来，并且进行合并操作~~~
    rIter = shuffleConsumerPlugin.run();

    // free up the data structures
    mapOutputFilesOnDisk.clear();

    // sort is complete
    sortPhase.complete();                         
    setPhase(TaskStatus.Phase.REDUCE); 
    statusUpdate(umbilical);
    Class keyClass = job.getMapOutputKeyClass();
    Class valueClass = job.getMapOutputValueClass();

    //分组比较
    RawComparator comparator = job.getOutputValueGroupingComparator();
     //如果前面3个任务都不是，执行的就是最主要的ReduceTask,根据新老API调用不同的方法
    if (useNewApi) {
      runNewReducer(job, umbilical, reporter, rIter, comparator, 
                    keyClass, valueClass);
    } else {
      runOldReducer(job, umbilical, reporter, rIter, comparator,

最低0.47元/天解锁文章

叫我不矜持

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
MapReduce源码分析——ReduceTask流程分析

前言Reduce会从Mapper任务中拉取很多小文件，小文件内部有序，但是整体是没序的，Reduce会合并小文件，然后套个归并算法，变成一个整体有序的文件。Reducer 主要有3个基本的过程：1.Shuffle阶段Reducer会通过网络IO将Mapper端的排序输出给复制过来。2.Sort阶段按key对reducer输入进行排序（因为不同...
复制链接

扫一扫