MapReduce核心Reduce Task源码分析

最新推荐文章于 2022-10-08 08:00:00 发布

七里香的阳光

最新推荐文章于 2022-10-08 08:00:00 发布

阅读量273

点赞数 2

分类专栏：源码分析文章标签： hadoop mapreduce hdfs

本文链接：https://blog.csdn.net/weixin_44022416/article/details/104612105

版权

源码分析专栏收录该内容

3 篇文章 0 订阅

订阅专栏

上文，已经对map的输入和输出做了源码分析，相信已经对map task的流程也都已经十分了解，现在，来分析一下Reduce的输入，因为输出是直接输出到HDFS了，这里不多做阐述。

Reduce Task分为四种，分别为Job-setup Task，Job-cleanup Task， Task-cleanup Task和Reduce Task，这里分析的是最后的普通的Reduce Task

Reduce Task 源码分析

hadoop的版本2.7.2，工具IDEA

在这里插入图片描述

我们直接看一下Reduce Task的run方法



public void run(JobConf job, final TaskUmbilicalProtocol umbilical)
    throws IOException, InterruptedException, ClassNotFoundException {
    job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
	//前面这些先不看
    if (isMapOrReduce()) {
      copyPhase = getProgress().addPhase("copy");
      sortPhase  = getProgress().addPhase("sort");
      reducePhase = getProgress().addPhase("reduce");
    }
    // start thread that will handle communication with parent
    TaskReporter reporter = startReporter(umbilical);
    
    boolean useNewApi = job.getUseNewReducer();
    initialize(job, getJobID(), reporter, useNewApi);

    // check if it is a cleanupJobTask
    if (jobCleanup) {
      runJobCleanupTask(umbilical, reporter);
      return;
    }
    if (jobSetup) {
      runJobSetupTask(umbilical, reporter);
      return;
    }
    if (taskCleanup) {
      runTaskCleanupTask(umbilical, reporter);
      return;
    }
    
    // Initialize the codec
    codec = initCodec();
    RawKeyValueIterator rIter = null;
    //Shuffle拉取的插件
    ShuffleConsumerPlugin shuffleConsumerPlugin = null;
    
    Class combinerClass = conf.getCombinerClass();
    CombineOutputCollector combineCollector = 
      (null != combinerClass) ? 
     new CombineOutputCollector(reduceCombineOutputCounter, reporter, conf) : null;

    Class<? extends ShuffleConsumerPlugin> clazz =
          job.getClass(MRConfig.SHUFFLE_CONSUMER_PLUGIN, Shuffle.class, ShuffleConsumerPlugin.class);
					
    shuffleConsumerPlugin = ReflectionUtils.newInstance(clazz, job);
    LOG.info("Using ShuffleConsumerPlugin: " + shuffleConsumerPlugin);

    ShuffleConsumerPlugin.Context shuffleContext = 
      new ShuffleConsumerPlugin.Context(getTaskID(), job, FileSystem.getLocal(job), umbilical, 
                  super.lDirAlloc, reporter, codec, 
                  combinerClass, combineCollector, 
                  spilledRecordsCounter, reduceCombineInputCounter,
                  shuffledMapsCounter,
                  reduceShuffleBytes, failedShuffleCounter,
                  mergedMapOutputsCounter,
                  taskStatus, copyPhase, sortPhase, this,
                  mapOutputFile, localMapFiles);
                  
    //到这里，我们shuffle插件已经拉取了map输出的数据了，，，，这里的插件被nodemanager集成了
    shuffleConsumerPlugin.init(shuffleContext);
	//这里是一个迭代器，里面有我们reduce从map拉取的全量数据，，（这里的迭代器则为真迭代器）
    rIter = shuffleConsumerPlugin.run();
	
    // free up the data structures
    mapOutputFilesOnDisk.clear();
    
    sortPhase.complete();                         // sort is complete
    setPhase(TaskStatus.Phase.REDUCE); 
    statusUpdate(umbilical);
    Class keyClass = job.getMapOutputKeyClass();
    Class valueClass = job.getMapOutputValueClass();
    //获取分组比较器，（map阶段的比较器则为快速排序服务，即为排序比较器）
    // 我们进去看一下排序比较器
    RawComparator comparator = job.getOutputValueGroupingComparator();
	
    if (useNewApi) {
      runNewReducer(job, umbilical, reporter, rIter, comparator, 
                    keyClass, valueClass);
    } else {
      runOldReducer(job, umbilical, reporter, rIter, comparator, 
                    keyClass, valueClass);
    }

    shuffleConsumerPlugin.close();
    done(umbilical, reporter);
  }

我们进到分组比较器（getOutputValueGroupingComparator）里面看看

public RawComparator getOutputValueGroupingComparator() {
	
    Class<? extends RawComparator> theClass = getClass(
     //如果用户配置了分组比较器，则取用户配置的，
      JobContext.GROUP_COMPARATOR_CLASS, null, RawComparator.class);
    if (theClass == null) {
     // 我们进到面看一下，默认获取的分组比较器是什么
      return getOutputKeyComparator();
    }
    //去到则直接反射创建
    return ReflectionUtils.newInstance(theClass, this);
  }

进到getOutputKeyComparator里面看看，到底获取的是分组什么比较器

 public RawComparator getOutputKeyComparator() {
    Class<? extends RawComparator> theClass = getClass(
      JobContext.KEY_COMPARATOR, null, RawComparator.class);
    if (theClass != null)
      //如果用户配置了，则反射创建用户配置的
      return ReflectionUtils.newInstance(theClass, this);
    return
    // 如果没有取到，则获取的是key比较器，，，也就是我们map阶段的排序比较器 
    // 这里我们可以知道，reduce端分组的比较器如果用户没有设置分组比较器，则取的是我们map端的比较器
    WritableComparator.get(getMapOutputKeyClass().asSubclass(WritableComparable.class), this);
  }

我们回去继续分析，进到runNewReducer里面

private <INKEY,INVALUE,OUTKEY,OUTVALUE>
  void runNewReducer(JobConf job,
                     final TaskUmbilicalProtocol umbilical,
                     final TaskReporter reporter,
                     //真迭代器
                     RawKeyValueIterator rIter,
                     //比较器
                     RawComparator<INKEY> comparator,
                     Class<INKEY> keyClass,
                     Class<INVALUE> valueClass
                     ) throws IOException,InterruptedException, 
                              ClassNotFoundException {
    // wrap value iterator to report progress.
    final RawKeyValueIterator rawIter = rIter;
    // 下面是对这个真迭代器包装了一下
    rIter = new RawKeyValueIterator() {
      public void close() throws IOException {
        rawIter.close();
      }
      public DataInputBuffer getKey() throws IOException {
        return rawIter.getKey();
      }
      public Progress getProgress() {
        return rawIter.getProgress();
      }
      public DataInputBuffer getValue() throws IOException {
        return rawIter.getValue();
      }
      public boolean next() throws IOException {
        boolean ret = rawIter.next();
        reporter.setProgress(rawIter.getProgress().getProgress());
        return ret;
      }
    };
    // make a task context so we can get the classes
    // 准备taskContext
    org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
      new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job,
          getTaskID(), reporter);
    // make a reducer
    // 通过反射获取我们配置的Reducer
    org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE> reducer =
      (org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>)
        ReflectionUtils.newInstance(taskContext.getReducerClass(), job);
        //reduce的输出先不看
    org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE> trackedRW = 
      new NewTrackingRecordWriter<OUTKEY, OUTVALUE>(this, taskContext);
    job.setBoolean("mapred.skip.on", isSkipping());
    job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
    //这里创建了一个reduce的上下文，讲真迭代器，比较器都传进去，我们进去看一下实现类
    org.apache.hadoop.mapreduce.Reducer.Context 
         reducerContext = createReduceContext(reducer, job, getTaskID(),
                                               rIter, reduceInputKeyCounter, 
                                               reduceInputValueCounter, 
                                               trackedRW,
                                               committer, 
                                               reporter, comparator, keyClass,
                                               valueClass);
    try {
      reducer.run(reducerContext);
    } finally {
      trackedRW.close(reducerContext);
    }
  }

接下来我们进去看一下怎么创建的reducerContext，我们进到它的实现类ReduceContextImpl里面

 public ReduceContextImpl(Configuration conf, TaskAttemptID taskid,
 					 	  //这里将我们的迭代器改名为了input
                           RawKeyValueIterator input, 
                           Counter inputKeyCounter,
                           Counter inputValueCounter,
                           RecordWriter<KEYOUT,VALUEOUT> output,
                           OutputCommitter committer,
                           StatusReporter reporter,
                           //比较器
                           RawComparator<KEYIN> comparator,
                           Class<KEYIN> keyClass,
                           Class<VALUEIN> valueClass
                          ) throws InterruptedException, IOException{
    super(conf, taskid, output, committer, reporter);
    // 这里我们的真迭代器被input指向，就很多会用，
    // 我们在reduce会经常调用的方法是nextkey，我们进到里面看看
    this.input = input;
    this.inputKeyCounter = inputKeyCounter;
    this.inputValueCounter = inputValueCounter;
    //比较器赋赋值
    this.comparator = comparator;
    //准备反序列化的一些东西
    // 我们的map输出是序列化的字节数组文件，被reduce拉走后肯定要反序列化，这里是准备
    this.serializationFactory = new SerializationFactory(conf);
    this.keyDeserializer = serializationFactory.getDeserializer(keyClass);
    this.keyDeserializer.open(buffer);
    this.valueDeserializer = serializationFactory.getDeserializer(valueClass);
    this.valueDeserializer.open(buffer);
    //迭代器的下一个还有没有数据，返回boolean
    hasMore = input.next();
    //获取配置
    this.keyClass = keyClass;
    this.valueClass = valueClass;
    this.conf = conf;
    this.taskid = taskid;
  }

在这里插入图片描述
我们进到nextKey里面

//我们先小总结一下，在reduce端，只要任务启动完了，拉完数据了，真迭代器准备好了
// 下面就会调用我们的Reduce Task的run方法了，已调用run方法，就开始while判断nextKey了
public boolean nextKey() throws IOException,InterruptedException {
	// nextKeyIsSame（下一个key和我是不是一组）默认值为false，
	// 刚开始，我们的hasMore肯定是true，第一次则不进到这里面
    while (hasMore && nextKeyIsSame) {
      nextKeyValue();
    }
    //如果有值
    if (hasMore) {
      //累加器的
      if (inputKeyCounter != null) {
        inputKeyCounter.increment(1);
      }
      //最终返回的是nextKeyValue方法，不过我们这个nextKeyValue和map阶段的实现不一样，我们进去看看
      return nextKeyValue();
    } else {
      return false;
    }
  }

我们进到nextKeyValue看一看，具体是怎么实现的

 public boolean nextKeyValue() throws IOException, InterruptedException {
   //这里，如果迭代器里有数据，则不执行
   // 没有取到返回false
    if (!hasMore) {
      key = null;
      value = null;
      return false;
    }
    //nextKeyIsSame（下一个key和我是不是一组）在定义的时候就是false，这里取反，则firstValue为真
    firstValue = !nextKeyIsSame;
    //获取key的字节数组
    DataInputBuffer nextKey = input.getKey();
    currentRawKey.set(nextKey.getData(), nextKey.getPosition(), 
                      nextKey.getLength() - nextKey.getPosition());
    buffer.reset(currentRawKey.getBytes(), 0, currentRawKey.getLength());
    //反序列化key,并对key赋值
    key = keyDeserializer.deserialize(key);
    DataInputBuffer nextVal = input.getValue();
    buffer.reset(nextVal.getData(), nextVal.getPosition(), nextVal.getLength()
        - nextVal.getPosition());
    //反序列化value，并对value赋值，这里的逻辑和map差不多，都是对kv赋值
    value = valueDeserializer.deserialize(value);

    currentKeyLength = nextKey.getLength() - nextKey.getPosition();
    currentValueLength = nextVal.getLength() - nextVal.getPosition();

    if (isMarked) {
      backupStore.write(nextKey, nextVal);
    }
	//更新hasMore，看一下迭代器里除了上面那条之外，还有没有第二条
    hasMore = input.next();
    //如果还有第二条
    if (hasMore) {
     //把第二条拿出来
      nextKey = input.getKey();
      // 这里，分组比较器，前两个个参数代表的是上一条记录，nextKey是下一条记录
      // 如果上一个key和下一个key值如果等于0，则代表相等
      // nextKeyIsSame则为真
      // 这里是对下一条记录做预判断
      nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0, 
                                     currentRawKey.getLength(),
                                     nextKey.getData(),
                                     nextKey.getPosition(),
                                     nextKey.getLength() - nextKey.getPosition()
                                         ) == 0;
    } else {
      nextKeyIsSame = false;
    }
    inputValueCounter.increment(1);
    //如果取到了，就返回true
    return true;
  }

这里，我们知道了run方法的nextKey主要做了两件事情，1.为我们的kv赋值，2.对下一条数据做预判断

这两件事情做完之后，只要我们的kv赋值成功了，则会调用我们的reduce方法，我们通过run方法也得知，当nextKey为真，则会从上下文中调用getCurrentKey和getvalues，我们看一下getCurrentKey这个方法

 //直接返回我们的key了，没什么逻辑
 public KEYIN getCurrentKey() {
    return key;
  }

我们在看看getValues这个方法看一下

public 
  Iterable<VALUEIN> getValues() throws IOException, InterruptedException {
    return iterable;
  }

返回了一个迭代器，这个迭代器是ValueIterable类型的，我们进到里面发现它创建了一个迭代器，我们再进到ValueIterator里面

//这个迭代器是最终干活的迭代器
 protected class ValueIterator implements ReduceContext.ValueIterator<VALUEIN> {

    private boolean inReset = false;
    private boolean clearMarkFlag = false;
	//判断还有没有记录
    @Override
    public boolean hasNext() {
      try {
        if (inReset && backupStore.hasNext()) {
          return true;
        } 
      } catch (Exception e) {
        e.printStackTrace();
        throw new RuntimeException("hasNext failed", e);
      }
      //判断返回值，要么用这条记录，要么用nextKeyIsSame
      // 这里我们知道了，run方法调用getValues返回迭代器
      // 迭代器里的hasNext的条件就是判断下一条记录是否还相同
      return firstValue || nextKeyIsSame;
    }
	//取值
    @Override
    public VALUEIN next() {
      if (inReset) {
        try {
          if (backupStore.hasNext()) {
            backupStore.next();
            DataInputBuffer next = backupStore.nextValue();
            buffer.reset(next.getData(), next.getPosition(), next.getLength()
                - next.getPosition());
            value = valueDeserializer.deserialize(value);
            return value;
          } else {
            inReset = false;
            backupStore.exitResetMode();
            if (clearMarkFlag) {
              clearMarkFlag = false;
              isMarked = false;
            }
          }
        } catch (IOException e) {
          e.printStackTrace();
          throw new RuntimeException("next value iterator failed", e);
        }
      } 

      // if this is the first record, we don't need to advance
      if (firstValue) {
        firstValue = false;
        return value;
      }
      // if this isn't the first record and the next key is different, they
      // can't advance it here.
      if (!nextKeyIsSame) {
        throw new NoSuchElementException("iterate past last value");
      }
      // otherwise, go to the next key/value pair
      try {
      //这里这个迭代器取值调用的方法是 nextKeyValue也是上面分析的
      // 这里我们得知，，我们在reduce端编写获得的迭代器就是这个迭代器
      // 为假迭代器，因为在这个取值的方法里面，又调用了nextKeyValue，这个是我们的上文分析的，是input，
      // 而input在上面是rIter改名后的
        nextKeyValue();
        return value;
      } catch (IOException ie) {
        throw new RuntimeException("next value iterator failed", ie);
      } catch (InterruptedException ie) {
        // this is bad, but we can't modify the exception list of java.util
        throw new RuntimeException("next value iterator interrupted", ie);        
      }
    }
  		 ......
  }

分析到这里，关于reduce迭代器是怎么迭代数据的，核心源码已经看完了，剩下的任务源码就不做分析了
，这里写一个伪代码总结一下

	protected void reduce(Text key, Iterable<xxxx> values, Context context) throws IOException, InterruptedException {
		xxxx xxxx = values.iterator(); //假迭代器
		
		while(xxxx.hasNext){ //xxxx.hasNext -- >  nextKeyIsSame
		
			xxxx.next() // xxxx.next--> nextKeyValue ---> input(真迭代器)
		}
  }

	input这个真的迭代器是可以一直迭代，直到把数据给迭代完
	而我们的reduce的这个迭代器，只能迭代一组数据，因为我们的nextKeyIsSame
	我们的nextKeyIsSame是nextKeyValue来更新的

总结一下，reducer当中先调while循环，条件是nextKey，nextKey调的是nextKeyValue，先对kv做了赋值
然后做nextKeyIsSame预判断，判断完之后调reduce方法，调用reduce方法会将假迭代器传进去，最终我们用到的也就是假迭代器，因为上一步已经做了预判断，并且赋值了。在我们程序的while循环中，如果有多条记录，那么hasNext一定为真，第一条记录也是肯定为真，能进到循环里，进到循环里的时候，会调用next把我们的值取出来，先会调用nextKeyValue,调用nextKeyValue又会更新我们的nextKeyIsSame，然后取值，取值完成之后，循环回来，在while循环内又是用nextKeyIsSame来做判定的，如果第二条是一组的，就重复上面的取值，继续迭代，如果迭代到一组的边界了，下一组的数据肯定不一样了，所以在这次取数据的时候，调用nextKeyValue把下一条的预判断更新为false，所以这一条取完值之后，在循环，值就是false了，这一条就不迭代了，所以在我们在这个假迭代器中，只能迭代一组数据。

到此为止，我们的MapReduce的Reduce Task的核心源码就分析完成，

（Shuffle阶段和Merge阶段是并行进行的。当远程拷贝数据量达到一定阈值后，便会触发相应的合并线程对数据进行合并。这两个阶段均是由类ReduceCopier实现的，该类大约包含2 200行代码（整个ReduceTask类才2 900行左右），所以这里我们就不在分析数据拉取的源码了。）

七里香的阳光

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
MapReduce核心Reduce Task源码分析

上文，已经对map的输入和输出做了源码分析，相信已经对map task的流程也都已经十分了解，现在，来分析一下Reduce的输入，因为输入是直接输出到HDFS了，这里不多做阐述。Reduce Task分为四种，分别为Job-setup Task，Job-cleanup Task， Task-cleanup Task和Reduce Task，这里分析的是最后的普通的Reduce TaskRedu...
复制链接

扫一扫