hadoop的源码分析--Reduce的input过程

最新推荐文章于 2021-02-27 16:47:14 发布

是大侠诶

最新推荐文章于 2021-02-27 16:47:14 发布

阅读量124

点赞数

本文链接：https://blog.csdn.net/qq_41563601/article/details/108274883

版权

 public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKey()) {//看上下文中 是否还有一组key
      // 如果有执行下一组 将当前的key 和一组数据放入到reduce方法中
        reduce(context.getCurrentKey(), context.getValues(), context);
        // If a back up store is used, reset it
        //目前我的理解是 拿到值的迭代器 看是否是 reduce的迭代器 ，如果是该类的子类 ，表示使用了迭代器。然后清空一下迭代器。
        Iterator<VALUEIN> iter = context.getValues().iterator();
        if(iter instanceof ReduceContext.ValueIterator) {
          ((ReduceContext.ValueIterator<VALUEIN>)iter).resetBackupStore();        
        }
      }
    } finally {
      cleanup(context);
    }
  }
}

reduce 有三个阶段，分别是 Shuffle ，Sort， Reduce三个阶段

Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.、
reducer 会使用http协议通过网络进行数据的拉取，shuffle就是一个数据拉取的过程
Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).
框架会通过key的值对reducer 输入的数据进行归并排序（因为不同的mapper可能会产生有相同的key值）
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
拉取和排序阶段是同时进行的也就是说，等输出的数组被拿到，他们就已经被合并了。

SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator.
为了对值迭代器所返回的值进行二次排序，此应用必须使用第二个key扩充关键字，并且定义一个分组比较器
The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.
这些键将使用整个键进行排序，但将使用分组比较器进行分组，以确定在同一个reduce调用中发送哪些键和值。
The grouping comparator is specified via Job.setGroupingComparatorClass(Class). The sort order is controlled by Job.setSortComparatorClass(Class).
分组比较器可以通过Job.setGroupingComparatorClass(Class). 方法实例化，排序的顺序通过Job.setSortComparatorClass(Class).方法控制

分组比较器的调用优先级：
1. 用户自定义的分组比较器
2. 用户自定义的排序比较器
3. 系统默认的排序比较器

Reduce
In this phase the reduce(Object, Iterable, Reducer.Context) method is called for each <key, (collection of values)> in the sorted inputs.
在这个阶段每一个已被排序的输入都会调用一次
The output of the reduce task is typically written to a RecordWriter via Reducer.Context.write(Object, Object).The output of the Reducer is not re-sorted.
reduce的输出通常会通过Reducer.Context.write(Object, Object).写入到RecordWriter 中，Reducer的输出不会再次排序

  public void run(JobConf job, TaskUmbilicalProtocol umbilical) throws IOException, InterruptedException, ClassNotFoundException {
        job.setBoolean("mapreduce.job.skiprecords", this.isSkipping());
        if (this.isMapOrReduce()) {//判断是否是Map或reduce阶段，设置各个时期
            this.copyPhase = this.getProgress().addPhase("copy");
            this.sortPhase = this.getProgress().addPhase("sort");
            this.reducePhase = this.getProgress().addPhase("reduce");
        }
        TaskReporter reporter = this.startReporter(umbilical);
        boolean useNewApi = job.getUseNewReducer();//记录是否使用新的API
        //进行初始化操作  
        this.initialize(job, this.getJobID(), reporter, useNewApi);


        if (this.jobCleanup) {
            this.runJobCleanupTask(umbilical, reporter);
        } else if (this.jobSetup) {
            this.runJobSetupTask(umbilical, reporter);
        } else if (this.taskCleanup) {
            this.runTaskCleanupTask(umbilical, reporter);
        } else {
            this.codec = this.initCodec();//得到文件的编码 
            RawKeyValueIterator rIter = null;
            ShuffleConsumerPlugin shuffleConsumerPlugin = null;
            Class combinerClass = this.conf.getCombinerClass();
            CombineOutputCollector combineCollector = null != combinerClass ? new CombineOutputCollector(this.reduceCombineOutputCounter, reporter, this.conf) : null;
            Class<? extends ShuffleConsumerPlugin> clazz = job.getClass("mapreduce.job.reduce.shuffle.consumer.plugin.class", Shuffle.class, ShuffleConsumerPlugin.class);
            shuffleConsumerPlugin = (ShuffleConsumerPlugin)ReflectionUtils.newInstance(clazz, job);
            LOG.info("Using ShuffleConsumerPlugin: " + shuffleConsumerPlugin);
            Context shuffleContext = new Context(this.getTaskID(), job, FileSystem.getLocal(job), umbilical, super.lDirAlloc, reporter, this.codec, combinerClass, combineCollector, this.spilledRecordsCounter, this.reduceCombineInputCounter, this.shuffledMapsCounter, this.reduceShuffleBytes, this.failedShuffleCounter, this.mergedMapOutputsCounter, this.taskStatus, this.copyPhase, this.sortPhase, this, this.mapOutputFile, this.localMapFiles);
            shuffleConsumerPlugin.init(shuffleContext);
            rIter = shuffleConsumerPlugin.run();

            //以上是shuffle的拉去过程，先不去考虑吧细节，只知道返回了一个迭代器对象

            this.mapOutputFilesOnDisk.clear();
            this.sortPhase.complete();
            this.setPhase(Phase.REDUCE);
            this.statusUpdate(umbilical);
            Class keyClass = job.getMapOutputKeyClass();
            Class valueClass = job.getMapOutputValueClass();
            RawComparator comparator = job.getOutputValueGroupingComparator();
            if (useNewApi) {
                this.runNewReducer(job, umbilical, reporter, rIter, comparator, keyClass, valueClass);
            } else {
                this.runOldReducer(job, umbilical, reporter, rIter, comparator, keyClass, valueClass);
            }

            shuffleConsumerPlugin.close();
            this.done(umbilical, reporter);
        }
    }

是大侠诶

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hadoop的源码分析--Reduce的input过程

public void run(Context context) throws IOException, InterruptedException { setup(context); try { while (context.nextKey()) {//看上下文中是否还有一组key reduce(context.getCurrentKey(), context.getValues(), context); // If a back up sto.
复制链接

扫一扫