Hadoop的GroupComparator是如何起做用的(源码分析)

目标:弄明白,我们配置的GroupComparator是如何对进入reduce函数中的key   Iterable<value> 进行影响。
如下是一个配置了 GroupComparator  的reduce 函数。具体影响是我们可以在自定义的 GroupComparator 中确定哪儿些value组成一组,进入一个reduce函数

点击(此处)折叠或打开

  1. public static class DividendGrowthReducer extends Reducer<Stock, DoubleWritable, NullWritable, DividendChange> {
  2.         private NullWritable outputKey = NullWritable.get();
  3.         private DividendChange outputValue = new DividendChange();
  4.         
  5.         @Override
  6.         protected void reduce(Stock key, Iterable<DoubleWritable> values, Context context)
  7.                 throws IOException, InterruptedException {
  8.             double previousDividend = 0.0;
  9.             for(DoubleWritable dividend : values) {
  10.                 double currentDividend = dividend.get();
  11.                 double growth = currentDividend - previousDividend;
  12.                 if(Math.abs(growth) > 0.000001) {
  13.                     outputValue.setSymbol(key.getSymbol());
  14.                     outputValue.setDate(key.getDate());
  15.                     outputValue.setChange(growth);
  16.                     context.write(outputKey, outputValue);
  17.                     previousDividend = currentDividend;
  18.                 }
  19.             }
  20.         }
  21.     }
着先我们找到向上找,是谁调用了我们写的这个reduce函数。 Reducer类的run 方法。通过如下代码,可以看到是在run方法中,对于每个key,调用一次reduce函数。
此处传入reduce函数的都是对象引用。

点击(此处)折叠或打开

  1. /**
  2.    * Advanced application writers can use the
  3.    * {@link #run(org.apache.hadoop.mapreduce.Reducer.Context)} method to
  4.    * control how the reduce task works.
  5.    */
  6.   public void run(Context context) throws IOException, InterruptedException {
  7.    .....
  8.     while (context.nextKey()) {
  9.        reduce(context.getCurrentKey(), context.getValues(), context);
  10.        .....
  11.     }   
  12.    .....
  13.   }
  14. }
结合我们写的reduce函数,key是在遍历value的时候会对应变化。
那我们继续跟踪context.getValues 得到的迭代器的next方法。context 此处是ReduceContext.java (接口). 对应的实现类为 ReduceContextImpl.java 

点击(此处)折叠或打开

  1. protected class ValueIterable implements Iterable<VALUEIN> {
  2.     private ValueIterator iterator = new ValueIterator();
  3.     @Override
  4.     public Iterator<VALUEIN> iterator() {
  5.       return iterator;
  6.     }
  7.   }
  8.   
  9.   /**
  10.    * Iterate through the values for the current key, reusing the same value
  11.    * object, which is stored in the context.
  12.    * @return the series of values associated with the current key. All of the
  13.    * objects returned directly and indirectly from this method are reused.
  14.    */
  15.   public
  16.   Iterable<VALUEIN> getValues() throws IOException, InterruptedException {
  17.     return iterable;
  18.   }
直接返回了一个iterable。继续跟踪 ValueIterable 类型的iterable。那明白了,在reduce 函数中进行Iterable的遍历,其实调用的是ValueIterable的next方法。下面看一下next的实现。

点击(此处)折叠或打开

  1. @Override
  2.     public VALUEIN next() {
  3.      ………………
  4.      nextKeyValue();
  5.      return value;
  6.      ………………
  7.     }
再继续跟踪nextKeyValue()方法。终于找了一个comparator。  这个就是我们配置的GroupingComparator.

点击(此处)折叠或打开

  1. @Override
  2.   public boolean nextKeyValue() throws IOException, InterruptedException {
  3.     ……………………………………
  4.     if (hasMore) {
  5.       nextKey = input.getKey();
  6.       nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0,
  7.                                      currentRawKey.getLength(),
  8.                                      nextKey.getData(),
  9.                                      nextKey.getPosition(),
  10.                                      nextKey.getLength() - nextKey.getPosition()
  11.                                          ) == 0;
  12.     } else {
  13.       nextKeyIsSame = false;
  14.     }
  15.     inputValueCounter.increment(1);
  16.     return true;
  17.   }
为了证明这个就是我们配置的 GroupingComparator。 跟踪ReduceContextImpl的构造调用者。 ReduceTask的run方法。

点击(此处)折叠或打开

  1. @Override
  2.   @SuppressWarnings("unchecked")
  3.   public void run(JobConf job, final TaskUmbilicalProtocol umbilical){
  4.     ………………………………
  5.     RawComparator comparator = job.getOutputValueGroupingComparator();
  6.     runNewReducer(job, umbilical, reporter, rIter, comparator,
  7.                     keyClass, valueClass);   
  8.   }
下面把 runNewReducer 的代码也贴出来。

点击(此处)折叠或打开

  1. void runNewReducer(JobConf job,
  2.                      final TaskUmbilicalProtocol umbilical,
  3.                      final TaskReporter reporter,
  4.                      RawKeyValueIterator rIter,
  5.                      RawComparator<INKEY> comparator,
  6.                      Class<INKEY> keyClass,
  7.                      Class<INVALUE> valueClass
  8.                      ) {
  9.   
  10.     org.apache.hadoop.mapreduce.Reducer.Context
  11.          reducerContext = createReduceContext(reducer, job, getTaskID(),
  12.                                                rIter, reduceInputKeyCounter,
  13.                                                reduceInputValueCounter,
  14.                                                trackedRW,
  15.                                                committer,
  16.                                                reporter, comparator, keyClass,
  17.                                                valueClass);
好吧,关于自定义GroupingComparator如何起做用的代码分析,就到此吧。






来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/30066956/viewspace-2095520/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/30066956/viewspace-2095520/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值