大数据—Hadoop（十三）_ MapReduce_06、核心框架原理_源码（3）_MapTask & ReduceTask

最新推荐文章于 2024-08-05 21:43:10 发布

大数据之负

最新推荐文章于 2024-08-05 21:43:10 发布

阅读量636

点赞数

分类专栏： Hadoop 文章标签： 1024程序员节 hadoop 大数据 mapreduce

本文链接：https://blog.csdn.net/m0_52968216/article/details/127388280

版权

Hadoop 专栏收录该内容

24 篇文章 2 订阅

订阅专栏

文章目录

1、MapTask源码解析
2、ReduceTask 源码

1、MapTask源码解析

1.1 案例

使用之前求各分区手机号总上行、下行、总流量案例讲解源码

案例链接: 大数据—Hadoop（九）_ MapReduce_02、序列化

1.2 总体流程

Run()

先去执行setup()
- 在任务开始时调用一次
- 可以重写初始化方法，初始化连接代码
中间过程
- 循环遍历
- 执行map()
  - 下面解析过程1-6步在map()里
- 每一行被调用一次
最后执行cleanup()
- 在任务结束后被调用一次
  - 下面解析过程7-10步在cleanup()里

1.2.1 执行map()

类名：Mapper
1、将数据写出到环形缓冲区

public class FlowMapper extends Mapper<LongWritable, Text,Text, FlowBean> {

   ……

	@Override
	protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

   ……

    // 5 写出
    context.write(outK,outV);
  }
}

类名：WrapperdMapper
2、

@Override
public void write(KEYOUT key, VALUEOUT value) throws IOException,
    InterruptedException {
  mapContext.write(key, value);
}

类名：TaskInputOutputContextImpl
3、

public void write(KEYOUT key, VALUEOUT value
                ) throws IOException, InterruptedException {
	output.write(key, value);
}

类名：MapTask
4、

@Override
public void write(K key, V value) throws IOException, InterruptedException {
  collector.collect(key, value,
                    partitioner.getPartition(key, value, partitions));
}

partitioner.getPartition(key, value, partitions));

类名：ProvincePartitioner
5、自定义分区器

public class ProvincePartitioner extends Partitioner<Text, FlowBean> {

@Override
public int getPartition(Text text, FlowBean flowBean, int numPartitions) {

    // text 是手机号
    String phone = text.toString();

    // 取手机号前三位
    String prePhone = phone.substring(0, 3);

    // 一般把常量放在前面防止空指针
    int partition;

    if("136".equals(prePhone)){
        partition = 0;
    }else if ("137".equals(prePhone)){
        partition = 1;
    }else if ("138".equals(prePhone)){
        partition = 2;
    }else if ("139".equals(prePhone)){
        partition = 3;
    }else {
        partition = 4;
    }

    return partition;
 }
}

collector.collect(key, value, partitioner.getPartition(key, value, partitions));

类名：MapTask
6、

public synchronized void collect(K key, V value, final int partition
                                 ) throws IOException {
  reporter.progress();
  if (key.getClass() != keyClass) {
    throw new IOException("Type mismatch in key from map: expected "
                          + keyClass.getName() + ", received "
                          + key.getClass().getName());
  }
  if (value.getClass() != valClass) {
    throw new IOException("Type mismatch in value from map: expected "
                          + valClass.getName() + ", received "
                          + value.getClass().getName());
  }
  if (partition < 0 || partition >= partitions) {
    throw new IOException("Illegal partition for " + key + " (" +
        partition + ")");
  }
  checkSpillException();
  bufferRemaining -= METASIZE;
  if (bufferRemaining <= 0) {
    // start spill if the thread is not running and the soft limit has been
    // reached
    spillLock.lock();
    try {
      do {
        if (!spillInProgress) {
          final int kvbidx = 4 * kvindex;
          final int kvbend = 4 * kvend;
          // serialized, unspilled bytes always lie between kvindex and
          // bufindex, crossing the equator. Note that any void space
          // created by a reset must be included in "used" bytes
          final int bUsed = distanceTo(kvbidx, bufindex);
          final boolean bufsoftlimit = bUsed >= softLimit;
          if ((kvbend + METASIZE) % kvbuffer.length !=
              equator - (equator % METASIZE)) {
            // spill finished, reclaim space
            resetSpill();
            bufferRemaining = Math.min(
                distanceTo(bufindex, kvbidx) - 2 * METASIZE,
                softLimit - bUsed) - METASIZE;
            continue;
          } else if (bufsoftlimit && kvindex != kvend) {
            // spill records, if any collected; check latter, as it may
            // be possible for metadata alignment to hit spill pcnt
            startSpill();
            final int avgRec = (int)
              (mapOutputByteCounter.getCounter() /
              mapOutputRecordCounter.getCounter());
            // leave at least half the split buffer for serialization data
            // ensure that kvindex >= bufindex
            final int distkvi = distanceTo(bufindex, kvbidx);
            final int newPos = (bufindex +
              Math.max(2 * METASIZE - 1,
                      Math.min(distkvi / 2,
                               distkvi / (METASIZE + avgRec) * METASIZE)))
              % kvbuffer.length;
            setEquator(newPos);
            bufmark = bufindex = newPos;
            final int serBound = 4 * kvend;
            // bytes remaining before the lock must be held and limits
            // checked is the minimum of three arcs: the metadata space, the
            // serialization space, and the soft limit
            bufferRemaining = Math.min(
                // metadata max
                distanceTo(bufend, newPos),
                Math.min(
                  // serialization max
                  distanceTo(newPos, serBound),
                  // soft limit
                  softLimit)) - 2 * METASIZE;
          }
        }
      } while (false);
    } finally {
      spillLock.unlock();
    }
  }

  try {
    // serialize key bytes into buffer
    int keystart = bufindex;
    keySerializer.serialize(key);
    if (bufindex < keystart) {
      // wrapped the key; must make contiguous
      bb.shiftBufferedKey();
      keystart = 0;
    }
    // serialize value bytes into buffer
    final int valstart = bufindex;
    valSerializer.serialize(value);
    // It's possible for records to have zero length, i.e. the serializer
    // will perform no writes. To ensure that the boundary conditions are
    // checked and that the kvindex invariant is maintained, perform a
    // zero-length write into the buffer. The logic monitoring this could be
    // moved into collect, but this is cleaner and inexpensive. For now, it
    // is acceptable.
    bb.write(b0, 0, 0);

    // the record must be marked after the preceding write, as the metadata
    // for this record are not yet written
    int valend = bb.markRecord();

    mapOutputRecordCounter.increment(1);
    mapOutputByteCounter.increment(
        distanceTo(keystart, valend, bufvoid));

    // write accounting info
    kvmeta.put(kvindex + PARTITION, partition);
    kvmeta.put(kvindex + KEYSTART, keystart);
    kvmeta.put(kvindex + VALSTART, valstart);
    kvmeta.put(kvindex + VALLEN, distanceTo(valstart, valend));
    // advance kvindex
    kvindex = (kvindex - NMETA + kvmeta.capacity()) % kvmeta.capacity();
  } catch (MapBufferTooSmallException e) {
    LOG.info("Record too large for in-memory buffer: " + e.getMessage());
    spillSingleRecord(key, value, partition);
    mapOutputRecordCounter.increment(1);
    return;
  }
}

类collect指环形缓冲区

左侧存索引	key
index	索引
partition	数据的分区
keystart	数据key开始位置
valstart	数据value开始位置

右侧存数据	value
key	Mapper的输出数据k，根据业务逻辑定
value	Mapper的输出数据v，根据业务逻辑定
unsued

keySerializer.serialize(key);
——序列化方法，为了支持跨界点传输
原数据只有22行，不能触发环形缓冲区溢写的最低要求
第6步执行一次写入一条数据

1.2.2 执行cleanup()

类名：MapTask
7、

@SuppressWarnings("unchecked")
private <INKEY,INVALUE,OUTKEY,OUTVALUE>
void runNewMapper(final JobConf job,
                final TaskSplitIndex splitIndex,
                final TaskUmbilicalProtocol umbilical,
                TaskReporter reporter
                ) throws IOException, ClassNotFoundException,
                         InterruptedException {

	……
	
	try {
	  input.initialize(split, mapperContext);
	  mapper.run(mapperContext);
	  mapPhase.complete();
	  setPhase(TaskStatus.Phase.SORT);
	  statusUpdate(umbilical);
	  input.close();
	  input = null;
	  output.close(mapperContext);
	  output = null;
	} finally {
	  closeQuietly(input);
	  closeQuietly(output, mapperContext);
	}
}

output.close(mapperContext);

8、查看怎么把环形缓冲区的内容刷写到磁盘

@Override
public void close(TaskAttemptContext context
                  ) throws IOException,InterruptedException {
  try {
    collector.flush();
  } catch (ClassNotFoundException cnf) {
    throw new IOException("can't find class ", cnf);
  }
  collector.close();
  }
}

collector.flush();

9、

public void flush() throws IOException, ClassNotFoundException,
       InterruptedException {
  LOG.info("Starting flush of map output");
  if (kvbuffer == null) {
    LOG.info("kvbuffer is null. Skipping flush.");
    return;
  }
  spillLock.lock();
  try {
    while (spillInProgress) {
      reporter.progress();
      spillDone.await();
    }
    checkSpillException();

    final int kvbend = 4 * kvend;
    if ((kvbend + METASIZE) % kvbuffer.length !=
        equator - (equator % METASIZE)) {
      // spill finished
      resetSpill();
    }
    if (kvindex != kvend) {
      kvend = (kvindex + NMETA) % kvmeta.capacity();
      bufend = bufmark;
      LOG.info("Spilling map output");
      LOG.info("bufstart = " + bufstart + "; bufend = " + bufmark +
               "; bufvoid = " + bufvoid);
      LOG.info("kvstart = " + kvstart + "(" + (kvstart * 4) +
               "); kvend = " + kvend + "(" + (kvend * 4) +
               "); length = " + (distanceTo(kvend, kvstart,
                     kvmeta.capacity()) + 1) + "/" + maxRec);
      sortAndSpill();
    }
  } catch (InterruptedException e) {
    throw new IOException("Interrupted while waiting for the writer", e);
  } finally {
    spillLock.unlock();
  }
  assert !spillLock.isHeldByCurrentThread();
  // shut down spill thread and wait for it to exit. Since the preceding
  // ensures that it is finished with its work (and sortAndSpill did not
  // throw), we elect to use an interrupt instead of setting a flag.
  // Spilling simultaneously from this thread while the spill thread
  // finishes its work might be both a useful way to extend this and also
  // sufficient motivation for the latter approach.
  try {
    spillThread.interrupt();
    spillThread.join();
  } catch (InterruptedException e) {
    throw new IOException("Spill failed", e);
  }
  // release sort buffer before the merge
  kvbuffer = null;
  mergeParts();
  Path outputPath = mapOutputFile.getOutputFile();
  fileOutputByteCounter.increment(rfs.getFileStatus(outputPath).getLen());
  // If necessary, make outputs permissive enough for shuffling.
  if (!SHUFFLE_OUTPUT_PERM.equals(
      SHUFFLE_OUTPUT_PERM.applyUMask(FsPermission.getUMask(job)))) {
    Path indexPath = mapOutputFile.getOutputIndexFile();
    rfs.setPermission(outputPath, SHUFFLE_OUTPUT_PERM);
    rfs.setPermission(indexPath, SHUFFLE_OUTPUT_PERM);
  }
}

10、排序并溢写

sortAndSpill();

方法中有快排的逻辑，生成包含多个分区的1个溢写文件

2、ReduceTask 源码

类名：ReduceTask
1、Reduce的三个阶段：copy、sort、reduce

public void run(JobConf job, final TaskUmbilicalProtocol umbilical)
throws IOException, InterruptedException, ClassNotFoundException {
job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());

	if (isMapOrReduce()) {
	  copyPhase = getProgress().addPhase("copy");
	  sortPhase  = getProgress().addPhase("sort");
	  reducePhase = getProgress().addPhase("reduce");
	}
	// start thread that will handle communication with parent
	TaskReporter reporter = startReporter(umbilical);
	
	boolean useNewApi = job.getUseNewReducer();
	initialize(job, getJobID(), reporter, useNewApi);   
	
	   ……
	
	shuffleConsumerPlugin.init(shuffleContext);
	
	rIter = shuffleConsumerPlugin.run();
	
	// free up the data structures
	mapOutputFilesOnDisk.clear();
	
	sortPhase.complete();                         // sort is complete
	setPhase(TaskStatus.Phase.REDUCE); 
	statusUpdate(umbilical);
	Class keyClass = job.getMapOutputKeyClass();
	Class valueClass = job.getMapOutputValueClass();
	RawComparator comparator = job.getOutputValueGroupingComparator();
	
	if (useNewApi) {
	  runNewReducer(job, umbilical, reporter, rIter, comparator, 
	                keyClass, valueClass);
	} else {
	  runOldReducer(job, umbilical, reporter, rIter, comparator, 
	                keyClass, valueClass);
	}
	
	shuffleConsumerPlugin.close();
	done(umbilical, reporter);
}

第一步run()需要执行5遍，因为有5个Reducer

2.1 初始化

2、初始化方法

initialize(job, getJobID(), reporter, useNewApi);

3、初始化方法

shuffleConsumerPlugin.init(shuffleContext);

类名：Shuffle

4、

@Override
public void init(ShuffleConsumerPlugin.Context context) {
	this.context = context;
	
	this.reduceId = context.getReduceId();
	this.jobConf = context.getJobConf();
	this.umbilical = context.getUmbilical();
	this.reporter = context.getReporter();
	this.metrics = ShuffleClientMetrics.create();
	this.copyPhase = context.getCopyPhase();
	this.taskStatus = context.getStatus();
	this.reduceTask = context.getReduceTask();
	this.localMapFiles = context.getLocalMapFiles();
	
	scheduler = new ShuffleSchedulerImpl<K, V>(jobConf, taskStatus, reduceId,
	    this, copyPhase, context.getShuffledMapsCounter(),
	    context.getReduceShuffleBytes(), context.getFailedShuffleCounter());
	merger = createMergeManager(context);
 }

context.getReduceShuffleBytes(), context.getFailedShuffleCounter());

引出第5步：确定MapTask个数

merger = createMergeManager(context);

引出第6-8步：合并方法，如果内存、磁盘都有数据，各合并一次，最后再合并一次

5、

public ShuffleSchedulerImpl(JobConf job, TaskStatus status,
                      TaskAttemptID reduceId,
                      ExceptionReporter reporter,
                      Progress progress,
                      Counters.Counter shuffledMapsCounter,
                      Counters.Counter reduceShuffleBytes,
                      Counters.Counter failedShuffleCounter) {
	totalMaps = job.getNumMapTasks();
	
	 ……

}

totalMaps = job.getNumMapTasks();

确定MapTask个数

2.2 合并方法

6、

protected MergeManager<K, V> createMergeManager(
  ShuffleConsumerPlugin.Context context) {
	return new MergeManagerImpl<K, V>(reduceId, jobConf, context.getLocalFS(),
	    context.getLocalDirAllocator(), reporter, context.getCodec(),
	    context.getCombinerClass(), context.getCombineCollector(), 
	    context.getSpilledRecordsCounter(),
	    context.getReduceCombineInputCounter(),
	    context.getMergedMapOutputsCounter(), this, context.getMergePhase(),
	    context.getMapOutputFile());
}

7、

new MergeManagerImpl<K, V>(reduceId, jobConf, context.getLocalFS(),
    context.getLocalDirAllocator(), reporter, context.getCodec(),
    context.getCombinerClass(), context.getCombineCollector(), 
    context.getSpilledRecordsCounter(),
    context.getReduceCombineInputCounter(),
    context.getMergedMapOutputsCounter(), this, context.getMergePhase(),
    context.getMapOutputFile());

类名：MergeManagerImpl

8、

public MergeManagerImpl(TaskAttemptID reduceId, JobConf jobConf, 
                  FileSystem localFS,
                  LocalDirAllocator localDirAllocator,  
                  Reporter reporter,
                  CompressionCodec codec,
                  Class<? extends Reducer> combinerClass,
                  CombineOutputCollector<K,V> combineCollector,
                  Counters.Counter spilledRecordsCounter,
                  Counters.Counter reduceCombineInputCounter,
                  Counters.Counter mergedMapOutputsCounter,
                  ExceptionReporter exceptionReporter,
                  Progress mergePhase, MapOutputFile mapOutputFile) {
	……
	// Allow unit tests to fix Runtime memory
	this.memoryLimit = (long)(jobConf.getLong(
	    MRJobConfig.REDUCE_MEMORY_TOTAL_BYTES,
	    Runtime.getRuntime().maxMemory()) * maxInMemCopyUse);
	
	this.ioSortFactor = jobConf.getInt(MRJobConfig.IO_SORT_FACTOR,
	    MRJobConfig.DEFAULT_IO_SORT_FACTOR);
	……
  }

2.3 抓取数据

9、

rIter = shuffleConsumerPlugin.run();

类名：Shuffle

10、

@Override
public RawKeyValueIterator run() throws IOException, InterruptedException {

	……
	
	// Start the map-completion events fetcher thread 
	final EventFetcher<K,V> eventFetcher = 
	  new EventFetcher<K,V>(reduceId, umbilical, scheduler, this,
	      maxEventsToFetch);
	eventFetcher.start();
	
	……
	
	// stop the scheduler
	scheduler.close();
	
	copyPhase.complete(); // copy is already complete
	taskStatus.setPhase(TaskStatus.Phase.SORT);
	reduceTask.statusUpdate(umbilical);
	
	// Finish the on-going merges...
	RawKeyValueIterator kvIter = null;
	try {
	  kvIter = merger.close();
	} catch (Throwable e) {
	  throw new ShuffleError("Error while doing final merge " , e);
	}
	
	// Sanity check
	synchronized (this) {
	  if (throwable != null) {
	    throw new ShuffleError("error in shuffle in " + throwingThreadName,
	                           throwable);
	  }
	}
	
	return kvIter;
}

11、
开始拉取数据

eventFetcher.start();

12、
拉起数据结束

scheduler.close();

13、
copy阶段结束

copyPhase.complete(); // copy is already complete

2.4 启动 ReduceTask 运行后续逻辑

14、

runNewReducer(job, umbilical, reporter, rIter, comparator, 
              keyClass, valueClass);
		ReduceTask

15、

private <INKEY,INVALUE,OUTKEY,OUTVALUE>
  void runNewReducer(JobConf job,
                 final TaskUmbilicalProtocol umbilical,
                 final TaskReporter reporter,
                 RawKeyValueIterator rIter,
                 RawComparator<INKEY> comparator,
                 Class<INKEY> keyClass,
                 Class<INVALUE> valueClass
                 ) throws IOException,InterruptedException, 
                          ClassNotFoundException {

	……
	
	try {
	  reducer.run(reducerContext);
	} finally {
	  trackedRW.close(reducerContext);
	  }
	}

类名：Reducer
16、

public void run(Context context) throws IOException, InterruptedException {
setup(context);
	try {
	  while (context.nextKey()) {
	    reduce(context.getCurrentKey(), context.getValues(), context);
	    // If a back up store is used, reset it
	    Iterator<VALUEIN> iter = context.getValues().iterator();
	    if(iter instanceof ReduceContext.ValueIterator) {
	      ((ReduceContext.ValueIterator<VALUEIN>)iter).resetBackupStore();        
	    }
	  }
	} finally {
	  cleanup(context);
	   }
    }
}

大数据之负

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
大数据—Hadoop（十三）_ MapReduce_06、核心框架原理_源码（3）_MapTask & ReduceTask

MapReduce将数据的计算，简单分成Map和Reduce两个阶段。Map阶段，将原本很大的数据集拆分成多个小份，在不同服务器上各个击破。Reduce阶段，则将原本小份的数据结果汇总，进一步计算，得到最终结果。
复制链接

扫一扫