shuffle流程

最新推荐文章于 2022-08-11 00:00:00 发布

geekLinyi

最新推荐文章于 2022-08-11 00:00:00 发布

阅读量218

点赞数

分类专栏： hadoop 文章标签： MapReduce shuffle

本文链接：https://blog.csdn.net/weixin_43855370/article/details/101722357

版权

hadoop 专栏收录该内容

25 篇文章 0 订阅

订阅专栏

shuffle流程

The Reducer copies the sorted output from each Mapper using HTTP across the network.

【Reducer通过网络使用http协议，从每个mapper复制排序后的输出】

shuffle流程
1. Map()函数
2. Buffer（环形缓冲区）
3. Partition（分区）
4. Sort（排序）
5. Spill to disk（溢写）
6. Merge on disk（合并）

Maptask类分析

【org.apache.hadoop.mapred.MapTask】run()
- conf.getNumReduceTasks() == 0
  
  【Map阶段分为两个阶段，map阶段和sort阶段】
  
  【首先判定当前任务设置的reduce数是否为0，如果为0，则无排序sort阶段】
- mapPhase = getProgress().addPhase(“map”, 0.667f);
  
  sortPhase = getProgress().addPhase(“sort”, 0.333f);
  
  【map阶段进度值为66.7%，sort阶段进度值为33.3%，如果在66.7之前程序中断，则可以判定为map阶段异常】
- 【org.apache.hadoop.mapred.Task】initialize()
  
  【首先获取Context上下文，设置OutputFormat类，默认为[org.apache.hadoop.mapreduce.lib.output.TextOutputFormat]】
- 【org.apache.hadoop.mapred.MapTask】runNewMapper()
  - a.【首先通过反射机制，创建一个自定义mapper对象，然后创建一个Inputformat】
  - b.调用【getSplitDetails()】计算逻辑切分（InputSplit），并定义RecordReader(记录阅读器)
  - c.构建RecordWrite（输出记录器），用于output输出。生成环形缓冲区。
  - d.通过上述构建的相关对象，调用new MapContextImpl()创建MapContext上下文
  - e.通过调用RecordReader的initialize进行初始化

f.调用mapper.run(mapperContext)
- g.mapPhase.complete()设置Map阶段结束

【org.apache.hadoop.mapred.MapTask$MapOutputBuffer】环形缓冲区

【在类中定义了环形缓冲区对应的相关属性】
- 【org.apache.hadoop.mapred.MapTask$MapOutputBuffer】init()
  
  》[partitions = job.getNumReduceTasks()]：默认将reduce数赋值给partition分区数
  
  》[final float spillper = job.getFloat(JobContext.MAP_SORT_SPILL_PERCENT, (float)0.8)]：设置溢写上限80%
  
  》[kvmeta = ByteBuffer.wrap(kvbuffer).order(ByteOrder.nativeOrder()).asIntBuffer()]：首先将kvbuffer字节数组包装成ByteBuffer字节缓冲区，调用order()方法，改变ByteBuffer字节缓冲区的顺序，最后调用asIntBuffer()将其转换成IntBuffer类型
  
  》setEquator(0)：设置赤道为0
  
  》softLimit = (int)(kvbuffer.length * spillper)：设置缓冲区spill限制为80M
- 【org.apache.hadoop.mapred.MapTask$MapOutputBuffer】collect(key, value, partition)
  
  【将k2和v2序列化后，添加至缓冲区，剩余空间是否为0，如果为0，触发spill】
- 【org.apache.hadoop.mapred.MapTask$MapOutputBuffer】flush()
  
  【刷盘操作，调用sortAndSpill()】
- 【org.apache.hadoop.mapred.MapTask$MapOutputBuffer】sortAndSpill()
  - a.首先创建溢写文件目录：
  - b.【org.apache.hadoop.util.QuickSort】sorter.sort()，实现快排
    
    【由于MapOutputBuffer实现IndexedSortable接口中的compare()和swap()方法】
    1. 【compare方法先按照partition分区进行比较，如分区号相同，则按照key值比较】
    2. 【swap方法，交换环形缓冲区的元数据】
  - c.构建Writer类，实现写数据的过程
    
    首先通过每个分区进行遍历，最终的spill文件形式为<k,v>对，先按分区，在按键值
  - d.合并阶段，在flush()方法中，最终将调用mergeParts()
    
    当numSpill==1，表示溢写文件为1个的时候，合并操作为重命名。
    
    当numSpill==0，表示溢写文件没有产生的时候，必须创建一个空的spill文件。
    
    当numSpill>1，表示溢写文件产生多个时，首先将相同分区的数据添加至segmentList中，然后调用Merge的merge方法进行合并，将调用MergeQueue中的Collections.sort()进行排序。
partition如何分区

如果partition=1，采用Partitioner分区器的getPartition()，并且返回值为0
如果partition>1，采用HashPartitioner，即org.apache.hadoop.mapreduce.lib.partition.HashPartitioner

partitions = jobContext.getNumReduceTasks();
if (partitions > 1) {
        partitioner = (org.apache.hadoop.mapreduce.Partitioner<K,V>)
        ReflectionUtils.newInstance(jobContext.getPartitionerClass(), job);
} else {
    partitioner = new org.apache.hadoop.mapreduce.Partitioner<K,V>() {
        @Override
        public int getPartition(K key, V value, int numPartitions) {
            return partitions - 1;
        }
	};
}

geekLinyi

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
shuffle流程

shuffle流程The Reducer copies the sorted output from each Mapper using HTTP across the network.【Reducer通过网络使用http协议，从每个mapper复制排序后的输出】shuffle流程Map()函数Buffer（环形缓冲区）Partition（分区）Sort（排序）Spill ...
复制链接

扫一扫