Spark Shuffle 详解（1）

最新推荐文章于 2024-07-24 14:31:07 发布

mahuacai

最新推荐文章于 2024-07-24 14:31:07 发布

阅读量2.6k

点赞数

分类专栏：源码分析 spark core 源码分析文章标签： spark 大数据

本文链接：https://blog.csdn.net/mahuacai/article/details/51916428

版权

版本：1.6.2不管是hadoop中map/reduce还是spark中各种算子，shuffle过程都是其中核心过程，shuffle的设计是否高效，基本确定了整个计算过程是否高效。设计难点在于shuffle过程涉及到大数据的IO操作（包括本地临时文件IO和网络IO），以及可能存在的cpu密集型排序计算操作。在spark1.6.2版本，spark针对大型数据有三种shuffle 机制，

摘要由CSDN通过智能技术生成

版本：1.6.2

不管是hadoop中map/reduce还是spark中各种算子，shuffle过程都是其中核心过程，shuffle的设计是否高效，基本确定了整个计算过程是否高效。设计难点在于shuffle过程涉及到大数据的IO操作（包括本地临时文件IO和网络IO），以及可能存在的cpu密集型排序计算操作。

在spark1.6.2版本，spark针对大型数据有三种shuffle 机制，即“sort-based shuffle”,”hash-based shuffle”,”tungsten-sort shuffle"

下面是官方对其的描述：

 
  /**
  
   * In sort-based shuffle, incoming records are sorted according to their target partition ids, then
  
   * written to a single map output file. Reducers fetch contiguous regions of this file in order to
  
   * read their portion of the map output. In cases where the map output data is too large to fit in
  
   * memory, sorted subsets of the output can are spilled to disk and those on-disk files are merged
  
   * to produce the final output file.
  
   *
  
   * Sort-based shuffle has two different write paths for producing its map output files:
  
   *
  
   *  - Serialized sorting: used when all three of the following conditions hold:
  
   *    1. The shuffle dependency specifies no aggregation or output ordering.
  
   *    2. The shuffle serializer supports relocation of serialized values (this is currently
  
   *       supported by KryoSerializer and Spark SQL's custom serializers).
  
   *    3. The shuffle produces fewer than 16777216 output partitions.
  
   *  - Deserialized sorting: used to handle all other cases.
  
   *
  
   * -----------------------
  
   * Serialized sorting mode
  
   * -----------------------
  
   *
  
   * In the serialized sorting mode, incoming records are serialized as soon as they are passed to the
  
   * shuffle writer and are buffered in a serialized form during sorting. This write path implements
  
   * several optimizations:
  
   *
  
   *  - Its sort operates on serialized binary data rather than Java objects, which reduces memory
  
   *    consumption and GC overheads. This optimization requires the record serializer to have certain
  
   *    properties to allow serialized records to be re-ordered without requiring deserialization.
  
   *    See SPARK-4550, where this optimization was first proposed and implemented, for more details.
  
   *
  
   *  - It uses a specialized cache-efficient sorter ( 
  [[ 
  ShuffleExternalSorter 
  ]] 
  ) that sorts
  
   *    arrays of compressed record pointers and partition ids. By using only 8 bytes of space per
  
   *    record in the sorting array, this fits more of the array into cache.
  
   *
  
   *  - The spill merging procedure operates on blocks of serialized records that belong to the same
  
   *    partition and does not need to deserialize records during the merge.
  
   *
  
   *  - When the spill compression codec supports concatenation of compressed data, the spill merge
  
   *    simply concatenates the serialized and compressed spill partitions to produce the final output
  
   *    partition.  This allows efficient data copying methods, like NIO's  
  ` 
  transferTo 
  ` 
  , to be used
  
   *    and avoids the need to allocate decompression or copying buffers during the merge.
  
   *
  
   * For more details on these optimizations, see SPARK-7081.
  
   */

本文针对shuffle相关的代码逻辑做一次串读，其中包括shuffle的原理，以及shuffle代码级别的实现。

Job，Stage，Task, Dependency

在Spark中，RDD是操作对象的单位，其中操作可以分为转换(transformation)和动作(actions),只有动作操作才会触发一个spark计算操作。
以rdd.map操作和rdd.count操作做比较

 
  /**
  
   * Return a new RDD by applying a function to all elements of this RDD.
  
   */
  
  def  
  map[ 
  U: ClassTag](f:  
  T =>  
  U): RDD[ 
  U] = withScope { 
  
   
  val cleanF = sc.clean(f) 
  
   
  new MapPartitionsRDD[ 
  U 
  ,  
  T]( 
  this 
  , (context 
  , pid 
  , iter) => iter.map(cleanF)) 
  
 } 
 

 
  /**
  
   * Return the number of elements in the RDD.
  
   */
  
  def  
  count():  
  Long = sc.runJob( 
  this 
  , Utils. 
  getIteratorSize _).sum

map是一个转换操作，它只是在当前的rdd的基础上创建一个MapPartitionsRDD对象，而count是一个动作操作，它会调用 sc.runJob向spark提交一个Job

Job是一组rdd的转换以及最后动作的操作集合，它是Spark里面计算最大最虚的概念，甚至在spark的任务页面中都无法看到job这个单位。但是不管怎么样，在spark用户的角度，job是我们计算目标的单位，每次在一个rdd上做一个动作操作(acions)时，都会触发一个job，完成计算并返回我们想要的数据。

Job是由一组RDD上转换和动作组成，这组RDD之间的转换关系表现为一个有向无环图(DAG)，每个RDD的生成依赖于前面1个或多个RDD。

在Spark中，两个RDD之间的依赖关系是Spark的核心。站在RDD的角度，两者依赖表现为点对点依赖，但是在Spark中，RDD存在分区（partition）的概念，两个RDD之间的转换会被细化为两个RDD分区之间的转换。

如上图所示，站在job角度，RDD_B由RDD_A转换而成，RDD_D由RDD_C转换而成，最后RDD_E由RDD_B和RDD_D转换，最后输出RDD_E上做了一个动作，将结果输出。但是细化到RDD内分区之间依赖，RDD_B对RDD_A的依赖，RDD_D对RDD_C的依赖是不一样，他们的区别用专业词汇来描述即为窄依赖和宽依赖。

所谓的窄依赖是说子RDD中的每一个数据分区只依赖于父RDD中的对应的有限个固定的数据分区，而宽依赖是指子RDD中的每个数据分区依赖于父RDD中的所有数据分区。

宽依赖很好理解，但是对于窄依赖比较绕口，特别是定义中有限与固定两个要求，宽依赖也满足有限和固定这两个要求？难道他们俩个之间区别也仅仅在于“有限”这个数字的大小？其实就是这样的理解，“有限”就表现为所依赖的分区数目相比完整分区数相差很大，而且spark靠窄依赖来实现的RDD基本上都大部分都是一对一的依赖，所以就不需要纠结这个有限的关键字。

这里还有一个问题，count操作是依赖父RDD的所有分区进行计算而得到，那么它是宽依赖吗？这么疑问，答案肯定就是否定的，首先这里依赖是父RDD和子RDD之间的关系描述，count操作只有输出，没有子rdd的概念，就不要把依赖的关系硬套上给你带来麻烦。看上面的实现，count只是把sc.runJob计算返回的Array[U]做一次sum操作而已。

窄依赖和宽依赖的分类是Spark中很重要的特性，不同依赖在实现，任务调度机制，容错恢复上都有不同的机制。

实现上：对于窄依赖，rdd之间的转换可以直接pipe化，而宽依赖需要采用shuffle过程来实现。
任务调度上：窄依赖意味着可以在某一个计算节点上直接通过父RDD的某几块数据（通常是一块）计算得到子RDD某一块的数据；而相对的，宽依赖意味着子RDD某一块数据的计算必须等到它的父RDD所有数据都计算完成之后才可以进行，而且需要对父RDD的计算结果需要经过shuffle才能被下一个rdd所操作。
容错恢复上：窄依赖的错误恢复会比宽依赖的错误恢复要快很多，因为对于窄依赖来说，只有丢失的那一块数据需要被重新计算，而宽依赖意味着所有的祖先RDD中所有的数据块都需要被重新计算一遍，这也是我们建议在长“血统”链条特别是有宽依赖的时候，需要在适当的时机设置一个数据检查点以避免过长的容错恢复。

在这边可以使用:RDD.checkpoint的方法来实现检查点

 
  /**
  
   * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
  
   * directory set with  
  ` 
  SparkContext#setCheckpointDir 
  `  
  and all references to its parent
  
   * RDDs will be removed. This function must be called before any job has been
  
   * executed on this RDD. It is strongly recommended that this RDD is persisted in
  
   * memory, otherwise saving it on a file will require recomputation.
  
   */
  
  def  
  checkpoint():  
  Unit = RDDCheckpointData.synchronized { 
  
  // NOTE: we use a global lock here due to complexities downstream with ensuring
  
    // children RDD partitions point to the correct parent partitions. In the future
  
    // we should revisit this consideration.
  
  if (context. 
  checkpointDir.isEmpty) { 
  
  throw new SparkException( 
  "Checkpoint directory has not been set in the SparkContext") 
  
   }  
  else if ( 
  checkpointData.isEmpty) { 
  
  checkpointData =  
  Some( 
  new ReliableRDDCheckpointData( 
  this)) 
  
   } 
  
 }

理清楚了Job层面RDD之间的关系，RDD层面分区之间的关系，那么下面讲述一下Stage概念。

Stage的划分是对一个Job里面一系列RDD转换和动作进行划分。

首先job是因动作而产生，因此每个job肯定都有一个ResultStage，否则job就不会启动。
其次，如果Job内部RDD之间存在宽依赖，Spark会针对它产生一个中间Stage，即为ShuffleStage，严格来说应该是ShuffleMapStage，这个stage是针对父RDD而产生的，相当于在父RDD上做一个父rdd.map().collect()的操作。ShuffleMapStage生成的map输入，对于子RDD，如果检测到所自己所“宽依赖”的stage完成计算，就可以启动一个shuffleFectch，从而将父RDD输出的数据拉取过程，进行后续的计算。

因此一个Job由一个ResultStage和多个ShuffleMapStage组成。

无Shuffle Job的执行过程

对一个无Shuffle的job执行过程的剖析可以知晓我们执行一个"动作"时,spark的处理流程. 下面我们就以一个简单例子进行讲解:

sc.textFile(“ mahuacai").count

//def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

这个例子很简单就是统计这个文件的行数;上面一行代码,对应了下面三个过程中:

sc.textFile(“mahucai")会返回一个rdd,
然后在这个rdd上做count动作,触发了一次Job的提交sc.runJob(this, Utils.getIteratorSize _)
对runJob返回的Array结构进行sum操作;

核心过程就是第二步,下面我们以代码片段的方式来描述这个过程,这个过程肯定是线性的,就用step来标示每一步,以及相关的代码类:

//step1:SparkContext

 
  
 
  /**
  
   * Run a function on a given set of partitions in an RDD and return the results as an array.
  
   */
  
  def  
  runJob[ 
  T 
  ,  
  U: ClassTag]( 
  
     rdd: RDD[ 
  T] 
  ,
  
      func: (TaskContext 
  ,  
  Iterator[ 
  T]) =>  
  U 
  ,
  
      partitions:  
  Seq[ 
  Int]): Array[ 
  U] = { 
  
   
  val results =  
  new Array[ 
  U](partitions.size) 
  
   runJob[ 
  T 
  ,  
  U](rdd 
  , func 
  , partitions 
  ,  
  (index, res) => results(index) = res) 
  
   results 
  
 } 
  
 
  
 
 

sc.runJob(this, Utils.getIteratorSize _)的过程会经过一组runJob的重载函数,进入上述step1中的runJob函数,相比原始的runJob,到达这边做的工作不多,比如设置partitions个数, Utils.getIteratorSize _到func转化等,以后像这样简单的过程就不再描述.

Step1做的一个很重要的工作是构造一个Array,并构造一个函数对象"(index, res) => results(index) = res"继续传递给runJob函数,然后等待runJob函数运行结束,将results返回; 对这里的解释相当在runJob添加一个回调函数,将runJob的运行结果保存到Array到, 回调函数,index表示mapindex, res为

最低0.47元/天解锁文章

mahuacai

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
Spark Shuffle 详解（1）

版本：1.6.2不管是hadoop中map/reduce还是spark中各种算子，shuffle过程都是其中核心过程，shuffle的设计是否高效，基本确定了整个计算过程是否高效。设计难点在于shuffle过程涉及到大数据的IO操作（包括本地临时文件IO和网络IO），以及可能存在的cpu密集型排序计算操作。在spark1.6.2版本，spark针对大型数据有三种shuffle 机制，
复制链接

扫一扫