spark shuffle过程分析

原创 2015年11月19日 15:21:22
    shuffle是作业执行过程中的一个重要阶段,对作业性能有很大影响,不管是对hadoop还是spark,shuffle都是一个核心环节,spark的shuffle和hadoop的shuffle的原理大致相同,shuffle发生在ShuffleMapTask中,在一个task处理partition数据时,需要对外输出作为下个stage的数据源,这个输出可能是不落盘的,但如果数据量很大,导致内存放不下,这时候就需要shuffle了,shuffle包含spill、和merge两个阶段,最终会输出为一个shuffle文件(spark1.3.1)和一个包含个partition索引的index文件。如果包含map端的aggregate和order则会对partition和KV数据进行排序,负责会按partition id进行排序,这样在reduce阶段,通过索引和该shuffle文件就能找到reduce需要的数据。
上一篇提到task的启动在runTask函数中,我们从这里开始看整个shuffle的过程
  override def runTask(context: TaskContext): MapStatus = {
    // 反序列化获得rdd和其依赖
    val ser = SparkEnv.get.closureSerializer.newInstance()
    val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
      ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)


    metrics = Some(context.taskMetrics)
    var writer: ShuffleWriter[Any, Any] = null
    try {
      val manager = SparkEnv.get.shuffleManager
      //获得SortShuffleWriter实例
      writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
      //中间数据的操作再这里写出,有先写内存,不够时写磁盘
      writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
      return writer.stop(success = true).get
    } catch {
			.....
    }
  }<span style="font-family: 'Courier New'; background-color: rgb(255, 255, 255);">	</span>
写出时用到了ExternalSorter,根据是否启用map端合并创建不用类型的实例,如果指定了aggregator和keyOrdering则会按顺序输出
  override def write(records: Iterator[_ <: Product2[K, V]]): Unit = {
    if (dep.mapSideCombine) {
      require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
      sorter = new ExternalSorter[K, V, C](
        dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
      sorter.insertAll(records)
    } else {
      // In this case we pass neither an aggregator nor an ordering to the sorter, because we don't
      // care whether the keys get sorted in each partition; that will be done on the reduce side
      // if the operation being run is sortByKey.
      sorter = new ExternalSorter[K, V, V](
        None, Some(dep.partitioner), None, dep.serializer)
      sorter.insertAll(records)
    }


    // Don't bother including the time to open the merged output file in the shuffle write time,
    // because it just opens a single file, so is typically too fast to measure accurately
    // (see SPARK-3570).
    val outputFile = shuffleBlockManager.getDataFile(dep.shuffleId, mapId)
    val blockId = shuffleBlockManager.consolidateId(dep.shuffleId, mapId)//合并spill文件
    val partitionLengths = sorter.writePartitionedFile(blockId, context, outputFile)
    shuffleBlockManager.writeIndexFile(dep.shuffleId, mapId, partitionLengths)//写索引文件


    mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
  }
下面是遍历一个partition中的数据并做输出
  def insertAll(records: Iterator[_ <: Product2[K, V]]): Unit = {
    // TODO: stop combining if we find that the reduction factor isn't high
    val shouldCombine = aggregator.isDefined
		//如果执行combine则会对map数据进行merge操作
    if (shouldCombine) {
      // Combine values in-memory first using our AppendOnlyMap
      val mergeValue = aggregator.get.mergeValue
      val createCombiner = aggregator.get.createCombiner
      var kv: Product2[K, V] = null
      //merge函数
      val update = (hadValue: Boolean, oldValue: C) => {
        if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
      }
      //遍历partition数据
      while (records.hasNext) {
        addElementsRead()
        kv = records.next()
        map.changeValue((getPartition(kv._1), kv._1), update)
        //对外输出,可能产生spill,至于何时spill,看下面分析
        maybeSpillCollection(usingMap = true)
      }
    } else if (bypassMergeSort) {
      // SPARK-4479: Also bypass buffering if merge sort is bypassed to avoid defensive copies
      if (records.hasNext) {
        spillToPartitionFiles(records.map { kv =>
          ((getPartition(kv._1), kv._1), kv._2.asInstanceOf[C])
        })
      }
    } else {
      // Stick values into our buffer
      while (records.hasNext) {
        addElementsRead()
        val kv = records.next()
        buffer.insert((getPartition(kv._1), kv._1), kv._2.asInstanceOf[C])
        maybeSpillCollection(usingMap = false)
      }
    }
  }
这里包含了spill的触发条件,最关键条件就是内存不够,当当前内粗不足时会首先申请内存,如果申请后仍然不够则需要spill,那么内存够的情况下,数据放哪里呢?继续看后面分析
  protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
    if (elementsRead > trackMemoryThreshold && elementsRead % 32 == 0 &&
        currentMemory >= myMemoryThreshold) {
      // Claim up to double our current memory from the shuffle memory pool
      val amountToRequest = 2 * currentMemory - myMemoryThreshold
      val granted = shuffleMemoryManager.tryToAcquire(amountToRequest)
      myMemoryThreshold += granted
      if (myMemoryThreshold <= currentMemory) {
        // We were granted too little memory to grow further (either tryToAcquire returned 0,
        // or we already had more memory than myMemoryThreshold); spill the current collection
        _spillCount += 1
        logSpillage(currentMemory)


        spill(collection)


        _elementsRead = 0
        // Keep track of spills, and release memory
        _memoryBytesSpilled += currentMemory
        releaseMemoryForThisThread()
        return true
      }
    }
    false
  }
如果内存够用,则把数据放入一个SizeTrackingAppendOnlyMap数据结构中,这是一个只能append的kv map集合,由spark本身实现,注意并不是java的collection,有兴趣的朋友可以翻下代码看看,内部存储是一个Array
    返回头看看spill写文件的过程,上面讲到了spill的触发条件,那么条件满足后文件怎么写的,写到哪里呢,看下面分析
  /**
   * Spill the current in-memory collection to disk, adding a new file to spills, and clear it.
   */
  override protected[this] def spill(collection: SizeTrackingPairCollection[(Int, K), C]): Unit = {
    if (bypassMergeSort) {
      spillToPartitionFiles(collection)
    } else {
      spillToMergeableFile(collection)
    }
  }
下面操作会真正的写出数据,并且是按key有序的输出,输出形式是 key1,value1,key2,value2.....,真正写文件的时候还是比较简单的,直接用BlockObjectWriter把一个个对象写到文件
  /**
   * Spill our in-memory collection to a sorted file that we can merge later (normal code path).
   * We add this file into spilledFiles to find it later.
   *
   * Alternatively, if bypassMergeSort is true, we spill to separate files for each partition.
   * See spillToPartitionedFiles() for that code path.
   *
   * @param collection whichever collection we're using (map or buffer)
   */
  private def spillToMergeableFile(collection: SizeTrackingPairCollection[(Int, K), C]): Unit = {
    assert(!bypassMergeSort)


    // Because these files may be read during shuffle, their compression must be controlled by
    // spark.shuffle.compress instead of spark.shuffle.spill.compress, so we need to use
    // createTempShuffleBlock here; see SPARK-3426 for more context.
    val (blockId, file) = diskBlockManager.createTempShuffleBlock()
    curWriteMetrics = new ShuffleWriteMetrics()
    var writer = blockManager.getDiskWriter(blockId, file, ser, fileBufferSize, curWriteMetrics)
    var objectsWritten = 0   // Objects written since the last flush


    // List of batch sizes (bytes) in the order they are written to disk
    val batchSizes = new ArrayBuffer[Long]


    // How many elements we have in each partition
    val elementsPerPartition = new Array[Long](numPartitions)


    // Flush the disk writer's contents to disk, and update relevant variables.
    // The writer is closed at the end of this process, and cannot be reused.
    def flush() = {
      val w = writer
      writer = null
      w.commitAndClose()
      _diskBytesSpilled += curWriteMetrics.shuffleBytesWritten
      batchSizes.append(curWriteMetrics.shuffleBytesWritten)
      objectsWritten = 0
    }


    var success = false
    try {
      val it = collection.destructiveSortedIterator(partitionKeyComparator)
      while (it.hasNext) {
        val elem = it.next()
        val partitionId = elem._1._1
        val key = elem._1._2
        val value = elem._2
        writer.write(key)
        writer.write(value)
        elementsPerPartition(partitionId) += 1
        objectsWritten += 1


        if (objectsWritten == serializerBatchSize) {
          flush()
          curWriteMetrics = new ShuffleWriteMetrics()
          writer = blockManager.getDiskWriter(blockId, file, ser, fileBufferSize, curWriteMetrics)
        }
      }
      if (objectsWritten > 0) {
        flush()
      } else if (writer != null) {
        val w = writer
        writer = null
        w.revertPartialWritesAndClose()
      }
      success = true
    } finally {
      if (!success) {
        // This code path only happens if an exception was thrown above before we set success;
        // close our stuff and let the exception be thrown further
        if (writer != null) {
          writer.revertPartialWritesAndClose()
        }
        if (file.exists()) {
          file.delete()
        }
      }
    }
    //把spillwenjian加入集合中,以便后期合并
    spills.append(SpilledFile(file, blockId, batchSizes.toArray, elementsPerPartition))
  }
最后spill文件会被合并成一个shuffle文件,spill文件的名字类似于temp_shuffle_&^%%$%之类的名字,合并为一个shuffle文件后,其他文件会被删除,上面就是整个shuffle过程,中间还有很多细节,感兴趣的朋友可以深入分析下。

Spark Shuffle过程分析

Spark Shuffle虽然采取了和MapReduce完全不一样的机制,但深层的原理还是有相同的地方的。本章讲解Spark Shuffle的基本处理机制。...
  • jeremyxn
  • jeremyxn
  • 2016年03月11日 11:01
  • 468

spark Shuffle过程分析

普通shuffle过程 shuffle过程是spark运算的重要过程,也是spark调优的关键地方之一,在spark中的reduceByKey,groupByKey,sortByKey,countB...
  • jiangsanfeng1111
  • jiangsanfeng1111
  • 2017年09月22日 14:52
  • 103

spark shuffle过程分析

spark shuffle流程分析 回到ShuffleMapTask.runTask函数 现在回到ShuffleMapTask.runTask函数中: overridedef runTask(c...
  • u014393917
  • u014393917
  • 2014年05月09日 13:36
  • 2925

第37课:Spark中Shuffle详解及作业

第37课:Spark中Shuffle详解及作业
  • zhumr
  • zhumr
  • 2016年09月14日 19:26
  • 5744

Spark性能优化:shuffle调优

shuffle调优 调优概述       大多数Spark作业的性能主要就是消耗在了shuffle环节,因为该环节包含了大量的磁盘IO、序列化、网络数据传输等操作。因此,如果要让作业的性能更上一层...
  • u012102306
  • u012102306
  • 2016年06月11日 19:38
  • 11964

对Spark中shuffle机制的浅谈

Shuffle,洗牌、搅乱的意思。顾名思义就是把有规则或者有顺序的东西,打乱。打过扑克和麻将的童鞋们会有切身的体验。而在Spark中,Shuffle的过程正好相反,它是将一组无规则的数据,变成一个有规...
  • u013485584
  • u013485584
  • 2015年01月13日 21:35
  • 727

Spark Shuffle详解

一:到底什么是Shuffle? Shuffle中文翻译为“洗牌”,需要Shuffle的关键性原因是某种具有共同特征的数据需要最终汇聚到一个计算节点上进行计算。 二:Shuffle可能面临的...
  • snail_gesture
  • snail_gesture
  • 2016年03月04日 16:19
  • 2639

Spark Sort-Based Shuffle详解

一:为什么需要Sort-Based Shuffle? 1, Shuffle一般包含两个阶段任务: 第一部分:产生Shuffle数据的阶段(Map阶段,额外补充,需要实现ShuffleManage...
  • snail_gesture
  • snail_gesture
  • 2016年03月05日 08:05
  • 2539

Spark技术内幕:Shuffle Read的整体流程

本文详细讲解Shuffle Read的整个过程,包括如何获得Block的元数据信息,进行网络,本地读取。通过一个整体的流程架构图,详细大家可以对整个过程有一个更加深刻的把握...
  • anzhsoft2008
  • anzhsoft2008
  • 2015年01月12日 08:07
  • 17704

Spark划分Shuffle依赖以及创建Stage的流程

博客为笔者学习过程中,自我的理解和总结,难免存在错误,如果给您造成困扰请原谅,同时希望指点迷津 上一篇博文介绍了Spark提交作业的流程以及作业是如何被触发在集群中运行的,答案便是:acti...
  • Dax1n
  • Dax1n
  • 2017年04月08日 22:33
  • 776
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:spark shuffle过程分析
举报原因:
原因补充:

(最多只允许输入30个字)