[Spark12]Spark中的Shuffle

Shuffle operations

        spark中的某些操作会触发一个称之为shuffle的东西,shuffle是spark中数据重分发的一种机制,以便于在跨分区进行数据的分组。shuffle通常会引起executor与节点之间的数据复制,这使得shuffle操作十分地复杂和昂贵。


##一个stage遇到一个shuffle就变成两个stage,两个shuffle就变成三个stage。

Background

        为了更好地理解在shuffle过程中发生了什么我们可以使用 reduceByKey  算子来举个例子,reduceByKey操作生成一个新的RDD,其中将相同key的所有value合并成一个tuple,tuple内部装着的是对与该key关联的所有的value执行reduce函数后产生的新的key和value。这带来的挑战是并不是相同key所有的value值都在一个分区上,甚至都可能不在同一台机器上,可是这些所有的value却必须联合起来才能计算出结果。

        在spark中,数据通常不会在执行操作时从跨分区重分布到合适的地方去,在计算过程中,一个task就对应着一个分区。为了将所有的数据组织成一个 reduceByKey  算子的reduce任务来执行,spark需要执行一个all-to-all操作,它必须从所有的分区上读取所有的key对应的所有的value,之后再将单个key的所有跨分区的value合并成一个value,这就是shuffle。虽然shuffle产生的新数据的在每个分区中的元素集是确定的,分区本身的排序也是确定的,但是这些元素的顺序是不确定的。如果一个人想要在shuffle之后得到可以预见的有序数据,那么就有可能使用以下的这些算子:mapPartitions/repartitionAndSortWithinPartitions /sortBy。

        repartition之类的操作如coalesce()repartition(),ByKey之类的操作(除了counting)如groupByKey()和reduceByKey(),join之类的操作如cogroup()和join()都会触发shuffle。


Performance Impact

        shuffle是一个十分昂贵的操作,因为它涉及到网络的I/O,磁盘的I/O和数据的序列化。在shuffle过程中,spark会生产一组map任务为shuffle组织数据,生成一组reduce任务来将数据聚合,这种命名法来自MapReduce,它与Spark的map和reduce操作没有直接关系。

        在shuffle内部,单个map tasks的结果被保存在内存中,直到放不下为止。然后,根据目标分区对它们进行排序,并将它们写入单个文件。在reduce端,tasks会读取相关的经过排序的数据块。

        某些shuffle操作会消耗大量的堆内存,因为它们在数据传输前后会利用内存中的数据结构组织数据,准确的说,reduceByKey 和aggregateByKey 在map端会创建这些数据结构,而reduce端则会生成这些数据。当数据在内存中存不下的时候,Spark会将这些数据表溢写到磁盘上,从而导致额外开销如磁盘I/O,并且会产生垃圾。

        shuffle还会在磁盘上产生大量的中间文件,在spark1.3版本中,这些中间的文件会被保存下来直到相应的RDD不再被使用并且作为垃圾被回收,这样做是为了当触发重算的时候这些中间文件不用被重新创建。垃圾收集可能会发生在很长的一段时间之后,如果应用程序保留了对这些RDD的引用,或者垃圾收集不经常启动的话。这意味着对于一个运行时长较长的spark作业,它可能会消耗大量的磁盘空间。这些中间文件的存储目录在配置Spark Context时由spark.local.dir参数明确指定。

        通过对各种参数的调整配置,我们可以对shuffle进行优化。shuffle的具体优化在Spark Configuration Guide部分。


----------------------------------------------------------------------------------------------------------------------------------------------------

介绍一下上面提到的两个算子repartition和coalesce,这两个算子在进行小文件的合并和避免数据倾斜上有重要的应用。

1、coalesce()算子:

    

coalesce(numPartitions)

Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset

减少RDD的分区数到指定的数值,通常使用在过滤掉大量数据之后的小文件合并。.

底层源码:
/**
 * Return a new RDD that is reduced into `numPartitions` partitions.
 *
 * This results in a narrow dependency, e.g. if you go from 1000 partitions
 * to 100 partitions, there will not be a shuffle, instead each of the 100
 * new partitions will claim 10 of the current partitions. If a larger number
 * of partitions is requested, it will stay at the current number of partitions.
 *
 * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
 * this may result in your computation taking place on fewer nodes than
 * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
 * you can pass shuffle = true. This will add a shuffle step, but means the
 * current upstream partitions will be executed in parallel (per whatever
 * the current partitioning is).
 *
 * @note With shuffle = true, you can actually coalesce to a larger number
 * of partitions. This is useful if you have a small number of partitions,
 * say 100, potentially with a few partitions being abnormally large. Calling
 * coalesce(1000, shuffle = true) will result in 1000 partitions with the
 * data distributed using a hash partitioner. The optional partition coalescer
 * passed in must be serializable.
 */
def coalesce(numPartitions: Int, shuffle: Boolean = false,
             partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
            (implicit ord: Ordering[T] = null)
    : RDD[T] = withScope {
  require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
  if (shuffle) {
    /** Distributes elements evenly across output partitions, starting from a random partition. */
    val distributePartition = (index: Int, items: Iterator[T]) => {
      var position = (new Random(index)).nextInt(numPartitions)
      items.map { t =>
        // Note that the hash code of the key will just be the key itself. The HashPartitioner
        // will mod it with the number of total partitions.
        position = position + 1
        (position, t)
      }
    } : Iterator[(Int, T)]

    // include a shuffle step so that our upstream tasks are still distributed
    new CoalescedRDD(
      new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
      new HashPartitioner(numPartitions)),
      numPartitions,
      partitionCoalescer).values
  } else {
    new CoalescedRDD(this, numPartitions, partitionCoalescer)
  }
}

2、repartition()算子:

repartition(numPartitions)

Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.

重新shufleRDD中的数据并将其分配到更多的或更少的分区上,以此来平衡数据。这通过要消耗网络资源。

repartition底层调用的是coalesce

底层源码:


/**
 * Return a new RDD that has exactly numPartitions partitions.
 *
 * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
 * a shuffle to redistribute data.
 *
 * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
 * which can avoid performing a shuffle.
 */
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
  coalesce(numPartitions, shuffle = true)
}

    

阅读更多
换一批

没有更多推荐了,返回首页