Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.
Spark内的某些操作会触发称为shuffle的事件。 shuffle是Spark的重新分布数据的机制,因此它在分区之间的分组不同。 这通常涉及将数据复制到执行器和机器上,从而使shuffle成为复杂而昂贵的操作。
背景
To understand what happens during the shuffle we can consider the example of the reduceByKey operation. The reduceByKey operation generates a new RDD where all values for a single key are combined into a tuple - the key and the result of executing a reduce function against all values associated with that key. The challenge is that not all values for a single key necessarily reside on the same partition, or even the same machine, but they must be co-located to compute the result.
要了解shuffle中会发生什么,我们可以考虑reduceByKey操作的示例。 reduceByKey操作生成一个新的RDD,其中单个key的所有值都被组合成一个元组 - key和执行reduce函数的结果与所有与该key关联的value。 挑战在于,并不是单个key的所有值都必须驻留在同一个分区上,甚至是同一个机器上,但它们必须位于同一位置才能计算结果。
In Spark, data is generally not distributed across partitions to be in the necessary place for a specific operation. During computations, a single task will operate on a single partition - thus, to organize all the data for a single reduceByKey reduce task to execute, Spark needs to perform an all-to-all operation. It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key - this is called the shuffle.
在Spark中,数据通常不会跨分区分布,以便在特定操作的必要位置。 在计算过程中,单个任务将在单个分区上运行 - 因此,为了执行单个reduceByKey任务来组织所有数据,Spark需要执行多对多的操作。 它必须从所有分区中读取以查找所有键的所有值,然后将分区中的值汇集在一起,以计算每个键的最终结果 - 这称为shuffle。
Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements is not. If one desires predictably ordered data following shuffle then it’s possible to use:
虽然新shuffle后数据的每个分区中的元素集将是确定性的,分区本身的排序也是如此,但是这些元素的排序不是。 如果一个人想要shuffle之后可预测的有序数据,那么可以使用:
–mapPartitions to sort each partition using, for example, .sorted
–repartitionAndSortWithinPartitions to efficiently sort partitions while simultaneously repartitioning
–sortBy to make a globally ordered RDD
mapPartitions以使用例如.sorted对每个分区进行排序
repartitionAndSortWithinPartitions有效地对分区进行分类,同时重新分区
sortBy来制作一个全局排序的RDD
Operations which can cause a shuffle include repartition operations like repartition and coalesce, ‘ByKey operations (except for counting) like groupByKey and reduceByKey, and join operations like cogroup and join.
可能导致shuffle 的操作包括重新分区操作,如重新分区和合并,“ByKey操作(除了计数),如groupByKey和reduceByKey),并加入像cogroup和join这样的操作。
Performance Impact
性能影响
The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations.
Shuffle 是一项昂贵的操作,因为它涉及磁盘I / O,数据串行化和网络I / O。 要组织随机播放的数据,Spark会生成一组任务 - 映射任务以组织数据,以及一组缩减任务以进行汇总。 这个命名法来自于MapReduce,并不直接与Spark的Map和Reduce操作有关。
Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks.
在内部,单独的map任务的结果将保存在内存中,直到它们不适合为止。 然后,这些根据目标分区进行排序并写入单个文件。 在reduce时,任务读取相关的排序块。
Certain shuffle operations can consume significant amounts of heap memory since they employ in-memory data structures to organize records before or after transferring them. Specifically, reduceByKey and aggregateByKey create these structures on the map side, and ‘ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.
某些shuffle 操作可能会占用大量的堆内存,因为它们在传输之前或之后使用内存中的数据结构来组织记录。 具体来说,reduceByKey和aggregateByKey在map上创建这些结构,而ByKey操作会在reduce方面生成这些结构。 当数据不适合内存时,Spark会将这些表溢出到磁盘,导致磁盘I / O的额外开销和增加的垃圾回收。
Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed. Garbage collection may happen only after a long period of time, if the application retains references to these RDDs or if GC does not kick in frequently. This means that long-running Spark jobs may consume a large amount of disk space. The temporary storage directory is specified by the spark.local.dir configuration parameter when configuring the Spark context.
Shuffle 还会在磁盘上生成大量的中间文件。 从Spark 1.3开始,这些文件将被保留,直到相应的RDD不再使用并被垃圾回收。 这样做,所以如果重新计算谱系,则不需要重新创建Shuffle 文件。 如果应用程序保留对这些RDD的引用或GC不频繁启动,垃圾收集可能仅在长时间之后才会发生。 这意味着长时间运行的Spark作业可能会消耗大量的磁盘空间。 当配置Spark上下文时,临时存储目录由spark.local.dir配置参数指定。