Shuffle operations
Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.
Spark内部的某种操作会触发一个叫Shuffle的事件。Shuffle是Spark重新分发数据的机制,以便跨分区做不同的分组。Shuffle通常包括跨执行器和节点复制数据,这使得Shuffle成为一种复杂且昂贵的操作。
Background
To understand what happens during the shuffle, we can consider the example of the reduceByKey operation. The reduceByKey
operation generates a new RDD where all values for a single key are combined into a tuple - the key and the result of executing a reduce function against all values associated with that key. The challenge is that not all values for a single key necessarily reside on the same partition, or even the same machine, but they must be co-located to compute the result.
要理解Shuffle期间发生了什么,我们可以以reduceByKey为例。reduceByKey操作生成一个新的RDD,其中单个键的所有值都被组合成一个元组——该键和对与该键相关的所有值执行reduce函数的结果。挑战在于,并不是单个键的所有值都必须位于同一个分区,甚至同一台机器上,但它们必须在同一位置才能计算结果。
In Sp