[官网解读]Spark Shuffle

傻么老幺

于 2024-04-05 15:02:37 发布

阅读量2.1k

点赞数 64

分类专栏： SPAR 文章标签： spark 大数据分布式

本文链接：https://blog.csdn.net/qq_43428465/article/details/137399543

版权

Shuffle是Spark中一种复杂的、代价高昂的数据重新分布机制，用于跨分区重组数据。它涉及到磁盘I/O、数据序列化和网络I/O。reduceByKey、groupByKey等操作会导致Shuffle，而Shuffle操作可能导致大量中间文件生成，占用磁盘空间。可以通过调整配置参数来优化Shuffle行为。

摘要由CSDN通过智能技术生成

Shuffle operations

Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.

Spark内部的某种操作会触发一个叫Shuffle的事件。Shuffle是Spark重新分发数据的机制，以便跨分区做不同的分组。Shuffle通常包括跨执行器和节点复制数据，这使得Shuffle成为一种复杂且昂贵的操作。

Background

To understand what happens during the shuffle, we can consider the example of the reduceByKey operation. The reduceByKey operation generates a new RDD where all values for a single key are combined into a tuple - the key and the result of executing a reduce function against all values associated with that key. The challenge is that not all values for a single key necessarily reside on the same partition, or even the same machine, but they must be co-located to compute the result.

要理解Shuffle期间发生了什么，我们可以以reduceByKey为例。reduceByKey操作生成一个新的RDD，其中单个键的所有值都被组合成一个元组——该键和对与该键相关的所有值执行reduce函数的结果。挑战在于，并不是单个键的所有值都必须位于同一个分区，甚至同一台机器上，但它们必须在同一位置才能计算结果。

In Sp

最低0.47元/天解锁文章

傻么老幺

关注

64
点赞
踩
47

收藏

觉得还不错? 一键收藏
打赏
0
评论
[官网解读]Spark Shuffle

在计算过程中，单个任务将在单个分区上进行操作，因此，为了组织单个reduceByKey reduce任务执行的所有数据，Spark需要执行一个all-to-all操作。它必须从所有分区中读取，以找到所有键的所有值，然后将分区之间的值合并在一起，以计算每个键的最终结果——这被称为shuffle。当内存放不下这些数据时，Spark会将这些表溢出到磁盘，从而导致磁盘I/O的额外开销和垃圾回收的增加。尽管新混洗数据的每个分区中的元素集是确定的，分区本身的排序也是确定的，但这些元素的排序不是。
复制链接

扫一扫