spark.sql.shuffle.partitions 和 spark.default.parallelism 的区别

最新推荐文章于 2024-06-18 08:45:00 发布

Lestat.Z.

最新推荐文章于 2024-06-18 08:45:00 发布

阅读量2.2w

点赞数 5

分类专栏： Spark Spark学习随笔文章标签： spark sparksql

本文链接：https://blog.csdn.net/yolohohohoho/article/details/87967783

版权

73 篇文章 8 订阅

订阅专栏

58 篇文章 1 订阅

订阅专栏

在关于spark任务并行度的设置中，有两个参数我们会经常遇到，spark.sql.shuffle.partitions 和 spark.default.parallelism, 那么这两个参数到底有什么区别的？

首先，让我们来看下它们的定义

Property Name	Default	Meaning
spark.sql.shuffle.partitions	200	Configures the number of partitions to use when shuffling data for joins or aggregations.
spark.default.parallelism	For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager: - Local mode: number of cores on the local machine - Mesos fine grained mode: 8 - Others: total number of cores on all executor nodes or 2, whichever is larger	Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

看起来它们的定义似乎也很相似，但在实际测试中，

我们可以在提交作业的通过 --conf 来修改这两个设置的值，方法如下：

spark-submit --conf spark.sql.shuffle.partitions=20 --conf spark.default.parallelism=20

关注

专栏目录