Spark程序启动资源策略

最新推荐文章于 2021-03-31 11:27:31 发布

CesarChoy

最新推荐文章于 2021-03-31 11:27:31 发布

阅读量188

点赞数

分类专栏： Spark 文章标签： spark 大数据

本文链接：https://blog.csdn.net/weixin_42687074/article/details/105552833

版权

Spark 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

一、前文

当我们开发完一个Spark项目之后，我们需要多少资源合适？合理的资源参数将帮我们更好地利用集群。

二、实验

1、代码逻辑

//第一次拉取数据
val df: DataFrame = spark.read.format("jdbc")
//拉取后shuffle
.repartition(numPartitions)

//第一次全局统计
val max_row = df.select(max("Id")).rdd.first().getInt(0)

//第二次全局统计
val cnt_row = df.select(count("Id")).rdd.first().toLong

//通过SparkSql清洗
df.createOrReplaceTempView("xx_yyyy")
val df2 = spark.sql()
//输出到外部系统
df2.saveToEs("xx_yyyy")

2、提交参数（模板）

spark-submit \
--class com.xxx.yyy.app_main.xx_yyyy \
--master yarn \
--deploy-mode client \
--driver-memory 1g \
--num-executors 1 \
--executor-cores 2 \
--executor-memory 4g \
/opt/test_run_jar/xx_yyyy.jar test 4

2、实验结果

排序	数据量	Driver Memory	Num Executors	Executor Core	Executor Memory	Shuffle	Cores	Storage Memory	Task	Time
0	143万	2g	4	4	3g	12	16	7.1g	124	88s
1	143万	2g	4	4	4g	12	16	9.4g	124	83s
2	143万	2g	4	4	4g	8	16	9.4g	88	80s
3	143万	2g	4	1	1g	12	12	6g	124	87s
4	143万	2g	4	1	1g	8	6	3.5g	88	93s
5	143万	2g	4	1	1g	4	4	2.6g	52	100s
6	143万	1g	4	1	1g	4	4	2.1g	52	103s
7	143万	1g	4	1	2g	4	4	4.2g	52	93s
8	143万	1g	4	1	1g	4	4	2.1g	52	103s
9	143万	1g	2	2	2g	4	4	2.5g	52	93s
10	143万	1g	1	4	4g	4	4	2.3g	52	91s
11	143万	1g	1	2	4g	4	4	4.6g	52	89s
12	143万	1g	1	1	4g	4	4	8.8g	52	100s
13	143万	1g	1	1	4g	1	1	2.5g	19	143s
14	143万	1g	1	1	2g	1	1	1.3g	19	134s
15	143万	1g	1	1	1g	1	1	802M	19	135s

16	143万	1g	2	2	4g	8	8	4.6g	88	84s
17	143万	1g	2	2	2g	8	8	4.2g	52	86s
18	143万	1g	2	2	1g	8	8	2.1g	52	84s
19	143万	1g	1	4	2g	8	4	1.3g	52	89s
20	143万	1g	1	4	2g	10	12	3.3g	106	88s
21	143万	1g	1	4	2g	12	12	3.3g	124	85s

3、结论

（1）1、3案例显示，core数越多，性能越强

（2）3~5案例显示，shuffle数量越多，性能越强

（3）3~5、19~21案例显示，shuffle数量过多，会导致初始指定的Executor和core数增加，内存数也会相应增加，但性能越强

（4）5~6案例显示，Driver端内存越高，性能略好

（5）7~8、16~18案例显示，Executor内存越高，性能越强，但程度有限

（6）8~10案例显示，同等core数情况下，Executor数越少，需要内存越少，性能越强

（7）8~10案例显示，恒定的core数，对分配的内存大小影响有限

（8）10~12案例显示，单Executor，core数越多，需要的内存越少，性能越强

（9）13~14案例显示，单core不能发挥分布式系统的性能，性能最差

四、策略

0、测试中：

我们较优的策略是9、10、18案例，最优策略是19.

1、如何确定合适的core数设置的优先级：

同等cores数的情况下：

每个Execuor更多的core数 > Executor数量 >> 单个core

2、如何确定内存：

在SparkWebUI查看，略大于各个Executor的Memory使用即可

3、设置适量的shuffle数量，但要避免触发core数的增加及内存资源增加

CesarChoy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark程序启动资源策略

一、前文当我们开发完一个Spark项目之后，我们需要多少资源合适？合理的资源参数将帮我们更好地利用集群。二、实验1、代码逻辑//第一次拉取数据val df: DataFrame = spark.read.format("jdbc")//拉取后shuffle.repartition(numPartitions)//第一次全局统计val max_r...
复制链接

扫一扫