spark任务动态资源分配

最新推荐文章于 2024-07-19 17:32:19 发布

YF_raaiiid

最新推荐文章于 2024-07-19 17:32:19 发布

阅读量1.6k

点赞数 3

文章标签： spark 大数据 hadoop

本文链接：https://blog.csdn.net/qq_45749457/article/details/125613640

版权

开启动态资源分配

为了生效还要求完成提前完成以下任意一种配置

第一种方法：

1、Application提交时需要附带以下设置：

set spark.dynamicAllocation.enabled = true
set spark.dynamicAllocation.shuffleTracking.enabled = true

第二种方法：

1、Application提交时需要附带以下设置：

spark.dynamicAllocation.enabled = true
spark.shuffle.service.enabled = true

2、设置External shuffle service：

设置shuffleTracking或者External shuffle service的目的是，使得executors被移除时不会丢失他们写过的shuffle files。

原理细节说明

实现动态资源分配之前，先实现对Executors产生的数据的解耦

解得啥子耦：整体spark application对其写入的数据文件的依赖

考虑Executor写入数据操作的行为：① 往shuffle files写入；② 往cache or disk 中持久化

解决shuffle files的数据

如果在shuffle完成前，移除了一个executor，那之后这个executor所要写入的shuffle files就要重新由别的executor再次完成计算并写入。（这是一个非常不必要的做法）

通过使用external shuffle service，会在每个node上为你的spark applications和executors独立地启动一个一直运行的进程。因此，之后spark executors将从这些进程获取shuffle files而不是直接去找executors。即，意味着shuffle files的生命周期可以超出写入这些shuffle files的executor的生命周期。

解决cache or disk持久化的数据

Executors除了往shuffle files写入数据，还会往cache或者disk中持久化数据。当Executors被移除后，Executors存在cache中的data将会无法被获取。

为了解决该问题，默认情况下保存cache data的executors不会被删除。当然，这个配置是可以修改的，通过设置参数spark.dynamicAllocation.cachedExecutorIdleTimeout,

set spark.shuffle.service.fetch.rdd.enabled to true后，spark能够从前面提到的external shuffle service去获取磁盘持久化的RDD blocks。在动态分配的情况下，如果启用此功能，则只有磁盘持久块的executors在 spark.dynamicAllocation.executorIdleTimeout 之后才会被被认为是空闲的，并将相应地释放。

on Yarn

1、对YARN进行配置，修改集群每台Node上的yarn-site.xml

原文件需修改的内容

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle,spark_shuffle</value>
</property>

原文件需增加的内容

<property>
    <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
    <value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<property>
    <name>spark.shuffle.service.port</name>
    <value>7337</value>
</property>

2、将$SPARK_HOME/yarn/ spark-<version>-yarn-shuffle.jar拷贝到每台NodeManager下的${HADOOP_HOME}/share/hadoop/yarn/lib/目录，然后重启所有修改过配置的节点。

3、配置$SPARK_HOME/conf/spark-defaults.conf，增加以下参数

# 启用External shuffle Service服务
spark.shuffle.service.enabled true
# Shuffle Service默认服务端口，必须和yarn-site中的一致
spark.shuffle.service.port 7337
# 开启动态资源分配
spark.dynamicAllocation.enabled true
# 每个Application最小分配的executor数
spark.dynamicAllocation.minExecutors 0
# 每个Application最大并发分配的executor数
spark.dynamicAllocation.maxExecutors 3
spark.dynamicAllocation.schedulerBacklogTimeout 1s
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 5s
# executor 空闲超过60s 则释放
spark.dynamicAllocation.executorIdleTimeout 60s
#  如果启用动态分配，则要运行executor的初始数量。如果设置了“–num-executors”（或“spark.executor.instances”）并且大于这个值，则会使用这个值进行初始化。 如： max(initialExecuor = 3, –num-executors = 10) 取最大
spark.dynamicAllocation.initialExecutors 1
# 如果启用了动态分配，并且缓存数据块的executor已经空闲了超过这个时间，executor将被释放
spark.dynamicAllocation.cachedExecutorIdleTimeout 60s

on K8s

官方文档说明属于Future Work。

Future Work

There are several Spark on Kubernetes features that are currently being worked on or planned to be worked on. Those features are expected to eventually make it into future versions of the spark-kubernetes integration.

Some of these include: