Spark Job Scheduling

最新推荐文章于 2023-01-01 15:45:30 发布

javartisan

最新推荐文章于 2023-01-01 15:45:30 发布

阅读量1.6k

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/Dax1n/article/details/70183034

版权

Spark 专栏收录该内容

70 篇文章 0 订阅

订阅专栏

Overview

Spark为计算任务提供资源调度，Spark App运行在独立的一组Executor线程之上，Spark调度器可以提供应用之间的资源调度。其次，在Spark App中，会有多个Job(spark action)并发运行,这是普遍存在的通过网络请求资源，spark在SparkContext内提供资源的公平调度。

Scheduling Across Applications

运行在Spark集群中的每一个Saprk App都会获取到一组独立的Executor线程运行task并且未应用存储数据。如果多个用户需要共享集群资源的话，有如下几种取决于Cluster Manager的不同方式管理资源的分发。

最简单的方式，所有的Cluster Manager都支持的是静态资源划分，这种方式App获取到其需要的最大的资源并在App的周期中一直保持资源，这种方式被用在Spark’s standalone 、YARN modes和 coarse-grained Mesos mode中。基于Cluster Manager不同资源的分配可以如下配置：

Standalone mode: By default, applications submitted to the standalone mode cluster will run in FIFO (first-in-first-out) order, and each application will try to use all available nodes. You can limit the number of nodes an application uses by setting the spark.cores.max configuration property in it, or change the default for applications that don’t set this setting through spark.deploy.defaultCores. Finally, in addition to controlling cores, each application’s spark.executor.memory setting controls its memory use.

spark.cores.max:集群中获取的最大CPU核心数，不是在单个节点！

spark.deploy.defaultCores：默认的核心数，可以为没有配置spark.deploy.defaultCores属性的App修改spark.cores.max

spark.executor.memory ：配置executor的内存使用数量

Mesos: To use static partitioning on Mesos, set the spark.mesos.coarse configuration property to true, and optionally set spark.cores.maxto limit each application’s resource share as in the standalone mode. You should also set spark.executor.memory to control the executor memory.
YARN: The --num-executors option to the Spark YARN client controls how many executors it will allocate on the cluster (spark.executor.instances as configuration property), while --executor-memory (spark.executor.memory configuration property) and --executor-cores (spark.executor.cores configuration property) control the resources per executor. For more information, see the YARN Spark Properties.

--num-executors选项，Yarn客户端可以控制在集群中executor分配的数量或者使用spark.executor.instances属性在配置文件中配置，

--executor-memory 选项可以配置executor的内存使用（使用spark.executor.memory属性在配置文件中配置）. --executor-cores选项配置每一个Executor的核心数 (或者使用 spark.executor.cores 属性进行配置)

Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend

running a single server application that can serve multiple requests by querying the same RDDs.

注意：目前还没有提供App之间的内存共享，如果你需要使用这种方式共享数据的话，我们建议运行一个Server来满足多个请求服务查询

Dynamic Resource Allocation

Spark提供了一种根据任务量动态调整分配资源的机制，这意味着如果你的App近期不再使用资源可以将资源还给集群（释放资源），当需要时候在申请。

当多个App共享集群时候，这个非常有利于集群中的资源利用。

这个特性默认是不可用的，并且只在粗粒度的集群管理者中可用动态资源分配，standalone mode, YARN mode, and Mesos coarse-grained mode.

（关于StandAlone或者Yarn动态资源分配参考文档）

Configuration and Setup

配置动态资源分配需要进行配置，首先你的应用必须设置 spark.dynamicAllocation.enabled 为 true .（在App中设置）其次必须在每一个Worker节点上开启 external shuffle service 服务，并且在App中设置spark.shuffle.service.enabled 为 true。external shuffle service 服务

的目的是运行在没有删除“Executor输出的shuffle文件”时之前关闭Executor，shuffle文件的清理有external shuffle service 服务进行。

不同集群配置external shuffle service 服务：

standalone mode：在Worker节点配置spark.shuffle.service.enabled =true启动即可

YARN mode, follow the instructions here.

Configuring the External Shuffle Service

To start the Spark Shuffle Service on each NodeManager in your YARN cluster, follow these instructions:

Build Spark with the YARN profile. Skip this step if you are using a pre-packaged distribution.
Locate the spark-<version>-yarn-shuffle.jar. This should be under $SPARK_HOME/common/network-yarn/target/scala-<version> if you are building Spark yourself, and under yarn if you are using a distribution.
Add this jar to the classpath of all NodeManagers in your cluster.
In the yarn-site.xml on each node, add spark_shuffle to yarn.nodemanager.aux-services, then set yarn.nodemanager.aux-services.spark_shuffle.class to org.apache.spark.network.yarn.YarnShuffleService.
Increase NodeManager's heap size by setting YARN_HEAPSIZE (1000 by default) in etc/hadoop/yarn-env.sh to avoid garbage collection issues during shuffle.
Restart all NodeManagers in your cluster.

The following extra configuration options are available when the shuffle service is running on YARN:

Property Name	Default	Meaning
`spark.yarn.shuffle.stopOnFailure`	`false`	Whether to stop the NodeManager when there's a failure in the Spark Shuffle Service's initialization. This prevents application failures caused by running containers on NodeManagers where the Spark Shuffle Service is not running.

All other relevant configurations are optional and under the spark.dynamicAllocation.* and spark.shuffle.service.* namespaces. For more detail, see the configurations page.

Resource Allocation Policy

Spark如果不在使用Executor的话应该释放Executor，在需要时候在获取。因为目前还没有好的方案可以预测在不久将来是否需要使用Executor，Spark使用启发式算法来决策何时移除Executor或添加Executor。

Request Policy

Spark App开启动态申请资源，当有Task没有被调度时候需要在申请额外的Executor。这种情况暗示着没有足够的Executor来满足Task的运行或者目前没有满足！

Spark轮询请求Executor，当有Task追加时候会根据延迟调度时间(spark.dynamicAllocation.schedulerBacklogTimeout)触发请求实际需要的Executor数目（当Task追加之后如果在指定时间内没有调度的话就触发请求）。以后再次触发时时为spark.dynamicAllocation.sustainedSchedulerBacklogTimeout，除此之外，每一次请求的Executor数目成指数增加，例如：1、2、4 、8...

指数增长方式是双重的，首先，应用在开始阶段会谨慎的申请一些资源，而实际情况下发现资源不够，还需要一些资源。其次，应用应该根据需要充分的使用集群资源

Remove Policy

移除Executor策略比较简单，当Spark应用的Executor闲置时间超过 spark.dynamicAllocation.executorIdleTimeout便移除。注意：在大多数情景，移除策略和请求策略是互斥的，当仍有任务追加时候，executor是不会闲置的。

Graceful Decommission of Executors

在非动态资源分配情况下，App无论是失败还是执行完毕退出，App的Executor都将会退出。这两种情景，所有与Executor关联的状态将不再需要并且删除。但是在动态资源分配情况下，然而，当App还没有运行完毕，但是有Executor退出，App需要访问退出Executor的状态时候必须执行再次计算，这种方式显然是不可取的，Spark提供了一种机制：在退出Executor之前保存其Executor的状态。

这个需要对Shuffle尤为重要，在Shuffle期间，Spark Executor首先会将map的输出写到本地磁盘，然后为map输出文件充当一个Server供其他Executor获取数据。动态资源分配时候可能在Shuffle还没有完成之前就移除了Executor，导致不必要的重新计算。

为了解决这个问题，需要使用external shuffle service（spark1.2引入）保存Shuffle文件，这个服务指的是长期运行在节点上独立于Spark App和Executors的线程。如果开启此服务，Spark Executor会在此服务上拉取数据，这意味Executor或者退出的Executor的状态会持续被保存

除了Shuffle文件，Executor还会缓存数据到磁盘或者内存，当Executor被移除的话，然而所有的缓存的数据将会失效，为了缓解这个问题，默认Executor缓存的数据是不会被移除的，你可以通过配置 spark.dynamicAllocation.cachedExecutorIdleTimeout参数在之后的指定时间内进行

释放。在以后版本缓存数据可能会被保存在推外存储。

Scheduling Within an Application

在Spark App中（指一个SparkContext实例）多个Job被独立线程提交持续并行执行。这一部分的Job指的是一个Spark Action 每一个Task会计算结果，Spark调度器是线程安全的并且支持App多个请求。

默认Spark 调度器是FIFO模式，每一个Job被划分为Stages，第一个Job获取资源并执行，接下来第二个，第三个等等。如果前面的Job不需要占用集群所有的资源的话，后面的Job有资源可以获取便会立即启动，但是如果前面的Job占用资源过多，后面的Job将会延迟执行。

在Spark0.8版本开始，可以配置Job之间公平调度，Spark轮询的方式指定Task，因此所有的Job等概率的获取集群资源。这样短时间作业可以再长时间作业之前被提交执行，可以获得较好的响应时间，不至于等待长作业的完成。这种模式适用于多用户。

为了开启公平调度，当配置SparkContext时候设置spark.scheduler.mode=FAIR即可

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.set("spark.scheduler.mode", "FAIR")
val sc = new SparkContext(conf)

Fair Scheduler Pools

公平调度支持作业分组到Pool中，为每一个Pool设置不同的调度方式（例如：权重）。这对于重要的Job创建一个优先级较高的Pool比较有好处，例如：将每一个用户的Job分配到同一个Pool中，然后用户平等共享并发Job总数，而不是按照用户的Job数目来获取资源。

没有指定Pool的job会存到默认的Pool中，可以使用spark.scheduler.pool到SparkContext中指定Job的Pool：

// Assuming sc is your SparkContext variable
sc.setLocalProperty("spark.scheduler.pool", "pool1")

设置 spark.scheduler.pool之后所有的Job（Spark Action）将会到对应的Pool中，每一个Pool会关联一些线程处理Job，可以使用如下代码清除线程的关联：

sc.setLocalProperty("spark.scheduler.pool", null)

Default Behavior of Pools

默认，每一个pool都是平等的使用集群，但是在每一个pool中Job的运行顺序是FIFO，例如：如果你给没一会用户创建一个Pool，意味着每一个用户将会由平等的机会使用集群，每一个用户的查询将会按照顺序执行，后面的查询无法争取前面的未释放的资源

Configuring Pool Properties

可以通过一文件指定Pool的属性，每一个Pool支持三个属性：

schedulingMode：可以配置为FIFO 或FAIR, 控制Pool中的作业是按照顺序执行还是公平执行。

weight: 这个属性控制本pool相对于其他pool使用集群资源的机会，默认所有的pool权值为1.可以指定权值为2，例如：指定为2的pool可以获取2倍的集群资源，设置一个更高的weight值，例如1000，就可以实现线程池之间的优先权——实际上，weight值为1000的调度池无论理论上什么时候作业被激活，它都总是能够最先运行。

minShare：除了上面的weight，还可以给每一个pool指定一个minimum share值，这个值是管理员希望给该pool分配的CPU核心数。公平调度器通过权重重新分配资源之前总是试图满足所有活动调度池的最小share，在没有给定一个高优先级的其他集群中，minShare属性是另外的一种方式来确保调度池能够迅速的获得一定数量的资源（例如10核CPU），默认情况下，每个调度池的minShare值都为0。

可以通过XML文件来设置pool属性（conf/fairscheduler.xml.template），和配置公平调度的xml模板文件一样，只需要在SparkConf中设置spark.scheduler.allocation.file的属性：

conf.set("spark.scheduler.allocation.file", "/path/to/file")

对于每个池，XML文件的格式是一个简单的<pool>元素，可以在这个元素中设置各种不同元素。例如：

<?xml version="1.0"?>
<allocations>
  <pool name="production">
    <schedulingMode>FAIR</schedulingMode>
    <weight>1</weight>
    <minShare>2</minShare>
  </pool>
  <pool name="test">
    <schedulingMode>FIFO</schedulingMode>
    <weight>2</weight>
    <minShare>3</minShare>
  </pool>
</allocations>