Spark1.4.1阅读官方文档系列——job schedule

最新推荐文章于 2020-12-02 16:37:02 发布

KJ_Chen

最新推荐文章于 2020-12-02 16:37:02 发布

阅读量660

点赞数

本文链接：https://blog.csdn.net/KJ_Chen/article/details/48339591

版权

本文主要介绍了Spark的Job Schedule，包括应用间调度和应用内调度。动态资源分配允许Spark根据工作负载动态增减executor，避免资源浪费。文章详细阐述了动态资源分配的启用、配置、需求策略和移除策略，并探讨了executor的优雅解除，特别是如何通过外部shuffle服务保存状态。此外，还讨论了应用内的调度策略，如FIFO和公平调度模式。

摘要由CSDN通过智能技术生成

一、综述

Job Schedule（简称JS）用来对不同的并行应用进行资源调度工作。

二、应用间的调度

当多个用户需要用到你的集群时，根据cluster manager会有不同的分配策略。

最简单的方法，同时也是各种cluster manager通用的策略，就是静态分配资源。也就是说，每个应用会被分配它能用到的最大量的资源，并且保持到这个应用的整个生命周期。

》standalone mode：默认情况下，被提交的应用按照先进先出的原则被分配资源，而且每个应用都会尝试去利用所有可被利用的节点。你可以限制每个应用的节点数量（设置spark.cores.max或者更改应用的默认设置spark.deploy.defaultCores），同时也可以控制每个应用的内存使用（spark.executor.memory ）；

》mesos：略

》yarn：Spark YARN client的--num-executors 选项，用来控制分配在集群上的executor数量； --executor-memory 和 --executor-cores用· 来控制每个executor上的资源；

第二种方法是动态分享CPU核，适用于mesos。这种模式同样会为每个应用分配独立的固定的内存（spark.executor.memory）。当一个应用不在运行任务时，其他应用就会在这个应用的资源上运行任务。因此也存在一些不可预知的潜在风险，比如说该应用在拿回所需要的资源的时候需要一段时间。To use this mode, simply use a mesos:// URL without settingspark.mesos.coarse to true。

注意：现在还没有一种模式让应用之间可以进行内存共享。如果你想分享数据，我们建议运行一个服务应用，该应用可以通过询问相同的RDD来服务多个需求。在未来的版本中，类似Tachyon的内存存储系统会提供另一种途径来分享RDD。

2.1、动态资源分配

Spark1.2已经介绍了根据工作量来动态分配应用的集群资源的能力。这就意味着应用可以把不再用的资源还给集群，同时也可以以后需要的时候再要回来。这个特性在多个应用运行在spark集群上分享集群资源上更加有用。动态资源分配的粒度为executor。可以通过spark.dynamicAllocation.enabled来设置。

目前这种功能只适用于YARN。未来将会有版本把这种特性扩展到standalone 和mesos coarse-grained模式。 Note that although Spark on Mesos already has a similar notion of dynamic resource sharing in fine-grained mode, enabling dynamic allocation allows your Mesos application to take advantage of coarse-grained low-latency scheduling while sharing cluster resources efficiently.

2.1.1、配置和建立（不翻译）

All configurations used by this feature live under the spark.dynamicAllocation.* namespace. To enable this feature, your application must set spark.dynamicAllocation.enabled to true. Other relevant configurations are described on the configurations page and in the subsequent sections in detail.

Additionally, your application must use an external shuffle service. The purpose of the service is to preserve the shuffle files written by executors so the executors can be safely removed (more detail described below). To enable this service, set spark.shuffle.service.enabled to true. In YARN, this external shuffle service is implemented inorg.apache.spark.yarn.network.YarnShuffleService that runs in each NodeManager in your cluster. To start this service, follow these steps:

Build Spark with the YARN profile. Skip this step if you are using a pre-packaged distribution.
Locate the spark-<version>-yarn-shuffle.jar. This should be under$SPARK_HOME/network/yarn/target/scala-<version> if you are building Spark yourself, and under lib if you are using a distribution.
Add this jar to the classpath of all NodeManagers in your cluster.
In the yarn-site.xml on each node, add spark_shuffle to yarn.nodemanager.aux-services, then setyarn.nodemanager.aux-services.spark_shuffle.class toorg.apache.spark.network.yarn.YarnShuffleService. Additionally, set all relevant spark.shuffle.service.*configurations.
Restart all NodeManagers in your cluster.

2.2资源分配策略

在一个较高层次上来说，spark应该放弃那些不再使用的executor而需要那些需要的executor。因为没有一种确切的方法能够预测一个将要被移除的executor不久会不会运行一个任务，或者说一个被新添加的executor是否即将转为空闲状态，所以我们需要一系列的启发式方法去确定什么时候移除什么时候需要executor。

2.2.1需求策略

当一个启用动态分配功能的spark应用有一个等待的任务等待被调度时，它需要额外的executor。这种情况必然意味着现有的executor不足以同时饱和所有已提交但未完成的任务。

spark每回合会请求executor。当存在一个等待的任务持续spark.dynamicAllocation.schedulerBacklogTimeout 秒时，实际的请求会被触发，如果这个等待的任务队列依然存在，那么每spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 之后，请求又会被触发。此外，请求executor的数量每回合相比于上一轮会有指数的增长。比如说，第一次增加一个executor，接下来就会添加2,4,8......个。

指数增长策略的动机是双重的。首先，一个应用一开始应该谨慎地请求添加executor，事实证明开始只需要少许额外的executor就够了。这也呼应了TCP慢启动的理由。其次，应用应该能够及时提升自己资源的利用率，以便到时候会需要许多executor。

2.2.2移除策略

当一个executor处于空闲状态持续spark.dynamicAllocation.executorIdleTimeout 秒，它将被移除。值得注意的是，移除条件需求条件是互斥的，因为如果有等待的任务被调度，那么就不应该存在空闲的executor。

2.3 executor优雅的解除

在动态分配之前，一个spark会在失败的时候退出也会在相关的应用退出的时候退出。在两种情况中，所有与executor相关的状态不再需要时可以被安全的抛弃。然而，动态分配后，当executor被移除，应用依然能够运行。如果应用尝试获得executor存储或者写入的状态，应用会重新计算状态。这样，spark需要一个机制，通过移除前保存executor的状态，来优雅的解除executor。

这种需求对于shuffle来说尤其重要。在一个shuffle的过程中，executor首先将自己的map结果写到本地磁盘，然后当其他executor要获取这些结果时，作为服务端提供所要的文件。当有一些任务比其他任务慢很多时，动态分配机制会在shuffle完成前移除executor，这时那个由被移除的executor的shuffle文件需要不必要的重新计算。

我们利用外部的shuffle服务来保留shuffle文件。这个服务引用一个长时间运行的进程，该进程在集群的每个节点上都独立于应用和executor而运行。如果改服务被enabled，那么executor将会从这个服务进程获取shuffle文件。这就意味着任何shuffle状态都会被持续提供，而不限于executor的生命周期。

除了写shuffle文件，executor同时将数据缓存到磁盘或者内存。然而，当一个executor被删除，缓存数据都将丢失。现在还不能解决这个问题。

在未来的版本中，缓存数据也可以通过off-heap storage被保留。

三、应用内的调度

内部给定的一个spark应用（比如SparkContext），如果多个不同的并行的任务从不同的线程提交，他们可以同时运行。

默认情况下，spark的调度以先入先出的形式运行任务。每个任务被划分成“阶段”（例如，map和reduce阶段），第一个任务当他们的各阶段有子任务（action）启动时，它优先占有可利用资源，然后第二个任务才获得优先权。如果队列里的第一个任务不需要利用整个节点，然后其后的任务立刻开始运行。但是，如果第一个任务很大，那么后面的任务将会被延迟。

spark 0.8之后的版本可以设置任务之间公平的分享资源。spark在任务之间分配子任务（task）使用循环赛的形式，这样所有的任务会得到大概一致的节点资源。这就意味着当有一个大的任务在跑时，一个小的任务也可以获得资源，不用等那个大的结束。这个模式适合于多用户设定。

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.set("spark.scheduler.mode", "FAIR")
val sc = new SparkContext(conf)

3.1公平的调度池

公平的调度同样也支持将不同任务分组到不同池里，并且可以为每个池设置权重。比如，你可以将一些重要的任务放在高优先级的池里。或者将每个用户的任务都放在一个池里然后不管用户有多少个并行任务，都给用户相同的份额，而不是给任务相同的份额。可以在设置hadoop fair scheduler之后使用该方法。

在没有外界介入的情况下，新提交的任务会进入默认池，但是任务池可以在提交任务的线程中设置，添加spark.scheduler.pool “local property” 到SparkContext。如下所示：

// Assuming sc is your SparkContext variable
sc.setLocalProperty("spark.scheduler.pool", "pool1")

在设置完本地性质后，所有的这个线程所提交的任务都会用这个池名。如果你想清空某线程相关的池，只需要调用：

sc.setLocalProperty("spark.scheduler.pool", null)

3.2 池的默认行为

默认情况下，每个池会获得节点的相同份额（在一个默认池中每个任务会有相同份额），但是在每个池内，任务以先进先出的顺序运行。比如，如果你为为每个用户创建一个池子，这就意味着每个用户会得到相同的节点份额，每个用户的请求会按顺序执行而不是后一个请求会占用上一个请求的资源。

3.3 配置池的性能

具体的池子性能可以通过配置文件进行修改。每个池子支持三个性能：

调度模式schedulingMode：可以是FIFO获FAIR

权重weight： This controls the pool’s share of the cluster relative to other pools. By default, all pools have a weight of 1. If you give a specific pool a weight of 2, for example, it will get 2x more resources as other active pools. Setting a high weight such as 1000 also makes it possible to implement priority between pools—in essence, the weight-1000 pool will always get to launch tasks first whenever it has jobs active。

最小份额minShare：除了所有权重以外，每个池子都可以别设定一个最小份额（CPU核数）。在根据权重重分配额外资源之前公平调度机制会尽量满足每个活跃的池的最小份额。默认下，每个池子的最小份额是0.

The pool properties can be set by creating an XML file, similar to conf/fairscheduler.xml.template, and setting a spark.scheduler.allocation.file property in your SparkConf.

conf.set("spark.scheduler.allocation.file", "/path/to/file")

The format of the XML file is simply a <pool> element for each pool, with different elements within it for the various settings. For example:

<?xml version="1.0"?>
<allocations>
  <pool name="production">
    <schedulingMode>FAIR</schedulingMode>
    <weight>1</weight>
    <minShare>2</minShare>
  </pool>
  <pool name="test">
    <schedulingMode>FIFO</schedulingMode>
    <weight>2</weight>
    <minShare>3</minShare>
  </pool>
</allocations>

A full example is also available in conf/fairscheduler.xml.template. Note that any pools not configured in the XML file will simply get default values for all settings (scheduling mode FIFO, weight 1, and minShare 0).