Spark 作业调度

最新推荐文章于 2023-12-30 06:30:00 发布

AlferWei

最新推荐文章于 2023-12-30 06:30:00 发布

阅读量1.1k

点赞数

分类专栏： Spark

Spark 专栏收录该内容

32 篇文章 0 订阅

订阅专栏

Job Scheduling

Overview
Scheduling Across Applications
- Dynamic Resource Allocation
  - Configuration and Setup
  - Resource Allocation Policy
    - Request Policy
    - Remove Policy
  - Graceful Decommission of Executors
Scheduling Within an Application
- Fair Scheduler Pools
- Default Behavior of Pools
- Configuring Pool Properties

Overview 综述

Spark has several facilities for scheduling resources between computations. First, recall that, as described in the cluster mode overview , each Spark application (instance of SparkContext) runs an independent set of executor processes. The cluster managers that Spark runs on provide facilities for scheduling across applications . Second, within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads. This is common if your application is serving requests over the network. Spark includes a fair scheduler to schedule resources within each SparkContext.

Spark 有几个在计算之间调度资源的功能。首先，请记住，如在集群模式所述，每个 Spark 应用(SparkContext 的实例)运行一组独立的执行程序进程。Spark 运行的集群管理器提供了跨应用程序进行调度的功能。第二，在每个 Spark 应用程序中，如果不同线程提交了多个作业(Spark 操作)，多个作业可能会同时运行。如果你的应用程序通过网络提供请求，这是很常见的。Spark 包括一个公平调度程序来调度每个 SparkContext 中的资源。

Scheduling Across Applications 跨应用调度

When running on a cluster, each Spark application gets an independent set of executor JVMs that only run tasks and store data for that application. If multiple users need to share your cluster, there are different options to manage allocation, depending on the cluster manager.

在集群上运行的时，每个 Spark 应用程序都会获得一组独立的执行器JVM，它们只对自己的应用程序运行任务和存储数据。如果有多个用户需要共享你的集群，根据不同的集群管理器，有不同的管理分配选择。

The simplest option, available on all cluster managers, is static partitioning of resources. With this approach, each application is given a maximum amount of resources it can use, and holds onto them for its whole duration. This is the approach used in Spark’s standalone and YARN modes, as well as the coarse-grained Mesos mode. Resource allocation can be configured as follows, based on the cluster type:

在所有的集群管理器上可用的最简单的选择是资源的静态分区。通过这种方法，每个应用程序都可以使用最大量的资源，并且将持续整个时间。这是 Spark standalone 和 YARN 以及粗粒度 Mesos 模式下使用的方法。资源可以根据如下的集群类型进行配置：

Standalone mode: By default, applications submitted to the standalone mode cluster will run in FIFO (first-in-first-out) order, and each application will try to use all available nodes. You can limit the number of nodes an application uses by setting the spark.cores.max configuration property in it, or change the default for applications that don’t set this setting through spark.deploy.defaultCores. Finally, in addition to controlling cores, each application’s spark.executor.memory setting controls its memory use.

独立模式：默认情况下，提交到独立模式集群上的应用程序将以 FIFO (先进先出) 顺序运行，每个应用程序将尝试使用所有的可用结点。你可以通过设置 spark.cores.max 配置项来限制一个应用程序使用的结点数，也可以通过 spark.deploy.defaultCores 更改未设置此配置的应用程序默认值。最后，除了控制核心，每个应用程序的 spark.executor.memory 配置可以控制其内存的使用。

Mesos: To use static partitioning on Mesos, set the spark.mesos.coarse configuration property to true , and optionally set spark.cores.max to limit each application’s resource share as in the standalone mode. You should also set spark.executor.memory to control the executor memory.

Mesos：在 Mesos 上使用静态分区，请将 spark.mesos.coarse 配置属性设置为true，并且可选的将 spark.cores.max 设置为以独立模式来限制每个应用程序的资源共享。你还应该设置 spark.executor.memory 来控制执行程序的内存。

YARN: The --num-executors option to the Spark YARN client controls how many executors it will allocate on the cluster ( spark.executor.instances as configuration property), while --executor-memory ( spark.executor.memory configuration property) and --executor-cores ( spark.executor.cores configuration property) control the resources per executor. For more information, see the YARN Spark Properties .

YARN：Spark YARN 客户端的 --num-executors 选项控制集群上分配的executor个数（spark.executor.instances 作为配置属性），而 --executor-memory （spark.executor.memory 配置属性）和 --executor-cores（spark.executor.cores配置属性）控制每个executor的资源。有关更多的信息，请参阅 YARN Spark Properties 。

A second option available on Mesos is dynamic sharing of CPU cores. In this mode, each Spark application still has a fixed and independent memory allocation (set by spark.executor.memory), but when the application is not running tasks on a machine, other applications may run tasks on those cores. This mode is useful when you expect large numbers of not overly active applications, such as shell sessions from separate users. However, it comes with a risk of less predictable latency, because it may take a while for an application to gain back cores on one node when it has work to do. To use this mode, simply use a mesos:// URL and set spark.mesos.coarse to false.

Mesos上可用的第二个选项是CPU内核的动态共享。在这种模式中，每个Spark 应用程序仍然有固定的、独立的内存分配（由 spark.executor.memory 设置），但是当应用程序不在计算机上运行任务时，其他的应用程序可能会在这些内核上运行任务。当您期望有大量不过分活跃的应用程序（例如来自不同用户的shell会话）时，此模式很有用。然而，它具有较少的可预测延迟的风险，因为当一个应用程序有工作要做时，一个应用程序可能需要一段时间才能获得一个节点上的内核。要使用此模式，只需使用mesos：// URL并将spark.mesos.coarse设置为false。

Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs.

请注意，目前没有任何模式提供跨应用程序的内存共享。如果您想以这种方式共享数据，我们建议您运行单服务器应用程序，通过查询相同的RDD可以提供多个请求。

Dynamic Resource Allocation 动态资源分配

Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.

Spark 提供一种机制，可以基于工作负载动态调整应用程序占用的资源。这意味着如果您的应用程序不再使用，您的应用程序可能会将资源返回给群集，并在需要时再次请求它们。如果多个应用程序在Spark群集中共享资源，则此功能特别有用。

This feature is disabled by default and available on all coarse-grained cluster managers, i.e. standalone mode, YARN mode, and Mesos coarse-grained mode.

默认情况下禁用此功能，并且在所有粗粒度集群管理器（即独立模式，YARN模式和Mesos粗粒度模式）上可用。

Configuration and Setup 配置和设置

There are two requirements for using this feature. First, your application must set spark.dynamicAllocation.enabled to true. Second, you must set up an external shuffle service on each worker node in the same cluster and set spark.shuffle.service.enabled to true in your application. The purpose of the external shuffle service is to allow executors to be removed without deleting shuffle files written by them (more detail described below). The way to set up this service varies across cluster managers:

使用此功能有两个要求。首先，您的应用程序必须将 spark.dynamicAllocation.enabled 设置为true。其次，您必须在同一个群集中的每个工作节点上设置外部 shuffle 服务，并在应用程序中将 spark.shuffle.service.enabled 设置为true。外部 shuffle 服务的目的是允许删除执行程序，而不删除它们写入的shuffle文件（下面将详细描述）。

集群管理器的设置方式各不相同：

In standalone mode, simply start your workers with spark.shuffle.service.enabled set to true.

独立模式中，只需要将 spark.shuffle.service.enabled 设置为true，即可启动工作。

In Mesos coarse-grained mode, run $SPARK_HOME/sbin/start-mesos-shuffle-service.sh on all slave nodes with spark.shuffle.service.enabled set to true. For instance, you may do so through Marathon.

在 Mesos 粗粒度模式下，将 spark.shuffle.service.enabled 设置为true，并在所有从结点上运行 $SPARK_HOME/sbin/start-mesos-shuffle-service.sh 。例如，你可以通过 Marathon 来这样做。

In YARN mode, follow the instructions here.

在YARN模式下，请按照此处的说明进行操作 here。

All other relevant configurations are optional and under the spark.dynamicAllocation.* and spark.shuffle.service.* namespaces. For more detail, see the configurations page.

在 spark.dynamicAllocation.* 和 spark.shuffle.service.* 命名空间下，所有其他相关配置是可选的。更多细节，请参阅配置页面 configurations page。

Resource Allocation Policy 资源分配政策

At a high level, Spark should relinquish executors when they are no longer used and acquire executors when they are needed. Since there is no definitive way to predict whether an executor that is about to be removed will run a task in the near future, or whether a new executor that is about to be added will actually be idle, we need a set of heuristics to determine when to remove and request executors.

在高层次上，Spark应该在不再使用executors时放弃executors，并在需要时获取executors。既然没有明确的方法来预测即将被删除的 executor 是否会在不久的将来执行一个任务，或者是否将要添加的新的 executor 实际上是空闲的，那么我们需要一套启发式方法来确定何时删除并请求 executor。

Request Policy 请求政策

A Spark application with dynamic allocation enabled requests additional executors when it has pending tasks waiting to be scheduled. This condition necessarily implies that the existing set of executors is insufficient to simultaneously saturate all tasks that have been submitted but not yet finished.

启用动态分配的Spark应用程序在等待调度的任务中请求额外的 executors。这种情况必然意味着现有的 executors 集合不足以同时使所有已经提交但尚未完成的任务饱和。

Spark requests executors in rounds. The actual request is triggered when there have been pending tasks for spark.dynamicAllocation.schedulerBacklogTimeout seconds, and then triggered again every spark.dynamicAllocation.sustainedSchedulerBacklogTimeout seconds thereafter if the queue of pending tasks persists. Additionally, the number of executors requested in each round increases exponentially from the previous round. For instance, an application will add 1 executor in the first round, and then 2, 4, 8 and so on executors in the subsequent rounds.

Spark 要求executors轮回。当等待的任务等待了 spark.dynamicAllocation.schedulerBacklogTimeout 秒时，实际的请求将被触发，然后再次触发每个spark.dynamicAllocation.sustainedSchedulerBacklogTimeout秒，如果挂起的任务队列仍然存在。此外，每轮要求 executors 的数量从上一轮呈指数级增长。例如，一个应用程序将在第一轮中添加1个executor，然后在后续的回合中添加2,4,8个等等executor。

The motivation for an exponential increase policy is twofold. First, an application should request executors cautiously in the beginning in case it turns out that only a few additional executors is sufficient. This echoes the justification for TCP slow start. Second, the application should be able to ramp up its resource usage in a timely manner in case it turns out that many executors are actually needed.

指数增长政策的动机是双重的。首先，一个应用程序应该在开始时谨慎地请求executor，以防只有少数额外的执行者是充足的。这回应了TCP慢启动的理由。第二，应用程序应该能够及时提高其资源使用情况，以证明实际需要多少 executor。

Remove Policy 移除政策

The policy for removing executors is much simpler. A Spark application removes an executor when it has been idle for more than spark.dynamicAllocation.executorIdleTimeout seconds. Note that, under most circumstances, this condition is mutually exclusive with the request condition, in that an executor should not be idle if there are still pending tasks to be scheduled.

移除 executors 的策略要简单得多。Spark应用程序在空闲时除去超过 spark.dynamicAllocation.executorIdleTimeout 秒以外的一个executor。请注意，在大多数情况下，这种情况与请求条件相互排斥，因为如果还有待排定的任务要执行，executor不应该空闲。

Graceful Decommission of Executors (Executors优雅退出)

Before dynamic allocation, a Spark executor exits either on failure or when the associated application has also exited. In both scenarios, all state associated with the executor is no longer needed and can be safely discarded. With dynamic allocation, however, the application is still running when an executor is explicitly removed. If the application attempts to access state stored in or written by the executor, it will have to perform a recompute the state. Thus, Spark needs a mechanism to decommission an executor gracefully by preserving its state before removing it.

在动态分配之前，Spark执行程序在出现故障或退出相关应用程序时退出。在这两种情况下，与executor关联的所有状态不再需要，可以被安全地丢弃。然而，通过动态分配，当一个executor被明确删除时，应用程序仍在运行。如果应用程序尝试访问由这个executor存储或写入的状态，则必须重新计算状态。因此，Spark需要一种机制，通过在删除executor之前保留其状态，正常地停用这个executor。

This requirement is especially important for shuffles. During a shuffle, the Spark executor first writes its own map outputs locally to disk, and then acts as the server for those files when other executors attempt to fetch them. In the event of stragglers, which are tasks that run for much longer than their peers, dynamic allocation may remove an executor before the shuffle completes, in which case the shuffle files written by that executor must be recomputed unnecessarily.

这个要求对于shuffle过程尤其重要。在shuffle期间，Spark executor 首先将其自己的映射输出到本地磁盘，然后在其他 executor 尝试获取它们时作为这些文件的服务器。如果运行慢的任务的运行时间远远超过任务，则动态分配可能会在shuffle完成之前删除一个executor，在这种情况下，必须重新计算该执行程序写入的shuffle文件。

The solution for preserving shuffle files is to use an external shuffle service, also introduced in Spark 1.2. This service refers to a long-running process that runs on each node of your cluster independently of your Spark applications and their executors. If the service is enabled, Spark executors will fetch shuffle files from the service instead of from each other. This means any shuffle state written by an executor may continue to be served beyond the executor’s lifetime.

保存 shuffle 文件的解决方案是使用 Spark1.2 中引入的外部 shuffle 服务。此服务是指在您的集群的每个节点上独立于Spark应用程序及其执行程序运行的长时间运行的进程。如果服务已启用，Spark执行程序将从服务中取出 shuffle 文件。这意味着由 executor 编写的 shuffle 状态可能会继续超出 executor 的生命周期。

In addition to writing shuffle files, executors also cache data either on disk or in memory. When an executor is removed, however, all cached data will no longer be accessible. To mitigate this, by default executors containing cached data are never removed. You can configure this behavior with spark.dynamicAllocation.cachedExecutorIdleTimeout. In future releases, the cached data may be preserved through an off-heap storage similar in spirit to how shuffle files are preserved through the external shuffle service.

除了写 shuffle 文件，executors 还可以缓存数据在磁盘或者内存中。然而，当 executor 被删除时，所有缓存的数据将不再可访问。为了减轻这种状况，默认的包含缓存数据的 executors 永远不会被删除。您可以使用 spark.dynamicAllocation.cachedExecutorIdleTimeout 配置此行为。在将来的版本中，缓存的数据可以通过类似于 shuffle 文件以外部 shuffle 服务来保存的方式，以精简的非堆栈存储来保存。

Scheduling Within an Application 在应用程序内进行调度

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).

在给定的Spark应用程序（SparkContext实例）中，如果从单独的线程提交多个并行作业，则可以同时运行。通过“job”，在本节中，我们指的是一个Spark动作（例如保存，收集）以及任何需要运行以评估该操作的任务。Spark的调度程序是完全线程安全的，并支持这种用例来启用服务多个请求的应用程序（例如，多个用户的查询）。Spark的调度程序是完全线程安全的，并支持这种用例来启用服务多个请求的应用程序（例如，多个用户的查询）。

By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

默认情况下，Spark 的调度程序以 FIFO 形式运行作业。每个作业分为“阶段”（例如 map 和 reduce 阶段），第一个作业具有在所有可用资源的优先级，而其阶段具有启动任务，然后第二个作业获得优先权等。如果队列头部的作业不需要使用整个群集，则后来的作业可以立即开始运行，但是如果队列头部的作业较大，则稍后的作业可能会显着延迟。

Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.

从Spark 0.8开始，也可以在作业之间配置公平的共享。在公平分享下，Spark以“循环”方式在任务之间分配任务，使所有作业获得大致相等的集群资源份额。这意味着当一个长时间作业运行的时候，提交的短期作业可以立即开始接收资源，仍然可以获得良好的响应时间，而无需等待长时间的作业完成。此模式最适合多用户设置。

To enable the fair scheduler, simply set the spark.scheduler.mode property to FAIR when configuring a SparkContext:

要启用公平调度，只需在配置 SparkContext 时，将 spark.scheduler.mode 设为 FAIR：

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.set("spark.scheduler.mode", "FAIR")
val sc = new SparkContext(conf)

Fair Scheduler Pools （公平调度池）

The fair scheduler also supports grouping jobs into pools, and setting different scheduling options (e.g. weight) for each pool. This can be useful to create a “high-priority” pool for more important jobs, for example, or to group the jobs of each user together and give users equal shares regardless of how many concurrent jobs they have instead of giving jobs equal shares. This approach is modeled after the Hadoop Fair Scheduler.

公平调度程序还支持将作业分组到池中，并为每个池设置不同的调度选项（例如权重）。这对于为更重要的作业创建“高优先级”池是有用的，例如，或将每个用户的作业分组在一起，并给予用户相等的份额，而不管他们有多少并发作业，而不是给予作业相等的份额。这种方法是在Hadoop Fair Scheduler之后形成的。

Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them. This is done as follows:

没有任何干预，新提交的作业进入默认池，但是可以通过向提交的线程中的SparkContext添加spark.scheduler.pool“local property”来设置作业的池。如以下的做法：

// Assuming sc is your SparkContext variable
sc.setLocalProperty("spark.scheduler.pool", "pool1")

After setting this local property, all jobs submitted within this thread (by calls in this thread to RDD.save , count , collect , etc) will use this pool name. The setting is per-thread to make it easy to have a thread run multiple jobs on behalf of the same user. If you’d like to clear the pool that a thread is associated with, simply call:

设置这个local属性之后，在此线程内提交的所有作业（通过此线程中的调用到RDD.save，count，collect等）将使用此池的名称。该设置是每个线程，以便使线程可以代表同一用户运行多个作业。如果要清除与线程关联的池，只需调用：

sc.setLocalProperty("spark.scheduler.pool", null)

Default Behavior of Pools （池的默认行为）

By default, each pool gets an equal share of the cluster (also equal in share to each job in the default pool), but inside each pool, jobs run in FIFO order. For example, if you create one pool per user, this means that each user will get an equal share of the cluster, and that each user’s queries will run in order instead of later queries taking resources from that user’s earlier ones.

默认情况下，每个池获得相同的集群份额（在默认池中与每个作业的共享相等），但在每个池中，作业以FIFO顺序运行。例如，如果您为每个用户创建一个池，这意味着每个用户将获得该群集的相等份额，并且每个用户的查询将按顺序运行，而不是从该用户较早的查询中获取资源的后续查询。

Configuring Pool Properties （配置池属性）

Specific pools’ properties can also be modified through a configuration file. Each pool supports three properties:

通过修改配置文件可以指定池的属性。每个池支持三个属性：

schedulingMode: This can be FIFO or FAIR, to control whether jobs within the pool queue up behind each other (the default) or share the pool’s resources fairly.

schedulingMode：这可以是FIFO或FAIR，以控制池内的作业是否相互排列（默认），或者公平地共享池的资源。

weight: This controls the pool’s share of the cluster relative to other pools. By default, all pools have a weight of 1. If you give a specific pool a weight of 2, for example, it will get 2x more resources as other active pools. Setting a high weight such as 1000 also makes it possible to implement priority between pools—in essence, the weight-1000 pool will always get to launch tasks first whenever it has jobs active.

weight：这将控制池相对于其他池的共享。默认情况下，所有池的权重都为1 。如果你给一个指定的池的权重为2，它将比其他活跃的池获得2倍多的资源。设置诸如1000之类的高权重也使得可以在池之间实现优先级 - 实质上，当作业处于活动状态时，weight-1000池将始终首先启动任务。

minShare: Apart from an overall weight, each pool can be given a minimum shares (as a number of CPU cores) that the administrator would like it to have. The fair scheduler always attempts to meet all active pools’ minimum shares before redistributing extra resources according to the weights. The minShare property can therefore be another way to ensure that a pool can always get up to a certain number of resources (e.g. 10 cores) quickly without giving it a high priority for the rest of the cluster. By default, each pool’s minShare is 0.

minShare：除了总体权重之外，每个池可以被给予管理员希望拥有的最小份额（作为一些CPU核心数）。公平调度总是尝试满足所有活动池的最小份额，然后根据权重重新分配额外的资源。因此，minShare属性可以是另一种方法来确保池总是能够快速获得一定数量的资源（例如10个内核），而不会为其余的集群提供高优先级。每个池的 minShare 默认是0。

The pool properties can be set by creating an XML file, similar to conf/fairscheduler.xml.template, and setting a spark.scheduler.allocation.file property in your SparkConf.

池属性可以通过创建一个XML文件来设置，类似于 conf/fairscheduler.xml.template，并且在SparkConf中设置 spark.scheduler.allocation.file 属性。

conf.set("spark.scheduler.allocation.file", "/path/to/file")

The format of the XML file is simply a <pool> element for each pool, with different elements within it for the various settings. For example:

XML文件的格式是简单的每个池的一个<pool>元素，其中包含不同的元素，用于各种设置。例如：

<?xml version="1.0"?>
<allocations>
  <pool name="production">
    <schedulingMode>FAIR</schedulingMode>
    <weight>1</weight>
    <minShare>2</minShare>
  </pool>
  <pool name="test">
    <schedulingMode>FIFO</schedulingMode>
    <weight>2</weight>
    <minShare>3</minShare>
  </pool>
</allocations>

A full example is also available in conf/fairscheduler.xml.template . Note that any pools not configured in the XML file will simply get default values for all settings (scheduling mode FIFO, weight 1, and minShare 0).

在 in conf/fairscheduler.xml.template 可以获取到完整的例子。请注意，未在XML文件中配置的任何池将简单地获取所有设置（调度模式FIFO，权重1和minShare 0）的默认值。

原文：http://spark.apache.org/docs/latest/job-scheduling.html；

AlferWei

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Spark 作业调度

Job SchedulingOverviewScheduling Across ApplicationsDynamic Resource AllocationConfiguration and SetupResource Allocation PolicyRequest PolicyRemove PolicyGraceful Decommission of ExecutorsScheduling ...
复制链接

扫一扫

专栏目录