在 YARN 上运行 Spark

最新推荐文章于 2024-08-01 19:34:14 发布

AlferWei

最新推荐文章于 2024-08-01 19:34:14 发布

阅读量2.2k

点赞数 1

分类专栏： Spark

Spark 专栏收录该内容

32 篇文章 0 订阅

订阅专栏

翻译中...

Running Spark on YARN

Support for running on YARN (Hadoop NextGen) was added to Spark in version 0.6.0, and improved in subsequent releases.

在Spark的0.6.0版本中已经支持在YARN（Hadoop NextGen）上运行的Spark，并在后续版本中得到改进。

Launching Spark on YARN

Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager. The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration. If the configuration references Java system properties or environment variables not managed by YARN, they should also be set in the Spark application’s configuration (driver, executors, and the AM when running in client mode).

确保HADOOP_CONF_DIR或YARN_CONF_DIR指向包含Hadoop集群的（客户端）配置文件的目录。这些配置用于写入HDFS并连接到YARN ResourceManager。此目录中包含的配置将分发到YARN群集，以便应用程序使用的所有容器都使用相同的配置。如果配置引用了不受YARN管理的Java系统属性或环境变量，那么也应该在Spark应用程序的配置（driver，executors和AM在客户端模式下运行时）中进行设置。

There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

有两种模式可以在YARN上启动Spark 应用。在集群模式下，Spark驱动程序运行在由集群上的YARN管理的应用程序主进程中，客户端可以在启动应用程序后离开。在客户端模式下，驱动程序在客户端进程中运行，应用程序主程序仅用于从YARN请求资源。

Unlike Spark standalone and Mesos modes, in which the master’s address is specified in the --master parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop configuration. Thus, the --master parameter is yarn.

与Spark standalone和Mesos模式不同，主节点地址在--master参数中指定，在YARN模式下，ResourceManager的地址从Hadoop配置中提取。因此，--master的参数是yarn。

To launch a Spark application in cluster mode:

$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options]

例如：

$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    --queue thequeue \
    lib/spark-examples*.jar \
    10

The above starts a YARN client program which starts the default Application Master. Then SparkPi will be run as a child thread of Application Master. The client will periodically poll the Application Master for status updates and display them in the console. The client will exit once your application has finished running. Refer to the “Debugging your Application” section below for how to see driver and executor logs.

以上启动了运行默认Application Master的YARN客户端程序。SparkPi 将作为Application Master的一个子线程运行。客户端将定期轮询Application Master的状态更新并将其显示在控制台中。一旦你的应用结束运行，客户端将退出。请参阅下面的“调试应用程序”部分，了解如何查看驱动程序和执行程序日志。

To launch a Spark application in client mode, do the same, but replace cluster with client. The following shows how you can run spark-shell in client mode:

以客户端模式运行 Spark 应用，相似的是，将cluster 替换为client。以下显示如何在客户端模式下运行spark-shell：

$ ./bin/spark-shell --master yarn --deploy-mode client

Adding Other JARs

In cluster mode, the driver runs on a different machine than the client, so SparkContext.addJar won’t work out of the box with files that are local to the client. To make files on the client available to SparkContext.addJar, include them with the --jars option in the launch command.

在集群模式下，驱动程序和客户端运行在不同的机器上，因此SparkContext.addJar 将不会与客户端本地文件配合使用。要使客户端上的文件可用于SparkContext.addJar，请在launch命令中使用--jars选项包含它们。

$ ./bin/spark-submit --class my.main.Class \
    --master yarn \
    --deploy-mode cluster \
    --jars my-other-jar.jar,my-other-other-jar.jar \
    my-main-jar.jar \
    app_arg1 app_arg2

Preparations

Running Spark on YARN requires a binary distribution of Spark which is built with YARN support. Binary distributions can be downloaded from the downloads page of the project website. To build Spark yourself, refer to Building Spark.

在YARN上运行Spark需要使用YARN支持构建的Spark的二进制分发。二进制发行版可以从项目网站的下载页面下载。要自己构建Spark，请参阅Building Spark。

To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. For details please refer to Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jarsand upload it to the distributed cache.

要使Spark运行时jar可以从YARN端访问，可以指定spark.yarn.archive或spark.yarn.jars。更多的细节可以参阅 Spark Properties。如果spark.yarn.archive或spark.yarn.jars都没有指定，Spark 将在$SPARK_HOME/jarsand目录下创建包含所有jar的zip文件，并上传到分布式缓存。

Configuration

Most of the configs are the same for Spark on YARN as for other deployment modes.See the configuration page for more information on those. These are configs that are specific to Spark on YARN.

Spark的YARN大部分配置与其他部署模式相同。有关这些的详细信息，请参阅configuration page 。这些配置项都是针对Spark on YARN的。

Debugging your Application

In YARN terminology, executors and application masters run inside “containers”. YARN has two modes for handling container logs after an application has completed. If log aggregation is turned on (with the yarn.log-aggregation-enable config), container logs are copied to HDFS and deleted on the local machine. These logs can be viewed from anywhere on the cluster with the yarn logs command.

在YARN 终端上，executors 和 application master 都运行在 "containers" 中。当一个application 完成后，YARN 有两种方式来处理container 日志。如果日志聚合被开启（使用yarn.log-aggregation-enable 配置项），container 的日志将被复制到HDFS上并且从本地的机器上删除。使用yarn logs命令可以在集群中的任何地方查看这些日志。

yarn logs -applicationId <app ID>

will print out the contents of all log files from all containers from the given application. You can also view the container log files directly in HDFS using the HDFS shell or API. The directory where they are located can be found by looking at your YARN configs (yarn.nodemanager.remote-app-log-dir and yarn.nodemanager.remote-app-log-dir-suffix). The logs are also available on the Spark Web UI under the Executors Tab. You need to have both the Spark history server and the MapReduce history server running and configure yarn.log.server.url in yarn-site.xml properly. The log URL on the Spark history server UI will redirect you to the MapReduce history server to show the aggregated logs.

这个命令将打印出指定application的所有containers的所有日志内容。你也可以使用HDFS shell 或者 API 来直接查看在HDFS 上的日志。日志的所在位置可以通过查看YARN 的配置（yarn.nodemanager.remote-app-log-dir 和 yarn.nodemanager.remote-app-log-dir-suffix）获知。也可以从Spark Web UI下的Executors的tab页中获得。这需要让Spark history server 和 MapReduce history server 一起运作，并且配置 yarn-site.xml 中的yarn.log.server.url 属性。Spark history server UI上的日志URL将被重定向 MapReduce history server 上来显示聚合过的日志。

When log aggregation isn’t turned on, logs are retained locally on each machine under YARN_APP_LOGS_DIR, which is usually configured to /tmp/logs or $HADOOP_HOME/logs/userlogs depending on the Hadoop version and installation. Viewing logs for a container requires going to the host that contains them and looking in this directory. Subdirectories organize log files by application ID and container ID. The logs are also available on the Spark Web UI under the Executors Tab and doesn’t require running the MapReduce history server.

当日志聚合功能没有被开启的时候，日志将保存在每台机器的YARN_APP_LOGS_DIR下面，YARN_APP_LOGS_DIR经常配置为/tmp/logs或者$HADOOP_HOME/logs/userlogs，这取决于 Hadoop 的版本和安装。查看一个container的日志需要到这个container所在的主机上并到日志所在的目录查看。子目录根据应用程序ID和容器ID组织日志文件。这些日志也可以在Spark Web UI的Executors Tab页获得，并且不需要运行 MapReduce history server。

To review per-container launch environment, increase yarn.nodemanager.delete.debug-delay-sec to a large value (e.g. 36000), and then access the application cache through yarn.nodemanager.local-dirs on the nodes on which containers are launched. This directory contains the launch script, JARs, and all environment variables used for launching each container. This process is useful for debugging classpath problems in particular. (Note that enabling this requires admin privileges on cluster settings and a restart of all node managers. Thus, this is not applicable to hosted clusters).

要查看每个容器的启动环境，请将yarn.nodemanager.delete.debug-delay-sec增加到一个较大的值（例如36000），然后通过在已启动的容器的节点上的yarn.nodemanager.local-dirs访问应用程序缓存。此目录包含启动脚本，JAR和用于启动每个容器的所有环境变量。此过程特别适用于调试类路径问题。（请注意，启用此功能需要管理员对群集设置的权限和所有节点管理器的重新启动，因此这不适用于宿主群集）。

To use a custom log4j configuration for the application master or executors, here are the options:

要为 application master 或者 executors 使用自定义的log4j 配置，参考以下的选项：

upload a custom log4j.properties using spark-submit, by adding it to the --files list of files to be uploaded with the application.
使用spark-submit上传自定义log4j.properties，将其添加到要与应用程序一起上传的文件的--files列表中。
add -Dlog4j.configuration=<location of configuration file> to spark.driver.extraJavaOptions (for the driver) or spark.executor.extraJavaOptions (for executors). Note that if using a file, the file: protocol should be explicitly provided, and the file needs to exist locally on all the nodes.
添加-Dlog4j.configuration = <配置文件的位置>到spark.driver.extraJavaOptions（对于驱动程序）或spark.executor.extraJavaOptions（对于执行程序）。请注意，如果使用文件，则应显式提供文件：协议，并且文件需要本地存在于所有节点上。
update the $SPARK_CONF_DIR/log4j.properties file and it will be automatically uploaded along with the other configurations. Note that other 2 options has higher priority than this option if multiple options are specified.
更新 $SPARK_CONF_DIR/log4j.properties 文件，它将与其他配置一起自动上传。请注意，如果指定了多个选项，其他2个选项的优先级高于此选项。

Note that for the first option, both executors and the application master will share the same log4j configuration, which may cause issues when they run on the same node (e.g. trying to write to the same log file).

请注意，对于第一个选项，执行程序和应用程序主机将共享相同的log4j配置，这可能会导致在同一节点上运行时出现问题（例如，尝试写入同一个日志文件）。

If you need a reference to the proper location to put log files in the YARN so that YARN can properly display and aggregate them, use spark.yarn.app.container.log.dir in your log4j.properties. For example, log4j.appender.file_appender.File=${spark.yarn.app.container.log.dir}/spark.log. For streaming applications, configuring RollingFileAppender and setting file location to YARN’s log directory will avoid disk overflow caused by large log files, and logs can be accessed using YARN’s log utility.

如果您需要引用正确的位置将日志文件放在YARN中，以便YARN可以正确显示和聚合它们，请在log4j.properties中使用spark.yarn.app.container.log.dir。例如，log4j.appender.file_appender.File = $ {spark.yarn.app.container.log.dir} /spark.log。对于流式应用程序，配置RollingFileAppender并将文件位置设置为YARN的日志目录将避免大型日志文件引起的磁盘溢出，并且可以使用YARN的日志实用程序访问日志。

To use a custom metrics.properties for the application master and executors, update the $SPARK_CONF_DIR/metrics.properties file. It will automatically be uploaded with other configurations, so you don’t need to specify it manually with --files.

要为 application master 和 executors 使用自定义的metricss.properties，请更新$SPARK_CONF_DIR/metrics.properties 文件。它将自动与其他配置一起上传，因此您不需要使用--files手动指定。