翻译中...
Running Spark on YARN
Support for running on YARN (Hadoop NextGen) was added to Spark in version 0.6.0, and improved in subsequent releases.
在Spark的0.6.0版本中已经支持在YARN(Hadoop NextGen)上运行的Spark,并在后续版本中得到改进。
Launching Spark on YARN
Ensure that HADOOP_CONF_DIR
or YARN_CONF_DIR
points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager. The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration. If the configuration references Java system properties or environment variables not managed by YARN, they should also be set in the Spark application’s configuration (driver, executors, and the AM when running in client mode).
确保HADOOP_CONF_DIR或YARN_CONF_DIR指向包含Hadoop集群的(客户端)配置文件的目录。这些配置用于写入HDFS并连接到YARN ResourceManager。此目录中包含的配置将分发到YARN群集,以便应用程序使用的所有容器都使用相同的配置。如果配置引用了不受YARN管理的Java系统属性或环境变量,那么也应该在Spark应用程序的配置(driver,executors和AM在客户端模式下运行时)中进行设置。
There are two deploy modes that can be used to launch Spark applications on YARN. In cluster
mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client
mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
有两种模式可以在YARN上启动Spark 应用。在集群模式下,Spark驱动程序运行在由集群上的YARN管理的应用程序主进程中,客户端可以在启动应用程序后离开。在客户端模式下,驱动程序在客户端进程中运行,应用程序主程序仅用于从YARN请求资源。
Unlike Spark standalone and Mesos modes, in which the master’s address is specified in the --master
parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop configuration. Thus, the --master
parameter is yarn.
与Spark standalone和Mesos模式不同,主节点地址在--master参数中指定,在YARN模式下,ResourceManager的地址从Hadoop配置中提取。因此,--master的参数是yarn。
To launch a Spark application in cluster
mode:
$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options]
例如:
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--queue thequeue \
lib/spark-examples*.jar \
10
The above starts a YARN client program which starts the default Application Master. Then SparkPi will be run as a child thread of Application Master. The client will periodically poll the Application Master for status updates and display them in the console. The client will exit once your application has finished running. Refer to the “Debugging your Application” section below for how to see driver and executor logs.
以上启动了运行默认Application Master的YARN客户端程序。SparkPi 将作为Application Master的一个子线程运行。客户端将定期轮询Application Master的状态更新并将其显示在控制台中。一旦你的应用结束运行,客户端将退出。请参阅下面的“调试应用程序”部分,了解如何查看驱动程序和执行程序日志。
To launch a Spark application in client
mode, do the same, but replace cluster
with client
. The following shows how you can run spark-shell
in client
mode:
$ ./bin/spark-shell --master yarn --deploy-mode client
Adding Other JARs
cluster
mode, the driver runs on a different machine than the client, so SparkContext.addJar
won’t work out of the box with files that are local to the client. To make files on the client available to SparkContext.addJar
, include them with the --jars
option in the launch command.$ ./bin/spark-submit --class my.main.Class \
--master yarn \
--deploy-mode cluster \
--jars my-other-jar.jar,my-other-other-jar.jar \
my-main-jar.jar \
app_arg1 app_arg2
Preparations
Running Spark on YARN requires a binary distribution of Spark which is built with YARN support. Binary distributions can be downloaded from the downloads page of the project website. To build Spark yourself, refer to Building Spark.
在YARN上运行Spark需要使用YARN支持构建的Spark的二进制分发。二进制发行版可以从项目网站的下载页面下载。 要自己构建Spark,请参阅Building Spark。
To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive
or spark.yarn.jars
. For details please refer to Spark Properties. If neither spark.yarn.archive
nor spark.yarn.jars
is specified, Spark will create a zip file with all jars under $SPARK_HOME/jars
and upload it to the distributed cache.
Configuration
Debugging your Application
yarn.log-aggregation-enable
config), container logs are copied to HDFS and deleted on the local machine. These logs can be viewed from anywhere on the cluster with the yarn logs
command.yarn.log-aggregation-enable 配置项
),container 的日志将被复制到HDFS上并且从本地的机器上删除。使用yarn logs命令可以在集群中的任何地方查看这些日志。
yarn logs -applicationId <app ID>
will print out the contents of all log files from all containers from the given application. You can also view the container log files directly in HDFS using the HDFS shell or API. The directory where they are located can be found by looking at your YARN configs (yarn.nodemanager.remote-app-log-dir
and yarn.nodemanager.remote-app-log-dir-suffix
). The logs are also available on the Spark Web UI under the Executors Tab. You need to have both the Spark history server and the MapReduce history server running and configure yarn.log.server.url
in yarn-site.xml
properly. The log URL on the Spark history server UI will redirect you to the MapReduce history server to show the aggregated logs.
这个命令将打印出指定application的所有containers的所有日志内容。你也可以使用HDFS shell 或者 API 来直接查看在HDFS 上的日志。日志的所在位置可以通过查看YARN 的配置(yarn.nodemanager.remote-app-log-dir
和 yarn.nodemanager.remote-app-log-dir-suffix
)获知。也可以从Spark Web UI下的Executors的tab页中获得。这需要让Spark history server 和 MapReduce history server 一起运作,并且配置 yarn-site.xml 中的yarn.log.server.url 属性。Spark history server UI上的日志URL将被重定向 MapReduce history server 上来显示聚合过的日志。
When log aggregation isn’t turned on, logs are retained locally on each machine under YARN_APP_LOGS_DIR
, which is usually configured to /tmp/logs
or $HADOOP_HOME/logs/userlogs
depending on the Hadoop version and installation. Viewing logs for a container requires going to the host that contains them and looking in this directory. Subdirectories organize log files by application ID and container ID. The logs are also available on the Spark Web UI under the Executors Tab and doesn’t require running the MapReduce history server.
当日志聚合功能没有被开启的时候,日志将保存在每台机器的YARN_APP_LOGS_DIR下面,YARN_APP_LOGS_DIR经常配置为/tmp/logs或者$HADOOP_HOME/logs/userlogs, 这取决于 Hadoop 的版本和安装。查看一个container的日志需要到这个container所在的主机上并到日志所在的目录查看。子目录根据应用程序ID和容器ID组织日志文件。这些日志也可以在Spark Web UI的Executors Tab页获得,并且不需要运行 MapReduce history server。
To review per-container launch environment, increase yarn.nodemanager.delete.debug-delay-sec
to a large value (e.g. 36000
), and then access the application cache through yarn.nodemanager.local-dirs
on the nodes on which containers are launched. This directory contains the launch script, JARs, and all environment variables used for launching each container. This process is useful for debugging classpath problems in particular. (Note that enabling this requires admin privileges on cluster settings and a restart of all node managers. Thus, this is not applicable to hosted clusters).
To use a custom log4j configuration for the application master or executors, here are the options:
要为 application master 或者 executors 使用自定义的log4j 配置,参考以下的选项:
- upload a custom
log4j.properties
usingspark-submit
, by adding it to the--files
list of files to be uploaded with the application. - 使用spark-submit上传自定义log4j.properties,将其添加到要与应用程序一起上传的文件的--files列表中。
- add
-Dlog4j.configuration=<location of configuration file>
tospark.driver.extraJavaOptions
(for the driver) orspark.executor.extraJavaOptions
(for executors). Note that if using a file, thefile:
protocol should be explicitly provided, and the file needs to exist locally on all the nodes. - 添加-Dlog4j.configuration = <配置文件的位置>到spark.driver.extraJavaOptions(对于驱动程序)或spark.executor.extraJavaOptions(对于执行程序)。请注意,如果使用文件,则应显式提供文件:协议,并且文件需要本地存在于所有节点上。
- update the
$SPARK_CONF_DIR/log4j.properties
file and it will be automatically uploaded along with the other configurations. Note that other 2 options has higher priority than this option if multiple options are specified. - 更新 $SPARK_CONF_DIR/log4j.properties 文件,它将与其他配置一起自动上传。 请注意,如果指定了多个选项,其他2个选项的优先级高于此选项。
Note that for the first option, both executors and the application master will share the same log4j configuration, which may cause issues when they run on the same node (e.g. trying to write to the same log file).
请注意,对于第一个选项,执行程序和应用程序主机将共享相同的log4j配置,这可能会导致在同一节点上运行时出现问题(例如,尝试写入同一个日志文件)。
If you need a reference to the proper location to put log files in the YARN so that YARN can properly display and aggregate them, use spark.yarn.app.container.log.dir
in your log4j.properties
. For example, log4j.appender.file_appender.File=${spark.yarn.app.container.log.dir}/spark.log
. For streaming applications, configuring RollingFileAppender
and setting file location to YARN’s log directory will avoid disk overflow caused by large log files, and logs can be accessed using YARN’s log utility.
如果您需要引用正确的位置将日志文件放在YARN中,以便YARN可以正确显示和聚合它们,请在log4j.properties中使用spark.yarn.app.container.log.dir。 例如,log4j.appender.file_appender.File = $ {spark.yarn.app.container.log.dir} /spark.log。 对于流式应用程序,配置RollingFileAppender并将文件位置设置为YARN的日志目录将避免大型日志文件引起的磁盘溢出,并且可以使用YARN的日志实用程序访问日志。
To use a custom metrics.properties for the application master and executors, update the $SPARK_CONF_DIR/metrics.properties
file. It will automatically be uploaded with other configurations, so you don’t need to specify it manually with --files
.