【spark】【在YARN上运行Spark】【Running Spark on YARN】

资源存储库

已于 2024-02-24 20:42:27 修改

阅读量836

点赞数 17

文章标签： spark

于 2024-02-17 19:11:27 首次发布

本文链接：https://blog.csdn.net/wq6qeg88/article/details/136141106

版权

Running Spark on YARN在YARN上运行Spark

Security 安全

Launching Spark on YARN在YARN上启动Spark

Adding Other JARs 添加其他JAR

Preparations 筹备工作

Configuration 配置

Debugging your Application

调试应用程序

Spark Properties

【Spark属性】

Available patterns for SHS custom executor log URL

Resource Allocation and Configuration Overview

资源分配和配置概述

Stage Level Scheduling Overview阶段级计划概述

Important notes 重要提示

Kerberos

YARN-specific Kerberos ConfigurationYARN特定的纱线配置

Troubleshooting Kerberos

故障排除

Configuring the External Shuffle Service配置外部Shuffle服务

Launching your application with Apache Oozie使用Apache Oozie启动应用程序

Using the Spark History Server to replace the Spark Web UI使用Spark历史服务器替换Spark Web UI

Running multiple versions of the Spark Shuffle Service运行多个版本的Spark Shuffle服务

Running Spark on YARN
在YARN上运行Spark

Launching Spark on YARN在YARN上启动Spark

Support for running on YARN (Hadoop NextGen) was added to Spark in version 0.6.0, and improved in subsequent releases.
在0.6.0版本中，Spark添加了对YARN（Hadoop NextGen）上运行的支持，并在后续版本中进行了改进。

Security 安全

Security features like authentication are not enabled by default. When deploying a cluster that is open to the internet or an untrusted network, it’s important to secure access to the cluster to prevent unauthorized applications from running on the cluster. Please see Spark Security and the specific security sections in this doc before running Spark.
默认情况下，身份验证等安全功能不启用。在部署对Internet或不受信任的网络开放的群集时，确保对群集的访问安全以防止未经授权的应用程序在群集上运行非常重要。在运行Spark之前，请参阅Spark Security和本文档中的特定安全部分。

Launching Spark on YARN
在YARN上启动Spark

Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster.

These configs are used to write to HDFS and connect to the YARN ResourceManager. The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration. If the configuration references Java system properties or environment variables not managed by YARN, they should also be set in the Spark application’s configuration (driver, executors, and the AM when running in client mode).
确保 HADOOP_CONF_DIR 或 YARN_CONF_DIR 指向包含Hadoop集群（客户端）配置文件的目录。

这些配置文件用于写入HDFS并连接到YARN ResourceManager。此目录中包含的配置将分发到YARN集群，以便应用程序使用的所有容器使用相同的配置。如果配置引用了Java系统属性或不是由YARN管理的环境变量，它们也应该在Spark应用程序的配置中进行设置（驱动程序，执行器，以及在客户端模式下运行时的AM）。

There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
有两种部署模式可用于在YARN上启动Spark应用程序。

在 cluster 模式下，Spark驱动程序在集群上由YARN管理的应用程序主进程中运行，客户端可以在启动应用程序后离开。

在 client 模式下，驱动程序在客户端进程中运行，应用程序主程序仅用于从YARN请求资源。

Unlike other cluster managers supported by Spark in which the master’s address is specified in the --master parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop configuration.

Thus, the --master parameter is yarn.
与Spark支持的其他集群管理器不同，在Spark中，master的地址是在 --master 参数中指定的。在YARN模式下，ResourceManager的地址是从Hadoop配置中获取的。

因此， --master 参数是 yarn 。

To launch a Spark application in cluster mode:
在 cluster 模式下启动Spark应用程序：

$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options]

$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options]

For example: 举例来说：

$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    --queue thequeue \
    examples/jars/spark-examples*.jar \
    10

The above starts a YARN client program which starts the default Application Master. Then SparkPi will be run as a child thread of Application Master.

The client will periodically poll the Application Master for status updates and display them in the console.

The client will exit once your application has finished running.

Refer to the Debugging your Application section below for how to see driver and executor logs.
上面的代码启动了一个YARN客户端程序，该程序启动了默认的Application Master。

然后SparkPi将作为Application Master的子线程运行。

客户端将定期轮询Application Master以获取状态更新，并将其显示在控制台中。客户端将在应用程序运行完毕后退出。

有关如何查看驱动程序和执行器日志的信息，请参阅下面的“查看您的应用程序”部分。

To launch a Spark application in client mode, do the same, but replace cluster with client.

The following shows how you can run spark-shell in client mode:
要在 client 模式下启动Spark应用程序，请执行相同的操作，但将 cluster 替换为 client 。

下面显示了如何在 client 模式下运行 spark-shell ：

$ ./bin/spark-shell --master yarn --deploy-mode client

$ ./bin/spark-shell --master yarn --deploy-mode client

Adding Other JARs

添加其他JAR

In cluster mode, the driver runs on a different machine than the client, so SparkContext.addJar won’t work out of the box with files that are local to the client.

To make files on the client available to SparkContext.addJar, include them with the --jars option in the launch command.
在 cluster 模式下，驱动程序运行在与客户端不同的机器上，因此 SparkContext.addJar 不会对客户端本地的文件进行开箱即用。

要使客户端上的文件可供 SparkContext.addJar 使用，请在launch命令中使用 --jars 选项包含这些文件。

$ ./bin/spark-submit --class my.main.Class \
    --master yarn \
    --deploy-mode cluster \
    --jars my-other-jar.jar,my-other-other-jar.jar \
    my-main-jar.jar \
    app_arg1 app_arg2

Preparations 筹备工作

Running Spark on YARN requires a binary distribution of Spark which is built with YARN support. Binary distributions can be downloaded from the downloads page of the project website. There are two variants of Spark binary distributions you can download. One is pre-built with a certain version of Apache Hadoop; this Spark distribution contains built-in Hadoop runtime, so we call it with-hadoop Spark distribution. The other one is pre-built with user-provided Hadoop; since this Spark distribution doesn’t contain a built-in Hadoop runtime, it’s smaller, but users have to provide a Hadoop installation separately. We call this variant no-hadoop Spark distribution. For with-hadoop Spark distribution, since it contains a built-in Hadoop runtime already, by default, when a job is submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not populate Yarn’s classpath into Spark. To override this behavior, you can set spark.yarn.populateHadoopClasspath=true. For no-hadoop Spark distribution, Spark will populate Yarn’s classpath by default in order to get Hadoop runtime. For with-hadoop Spark distribution, if your application depends on certain library that is only available in the cluster, you can try to populate the Yarn classpath by setting the property mentioned above. If you run into jar conflict issue by doing so, you will need to turn it off and include this library in your application jar.
在YARN上运行Spark需要一个支持YARN的Spark二进制发行版。二进制发行版可以从项目网站的下载页面下载。您可以下载Spark二进制发行版的两个变体。一个是使用特定版本的Apache Hadoop预构建的;这个Spark发行版包含内置的Hadoop运行时，所以我们称之为 with-hadoop Spark发行版。另一个是使用用户提供的Hadoop预构建的;因为这个Spark发行版不包含内置的Hadoop运行时，所以它更小，但用户必须单独提供Hadoop安装。我们称之为 no-hadoop Spark分布。对于 with-hadoop Spark发行版，由于它已经包含了内置的Hadoop运行时，默认情况下，当作业提交到Hadoop Yarn集群时，为了防止jar冲突，它不会将Yarn的classpath填充到Spark中。要覆盖此行为，您可以设置 spark.yarn.populateHadoopClasspath=true 。对于 no-hadoop Spark发行版，Spark将默认填充Yarn的classpath以获取Hadoop运行时。对于 with-hadoop Spark发行版，如果您的应用程序依赖于仅在集群中可用的特定库，则可以尝试通过设置上述属性来填充Yarn类路径。如果你这样做会遇到jar冲突问题，你需要关闭它，并在你的应用程序jar中包含这个库。

To build Spark yourself, refer to Building Spark.
要自己构建Spark，请参阅Building Spark。

To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. For details please refer to Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.
要使Spark运行时jar可以从YARN端访问，您可以指定 spark.yarn.archive 或 spark.yarn.jars 。详情请参阅Spark Properties。如果既没有指定 spark.yarn.archive 也没有指定 spark.yarn.jars ，Spark将创建一个包含 $SPARK_HOME/jars 下所有jar的zip文件，并将其上传到分布式缓存。

Configuration 配置

Most of the configs are the same for Spark on YARN as for other deployment modes. See the configuration page for more information on those. These are configs that are specific to Spark on YARN.
Spark on YARN的大多数部署模式与其他部署模式相同。有关这些的更多信息，请参见配置页面。这些是特定于Spark on YARN的参数。

Debugging your Application

调试应用程序

In YARN terminology, executors and application masters run inside “containers”. YARN has two modes for handling container logs after an application has completed. If log aggregation is turned on (with the yarn.log-aggregation-enable config), container logs are copied to HDFS and deleted on the local machine. These logs can be viewed from anywhere on the cluster with the yarn logs command.
在YARN术语中，执行器和应用程序主程序运行在“容器”中。YARN有两种模式用于在应用程序完成后处理容器日志。如果打开了日志聚合（使用 yarn.log-aggregation-enable 配置），则容器日志将复制到HDFS并在本地计算机上删除。可以使用 yarn logs 命令从群集上的任何位置查看这些日志。

yarn logs -applicationId <app ID>

will print out the contents of all log files from all containers from the given application. You can also view the container log files directly in HDFS using the HDFS shell or API. The directory where they are located can be found by looking at your YARN configs (yarn.nodemanager.remote-app-log-dir and yarn.nodemanager.remote-app-log-dir-suffix). The logs are also available on the Spark Web UI under the Executors Tab. You need to have both the Spark history server and the MapReduce history server running and configure yarn.log.server.url in yarn-site.xml properly. The log URL on the Spark history server UI will redirect you to the MapReduce history server to show the aggregated logs.
将从给定应用程序的所有容器中打印出所有日志文件的内容。您还可以使用HDFS shell或API直接在HDFS中查看容器日志文件。它们所在的目录可以通过查看您的YARN目录（ yarn.nodemanager.remote-app-log-dir 和 yarn.nodemanager.remote-app-log-dir-suffix ）找到。这些日志也可以在Spark Web UI的Executors选项卡下找到。您需要同时运行Spark历史服务器和MapReduce历史服务器，并在 yarn-site.xml 中正确配置 yarn.log.server.url 。Spark历史服务器UI上的日志URL将重定向到MapReduce历史服务器以显示聚合日志。

When log aggregation isn’t turned on, logs are retained locally on each machine under YARN_APP_LOGS_DIR, which is usually configured to /tmp/logs or $HADOOP_HOME/logs/userlogs depending on the Hadoop version and installation. Viewing logs for a container requires going to the host that contains them and looking in this directory. Subdirectories organize log files by application ID and container ID. The logs are also available on the Spark Web UI under the Executors Tab and doesn’t require running the MapReduce history server.
当日志聚合未打开时，日志将在本地保留在 YARN_APP_LOGS_DIR 下的每台机器上，通常根据Hadoop版本和安装配置为 /tmp/logs 或 $HADOOP_HOME/logs/userlogs 。查看容器的日志需要转到包含它们的主机并在此目录中查找。子目录按应用程序ID和容器ID组织日志文件。日志也可以在Spark Web UI的Executors选项卡下获得，并且不需要运行MapReduce历史服务器。

To review per-container launch environment, increase yarn.nodemanager.delete.debug-delay-sec to a large value (e.g. 36000), and then access the application cache through yarn.nodemanager.local-dirs on the nodes on which containers are launched. This directory contains the launch script, JARs, and all environment variables used for launching each container. This process is useful for debugging classpath problems in particular. (Note that enabling this requires admin privileges on cluster settings and a restart of all node managers. Thus, this is not applicable to hosted clusters).
要查看每个容器的启动环境，请将 yarn.nodemanager.delete.debug-delay-sec 增加到一个较大的值（例如 36000 ），然后通过启动容器的节点上的 yarn.nodemanager.local-dirs 访问应用缓存。该目录包含启动脚本、JAR和用于启动每个容器的所有环境变量。这个过程对于调试类路径问题特别有用。(Note启用此功能需要对群集设置具有管理员权限并重新启动所有节点管理器。因此，这不适用于托管集群）。

To use a custom log4j configuration for the application master or executors, here are the options:
要为应用程序主程序或执行器使用自定义log4j配置，请使用以下选项：

upload a custom log4j.properties using spark-submit, by adding it to the --files list of files to be uploaded with the application.
使用 spark-submit 上传自定义 log4j.properties ，方法是将其添加到要与应用程序一起上传的文件的 --files 列表中。
add -Dlog4j.configuration=<location of configuration file> to spark.driver.extraJavaOptions (for the driver) or spark.executor.extraJavaOptions (for executors). Note that if using a file, the file: protocol should be explicitly provided, and the file needs to exist locally on all the nodes.
将 -Dlog4j.configuration=<location of configuration file> 添加到 spark.driver.extraJavaOptions （对于驱动程序）或 spark.executor.extraJavaOptions （对于执行器）。请注意，如果使用文件，则应显式提供

最低0.47元/天解锁文章

资源存储库

关注

17
点赞
踩
29

收藏

觉得还不错? 一键收藏
打赏
0
评论
【spark】【在YARN上运行Spark】【Running Spark on YARN】

通常情况下，这并不重要，因为Spark在开始另一个阶段之前完成了一个阶段，唯一可能有影响的情况是在作业服务器类型的场景中，所以需要记住这一点。例如，当运行一个YARN集群，其中运行多个Spark版本的应用程序的混合工作负载时，这可能很有帮助，因为给定版本的shuffle服务并不总是与其他版本的Spark兼容。Spark发行版，由于它已经包含了内置的Hadoop运行时，默认情况下，当作业提交到Hadoop Yarn集群时，为了防止jar冲突，它不会将Yarn的classpath填充到Spark中。
复制链接

扫一扫