【spark】【在YARN上运行Spark】【Running Spark on YARN】

目录

Running Spark on YARN在YARN上运行Spark

Security 安全 

Launching Spark on YARN在YARN上启动Spark

Adding Other JARs 添加其他JAR 

Preparations 筹备工作 

Configuration 配置 

Debugging your Application

调试应用程序

Spark Properties

【Spark属性】

Available patterns for SHS custom executor log URL

Resource Allocation and Configuration Overview

资源分配和配置概述

Stage Level Scheduling Overview阶段级计划概述

Important notes 重要提示 

Kerberos

YARN-specific Kerberos ConfigurationYARN特定的纱线配置

Troubleshooting Kerberos

故障排除

Configuring the External Shuffle Service配置外部Shuffle服务

Launching your application with Apache Oozie使用Apache Oozie启动应用程序

Using the Spark History Server to replace the Spark Web UI使用Spark历史服务器替换Spark Web UI

Running multiple versions of the Spark Shuffle Service运行多个版本的Spark Shuffle服务

Running Spark on YARN
在YARN上运行Spark

Launching Spark on YARN在YARN上启动Spark

Support for running on YARN (Hadoop NextGen) was added to Spark in version 0.6.0, and improved in subsequent releases.
在0.6.0版本中,Spark添加了对YARN(Hadoop NextGen)上运行的支持,并在后续版本中进行了改进。

Security 安全 

Security features like authentication are not enabled by default. When deploying a cluster that is open to the internet or an untrusted network, it’s important to secure access to the cluster to prevent unauthorized applications from running on the cluster. Please see Spark Security and the specific security sections in this doc before running Spark.
默认情况下,身份验证等安全功能不启用。在部署对Internet或不受信任的网络开放的群集时,确保对群集的访问安全以防止未经授权的应用程序在群集上运行非常重要。在运行Spark之前,请参阅Spark Security和本文档中的特定安全部分。

Launching Spark on YARN
在YARN上启动Spark

Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster.

These configs are used to write to HDFS and connect to the YARN ResourceManager. The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration. If the configuration references Java system properties or environment variables not managed by YARN, they should also be set in the Spark application’s configuration (driver, executors, and the AM when running in client mode).
确保 HADOOP_CONF_DIR 或 YARN_CONF_DIR 指向包含Hadoop集群(客户端)配置文件的目录。

这些配置文件用于写入HDFS并连接到YARN ResourceManager。此目录中包含的配置将分发到YARN集群,以便应用程序使用的所有容器使用相同的配置。如果配置引用了Java系统属性或不是由YARN管理的环境变量,它们也应该在Spark应用程序的配置中进行设置(驱动程序,执行器,以及在客户端模式下运行时的AM)。

There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
有两种部署模式可用于在YARN上启动Spark应用程序。

在 cluster 模式下,Spark驱动程序在集群上由YARN管理的应用程序主进程中运行,客户端可以在启动应用程序后离开。

在 client 模式下,驱动程序在客户端进程中运行,应用程序主程序仅用于从YARN请求资源。

Unlike other cluster managers supported by Spark in which the master’s address is specified in the --master parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop configuration.

Thus, the --master parameter is yarn.
与Spark支持的其他集群管理器不同,在Spark中,master的地址是在 --master 参数中指定的。在YARN模式下,ResourceManager的地址是从Hadoop配置中获取的。

因此, --master 参数是 yarn 。

To launch a Spark application in cluster mode:
在 cluster 模式下启动Spark应用程序:

$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options]

$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options] 

For example: 举例来说:

$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    --queue thequeue \
    examples/jars/spark-examples*.jar \
    10

The above starts a YARN client program which starts the default Application Master. Then SparkPi will be run as a child thread of Application Master.

The client will periodically poll the Application Master for status updates and display them in the console.

The client will exit once your application has finished running.

Refer to the Debugging your Application section below for how to see driver and executor logs.
上面的代码启动了一个YARN客户端程序,该程序启动了默认的Application Master。

然后SparkPi将作为Application Master的子线程运行。

客户端将定期轮询Application Master以获取状态更新,并将其显示在控制台中。客户端将在应用程序运行完毕后退出。

有关如何查看驱动程序和执行器日志的信息,请参阅下面的“查看您的应用程序”部分。

To launch a Spark application in client mode, do the same, but replace cluster with client.

The following shows how you can run spark-shell in client mode:
要在 client 模式下启动Spark应用程序,请执行相同的操作,但将 cluster 替换为 client 。

下面显示了如何在 client 模式下运行 spark-shell :

$ ./bin/spark-shell --master yarn --deploy-mode client

 $ ./bin/spark-shell --master yarn --deploy-mode client

Adding Other JARs 

添加其他JAR 

In cluster mode, the driver runs on a different machine than the client, so SparkContext.addJar won’t work out of the box with files that are local to the client.

To make files on the client available to SparkContext.addJar, include them with the --jars option in the launch command.
在 cluster 模式下,驱动程序运行在与客户端不同的机器上,因此 SparkContext.addJar 不会对客户端本地的文件进行开箱即用。

要使客户端上的文件可供 SparkContext.addJar 使用,请在launch命令中使用 --jars 选项包含这些文件。

$ ./bin/spark-submit --class my.main.Class \
    --master yarn \
    --deploy-mode cluster \
    --jars my-other-jar.jar,my-other-other-jar.jar \
    my-main-jar.jar \
    app_arg1 app_arg2

Preparations 筹备工作 

Running Spark on YARN requires a binary distribution of Spark which is built with YARN support. Binary distributions can be downloaded from the downloads page of the project website. There are two variants of Spark binary distributions you can download. One is pre-built with a certain version of Apache Hadoop; this Spark distribution contains built-in Hadoop runtime, so we call it with-hadoop Spark distribution. The other one is pre-built with user-provided Hadoop; since this Spark distribution doesn’t contain a built-in Hadoop runtime, it’s smaller, but users have to provide a Hadoop installation separately. We call this variant no-hadoop Spark distribution. For with-hadoop Spark distribution, since it contains a built-in Hadoop runtime already, by default, when a job is submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not populate Yarn’s classpath into Spark. To override this behavior, you can set spark.yarn.populateHadoopClasspath=true. For no-hadoop Spark distribution, Spark will populate Yarn’s classpath by default in order to get Hadoop runtime. For with-hadoop Spark distribution, if your application depends on certain library that is only available in the cluster, you can try to populate the Yarn classpath by setting the property mentioned above. If you run into jar conflict issue by doing so, you will need to turn it off and include this library in your application jar.
在YARN上运行Spark需要一个支持YARN的Spark二进制发行版。二进制发行版可以从项目网站的下载页面下载。您可以下载Spark二进制发行版的两个变体。一个是使用特定版本的Apache Hadoop预构建的;这个Spark发行版包含内置的Hadoop运行时,所以我们称之为 with-hadoop Spark发行版。另一个是使用用户提供的Hadoop预构建的;因为这个Spark发行版不包含内置的Hadoop运行时,所以它更小,但用户必须单独提供Hadoop安装。我们称之为 no-hadoop Spark分布。对于 with-hadoop Spark发行版,由于它已经包含了内置的Hadoop运行时,默认情况下,当作业提交到Hadoop Yarn集群时,为了防止jar冲突,它不会将Yarn的classpath填充到Spark中。要覆盖此行为,您可以设置 spark.yarn.populateHadoopClasspath=true 。 对于 no-hadoop Spark发行版,Spark将默认填充Yarn的classpath以获取Hadoop运行时。对于 with-hadoop Spark发行版,如果您的应用程序依赖于仅在集群中可用的特定库,则可以尝试通过设置上述属性来填充Yarn类路径。如果你这样做会遇到jar冲突问题,你需要关闭它,并在你的应用程序jar中包含这个库。

To build Spark yourself, refer to Building Spark.
要自己构建Spark,请参阅Building Spark。

To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. For details please refer to Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.
要使Spark运行时jar可以从YARN端访问,您可以指定 spark.yarn.archive 或 spark.yarn.jars 。详情请参阅Spark Properties。如果既没有指定 spark.yarn.archive 也没有指定 spark.yarn.jars ,Spark将创建一个包含 $SPARK_HOME/jars 下所有jar的zip文件,并将其上传到分布式缓存。

Configuration 配置 

Most of the configs are the same for Spark on YARN as for other deployment modes. See the configuration page for more information on those. These are configs that are specific to Spark on YARN.
Spark on YARN的大多数部署模式与其他部署模式相同。有关这些的更多信息,请参见配置页面。这些是特定于Spark on YARN的参数。

Debugging your Application

调试应用程序

In YARN terminology, executors and application masters run inside “containers”. YARN has two modes for handling container logs after an application has completed. If log aggregation is turned on (with the yarn.log-aggregation-enable config), container logs are copied to HDFS and deleted on the local machine. These logs can be viewed from anywhere on the cluster with the yarn logs command.
在YARN术语中,执行器和应用程序主程序运行在“容器”中。YARN有两种模式用于在应用程序完成后处理容器日志。如果打开了日志聚合(使用 yarn.log-aggregation-enable 配置),则容器日志将复制到HDFS并在本地计算机上删除。可以使用 yarn logs 命令从群集上的任何位置查看这些日志。

yarn logs -applicationId <app ID>

will print out the contents of all log files from all containers from the given application. You can also view the container log files directly in HDFS using the HDFS shell or API. The directory where they are located can be found by looking at your YARN configs (yarn.nodemanager.remote-app-log-dir and yarn.nodemanager.remote-app-log-dir-suffix). The logs are also available on the Spark Web UI under the Executors Tab. You need to have both the Spark history server and the MapReduce history server running and configure yarn.log.server.url in yarn-site.xml properly. The log URL on the Spark history server UI will redirect you to the MapReduce history server to show the aggregated logs.
将从给定应用程序的所有容器中打印出所有日志文件的内容。您还可以使用HDFS shell或API直接在HDFS中查看容器日志文件。它们所在的目录可以通过查看您的YARN目录( yarn.nodemanager.remote-app-log-dir 和 yarn.nodemanager.remote-app-log-dir-suffix )找到。这些日

  • 17
    点赞
  • 29
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

资源存储库

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值