Spark 2.4.8 提交应用

最新推荐文章于 2023-09-23 13:14:24 发布

大怀特

最新推荐文章于 2023-09-23 13:14:24 发布

阅读量203

点赞数

分类专栏： bigdata 文章标签： spark big data 大数据

原文链接：http://spark.apache.org/docs/2.4.8/submitting-applications.html

版权

bigdata 专栏收录该内容

60 篇文章 1 订阅

订阅专栏

Spark 2.4.8 提交应用

Submitting Applications
捆绑你应用程序的依赖
用spark-submit启动应用程序
Master URLs
从文件中下载配置文件
高级依赖管理
更多信息

Submitting Applications

在Spark目录中bin下spark-submit脚本用来启动在集群上的应用程序.它可以使用Spark支持的所有集群管理器,通过一致的接口,这样你不需要为你每个程序有专门的配置.

捆绑你应用程序的依赖

如果你的代码依赖其它项目,你将需要打包它们在你应用里,为了分发代码到Spark集群中.为这样做,创建一个组装jar(或是“uber” jar)包含你的代码和它的依赖.sbt和maven都有组装插件.当创建一个组装jars,列出Spark和Hadoop作为provided依赖; 这需要不捆绑,因为他们通过群集管理运行时提供.一旦你有组装jar,你可以调他们在bin/spark-submit如下所示的脚本,当你传递jar.

对于Python,你能使用spark-submit的--py-files参数增加.py,.zip 或.egg文件你的应用一起被分发. 如果你依赖多个Python文件我们推荐打包他们到.zip 或.egg.

用spark-submit启动应用程序

一旦用户应用被绑定,它能被启动使用bin/spark-submit脚本.脚本可以,用Spark和他的依赖设置类路径,并且可以支持不同的Spark支持的集群管理器和发布模式:

./bin/spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

一些常用的使用选项:

–class: 程序的入口
–master 集群的URL(如,spark://23.195.26.187:7077)
–deploy-mode:在工作节点(集群)上是否部署你的driver或本地作为额外的client(client)(默认:client)
–conf:任意的Spark配置属性用key=value格式.值包含空格用又引号包围上 “key=value”
application-jar: 捆绑jar的路径,包含你的应用和所有依赖.URL一定在你集群内全局可见,例如,一个hdfs://路径或file:// 路径在所有节点上都有.
application-arguments: 参数传递到你的main方法中,如有有.
常用的部署策略是从网关机器提交你的应用, 网关机器物理位置和你工作节点相同(如主节点在独立EC2集群).在这个设置中,client模式是合适的.在client模式driver直接启动在 spark-submit进程中,作为一个client放到集群中.应用的输入输出附加在控制台上.因此,这种模式非常适合程序需要REPL时(例如Spark shell).

或者,你的应用被提交机器远离工作机器(例如,本地电脑),这样通常使用cluster模式来减少在driver和executor网络延时.当前Python应用不被支持独立集群格式.

对Python应用,简章传.py文件替换代替jar,并且增加Python.zip,.egg或.py文件来搜索路径用--py-files.

有一些可用选项是专门为cluster manager使用的.例如,用Spark standalone cluster 带cluster部署模式,你可以指定--supervise来确保dirver自动重启,如果失败不退出代码.来列举所有这样可用选项到 spark-submit, 带–help运行.这里有一些常见选项例子

# Run application locally on 8 cores
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8] \
  /path/to/examples.jar \
  100

# Run on a Spark standalone cluster in client deploy mode
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a Spark standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --deploy-mode cluster \
  --supervise \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \  # can be client for client mode
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  1000

# Run a Python application on a Spark standalone cluster
./bin/spark-submit \
  --master spark://207.184.161.138:7077 \
  examples/src/main/python/pi.py \
  1000

# Run on a Mesos cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master mesos://207.184.161.138:7077 \
  --deploy-mode cluster \
  --supervise \
  --executor-memory 20G \
  --total-executor-cores 100 \
  http://path/to/examples.jar \
  1000

# Run on a Kubernetes cluster in cluster deploy mode
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master k8s://xx.yy.zz.ww:443 \
  --deploy-mode cluster \
  --executor-memory 20G \
  --num-executors 50 \
  http://path/to/examples.jar \
  1000

Master URLs

master URL传到Spark可以是下边格式中的某一个:

Master URL	Meaning
local	Run Spark locally with one worker thread (i.e. no parallelism at all).
local[K]	Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
local[K,F]	Run Spark locally with K worker threads and F maxFailures (see spark.task.maxFailures for an explanation of this variable)
local[*]	Run Spark locally with as many worker threads as logical cores on your machine.
local[*,F]	Run Spark locally with as many worker threads as logical cores on your machine and F maxFailures.
spark://HOST:PORT	Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.
spark://HOST1:PORT1,HOST2:PORT2	Connect to the given Spark standalone cluster with standby masters with Zookeeper. The list must have all the master
mesos://HOST:PORT Connect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos	cluster using ZooKeeper, use mesos://zk://… To submit with --deploy-mode cluster, the HOST:PORT should be configured to connect to the MesosClusterDispatcher.
yarn	Connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.
k8s://HOST:PORT	Connect to a Kubernetes cluster in cluster mode. Client mode is currently unsupported and will be supported in future releases. The HOST and PORT refer to the Kubernetes API Server. It connects using TLS by default. In order to force it to use an unsecured connection, you can use

从文件中下载配置文件

spark-submit 脚本可以加载Spark配置文件的默认值从属性文件并且传递他们到你的程序.默认,读配置项从conf/spark-defaults.conf.更多详情查看加载默认项.
加载默认Spark配置这种方式可以避免spark提交时需要特定的标记.例如,如果spark.master被配置,你可以安全的忽略–master标记在Spark提交.一般,配置值显示设置在SparkConf中最优先,再是标记传入spark-submit,之后才是默认值.

如果你始终不清楚配置项是从哪是来的,你可以打印更细的调试信息,在运行spark-submit 时带上 --verbose 选项.

高级依赖管理

当使用spark-submit,应用jar可以带任何jar使用–jar选项将要自动传输到集群.在–jars 后多URL需要用逗号分隔.那个列表包含在driver和executor的类路径.目录扩展不能使用在–jars.

Spark使用下边URL模式允许不同策略来传播jars:

file: - 绝对路径和file:/ URIs被提供通过driver HTTP 文件server,每个executor拉文件从driver中的HTTP server.
hdfs:, http:, https:, ftp: - 这些拉文件和jar从期望的URI
local: - 以ocal:/为起始的URI,期待本地文件存在每个工作节点上.这味道着没有网络IO被带来,适用于大文件或jar,被推送到每个woker,或通过NFS共享,或GlusterFS等等.
注意JARs和文件是被复制到,每个在executor nodes上的SparkContext工具目录.这样随着时间会占用大量空间,并且将需要清除.用Yarn,清除是自动处理,使用Spark standalone自动清除需要配置spark.worker.cleanup.appDataTtl特性.

用户可能也包含一些依赖通过提供,列出maven坐标用–packages. 所有传递依赖当使用这个命令被处理.另外仓库(或是在SBT中resolvers)可以被加入用逗号方式,带上标记 --repositories.(注意,某些情况可以提供保护密码凭证,像 https://user:password@host/…当使用凭证一定要小心).这些命令可以用到pyspark, spark-shell, 和 spark-submit 包含Spark Packages.

对于Python, 等同于–py-files选项可以被用来分发 .egg, .zip 和 .py 库到 executors.