Spark1.6.0官方文档翻译02--spark-submit script

最新推荐文章于 2022-10-29 13:01:50 发布

乖猪宝贝

最新推荐文章于 2022-10-29 13:01:50 发布

阅读量1k

点赞数

分类专栏： Spark1.6.0官方文档翻译（存档）文章标签： spark 笔记官方文档翻译

Spark1.6.0官方文档翻译（存档）专栏收录该内容

2 篇文章 0 订阅

订阅专栏

http://spark.apache.org/docs/latest/

Submitting Applications

The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application specially for each one.

spark-submit脚本用来提交spark应用程序到spark集群。

Bundling Your Application’s Dependencies

If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime. Once you have an assembled jar you can call the bin/spark-submit script as shown here while passing your jar.

For Python, you can use the --py-files argument of spark-submit to add .py, .zip or .egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip or .egg.

如果用户自己写的代码依赖于其它的项目，那么用户必须将所以来的项目打包进自己的spark程序中以便于提交给spark集群来运行。最好就是创建一个jar包含自己的spark代码和以来的代码。SBT和Maven都提供这样的功能。

Launching Applications with spark-submit

Once a user application is bundled, it can be launched using the bin/spark-submit script. This script takes care of setting up the classpath with Spark and its dependencies, and can support different cluster managers and deploy modes that Spark supports:

./bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

Some of the commonly used options are:

--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi) 程序的入口文件，也就是main方法所在的类文件。
--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077) master节点的url
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client) 是否将driver部署在cluster或者client？如果选择cluster的话，就会从spark的worker nodes中选择一个作为driver，如果选择client的话，就会将用户提交application的这台机器作为driver。默认的情况下是选择client模式。也就是说用那台机器执行spark-submit就会将driver部署在这台机器上。
--conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).Spark的配置属性，以key=value对形式提供。
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes. 提交的application的url和所以来的jar的url，这里的url必须是全局可见的，也就是说在spark集群中的所有的机器都应该能够访问到这个url。例如hdfs和本地的file。
application-arguments: Arguments passed to the main method of your main class, if any 如果提交的application需要参数的话，就在application-jar后面提供相应参数。

† A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).

Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for Mesos clusters. Currently only YARN supports cluster mode for Python applications.

这两段的意思是，如果提交application的机器是集群中的任何一台机器的话（尤其是master节点的话），那么请选择client部署模式。如果提交application的机器是自己机器的话（这台机器不在spark集群中），那么最好使用cluster的部署模式。目的就是尽量的将driver端部署在集群中，减少数据在网络中的传输。

For Python applications, simply pass a .py file in the place of <application-jar> instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files. 如果是python程序的话，那么用.py文件代替.jar文件。

There are a few options available that are specific to the cluster manager that is being used. For example, with a Spark standalone cluster with cluster deploy mode, you can also specify --supervise to make sure that the driver is automatically restarted if it fails with non-zero exit code. To enumerate all such options available to spark-submit, run it with --help. Here are a few examples of common options:

# Run application locally on 8 cores
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8] \  并行度为8
  /path/to/examples.jar \
  100

# Run on a Spark standalone cluster in client deploy mode
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a Spark standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --deploy-mode cluster
  --supervise
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \  # can be client for client mode
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  1000

# Run a Python application on a Spark standalone cluster
./bin/spark-submit \
  --master spark://207.184.161.138:7077 \
  examples/src/main/python/pi.py \
  1000

# Run on a Mesos cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master mesos://207.184.161.138:7077 \
  --deploy-mode cluster
  --supervise
  --executor-memory 20G \
  --total-executor-cores 100 \
  http://path/to/examples.jar \
  1000

Master URLs master的配置

The master URL passed to Spark can be in one of the following formats:

Master URL	Meaning
`local`	Run Spark locally with one worker thread (i.e. no parallelism at all). 本地模式，并行度是1，也就是没有并行。
`local[K]`	Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). 本地模式，并行度是k。也就是会启动k个worker。
`local[*]`	Run Spark locally with as many worker threads as logical cores on your machine. 本地模式，并行度尽可能的大。取决于机器的逻辑核心数。逻辑cpu cores。
`spark://HOST:PORT`	Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default. standalone模式，格式：spark://host:port master的ip或者域名和端口号，默认端口号是7077.
`mesos://HOST:PORT`	Connect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use `mesos://zk://...`. To submit with `--deploy-mode cluster`, the HOST:PORT should be configured to connect to the MesosClusterDispatcher. Mesos模式，默认端口号是5050。如果Mesos使用了zookeeper的话，那么host:port应该是MesosClusterDispatcher所指的。
`yarn`	Connect to a YARN cluster in `client` or `cluster` mode depending on the value of `--deploy-mode`. The cluster location will be found based on the `HADOOP_CONF_DIR` or `YARN_CONF_DIR` variable. yarn模式，很据--deply-mode判断是client还是cluster模式。cluster的location是由`HADOOP_CONF_DIR` 或者 `YARN_CONF_DIR` 来决定的。
`yarn-client`	Equivalent to `yarn` with `--deploy-mode client`, which is preferred to `yarn-client` yarn client模式。
`yarn-cluster`	Equivalent to `yarn` with `--deploy-mode cluster`, which is preferred to `yarn-cluster` yarn cluster模式。

Loading Configuration from a File

The spark-submit script can load default Spark configuration values from a properties file and pass them on to your application. By default it will read options from conf/spark-defaults.conf in the Spark directory. For more detail, see the section on loading default configurations.

Loading default Spark configurations this way can obviate the need for certain flags to spark-submit. For instance, if the spark.master property is set, you can safely omit the --master flag from spark-submit. In general, configuration values explicitly set on a SparkConf take the highest precedence, then flags passed to spark-submit, then values in the defaults file.

spark-submit可以从配置文件中加载相应的配置属性，也可以通过在命令行中显示的指定的方式设置配置属性，还可以利用SparkConf指定配置属性的值（SparkConf在源码中的set方法可以设置这些属性值）。默认的文件是conf/spark-default.conf文件。

使用文件加载配置属性方式可以避免每次重复的输入相应配置的麻烦（但是灵活性会相应的减小，SparkConf配置方式也会存在这样的问题）。一般来说，SparkConf中设置的属性值具有最高的权限，也就是说会覆盖配置文件中相应的配置属性的值，其次是spark-submit中的权限，权限最小的是配置文件conf/spark-default.conf。

If you are ever unclear where configuration options are coming from, you can print out fine-grained debugging information by running spark-submit with the --verbose option.

如果不清楚这些配置选项的出处，可以在spark-submit中加上--verbose来打映出细粒度的调试信息。

Advanced Dependency Management

When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. Spark uses the following URL scheme to allow different strategies for disseminating jars:

file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.
hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected
local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.

Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes. This can use up a significant amount of space over time and will need to be cleaned up. With YARN, cleanup is handled automatically, and with Spark standalone, automatic cleanup can be configured with the spark.worker.cleanup.appDataTtl property.

Users may also include any other dependencies by supplying a comma-delimited list of maven coordinates with --packages. All transitive dependencies will be handled when using this command. Additional repositories (or resolvers in SBT) can be added in a comma-delimited fashion with the flag --repositories. These commands can be used with pyspark, spark-shell, and spark-submit to include Spark Packages.

For Python, the equivalent --py-files option can be used to distribute .egg, .zip and .py libraries to executors.

当使用spark-submit时候，--jars后面的源文件和依赖文件将会自动的分发到集群中相应的节点上。spark提供以下几种方式来传递--jars的值，file，hdfs，http，https，ftp和local。

file：格式是file:/urls。当这些jars是保存在某个http服务器上的时候用file模式。相应的executor就会从这个http服务器上拉取相应jar文件；

hdfs，http，https，ftp：格式是给出jar文件所在的hdfs或者ftp服务器的url即可；

local：jars存在本地的，例如使用NFS的时候，所需的jars已经可以通过本地访问。这样的好处就是可以避免文件在网络中的传输。尤其是当这些文件较大的时候，可以避免网络io。

对于python来说，可以使用--py-files来分发相应文件给executors。

More Information

Once you have deployed your application, the cluster mode overview describes the components involved in distributed execution, and how to monitor and debug applications.