spark的submit

# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \  # can be client for client mode
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  1000

Some of the commonly used options are:

  • --class: 应用的入口 (e.g. org.apache.spark.examples.SparkPi)
  • --master: 集群的master (e.g. spark://23.195.26.187:7077)
  • --deploy-mode: 在node节点启动 cluster模式(cluster) , 外部模式:client (client) (默认的)
  • --conf: keyvalue对 “key=value”
  • application-jar: jar包包含依赖的jar包必须可见(对集群来说必须可见),比如hdfs路径下:hdfs:// pathv或者是所有节点都存在的文件file:// path.
  • application-arguments:传递给main函数的参数


yarn Connect to a YARN cluster in client or cluster mode depending on the value of  --deploy-mode. 将会从HADOOP_CONF_DIR or YARN_CONF_DIR 获取配置文件.
yarn-client Equivalent to yarn with --deploy-mode client, which is preferred to `yarn-client`
yarn-cluster Equivalent to yarn with --deploy-mode cluster, which is preferred to `yarn-cluster`

 In general, configuration values explicitly set on a SparkConf take the highest precedence, then flags passed to spark-submit, then values in the defaults file.

通常,显示通过SparkConf设置的参数的优先级优先级最高,其次是 spark-submit,最后才是默认的配置文件。

所以即使你的spark-submit 中  --master 省略也是可以 甚至 --deploy-mode 都是有默认值的,默认读取conf/spark-defaults.conf 。


更高级的设置:

spark-submit时候, application jar 以及任何依赖的通过 --jars option 指定的jars,将会自动的转移到集群中 . spark采用以下几种方式允许采用不同的方法来传递jars:

  • file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.每个executor从driver的httpserver来拉取文件
  • hdfs:http:https:ftp: - these pull down files and JARs from the URI as expected
  • local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.

Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes. This can use up a significant amount of space over time and will need to be cleaned up. With YARN, cleanup is handled automatically, and with Spark standalone, automatic cleanup can be configured with the spark.worker.cleanup.appDataTtl property.

Users may also include any other dependencies by supplying a comma-delimited list of maven coordinates with --packages. All transitive dependencies will be handled when using this command. Additional repositories (or resolvers in SBT) can be added in a comma-delimited fashion with the flag --repositories. These commands can be used with pysparkspark-shell, and spark-submit to include Spark Packages.

For Python, the equivalent --py-files option can be used to distribute .egg.zip and .py libraries to executors.

More Information

Once you have deployed your application, the cluster mode overview describes the components involved in distributed execution, and how to monitor and debug applications.


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

mtj66

看心情

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值