spark submit参数及调优

最新推荐文章于 2024-04-28 07:59:10 发布

WQ同学

最新推荐文章于 2024-04-28 07:59:10 发布

阅读量434

点赞数

分类专栏：面试题 spark

本文链接：https://blog.csdn.net/u012957549/article/details/87893383

版权

spark 同时被 2 个专栏收录

122 篇文章 15 订阅

订阅专栏

面试题

30 篇文章 0 订阅

订阅专栏

刚好周末来整理一下spark submit参数这一块的内容，充实一下。
首先学习当然要去官网了
这里是官网给的一个例子：

./bin/spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

这里是官网给出的解析：

--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client) †
--conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
application-arguments: Arguments passed to the main method of your main class, if any

下面简单介绍一下

--class:  你所要运行的类
--master: 如spark://host:port, mesos://host:port, yarn,  yarn-cluster,yarn-client, local
--deploy-mode: 使用 client 还是 cluster 方式提交job 默认使用 client 区别在于 driver 位置 
--conf: 配置参数
application-jar:  java 的位置 如  hdfs:// path or a file:// path .
application-arguments: 传递给主类的主方法的参数

再实际的项目当然不知用到这些了这个时候可以使用./spark-submit --help 去查看所有spark-shell 所提供的参数配置

参数名	参数说明
–name	应用程序的名称
–jars	逗号分隔的本地jar包，包含在driver和executor的classpath下
–packages	包含在driver和executor的classpath下的jar包逗号分隔的”groupId:artifactId：version”列表
–repositories	逗号分隔的远程仓库
–exclude-packages	为了避免冲突而指定不包含的 package 用逗号分隔的”groupId:artifactId”列表
–py-files	逗号分隔的”.zip”,”.egg”或者“.py”文件，这些文件放在python app的PYTHONPATH下面
–files	逗号分隔的文件，这些文件放在每个executor的工作目录下面
–conf	指定 spark 配置属性的值，例如 -conf spark.executor.extraJavaOptions="-XX:MaxPermSize=256m"
properties-file	固定的spark配置属性，默认是conf/spark-defaults.conf
–driver-memory	Driver内存，默认 1G
–driver-java-options	传给driver的额外的Java选项例如 -XX:PermSize=128M -XX:MaxPermSize=256M
–driver-library-path	传给driver的额外的库路径
–driver-class-path	传给 driver 的额外的类路径
–proxy-user	模拟提交应用程序的用户
–driver-cores	Driver的核数，默认是1。这个参数仅仅在standalone集群deploy模式下使用
–supervise	Driver失败时，重启driver。在mesos或者standalone下使用
–verbose	打印debug信息
–total-executor-cores	executor使用的总核数，仅限于Spark Alone、Spark on Mesos模式
–executor-memory	每个executor的内存，默认是1G
–executor-cores	每个executor使用的内核数，默认为1，仅限于Spark on Yarn 和standalone模式
–num-executors	启动的executor数量。默认为2。在yarn下使用
–queue	提交应用程序给哪个YARN的队列，默认是default队列，仅限于Spark on Yarn模式
–archives	被每个executor提取到工作目录的档案列表，用逗号隔开

备注：standalone模式每个worker默认一个executor，指定executor的数量，但是我们可以通过这样的方式间接指定

executor 数量 = spark.cores.max（--total-executor-cores）/spark.executor.cores（--executor-cores）

spark.cores.max 是指你的spark程序需要的总核数
spark.executor.cores 是指每个executor需要的核数