spark-submit 参数介绍:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--repositories Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor.
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME User to impersonate when submitting the application.
--help, -h Show this help message and exit
--verbose, -v Print additional debug output
--version, Print the version of current Spark
Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)
YARN-only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on
secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above. This keytab will be copied to
the node running the Application Master via the Secure
Distributed Cache, for renewing the login tickets and the
delegation tokens periodically.
/usr/bin/spark-submit
--master yarn
--deploy-mode client
--driver-memory 2G
--executor-memory 4G
--executor-cores 4
--num-executors 50
--conf spark.default.parallelism=600
--class "com.aisino.zxpt.analyse.yxc.Fpmx_desc"
--jars lib/amqp-client-5.2.0.jar,lib/commons-pool2-2.4.2.jar,lib/guava-12.0.1.jar,lib/hadooplog-core-1.0.jar,libase-client.jar,libase-common.jar,libase-protocol-1.0.0-cdh5.4.3.jar,libase-server.jar,librace-core-3.1.0-incubating.jar,libfka_2.10-0.9.0.0.jar,libfka-clients-0.9.0.0.jar,lib/metrics-core-2.2.0.jar,lib/mysql-connector-java-5.1.37-bin.jar,lib/protobuf-java-2.5.0.jar,lib/SequenceIDGen.jar,lib/spark-assembly-1.5.0-hadoop2.6.0.jar,lib/spark-streaming-kafka_2.10-1.6.0.jar,lib/tools.jar,libclient-0.3.jar
/home/inttest/zxpt/yxc.jar >/home/inttest/zxpt/txt.log
num-executors * executor-cores 不要超过队列总CPU core的1/3~1/2比较合适
executor-memory * num-executors 代表你的spark作业申请到的总内存量(也就是所有Executor进程的内存总和)
executor-cores 每个executor使用的内核数 仅限于Spark on YARN模式 应该是Cpu 核数
driver-memory 默认为1g 主要设置Driver进程的内存
参数调优建议:Driver的内存通常来说不设置,或者设置1G左右应该就够了。
唯一需要注意的一点是,如果需要使用collect算子将RDD的数据全部拉取到Driver上进行处理,那么必须确保Driver的内存足够大,否则会出现OOM内存溢出的问题。
executor-memory 默认为1g 主要设置Executor进程的内存 executor内存的大小 很多时候决定Spark作业的性能,而且跟常见的JVM OOM异常 有直接关系
参数调优建议:每个Executor进程的内存设置4G~8G较为合适。但是这只是一个参考值,具体的设置还是得根据不同部门的资源队列来定。可以看看自己团队的资源队列的最大内存限制是多少,num-executors乘以executor-memory,就代表了你的Spark作业申请到的总内存量(也就是所有Executor进程的内存总和),这个量是不能超过队列的最大内存量的。此外,如果你是跟团队里其他人共享这个资源队列,那么申请的总内存量最好不要超过资源队列最大总内存的1/3~1/2,避免你自己的Spark作业占用了队列所有的资源,导致别的同学的作业无法运行。
num-executors 默认为2 Spark作业 总共用多少个Executor进程来执行 Driver在向YARN集群管理器申请资源时,YARN集群管理器会尽可能按照你的设置来在集群的各个工作节点上,启动相应数量的Executor进程。这个参数非常之重要,如果不设置的话,默认只会给你启动少量的Executor进程,此时你的Spark作业的运行速度是非常慢的。
参数调优建议:每个Spark作业的运行一般设置50~100个左右的Executor进程比较合适,设置太少或太多的Executor进程都不好。设置的太少,无法充分利用集群资源;设置的太多的话,大部分队列可能无法给予充分的资源
executor-cores 主要设置Executor进程的CPU core数量 这个参数决定了每个Executor进程并执行task线程的能力 因为每个CPU core同一时间只能执行一个task线程因此每个Executor core数量越多,越能越快速地执行完分配自己的所有task任务
数调优建议:Executor的CPU core数量设置为2~4个较为合适。同样得根据不同部门的资源队列来定,可以看看自己的资源队列的最大CPU core限制是多少,再依据设置的Executor数量,来决定每个Executor进程可以分配到几个CPU core。同样建议,如果是跟他人共享这个队列,那么num-executors * executor-cores不要超过队列总CPU core的1/3~1/2左右比较合适,也是避免影响其他同学的作业运行。
spark.default.parallelism 用于设置每个stage的默认task数量
参数调优建议:Spark作业的默认task数量为500~1000个较为合适 不去设置这个参数 那么此时就会导致Spark自己根据底层HDFS的block数量来设置task数量 默认是一个HDFS block对应一个task,通常来说Spark默认设置的数量是偏少的,如果task数量偏少的话,就会导致你前面设置的Executor的参数都前功尽弃,试想一下,无论你的Executor进程有多少个,内存和CPU有多大,但是task只有1个或者10个,那么90%的Executor进程可能根本就没有task执行,也就是白白浪费了资源
num-executors*executor-cores的2~3倍较为合适
eg: Executor的总CPU core数量为300个,
那么设置1000个task是可以的,此
时可以充分地利用Spark集群的资源。