spark-submit 参数调优

最新推荐文章于 2022-07-13 14:38:09 发布

晨光1024

最新推荐文章于 2022-07-13 14:38:09 发布

阅读量306

点赞数

分类专栏： spark

spark 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

spark-submit 参数介绍：

--master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
--class CLASS_NAME          Your application's main class (for Java / Scala apps).
--name NAME                 A name of your application.
--jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
--packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
--exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
--repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
--py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
--files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor.

--conf PROP=VALUE           Arbitrary Spark configuration property.
--properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

--driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options       Extra Java options to pass to the driver.
--driver-library-path       Extra library path entries to pass to the driver.
--driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).

--proxy-user NAME User to impersonate when submitting the application.

--help, -h                  Show this help message and exit
--verbose, -v               Print additional debug output
--version,                  Print the version of current Spark

Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).

Spark standalone or Mesos with cluster deploy mode only:
--supervise                 If given, restarts the driver on failure.
--kill SUBMISSION_ID        If given, kills the driver specified.
--status SUBMISSION_ID      If given, requests the status of the driver specified.

Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.

Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)

YARN-only:
--driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
--queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
--num-executors NUM         Number of executors to launch (Default: 2).
--archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
--principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
--keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.

/usr/bin/spark-submit
--master yarn
--deploy-mode client
--driver-memory 2G
--executor-memory 4G
--executor-cores 4
--num-executors 50
--conf spark.default.parallelism=600
--class "com.aisino.zxpt.analyse.yxc.Fpmx_desc"
--jars lib/amqp-client-5.2.0.jar,lib/commons-pool2-2.4.2.jar,lib/guava-12.0.1.jar,lib/hadooplog-core-1.0.jar,libase-client.jar,libase-common.jar,libase-protocol-1.0.0-cdh5.4.3.jar,libase-server.jar,librace-core-3.1.0-incubating.jar,libfka_2.10-0.9.0.0.jar,libfka-clients-0.9.0.0.jar,lib/metrics-core-2.2.0.jar,lib/mysql-connector-java-5.1.37-bin.jar,lib/protobuf-java-2.5.0.jar,lib/SequenceIDGen.jar,lib/spark-assembly-1.5.0-hadoop2.6.0.jar,lib/spark-streaming-kafka_2.10-1.6.0.jar,lib/tools.jar,libclient-0.3.jar
/home/inttest/zxpt/yxc.jar >/home/inttest/zxpt/txt.log

num-executors * executor-cores 不要超过队列总CPU core的1/3~1/2比较合适
executor-memory * num-executors 代表你的spark作业申请到的总内存量(也就是所有Executor进程的内存总和)
executor-cores 每个executor使用的内核数仅限于Spark on YARN模式应该是Cpu 核数

driver-memory    默认为1g   主要设置Driver进程的内存
               参数调优建议：Driver的内存通常来说不设置，或者设置1G左右应该就够了。
               唯一需要注意的一点是，如果需要使用collect算子将RDD的数据全部拉取到Driver上进行处理，那么必须确保Driver的内存足够大，否则会出现OOM内存溢出的问题。

executor-memory 默认为1g 主要设置Executor进程的内存 executor内存的大小很多时候决定Spark作业的性能,而且跟常见的JVM OOM异常有直接关系
参数调优建议：每个Executor进程的内存设置4G~8G较为合适。但是这只是一个参考值，具体的设置还是得根据不同部门的资源队列来定。可以看看自己团队的资源队列的最大内存限制是多少，num-executors乘以executor-memory，就代表了你的Spark作业申请到的总内存量（也就是所有Executor进程的内存总和），这个量是不能超过队列的最大内存量的。此外，如果你是跟团队里其他人共享这个资源队列，那么申请的总内存量最好不要超过资源队列最大总内存的1/3~1/2，避免你自己的Spark作业占用了队列所有的资源，导致别的同学的作业无法运行。

num-executors 默认为2 Spark作业总共用多少个Executor进程来执行 Driver在向YARN集群管理器申请资源时，YARN集群管理器会尽可能按照你的设置来在集群的各个工作节点上，启动相应数量的Executor进程。这个参数非常之重要，如果不设置的话，默认只会给你启动少量的Executor进程，此时你的Spark作业的运行速度是非常慢的。
参数调优建议：每个Spark作业的运行一般设置50~100个左右的Executor进程比较合适，设置太少或太多的Executor进程都不好。设置的太少，无法充分利用集群资源；设置的太多的话，大部分队列可能无法给予充分的资源

executor-cores 主要设置Executor进程的CPU core数量这个参数决定了每个Executor进程并执行task线程的能力因为每个CPU core同一时间只能执行一个task线程因此每个Executor core数量越多,越能越快速地执行完分配自己的所有task任务
数调优建议：Executor的CPU core数量设置为2~4个较为合适。同样得根据不同部门的资源队列来定，可以看看自己的资源队列的最大CPU core限制是多少，再依据设置的Executor数量，来决定每个Executor进程可以分配到几个CPU core。同样建议，如果是跟他人共享这个队列，那么num-executors * executor-cores不要超过队列总CPU core的1/3~1/2左右比较合适，也是避免影响其他同学的作业运行。

spark.default.parallelism   用于设置每个stage的默认task数量
               参数调优建议：Spark作业的默认task数量为500~1000个较为合适   不去设置这个参数   那么此时就会导致Spark自己根据底层HDFS的block数量来设置task数量默认是一个HDFS block对应一个task，通常来说Spark默认设置的数量是偏少的，如果task数量偏少的话，就会导致你前面设置的Executor的参数都前功尽弃，试想一下，无论你的Executor进程有多少个，内存和CPU有多大，但是task只有1个或者10个，那么90%的Executor进程可能根本就没有task执行，也就是白白浪费了资源
               num-executors*executor-cores的2~3倍较为合适
                   eg:   Executor的总CPU core数量为300个，
                       那么设置1000个task是可以的，此
                       时可以充分地利用Spark集群的资源。

晨光1024

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark-submit 参数调优

spark-submit 参数介绍： --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or ...
复制链接

扫一扫