spark参数详解

最新推荐文章于 2023-06-27 15:59:10 发布

cxy1991xm

最新推荐文章于 2023-06-27 15:59:10 发布

阅读量1.5k

点赞数

分类专栏： spark

本文链接：https://blog.csdn.net/cxy1991xm/article/details/92085049

版权

spark 专栏收录该内容

7 篇文章 1 订阅

订阅专栏

spark的配置参数可以在多个地方配置，以executor的memory为例，有三个地方可以配置
(1)spark-submit的--executor-memory选项
(2)spark-defaults.conf的spark.executor.memory配置
(3)spark-env.sh的SPARK_EXECUTOR_MEMORY配置

优先级：spark-submit --选项 > spark-defaults.conf配置 > spark-env.sh配置 > 默认值

一、spark-defaults.conf
1、spark.driver.extraClassPath ：Extra classpath entries to prepend to the classpath of the driver.
注意要保证启动driver的机器上存在该路径
2、spark.driver.extraJavaOptions ：A string of extra JVM options to pass to the driver.
通过此参数可以指定driver的jvm参数。但不要设置-Xmx，而是通过--driver-memory设置
3、spark.driver.extraLibraryPath ：Set a special library path to use when launching the driver JVM.
4、spark.executor.extraClassPath ：Extra classpath entries to prepend to the classpath of executors.
注意要保证启动executor的机器上存在该路径
5、spark.executor.extraJavaOptions ：A string of extra JVM options to pass to executors.
通过此参数可以指定driver的jvm参数。但不要设置-Xmx，而是通过spark.executor.memory设置
6、spark.executor.extraLibraryPath ：Set a special library path to use when launching executor JVM.
7、spark.local.dir ：Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk.
8、spark.yarn.archive 或 spark.yarn.jars ：To make Spark runtime jars accessible from YARN side,
you can specify spark.yarn.archive or spark.yarn.jars. For details please refer to Spark Properties.
If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all
jars under $SPARK_HOME/jars and upload it to the distributed cache.
比如：hdfs:///spark_lib/spark_lib.zip
9、spark.yarn.dist.files ：Comma-separated list of files to be placed in the working directory of each executor.
在提交任务，会将此参数指定的文件上传到分布式存储系统(比如hdfs)，每个executor(包括driver)的工作路径会包含这些
文件，其实和--py-files指定的文件一样，区别是--py-files会因为任务不同而变化，而此参数指定的文件每个任务都一样。
10、spark.yarn.dist.jars ：和spark.yarn.dist.files类似，区别是指定jar
----------Dynamic Allocation根据负载情况动态分配executor----------
11、spark.dynamicAllocation.enabled ：true表示启动动态分配executor，并且必须要求spark.shuffle.service.enabled必须为true。
(1)在${SPARK_HOME}下找到spark-<version>-yarn-shuffle.jar，<version>是版本号，比如2.4.3，并保存到每个节点的
(2)${HADOOP_HOME}/share/hadoop/yarn/lib/下。修改每个节点的yarn-site.xml，如下：
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle,spark_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
(3)相关参数：spark.dynamicAllocation.minExecutors、spark.dynamicAllocation.maxExecutors、spark.dynamicAllocation.initialExecutors
12、spark.yarn.shuffle.stopOnFailure ：当spark.dynamicAllocation.enabled为true时才使用，Whether to stop the NodeManager when there's a failure in the Spark Shuffle Service's initialization. This prevents application failures caused by
running containers on NodeManagers where the Spark Shuffle Service is not running.
13、spark.executor.memory ：executor堆内存
14、spark.yarn.executor.memoryOverhead ：executor非堆内存
15、spark.executor.cores ：每个executor的cores数量
16、spark.driver.memory ：driver堆内存
17、spark.yarn.driver.memoryOverhead ：driver非堆内存
18、spark.driver.cores ：driver的cores数量
19、spark.driver.maxResultSize ：driver拉取结果的最大值，比如collect
20、spark.default.parallelism ：rdd的默认分区数
21、spark.speculation ：是否推测执行
23、spark.master ：The cluster manager to connect to。若设置为yarn，则cluster manager的配置在spark-env.sh中由 HADOOP_CONF_DIR 或者 YARN_CONF_DIR指定
-----------spark history的配置-----------
24、spark.eventLog.enabled ：为true表示记录spark event log
25、spark.eventLog.compress ：为true表示event log压缩
26、spark.eventLog.dir ：event log的存放地址，比如hdfs
27、spark.task.maxFailures ：任务最大失败次数
28、spark.serializer ：序列化类，比如org.apache.spark.serializer.KryoSerializer
29、spark.sql.shuffle.partitions ：Configures the number of partitions to use when shuffling data for joins or aggregations.

二、spark-env.sh
1、HADOOP_CONF_DIR ：指定hadoop配置文件的路径
2、SPARK_HISTORY_OPTS ：指定spark history的参数，如
"-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=1000 -Dspark.history.ui.maxApplications=1000
-Dspark.history.fs.logDirectory=hdfs://ip:port/log/ -Dspark.history.fs.cleaner.enabled=true
-Dspark.history.fs.cleaner.interval=1d -Dspark.history.fs.cleaner.maxAge=7d"
3、export SPARK_LOG_DIR=/data/spark/log ：指定日志存放路径
4、export SPARK_LOCAL_DIRS=/data/spark/local ： storage directories to use on this node for shuffle and RDD data

cxy1991xm

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
spark参数详解

spark的配置参数可以在多个地方配置，以executor的memory为例，有三个地方可以配置(1)spark-submit的--executor-memory选项(2)spark-defaults.conf的spark.executor.memory配置(3)spark-env.sh的SPARK_EXECUTOR_MEMORY配置优先级：spark-submit --选项 > ...
复制链接

扫一扫