Spark编程指南之五：Spark集群相关参数配置

最新推荐文章于 2023-03-07 18:10:23 发布

Darren.P

最新推荐文章于 2023-03-07 18:10:23 发布

阅读量741

点赞数

分类专栏： SPARK 大数据文章标签： spark-env driver-memory executor-memory executor-cores num-executors

本文链接：https://blog.csdn.net/weixin_42628594/article/details/85461596

版权

SPARK 同时被 2 个专栏收录

13 篇文章 1 订阅

订阅专栏

大数据

12 篇文章 1 订阅

订阅专栏

文章目录

Spark配置优先级

代码优先级最高->提交时spark-submit次之->集群配置文件spark-defaults.conf最低
spark-defaults.conf默认内容如下：

# Example:
# spark.master                     spark://master:7077
# spark.eventLog.enabled           true
# spark.eventLog.dir               hdfs://namenode:8021/directory
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
# spark.driver.memory              5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

下面从优先级从低到高的顺序讲解。

集群配置文件相关配置

spark-env.sh

spark-env.sh可以配置Work、Executor的相关配置
原始文件如下：

# Options read in YARN client/cluster mode
# - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)
# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)

例如，可以配置如下信息：

SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=800m
SPARK_WORKER_INSTANCES=2

Finally, note that the Java VM does not always behave well with more than 200 GB of RAM. If you purchase machines with more RAM than this, you can run multiple worker JVMs per node. In Spark’s standalone mode, you can set the number of workers per node with the SPARK_WORKER_INSTANCES variable in conf/spark-env.sh, and the number of cores per worker with SPARK_WORKER_CORES
单位如下：

1b (bytes)
1k or 1kb (kibibytes = 1024 bytes)
1m or 1mb (mebibytes = 1024 kibibytes)
1g or 1gb (gibibytes = 1024 mebibytes)
1t or 1tb (tebibytes = 1024 gibibytes)
1p or 1pb (pebibytes = 1024 tebibytes)

注：配置后要将文件发送到集群其他节点。

spark-defaults.conf

spark.executor.cores 核
spark.executor.memory 内存
spark.default.parallelism 并行度
spark.broadcast.blockSize 广播块大小，默认4m.

spark-submit相关配置

可以配置executor-memory、executor-cores等。
配置项前面都是两个-，或者使用–conf
–executor-memory 1200m 或者
–conf spark.executor.memory=1200m
举例

./bin/spark-submit --master local[2] --driver-memory 1g --executor-memory 1g --executor-cores 1 --num-executors 3 --class org.apache.spark.examples.JavaSparkPi  examples/jars/spark-examples_2.11-2.1.3.jar

代码相关配置

val conf=new SparkConf()
conf.setAppName("example")
conf.set("spark.executor.cores","1")
conf.set("spark.executor.memory","800m")

参考链接

http://spark.apache.org/docs/latest/configuration.html
http://spark.apache.org/docs/latest/hardware-provisioning.html