Spark配置优先级
代码优先级最高->提交时spark-submit次之->集群配置文件spark-defaults.conf最低
spark-defaults.conf默认内容如下:
# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
下面从优先级从低到高的顺序讲解。
集群配置文件相关配置
spark-env.sh
spark-env.sh可以配置Work、Executor的相关配置
原始文件如下:
# Options read in YARN client/cluster mode
# - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)
# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
例如,可以配置如下信息:
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=800m
SPARK_WORKER_INSTANCES=2
Finally, note that the Java VM does not always behave well with more than 200 GB of RAM. If you purchase machines with more RAM than this, you can run multiple worker JVMs per node. In Spark’s standalone mode, you can set the number of workers per node with the SPARK_WORKER_INSTANCES variable in conf/spark-env.sh, and the number of cores per worker with SPARK_WORKER_CORES
单位如下:
1b (bytes)
1k or 1kb (kibibytes = 1024 bytes)
1m or 1mb (mebibytes = 1024 kibibytes)
1g or 1gb (gibibytes = 1024 mebibytes)
1t or 1tb (tebibytes = 1024 gibibytes)
1p or 1pb (pebibytes = 1024 tebibytes)
注:配置后要将文件发送到集群其他节点。
spark-defaults.conf
spark.executor.cores 核
spark.executor.memory 内存
spark.default.parallelism 并行度
spark.broadcast.blockSize 广播块大小,默认4m.
spark-submit相关配置
可以配置executor-memory、executor-cores等。
配置项前面都是两个-,或者使用–conf
–executor-memory 1200m 或者
–conf spark.executor.memory=1200m
举例
./bin/spark-submit --master local[2] --driver-memory 1g --executor-memory 1g --executor-cores 1 --num-executors 3 --class org.apache.spark.examples.JavaSparkPi examples/jars/spark-examples_2.11-2.1.3.jar
代码相关配置
val conf=new SparkConf()
conf.setAppName("example")
conf.set("spark.executor.cores","1")
conf.set("spark.executor.memory","800m")
参考链接
http://spark.apache.org/docs/latest/configuration.html
http://spark.apache.org/docs/latest/hardware-provisioning.html