Spark 配置

最新推荐文章于 2024-03-21 21:57:52 发布

有数的编程笔记

最新推荐文章于 2024-03-21 21:57:52 发布

阅读量981

点赞数

分类专栏： Spark/Hive

本文链接：https://blog.csdn.net/qq_33446500/article/details/109064383

版权

Spark/Hive 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

文章目录

1. Spark 配置
2. 重新指定配置文件目录
3. 继承Hadoop集群配置
4. 定制的Hadoop/Hive配置

1. Spark 配置

Spark提供了三个位置来配置系统：

Spark属性控制大多数应用程序参数，可以通过使用SparkConf对象、bin/spark-submit脚本选项、conf/Spark-default.conf文件或通过Java系统属性进行设置。
环境变量可用于通过每个节点上的conf/spark-env.sh脚本设置每台机器的设置，例如IP地址。
可以通过log4j.properties配置日志记录。

1.1. Spark 属性

Spark属性控制大多数应用程序设置，并为每个应用程序分别配置。这些属性可以直接在传递给SparkContext的SparkConf上设置。SparkConf允许通过set()方法配置一些公共属性（例如Master URL和应用程序名称）以及任意键值对。例如，可以用以下两个属性初始化一个应用程序：

val conf = new SparkConf()
    .setMaster("local[2]")
    .setAppName("CountingSheep")
val sc = new SparkContext(conf)

指定某个持续时间的属性应该配置为时间单位。接受下列格式：

单位	格式（示例）
毫秒	25ms
秒	5s
分钟	10m or 10min
小时	3h
天	5d
年	1y

指定字节大小的属性应该配置一个大小单位。接受以下格式：

单位	格式（示例）
bytes	1b
kb	1k or 1kb
mb	1m or 1mb
gb	1g or 1gb
tp	1t or 1tb
pb	1p or 1pb

没有单位的数字通常被解释为字节，少数则被解释为KiB或MiB。请参阅各个配置属性的文档。在可能的情况下，最好指定单位。

1.1.1. 动态加载Spark属性

在某些情况下，可能希望避免在SparkConf中硬编码某些配置。例如，如果希望用不同的masters或不同的内存运行相同的应用程序。Spark允许简单地创建一个空SparkConf：

val sc = new SparkContext(new SparkConf())

然后，可以在提交应用程序时通过选项指定配置值：

./bin/spark-submit 
--name "My app" 
--master local[4] 
--conf spark.eventLog.enabled=false
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar

bin/spark-shell和bin/spark-submit工具支持两种动态加载配置方式。

第一种是命令行选项，如--master。
spark-submit可以使用--conf参数接受任何spark属性，但对在启动spark应用程序中起作用的属性需要使用特殊标志。通过--help选项可以查看到bin/spark-submit脚本支持的全部选项.
第二种是从conf/spark-defaults.conf中读取配置选项，其中每行由一个键和一个由空白分隔的值组成。例如：
spark.master spark://5.6.7.8:7077
spark.executor.memory 4g
spark.eventLog.enabled true
spark.serializer org.apache.spark.serializer.KryoSerializer

使用命令行参数或conf/spark-defaults.conf中的任何值都将传递给应用程序，并与通过SparkConf指定的值合并。直接在SparkConf上设置的属性具有最高优先级，其次是bin/spark-shell和bin/spark-submit脚本选项设置的配置，最后是传递到conf/spark-defaults.conf文件中的选项。
自Spark的早期版本以来，一些配置键已被重命名；在这种情况下，旧的键名称仍然被接受，但其优先级低于新键的任何实例。

Spark属性主要分为两类：

一类与部署相关，如spark.driver.memory、spark.executor.instances，这种属性通过SparkConf进行设置，运行应用程序时可能无法生效。或者行为取决于选择的集群管理器和部署模式。
因此建议通过conf/spark-defaults.conf或bin/spark-submit命令行选项进行设置；
另一类主要与spark运行时控制相关，如spark.task.maxfailures，这类属性可以通过任何一种方式进行设置。

1.1.2. 查看Spark属性

在应用程序Web UI界面(http://<driver>:4040)的Environment选项卡中列出了Spark属性。通过查看该选项卡，可以检查属性是否设置正确。
注意，只有通过conf/spark-defaults.conf、SparkConf或命令行选项显式指定的值才会出现。对于所有其他配置属性，可以假定使用默认值。

1.2. 环境变量

某些Spark配置可以通过环境变量进行配置，这些环境变量从$SPARK_HOME/conf/spark-env.sh脚本中读取。在Standalone模式和Mesos模式下，此文件可以设定一些和机器相关的信息，如hostname。当运行local Spark应用程序或bin/spark-submit脚本时，该文件也会被加载。
注意，该文件默认不存在，通过将conf/spark-env.sh.template文件重命名为conf/spark-env.sh的方式来创建它，并确保脚本可执行。
由于conf/spark-env.sh是一个shell脚本，其中一些环境变量可以通过编程方式设置。例如，可以通过查找特定网络接口的IP来计算SPARK-LOCAL-IP。

以下变量可以在spark-env.sh中设置：

环境变量名称	说明
JAVA_HOME	安装Java的位置（当在PATH环境变量中没有时进行设置）
PYSPARK_PYTHON	Python binary executable to use for PySpark in both driver and workers (default is python2.7 if available, otherwise python). Property spark.pyspark.python take precedence if it is set
PYSPARK_DRIVER_PYTHON	Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON). Property spark.pyspark.driver.python take precedence if it is set
SPARKR_DRIVER_R	R binary executable to use for SparkR shell (default is R). Property spark.r.shell.command take precedence if it is set
SPARK_LOCAL_IP	要绑定到的机器的IP地址。
SPARK_PUBLIC_DNS	Spark程序将向其他机器公布的主机名

除上述之外，还可以选择设置 Spark Standalone 的环境变量，例如每台机器上使用的核心数量和最大内存。

Spark On Yark可以设置的环境变量请参考如下文档的第三部分(Options read in YARN client/cluster mode)。
当Spark On Yark并且使用cluster模式时，需要通过使用conf/spark-defaults.conf文件中的spark.yarn.appMasterEnv.[EnvironmentVariableName]属性设置环境变量。conf/spark-env.sh文件中设置的环境变量不会在cluster模式下的YARN Application Master进程中反映出来。有关更多信息，请参见与YARN相关的Spark属性。

# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program

# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos

# Options read in YARN client/cluster mode
# - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)

# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_MASTER_WEBUI_PORT, Port for the master web UI (default: 8080).
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_WEBUI_PORT, Port for the worker web UI (default: 8081).
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers
# - SPARK_LOCAL_DIRS, Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. 

# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)
# - SPARK_NO_DAEMONIZE  Run the proposed command in the foreground. It will not output a PID file.

# Options for native BLAS, like Intel MKL, OpenBLAS, and so on.
# You might get better performance to enable these options if using native BLAS (see SPARK-21305).
# - MKL_NUM_THREADS=1        Disable multi-threading of Intel MKL
# - OPENBLAS_NUM_THREADS=1   Disable multi-threading of OpenBLAS

2. 重新指定配置文件目录

要指定一个不同于默认$SPARK_HOME/conf的配置目录，可以设置SPARK_CONF_DIR环境变量。Spark将从该配置指定的路径获取配置文件(spark-defaults.conf, spark-env.sh, log4j.properties等)。

3. 继承Hadoop集群配置

如果通过Spark读写HDFS，以下两个Hadoop配置文件应该包含在Spark的classpath中：

hdfs-site.xml，为HDFS client提供默认行为。
core-site.xml，设置默认的文件系统名称。

要使这些文件对Spark可见，需要在$SPARK_HOME/conf/spark-env.sh文件中设置HADOOP_CONF_DIR环境变量，指向Hadoop配置文件的路径。

4. 定制的Hadoop/Hive配置

如果Spark应用程序需要与Hadoop、Hive或两者交互，那么在Spark的classpath中需要包含Hadoop/Hive的配置文件。
不同的应用程序可能需要不同的Hadoop/Hive客户端配置。可以通过复制、修改每个应用程序classpath中的hdfs-site.xml, core-site.xml, yarn-site.xml, hive-site.xml文件实现不同应用程序的定制配置。

在Spark On Yark，这些配置文件是在集群范围内可见的，应用程序不能安全地更改它们。
更好的选择是使用spark.hadoop.*形式的spark hadoop属性，使用spark.hive.*形式的Spark hive属性，例如，设置spark.hadoop.abc.def=xyz属性表示设置了了abc.def=xyz的Hadoop属相。它们可以被认为与在$SPARK_HOME/conf/spark-default.conf中设置的普通spark属性相同。

在某些情况下，可能希望避免在SparkConf中对某些配置进行硬编码。例如，Spark简单地创建一个空的SparkConf并设置Spark/Spark hadoop/Spark hive属性。

val conf = new SparkConf().set("spark.hadoop.abc.def","xyz")
val sc = new SparkContext(conf)

为了避免硬编码参数，可以在提交应用程序时通过命令行选项来添加配置：

./bin/spark-submit \ 
  --name "My app" \ 
  --master local[4] \  
  --conf spark.eventLog.enabled=false \ 
  --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \ 
  --conf spark.hadoop.abc.def=xyz \ 
  myApp.jar

有数的编程笔记

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark 配置

文章目录1. Spark 配置1.1. Spark 属性1.1.1. 动态加载Spark属性1.1.2. 查看Spark属性1.2. 环境变量1. Spark 配置Spark提供了三个位置来配置系统：Spark属性控制大多数应用程序参数，可以通过使用SparkConf对象、bin/spark-submit脚本选项、conf/Spark-default.conf文件或通过Java系统属性进行设置。环境变量可用于通过每个节点上的conf/spark-env.sh脚本设置每台机器的设置，例如IP地址。
复制链接

扫一扫