spark-Spark Configuration

原文:spark configuration

Spark Properties

设置参数的3中具体方式
  1. sparkconf
  2. bin/spark-submit 
  3. conf/spark-defaults.conf文件
优先级:SparkConf>spark-submit or spark-shell>spark-defaults.conf file.最终的参数为3者的merge


Spark properties 为不同的引用配置不同的参数,例如:本地模式2个线程

val conf = new SparkConf()
             .setMaster("local[2]")
             .setAppName("CountingSheep")
val sc = new SparkContext(conf)

对于时间和字节参数要添加单位,如

25ms (milliseconds)
5s (seconds)
10m or 10min (minutes)
3h (hours)
5d (days)
1y (years)
1b (bytes)
1k or 1kb (kibibytes = 1024 bytes)
1m or 1mb (mebibytes = 1024 kibibytes)
1g or 1gb (gibibytes = 1024 mebibytes)
1t or 1tb (tebibytes = 1024 gibibytes)
1p or 1pb (pebibytes = 1024 tebibytes)

Dynamically Loading Spark Properties

可以创建空conf

val sc = new SparkContext(new SparkConf())

在运行是指定参数

./bin/spark-submit --name "My app" --master local[4] --conf spark.eventLog.enabled=false
  --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar

bin/spark-submit 可以直接读取conf/spark-defaults.conf文件,每一行为一个key和value

spark.master            spark://5.6.7.8:7077
spark.executor.memory   4g
spark.eventLog.enabled  true
spark.serializer        org.apache.spark.serializer.KryoSerializer


Viewing Spark Properties

web UI at http://<driver>:4040 的“Environment” tab中具体校对提交的参数是否和自己本意一致

Available Properties

Most of the properties that control internal settings have reasonable default values. Some of the most common options to set are:

Application Properties

  • spark.driver.maxResultSize:限定Spark action (e.g. collect)到driver的序列化大小,超过Jobs 将aborted 
  • spark.memory.fraction:0.6  Fraction of (heap space - 300MB) used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records. Leaving this at the default value is recommended. For more detail, including important information about correctly tuning JVM garbage collection when increasing this value, see this description.

Inheriting Hadoop Cluster Configuration

If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that should be included on Spark’s classpath:

  • hdfs-site.xml, which provides default behaviors for the HDFS client.
  • core-site.xml, which sets the default filesystem name.

The location of these configuration files varies across CDH and HDP versions, but a common location is inside of /etc/hadoop/conf. Some tools, such as Cloudera Manager, create configurations on-the-fly, but offer a mechanisms to download copies of them.

To make these files visible to Spark, set HADOOP_CONF_DIR in $SPARK_HOME/spark-env.sh to a location containing the configuration files.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值