使用 spark-submit 部署应用、自定义分区器、checkpoint、共享变量

最新推荐文章于 2022-10-31 17:40:06 发布

Geek白先生

最新推荐文章于 2022-10-31 17:40:06 发布

阅读量546

点赞数

分类专栏： Spark 文章标签： spark-submit

本文链接：https://blog.csdn.net/weixin_43699817/article/details/100822352

版权

Spark 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

文章目录

spark-submit 部署应用

不论使用的是哪一种集群管理器，都可以使用 spark-submit 将你的应用提交到那种集群管理器上。
通过不同的配置选项，spark-submit 可以连接到相应的集群管理器上，并控制应用所使用的资源数量

附加的参数：

--master 表示要连接的集群管理器，后接的值如下：

spark://host:port 连接到指定端口的Spark独立集群上。默认情况下Spark 独立主节点使用7077端口
mesos://host:port 连接到指定端口的Mesos 集群上。默认情况下Mesos 主节点监听5050端口
yarn  连接到一个YARN 集群。当在YARN上运行时，需要设置环境变量HADOOP_CONF_DIR指向Hadoop 配置目录，以获取集群信息
local 	 	运行本地模式，使用单核
local[N] 	运行本地模式，使用N个核心
local[*] 	运行本地模式，使用尽可能多的核心

--deploy-mode 选择driver驱动管理器在“client”和“cluster”运行；
在集群管理器上，如果是“client”提交，那么驱动管理器在client运行；如果是“cluster”提交，那么驱动管理器在集群上运行；

//比如：
在spark on yarn模式下：
    在slave1节点上执行：{spark-submit --master yarn --deploy-mode client},那么spark-submit守护进程在slave1上
    在slave1节点上执行：{spark-submit --master yarn --deploy-mode cluster},那么spark-submit守护进程在nodemanger节点上
    但执行{spark-shell --master yarn --deploy-mode cluster}，报以下错误：
   	 	Error: Cluster deploy mode is not applicable to Spark shells. (spark-shell本就是客户端)

--executor-memory
针对当前spark-submit，设置每个executor所需的内存大小；执行器进程使用的内存量，以字节为单位

--driver-memory
驱动器进程使用的内存量，以字节为单位

--class 运行 Java 或 Scala 程序时应用的主类
--name 应用的显示名，会显示在 Spark 的网页用户界面中
--jars 需要上传并放到应用的 CLASSPATH 中的 JAR 包的列表。
如果应用依赖于少量第三方的 JAR 包，可以把它们放在这个参数里
--files 需要放到应用工作目录中的文件的列表。这个参数一般用来放需要分发到各节点的数据文件
--py-files 需要添加到 PYTHONPATH 中的文件的列表。其中可以包含 .py、.egg 以及 .zip 文件

spark-env.sh 具体的属性配置信息

【在spsrk on yarn模式下】
	  SPARK_EXECUTOR_INSTANCES, Number of executors to start (Default: 2)
	     --设置exexutor(执行进程)的总数量，原则上越大越好。
	  SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
	     --设置executor(执行进程)所需的cores(核数)，cores(核数)决定Task并行度。
	  SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
	     --设置每个executor所需的内存大小；可以指定单位；
	  SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)
	     --设置driver进程所需的内存大小

【在spsrk Standalone模式下】
      SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
	     --设置Master的主机名或IP
	  SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
	     --设置Master端口号，默认为7077
	  SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
             --设置Master配置属性
	  SPARK_WORKER_CORES, to set the number of cores to use on this machine
	     --设置workerg工作节点的核数
	  SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
	     --设置给每个executor内存数大小
	  SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker   
	     --设置worker节点的端口号
	  SPARK_WORKER_INSTANCES, to set the number of worker processes per node
	     --设置每个节点开启多少个worker进程
	  SPARK_WORKER_DIR, to set the working directory of worker processes
	     --设置每个worker节点的工作目录，比如：缓存操作进行Disk操作时。
	  SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
	     --设置Worker配置属性
	  SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
	     --设置守护进程的内存大小，默认为1G
	  SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
	     --设置HISTORY配置属性
	  SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
	     --设置SHUFFLE配置属性
	  SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
	     --设置Java守护进程配置属性
	  SPARK_PUBLIC_DNS, to set the public dns name of the master or workers
	     --设置DNS服务器名称

配置资源分配参数

在Standalone模式下，每个应用程序在每个Worker节点下最多有一个Executor进程！
一个Worker对应一个Executor
配置资源分配参数：

 --executor-memory      设置每个Executor所使用的的内存容量
 --total-executor-cores 设置当前应用所使用的的总核数，一个executor占用一个核

在Yarn模式下，配置资源分配参数：

--num-executors   设置Executor执行器的个数
--executor-memory 设置每个Executor执行器的内存用量
--executor-cores  设置每个Executor执行器的核数。

调优案例分析

Spark集群规模为6台节点，每台节点的内存为64G，核为16核
要求，提交App在on Yarn，写出方案：

1.spark-submit --master yarn
             	--deploy-mode cluster
				--num-executors 6
				--executor-memory 63G
				--executor-cores 15
				
(优化方案)2.spark-submit --master yarn
           --deploy-mode cluster
		   --num-executors 29        //首先考虑每个节点开启5个Executor进程，5*6=30，但需要考虑Driver驱动器，故设置Executor总数为29
		   --executor-memory 12G     //因每个节点开启5个Executor进程，63G/5=12G,故每个Executor内存数为12G
		   --executor-cores 3        //每台节点剩余15核,并开启5个Executor，故每个Executor核数为15/5=3
		   
(最优方案)3.spark-submit --master yarn
           --deploy-mode cluster
		   --num-executors 17      //每台节点剩余15核，那么每个节点需开启3个Executor，总数3*6=18，但需要考虑Driver驱动器，故设置Executor总数为17
		   --executor-memory 20G   //每台节点剩余63G,可分配内存上限数为63G/3=21G.
		   --executor-cores 5      //首先控制executor并行度为5

自定义分区器

自定义分区实现方式：重写继承partitioner类
numPartitions: Int【返回创建出来的分区数】
getPartition(key: Any): Int 【返回给定键的分区编号（0 到numPartitions-1）】
equals() 【Java判断相等性的标准方法】Spark需要用这个方法来检查你的分区器对象是否和其他分区器实例相同，这样Spark才可以判断两个RDD的分区方式是否相同

举例：

class UsridPartitioner(numParts:Int) extends Partitioner{
	  //覆盖分区数
	  override def numPartitions: Int = numParts
	  //覆盖分区号获取函数
	  override def getPartition(key: Any): Int = {
	    key.toString.toInt%numParts
	  }
}

object Test {
  def main(args: Array[String]) {
  	val conf=new SparkConf().setMaster("local").setAppName("自定义分区")
    val sc=new SparkContext(conf)
    val data=sc.parallelize(1 to 10,5) //模拟5个分区的数据
    //根据尾号转变为10个分区，分写到10个文件
    data.map((_,1)).partitionBy(new UsridPartitioner(10)).saveAsTextFile("/chenm/partition")
  }
}

检查点checkpoint

缓存Cache对RDD进行内存缓存处理，但不对RDD进行相关操作。
checkpoint物化处理：
a.将RDD存储到物理磁盘上，以二进制流的方式。
b.物化后，RDD的依赖关系以及RDD本身将不复存在。
实现物化操作，需完成以下两个操作：
a.必须制定checkpoint的路径，它没有默认路径，如果不指定，将抛异常；
b.物化操作(checkpoint)为转换操作，所以执行Action操作之后才进行物化，RDD计算将执行两次。
c.一般情况下，在checkpiont之前做cache()后，将缓存内容直接输出至物化地址;

Spark共享变量

共享变量分两种：广播变量和累加器，实现任务间变量共享访问；
广播变量
将变量以缓存并只读的方式分发至每个机器节点上。类似于hadoop中的分布式缓存。
特点：a.只读；b.在每个节点中缓存
创建方式：调用sc.broadcast()创建广播变量；获取广播变量：bc.value
累加器：类似以hadoop中的计数器，进行“加”处理，用于统计处理。
创建方式：旧版本–>调用 val acc = sc.accumulator(0) 创建累加器；
新版本–>调用val acc = sc.longAccumulator创建累加器，使用acc.add(Long l)