10.8 Spark资源调度源码分析

最新推荐文章于 2020-09-26 20:27:47 发布

心雨先生

最新推荐文章于 2020-09-26 20:27:47 发布

阅读量542

点赞数 1

分类专栏：大数据-Spark 逐渐了解大数据文章标签： Spark资源调度

本文链接：https://blog.csdn.net/u011418530/article/details/81231007

版权

逐渐了解大数据同时被 2 个专栏收录

53 篇文章 2 订阅

订阅专栏

大数据-Spark

37 篇文章 0 订阅

订阅专栏

Work启动之后会向Master注册

代码层面来说：注册过程就是往Master数据结构里面插入一条数据HashSet ,这时候Master里就会有 val workers = new HashSet()的代码

1，客户端client执行spark-submit 任务命令，就会向Master请求资源，用来启动Driver进程，他会将当前Driver信息注册给Master；代码层面：就是往Master里面的数据结构插入一条记录；这时候Master里面就会多出一条，val waiting = new ArrayBuffer()

2，Driver启动,初始化SparkContext，在初始化SparkContext的时候会创建两个对象DAGScheduler,TaskScheduler；TaskScheduler会向master请求资源，启动Executor；从代码层面来说，就是将当前的Application信息注册到Master的数据结构里面；意思就是Master里面就会插入val waitingApps = new ArrayBuffer()

3，源码底层会检测Matser的数据结构，数据改变就会调用schedule方法；schedule就是资源调度的方法

spark-submit --master spark://hadoop1:7077

--deploy-mode cluster --executor-cores（executor计算进程使用多少个core） 2

--executor-memory（executor计算进程使用多少内存） 1G

--total-exutor-cores（所有的Executor一共使用多少个core） 10

--driver-memory 1G

是在内存充足的情况下使用的core：10/2 = 5

结论：

（1）默认情况下，每一个Worker为当前的Application启动一个Executor，这个Executor会使用全部的core

./spark-submit --master spark://hadoop1:7077 --class org.apache.spark.examples.SparkPi ../lib/spark-examples-1.6.0-hadoop2.6.0.jar 1000

（2）在内存充足的前提下：如果想在某一台Worker上启动多个Executor需要设置--executor-cores

修改了spark-env.sh 每一台Worker 2G内存 3个core

./spark-submit --master spark://hadoop1:7077 --executor-cores 1 --class org.apache.spark.examples.SparkPi ../lib/spark-examples-1.6.0-hadoop2.6.0.jar 1000

每一个Worker能够启动Executor的公式：

execuotNums = Math.min(Worker.sumCores/coresPerExecutor,Worker.sunMemory/memoryPerExecutor)

（3）默认情况下，Executror会在集群中分散启动（有利于数据本地化），如果不想分散启动怎么办？

new SparkConf().set("spark.deploy.spreadOut","false")

./spark-submit --master spark://hadoop1:7077 --executor-cores 1 --total-executor-cores 2 --class org.apache.spark.examples.SparkPi ../lib/spark-examples-1.6.0-hadoop2.6.0.jar 1000

Spark-submit 命令

--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.

--deploy-mode DEPLOY_MODE

--class

--name

--jars 依赖的jar包路径， Driver和Executor里面的程序都可以依赖这个jar包

--files 会将指定的文件，下载到每一个Executor的工作目录区默认是在Spark安装包下的WOrker目录

--conf --conf spark.deploy.spreadOut=false 可以代替代码里面 new SparkConf().set("spark.deploy.spreadOut","false")

--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).

--driver-java-options Extra Java options to pass to the driver.

--driver-library-path Extra library path entries to pass to the driver.

--driver-class-path Extra class path entries to pass to the driver. Note that

jars added with --jars are automatically included in the

classpath.

--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).

--help, -h Show this help message and exit

--verbose, -v Print additional debug output

--version, Print the version of current Spark

Spark standalone with cluster deploy mode only:

--driver-cores NUM Cores for driver (Default: 1).

Spark standalone or Mesos with cluster deploy mode only:

--supervise 如果Driver在cluster模式下挂掉了会重启

Spark standalone and Mesos only:

--total-executor-cores NUM Total cores for all executors.

Spark standalone and YARN only:

--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,

or all available cores on the worker in standalone mode)

YARN-only:

--driver-cores NUM Number of cores used by the driver, only in cluster mode

(Default: 1).

--queue QUEUE_NAME 指定资源队列的名称

--num-executors NUM Number of executors to launch (Default: 2).

心雨先生

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
10.8 Spark资源调度源码分析

Work启动之后会向Master注册代码层面来说：注册过程就是往Master数据结构里面插入一条数据HashSet ,这时候Master里就会有 val workers = new HashSet()的代码1，客户端client执行spark-submit 任务命令，就会向Master请求资源，用来启动Driver进程，他会将当前Driver信息注册给Master；代码层面：就是...
复制链接

扫一扫

专栏目录