spark.streaming.concurrentJobs的值设定为多少合适？

最新推荐文章于 2022-07-07 23:30:00 发布

走向自由

最新推荐文章于 2022-07-07 23:30:00 发布

阅读量534

点赞数

分类专栏： spark

原文链接：https://blog.csdn.net/xueba207/article/details/51152627

版权

spark 专栏收录该内容

24 篇文章 2 订阅

订阅专栏

最近，在spark streaming 调优时，发现个增加job并行度的参数spark.streaming.concurrentJobs，spark 默认值为1，当增加为2时（在spark-default中配置），如遇到处理速度慢 streaming application UI 中会有两个Active Jobs（默认值时为1），也就是在同一时刻可以执行两个批次的streaming job，下文分析这个参数是如何影响streaming 的执行的。

参数引入

在spark streaming 的JobScheduler line 47，读取了该参数：

private val numConcurrentJobs = ssc.conf.getInt("spark.streaming.concurrentJobs", 1)
private val jobExecutor =  ThreadUtils.newDaemonFixedThreadPool(numConcurrentJobs, "streaming-job-executor")

使用concurrentJobs参数初始化jobExecutor线程池，也就是这个参数直接影响了job executor线程池中的线程数目。

job executor

job executor 线程池用来execute JobHandler线程；在jobSchedule中有个job容器jobSets：

private val jobSets: java.util.Map[Time, JobSet] = new ConcurrentHashMap[Time, JobSet]

用来保存不同的时间点生成的JobSet，而JobSet中包含多个Job；
JobSet submit逻辑：

  def submitJobSet(jobSet: JobSet) {
    if (jobSet.jobs.isEmpty) {
      logInfo("No jobs added for time " + jobSet.time)
    } else {
      listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
      jobSets.put(jobSet.time, jobSet)
      jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
      logInfo("Added jobs for time " + jobSet.time)
    }
  }

不难看出jobExecutor的容量决定了池子中同时可以被处理的JobHandler线程数，JobHandler是job的执行线程，因此决定了可以被同时被提交的Job数目。

使用方法

可以通过集中方法为streaming job配置此参数。
- spark-default中修改
全局性修改，所有的streaming job都会受到影响。
- 提交streaming job是 –conf 参数添加（推荐）
在提交job时，可以使用–conf 参数为该job添加个性化的配置。例如：
bin/spark-submit --master yarn --conf spark.streaming.concurrentJobs=5
设置该streaming job的job executor 线程池大小为5，在资源充足的情况下可以同时执行5个batch job。
- 代码设置
在代码中通过sparkConf设置：
sparkConf.set("spark.streaming.concurrentJobs", "5");
或者
System.setProperty("spark.streaming.concurrentJobs", "5");

scheduler mode的使用

FIFO调度方式

By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

FAIR调度方式

Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.

spark.streaming.concurrentJobs值的确定

通常设定为jobSet里job的数目或者两倍。

spark.streaming.concurrentJobs=size(jobSet) // 刚好够用，没有浪费

spark.streaming.concurrentJobs=size(jobSet) * 2 //预留一定资源，防止堆积

在一个batch间隔里能处理完成的job，即使设定多个job执行现场也没有用，有点浪费。若在一个batch间隔里不能执行完成该jobSet的所有job，那就可以多几个job同时运行，前提是分配给application的资源足够。

参考：

https://blog.csdn.net/qq_16146103/article/details/107591802#comments_13437962

http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

https://blog.csdn.net/xueba207/article/details/51152627

https://www.jianshu.com/p/ab3810a4de97

走向自由

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
spark.streaming.concurrentJobs的值设定为多少合适？

最近，在spark streaming 调优时，发现个增加job并行度的参数spark.streaming.concurrentJobs，spark 默认值为1，当增加为2时（在spark-default中配置），如遇到处理速度慢 streaming application UI 中会有两个Active Jobs（默认值时为1），也就是在同一时刻可以执行两个批次的streaming job，下文分析这个参数是如何影响streaming 的执行的。参数引入在spark streaming 的JobSc
复制链接

扫一扫

专栏目录