flink并行度设置问题

之前写过一篇文章,介绍flink的并行度问题:https://blog.csdn.net/L13763338360/article/details/106632612

并行度的设置有几种,按优先级先后依次是:

  • 算子级别
  • 执行环境级别
  • 命令行级别
  • 配置文件级别

公司用的flink是基于开源改造的,跟开源还是有些区别,使用过程中也碰到一些问题,这里简单总结下。

有两个跟并行度相关的配置

  • taskmanager.numberOfTaskManagers:taskManager数量
  • taskmanager.numberOfTaskSlots:每个taskManager的slot数量

任务启动的时候,slot数量=numberOfTaskManagers*numberOfTaskSlots。

有两个跟任务相关的资源

  • cpu核心数
  • 单核内存

前提条件:申请了cpu核心数为4,内存8g,numberOfTaskManagers为4,numberOfTaskSlots为2,理论上slot数量为:4*2=8.

一般认为:4个cpu核心数和8G内存,4个taskmanager,8个slot。任务起来的时候,有多少slot就申请多少,这里应该申请8个slot,没用的闲置着,如果任务需要的slot数量超过了8,资源申请不下来任务启动失败。

实际情况

  • 情况1:没有设置任何并行度,实际申请和使用1核,2g内存,1个taskmanager,2个slot。
  • 情况2:设置了算子级别的并行度,有几个算子并行度均为2,实际申请和使用3核,6g内存,3个taskmanager,6个slot。
  • 情况3:设置了算式级别并行度,读两个kafka集群,每个集群1个topic有5个分区,所以设置两个source的并行度均是5,还有其他算子的并行度,需要29个slot,但只分配了15个,slot不够,申请资源失败,报错如下

org.apache.flink.runtime.executiongraph.ExecutionGraph        - oceanus#trace#Job livelink-data-flow (000000000000eab20000000000000009) switched from state RUNNING to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 29, slots allocated: 15, previous allocation IDs: [], execution status: completed: Attempt #0 (Source: kafka-reader-_0 -> Map -> Filter (1/1)) @ org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@75b2a848 - [SCHEDULED], completed: Attempt #0 (Source: kafka-reader-_1 (1/1)) @ org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@7cefd6df - [SCHEDULED], completed: Attempt #0 (Source: Custom Source (1/1)) @ org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@7d4da7f1 - [SCHEDULED], completed: Attempt #0 (Map (1/8)) @ org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@6ceb85dd - [SCHEDULED], incomplete: java.util.concurrent.CompletableFuture@c7b90e0[Not completed, 1 dependents], completed: Attempt #0 (Map (3/8)) @ org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@ea0e69e - [SCHEDULED], completed: Attempt #0 (Map (4/8)) @ org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@1ec2aa94 - [SCHEDULED], completed: Attempt #0 (Map (5/8)) @ org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@2e3711fc - [SCHEDULED], completed: Attempt #0 (Map (6/8)) @ org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@7e1b684a - [SCHEDULED], incomplete: java.util.concurrent.CompletableFuture@7706e501[Not completed, 1 dependents], completed: Attempt #0 (Map (8/8)) @ org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@7c1b677f - [SCHEDULED], incomplete: java.util.concurrent.CompletableFuture@6ab521f1[Not completed, 1 dependents], incomplete: java.util.concurrent.CompletableFuture@4f494dd5[Not completed, 1 dependents], completed: Attempt #0 (Co-Process-Broadcast (2/8)) @ org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@fa797d0 - [SCHEDULED], completed: Attempt #0 (Co-Process-Broadcast (3/8)) @ org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@282029c4 - [SCHEDULED], completed exceptionally: java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException/java.util.concurrent.CompletableFuture@6b592f98[Completed exceptionally], completed: Attempt #0 (Co-Process-Broadcast (5/8)) @ org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@6efbe0c3 - [SCHEDULED], completed: Attempt #0 (Co-Process-Broadcast (6/8)) @ org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@5dd1f3db - [SCHEDULED], completed: Attempt #0 (Co-Process-Broadcast (7/8)) @ org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@74168fac - [SCHEDULED], completed: Attempt #0 (Co-Process-Broadcast (8/8)) @ org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@56914b59 - [SCHEDULED], incomplete: java.util.concurrent.CompletableFuture@675d54b0[Not completed, 1 dependents], incomplete: java.util.concurrent.CompletableFuture@35402cab[Not completed, 1 dependents], incomplete: java.util.concurrent.CompletableFuture@75ebbe40[Not completed, 1 dependents], incomplete: java.util.concurrent.CompletableFuture@297cf827[Not completed, 1 dependents], incomplete: java.util.concurrent.CompletableFuture@2489d16c[Not completed, 1 dependents], incomplete: java.util.concurrent.CompletableFuture@10188422[Not completed, 1 dependents], incomplete: java.util.concurrent.CompletableFuture@382931df[Not completed, 1 dependents], incomplete: java.util.concurrent.CompletableFuture@765a5b3d[Not completed, 1 dependents], incomplete: java.util.concurrent.CompletableFuture@1c32e62b[Not completed, 1 dependents]
    at org.apache.flink.runtime.executiongraph.SchedulingUtils.lambda$scheduleEager$1(SchedulingUtils.java:194)
    at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)
    at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866)
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
    at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:634)
    at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.lambda$new$0(FutureUtils.java:657)
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
    at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:190)
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
    at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:700)
    at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:484)
    at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:380)
    at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
    at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
    at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:999)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
    at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
    at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
    at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
    at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
    at akka.actor.ActorCell.invoke(ActorCell.scala:561)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
    at akka.dispatch.Mailbox.run(Mailbox.scala:225)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

为啥已经分配了15个slot,不是最多8个?

看来并行度、numberOfTaskManagers、numberOfTaskSlots、cpu、内存、实际slot数量之间的关系,有一定迷惑性。

  • 并行度不好设置,并行度设置小了,资源没有充分利用;并行度大了,资源申请不了,任务起不来。
  • cpu、内存申请和numberOfTaskManagers、numberOfTaskSlots配置也有一定关系,需要注意。

前面说了,设置并行度的方法,按优先级先后:算子级别、执行环境级别、命令行级别、配置文件级别,既然算子级别设置并行度,不好控制实际启动的taskmanager和slot的数量,那就试试其他的。

env.setParallelism(8);

执行环境级别,设置全局并行度为8,结果如预期:4核、8G、4个taskmanager、8个slot,总共5个task,比之前自己在算子级别设置并行度更少task。

为什么会这样?需要重新研究下parallism、subtask、task、slot、operator chain等概念,另外还需要研究下sharing group slot概念,内容比较多,这里不详细介绍。

 

 

 

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值