Eager Rebalance VS Incremental Cooperative Rebalance

前言:Load balancing 和 scheduling 在kafka中占据很重要的位置,遵循分布式系统中常用的做法,kafka客户端使用 group management api 来形成协作的客户端进程组。这些客户端组的由 kafka coordinator维护关系,同时kafka coordinator也会负责协调同一个group下的各个成员。本文将从基于2.6版本的kafka connect出发,介绍在kafka coordinator协调下的两个协议 Eager Rebalance 和 Incremental Cooperative Rebalancing。 

概念,术语,缩写

coordinator:负责处理重平衡生命周期中的各种事件,本质上还是一个broker;

connect:一个负责将数据导出/导入kafka的应用,分为source和sink两个模块;

leader:同一个connect组下的特殊节点,由coordinator选择,并负责分配connector和task给组下的其他connect节点;

Rebalancing:本文指在connect之间进行的重新对connector和task进行分配的操作;

assignment:由leader计算,分配给各个connect成员的connector和task,就是assignment;

JoinGroupRequest:connect加入组的请求,其中包括自身connect节点支持的协议,上一轮平衡中分配到的assignment;

JoinGroupResponse:coordinator向connect发送的加入组的结构,其中包括当前connect组的leader和分配方案assignment;

SyncGroupRequest:每个connect在收到JoinGroupResponse后需要响应给coordinator的请求,如果时leader节点,则这里包含assignment;

SyncGroupResponse:coordinator发送给各个connect由leader计算出来的assignment结果等信息;

HeartbeatResponse:coordinator发送给各个connect的heartbeat相应;

generation:当前执行的rebalancing是第几次,类似epoch机制,由coordinator逐步加1。

1、Rebalancing触发的时机

在集群发生一下情况时会触发rebalancing:

  • 集群有新的connect加入或者有connect退出;
  • 有connector提交或者connector删除;
  • 有task提交或者task删除

当需要触发rebalancing时,所有connect在heartbeat线程中都会收到来自coordinator的错误信息。(Errors.REBALANCE_IN_PROGRESS)

2、Eager Rebalance

在Kafka2.3之前,connect的Rebalancing的分配策略基本都是基于Eager协议的。Eager协议最大的特点就是要求Rebalancing发生时要求每个connect放弃自己当前持有负责connector和task。

举个例子说明下,假设当前connect集群有2个connector,分别是ct1,ct2。当前集群有3个task,分别是t1,t2,t3,此时对应两个connect节点c1,c2分配情况是:{c1 -> ct1,t1,t2},   {c2 -> ct2,t3}。此时如果有一个新的connect成员加入到这个组时(我们假定为c3),触发Rebalancing,会发生如下事情:

  1. 最开始的时候,c1和c2在各自的hearbeat线程中发送心跳信息给group coordinator;
  2. coordinator收到了一个来自c3的joinGroupRequest,coordinator知道有新的成员加入;
  3. coordinator在下一轮的心跳中发送的HeartBeatResponse中中,告知各个connect成员,当前集群处于Rebalancing中,需要重新加入到集群中;
  4. c1和c2停止当前正在执行的connector和task(这个过程也叫revoke),并将自身的信息(包括支持的协议,上一轮的assignment以及自身标识id)打包生成joinGroupRequest发送给coordinator;
  5. coordinator从加入的connect成员中选出一位作为leader(默认是第一个加入的),发送joinGroupResponse中,其中包含各个connect自身发送给coordinator的信息,leader据此计算assignment,并夹在SyncGroupRequest中一起发回给coordinator;
  6. coordinator发给各个connect SyncGroupResponse,这里包含leader信息以及当前connect分配到的assignment,至此Rebalancing结束。

注意点1:第4步的c1和c2收到来自coordinator的rebalancing请求之后,会立即放弃当前正在执行的connector和task,而不是在rebalancing执行完成之后再放弃,这样设计的目的可能是避免有connector和task停止失败导致重复出现在不同的connect,重新加入集群的connect都没有在执行的任务。

注意点2:在第5步中并非只有leader会收到joinGroupResponse,其他非leader的connect也会收到joinGroupResponse,joinGroupResponse其中一个比较重要的属性就是generation,它表示当前是第几次rebalancing,类似epoch,这样带着老的generation的connect重新加入组时,就不会考虑该connect的信息(认为它的assignment是过期的);同样其他非leader节点也会发送SyncGroupRequest,不过发送的assignment为空。

用时序图表示上面过程:

总结一下:整个Rebalancing涉及到三个角色coordinator,leader和其他connect(下面称为member)

  1. 所有的成员要先向coordinator注册,由coordinator选出leader, 然后由leader来分配assignment。这里存在着3个角色,类似于Yarn的resource manager, application master和node manager. 它们也都是为了解决扩展性的问题。单个Kafka集群可能会存在着比broker的数量大得多的connect组,而每个connect的情况可能是不稳定的,可能会频繁变化,每次变化都需要一次协调,如果由broker来负责实际的协调工作,会给broker增加很多负担。所以,从connect里选出来一个做为leader,由leader来执行性能开销大的协调任务, 这样把负载分配到client端,可以减轻broker的压力,支持更多数量的消费组;
  2. Kafka不像YARM那样有NodeManager,存在可以实时监测节点状态的功能。也就是说leader和其他conenct一样也需要定时向coordinator发heartbeat;
  3. YARN的RM是只负责资源分配的,Kafka的coordinator、还需要确定group的成员,即使在leader确定后,leader也不负责确定group的成员,可以推断出,所有group member都需要发心跳给coordinator,这样coordinator才能确定group的成员。为什么心跳不直接发给leader呢?或许是为了可靠性。毕竟leader和其他member之间是可能存在着网络分区的情况的。但是,coordinator作为broker,如果任何group member无法与coordinator通讯,那也就肯定不能作为这个group的成员了。这也决定了,这个Group Management Protocol不应依赖于member和leader之间可靠的网络通讯,因为leader不应该与member直接交互。而应该通过coordinator来管理这个组。这种行为与YARN有明显的区别,因为YARN的每个节点都在集群内部,而Kafka的client和connect却不是Kafka Broker的一部分,可能存在于这种网络环境和地理位置。

下面分析下Eager协议的assignment计算过程:

leader在获取到由coordinator发送过来的信息之后,将所有的connect(包括自身)组合成头尾相连的链表,随后遍历所有的connector,每遍历到一个connector就将其交给队列中的下一个节点,当分配到队列的尾部节点时又从头开始分配,直到所有的connector都分配完毕。(task也是一样的分配过程)具体分配代码如下

    // EagerAssignor 89行
    private Map<String, ByteBuffer> performTaskAssignment(String leaderId, long maxOffset,
                                                          Map<String, ExtendedWorkerState> memberConfigs,
                                                          WorkerCoordinator coordinator) {
        // 这里的Map key是connect的id,value则是connectorId和taskId集合
        // 也就是存储每个connect分配到connector和task
        Map<String, Collection<String>> connectorAssignments = new HashMap<>();
        Map<String, Collection<ConnectorTaskId>> taskAssignments = new HashMap<>();

        // intensive than connectors).
        List<String> connectorsSorted = sorted(coordinator.configSnapshot().connectors());
        // 将所有的connect排序后组成一个头尾相连的环行链表
        CircularIterator<String> memberIt = new CircularIterator<>(sorted(memberConfigs.keySet()));
        // 遍历每一个connector,并将遍历到connector分配给环形链表的下一个节点
        // CircularIterator会在遍历到尾节点后,从头开始分配
        for (String connectorId : connectorsSorted) {
            String connectorAssignedTo = memberIt.next();
            log.trace("Assigning connector {} to {}", connectorId, connectorAssignedTo);
            Collection<String> memberConnectors = connectorAssignments.get(connectorAssignedTo);
            if (memberConnectors == null) {
                memberConnectors = new ArrayList<>();
                connectorAssignments.put(connectorAssignedTo, memberConnectors);
            }
            memberConnectors.add(connectorId);
        }
        for (String connectorId : connectorsSorted) {
            // task和connector的分配时分开的,不过task的分配方式和connector也是一样
            // 对于遍历到task,交给队列的下一个节点
            for (ConnectorTaskId taskId : sorted(coordinator.configSnapshot().tasks(connectorId))) {
                String taskAssignedTo = memberIt.next();
                log.trace("Assigning task {} to {}", taskId, taskAssignedTo);
                Collection<ConnectorTaskId> memberTasks = taskAssignments.get(taskAssignedTo);
                if (memberTasks == null) {
                    memberTasks = new ArrayList<>();
                    taskAssignments.put(taskAssignedTo, memberTasks);
                }
                memberTasks.add(taskId);
            }
        }

        coordinator.leaderState(new LeaderState(memberConfigs, connectorAssignments, taskAssignments));

        // 对分配的结果assignment进行序列化
        return fillAssignmentsAndSerialize(memberConfigs.keySet(), Assignment.NO_ERROR,
                leaderId, memberConfigs.get(leaderId).url(), maxOffset, connectorAssignments, taskAssignments);
    }

注意:这里connector和task的是分开来的执行循环分配的。因为在某些常见情况下同时分配connector及其task可能会导致工作分配非常不均匀(例如,对于每个仅生成 1 个task的connector;在 2 个或偶数个节点的connect集群中,只有偶数节点会被分配connector,只有奇数节点会被分配task,但平均而言,task实际上比connector更占用资源)。

3、Imcremental Cooperative Rebalance

Eager协议最大的问题在于只要同一个组的成员发生变化,或者有像这个集群发送新增connector等请求,都会立即触发Rebalancing,导致所有的connect不得不放弃当前正在执行的connector和task,这种行为称为Stop The World,下面简称STW。对于同样使用Eager协议的Kafka consumer而言这点影响不大,毕竟consumer执行的任务基本都是一样的。但是Kafka connect就不一样,不同的connect上面执行的connector和task可能会不一样(例如c1执行的debezium-mysql 的任务,而c2执行mongodb的任务),有一个connector变化都会影响其他connector。Kafka社区在意识到这个问题后,在2.3的版本中推出了新的兼容性Rebalance协议Imcremental Cooperative Rebalance。

Imcremental Cooperative Rebalance并非是一套全新的协议,而是在Eager的基础上扩展。通过connect在加入集群的时候,上报当前自身执行的assignment信息。还是以上面的例子介绍下该协议:

  1. c1和c2得知需要重新加入集群后,并没有撤销自己当前的任务,而是在发送JoinGroupRequest的请求中,带上自己在上一轮Rebalancing中分配的assignment;
  2. leader在接受到来自coordinator的JoinGroupResponse后,同样并非将所有的connector和task重新洗牌,而是根据现有的connector、task和connect三者进行数量计算,根据当前的connector和task数除以包括新加入的conenct节点数量,得出每个节点需要平摊的connector和task数量,随后leader根据这个数量判定connect要revoke(放弃)哪些connector和task;
  3. 每个connect在收到SyncGroupResponse后,根据里面的revoke结果,停止部分connector和task(如果revoke为空,则跳过这一步);
  4. 直到这里connect集群才完成第一阶段的加入,Imcremental Cooperative Rebalance其实是一个让所有connect两次加入的协议,在第3步所有connect都收到SyncGroupResponse并停止了需要停止的conenctor和task之后,coordiantor并没有结束Rebalancing,所有的connect在hearbeat的时候检测到当前仍然处于Rebalancing阶段,connect不得不再一次加入到Rebalancing中;
  5. 到了这一步connect重新加入的目的就是用于分配第3步由同个集群其他connect放弃的connector和task,这里分配的方式也比较简单,以connector分配为例,让所有的connect按照已经分配的connector排序,然乎拿第一个connect,假设是ct1,它应该是拥有最少connector的节点,找到比它数量还多的connect的第一个,假定是ct2(其实就是拿第一和第二),往ct1中灌入connector,直到它的数量等于ct2,此时ct1和ct2中connector的数量相等
    再拿接下来的ct3,此时假定connector的数量比为ct3>ct2=ct1,那么往ct1和ct2灌入connector,直到ct3=ct2=ct1,依次类推。。。

同样用时序图展示的话:

这里再贴下代码分析assignment计算

 protected Map<String, ByteBuffer> performTaskAssignment(String leaderId, long maxOffset,
                                                            Map<String, ExtendedWorkerState> memberConfigs,
                                                            WorkerCoordinator coordinator, short protocolVersion) {
        log.debug("Performing task assignment during generation: {} with memberId: {}",
                coordinator.generationId(), coordinator.memberId());

        // Base set: The previous assignment of connectors-and-tasks is a standalone snapshot that
        // can be used to calculate derived sets
        log.debug("Previous assignments: {}", previousAssignment);
        // 判断当前leader的generationId是否等于上一轮的coordinator的generationId
        // 如果不是,则需要清空上一轮leader计算出来的assignment,出现这种情况的原因可能
        // 当前connect在某一轮被作为了leader,在计算assignment完成后和coordinator分区
        // 导致assignment未能成功同步到coordinator。在重新加入后,如果该connect节点
        // 再次被选为leader,则应该清空掉它分配的assignment,它所保留的视图应该过期了。
        int lastCompletedGenerationId = coordinator.lastCompletedGenerationId();
        if (previousGenerationId != lastCompletedGenerationId) {
            log.debug("Clearing the view of previous assignments due to generation mismatch between "
                    + "previous generation ID {} and last completed generation ID {}. This can "
                    + "happen if the leader fails to sync the assignment within a rebalancing round. "
                    + "The following view of previous assignments might be outdated and will be "
                    + "ignored by the leader in the current computation of new assignments. "
                    + "Possibly outdated previous assignments: {}",
                    previousGenerationId, lastCompletedGenerationId, previousAssignment);
            // 将该leader保存的assignment视图清空
            this.previousAssignment = ConnectorsAndTasks.EMPTY;
        }

        // 从coordinator中获取当前集群的配置快照configSnapshot,这里包括需要创建的connector
        // 和task,以及上一轮分配给connect的connector和task
        ClusterConfigState snapshot = coordinator.configSnapshot();
        // configSnapshot中获取connector和task快照
        Set<String> configuredConnectors = new TreeSet<>(snapshot.connectors());
        Set<ConnectorTaskId> configuredTasks = configuredConnectors.stream()
                .flatMap(c -> snapshot.tasks(c).stream())
                .collect(Collectors.toSet());

        // Base set: The set of configured connectors-and-tasks is a standalone snapshot that can
        // be used to calculate derived sets
        ConnectorsAndTasks configured = new ConnectorsAndTasks.Builder()
                .with(configuredConnectors, configuredTasks).build();
        log.debug("Configured assignments: {}", configured);

        // Base set: The set of active connectors-and-tasks is a standalone snapshot that can be
        // used to calculate derived sets
        // 从各个connect上传过来的assignment获取connector和task
        // 和从coordinator获取的assignment的区别是这里的从connect获取的assignment包含了要被删除
        // 的connector和task
        ConnectorsAndTasks activeAssignments = assignment(memberConfigs);
        log.debug("Active assignments: {}", activeAssignments);

        // This means that a previous revocation did not take effect. In this case, reset
        // appropriately and be ready to re-apply revocation of tasks
        // leader同样会存储上一轮让各个connect放弃的assignment到previousRevocation这个属性中
        // 如果previousRevocation不为空说明上一轮有需要让connect放弃的assignment
        // 和当前其他connect节点提交的assignment比较,如果存在则说明revoke失败,需要清空previousRevocation
        // 然后再次重新计算需要revoke
        if (!previousRevocation.isEmpty()) {
            if (previousRevocation.connectors().stream().anyMatch(c -> activeAssignments.connectors().contains(c))
                    || previousRevocation.tasks().stream().anyMatch(t -> activeAssignments.tasks().contains(t))) {
                previousAssignment = activeAssignments;
                canRevoke = true;
            }
            previousRevocation.connectors().clear();
            previousRevocation.tasks().clear();
        }

        // Derived set: The set of deleted connectors-and-tasks is a derived set from the set
        // difference of previous - configured
        // 将上一轮leader分配快照和当前coordinator的快照比较,计算出被删除的assignment
        // ps:为啥不拿connect提交的和coordinator的快照比较?
        ConnectorsAndTasks deleted = diff(previousAssignment, configured);
        log.debug("Deleted assignments: {}", deleted);

        // Derived set: The set of remaining active connectors-and-tasks is a derived set from the
        // set difference of active - deleted
        // 将当前connect提交的assignment和被删除的比较,计算出当前仍然需要保持运行的assignment
        ConnectorsAndTasks remainingActive = diff(activeAssignments, deleted);
        log.debug("Remaining (excluding deleted) active assignments: {}", remainingActive);

        // Derived set: The set of lost or unaccounted connectors-and-tasks is a derived set from
        // the set difference of previous - active - deleted
        // 将leader快照assignment为基础、剔除connect的assignment以及被删除的assignment进行比较
        // 计算出丢失的assignment,丢失的原因可能是有部分connect挂了,导致分配给他们的assignment
        // 没有能够提交到这次rebalance中
        ConnectorsAndTasks lostAssignments = diff(previousAssignment, activeAssignments, deleted);
        log.debug("Lost assignments: {}", lostAssignments);

        // Derived set: The set of new connectors-and-tasks is a derived set from the set
        // difference of configured - previous - active
        // 以当前coordinator的快照为基础,剔除leader的assignment快照和connect提交的assignment
        // 计算出新提交的connector和task
        ConnectorsAndTasks newSubmissions = diff(configured, previousAssignment, activeAssignments);
        log.debug("New assignments: {}", newSubmissions);

        // A collection of the complete assignment
        // completeWorkerAssignment表示当前每个connect分配到connector和task,也就是每个connect当前的负载
        List<WorkerLoad> completeWorkerAssignment = workerAssignment(memberConfigs, ConnectorsAndTasks.EMPTY);
        log.debug("Complete (ignoring deletions) worker assignments: {}", completeWorkerAssignment);

        // Per worker connector assignments without removing deleted connectors yet
        // connectorAssignments计算每个connect当前获取到connector数量
        Map<String, Collection<String>> connectorAssignments =
                completeWorkerAssignment.stream().collect(Collectors.toMap(WorkerLoad::worker, WorkerLoad::connectors));
        log.debug("Complete (ignoring deletions) connector assignments: {}", connectorAssignments);

        // Per worker task assignments without removing deleted connectors yet
        // taskAssignments计算每个connect当前获取到的task数量
        Map<String, Collection<ConnectorTaskId>> taskAssignments =
                completeWorkerAssignment.stream().collect(Collectors.toMap(WorkerLoad::worker, WorkerLoad::tasks));
        log.debug("Complete (ignoring deletions) task assignments: {}", taskAssignments);

        // A collection of the current assignment excluding the connectors-and-tasks to be deleted
        // currentWorkerAssignment则是从completeWorkerAssignment中剔除需要删除的assignment
        // 得出删除connector和task后的每个connect的负载
        List<WorkerLoad> currentWorkerAssignment = workerAssignment(memberConfigs, deleted);

        // deleted是上面计算出的需要删除的assignment,需要遍历当前connect分配的assignment
        // 才能得知deleted当前是分配在哪个connect上,继而知道每个connect需要revoke的connector和task
        Map<String, ConnectorsAndTasks> toRevoke = computeDeleted(deleted, connectorAssignments, taskAssignments);
        log.debug("Connector and task to delete assignments: {}", toRevoke);

        // Revoking redundant connectors/tasks if the the workers have duplicate assignments
        // 这一步是去除被重复分配的connector和task
        toRevoke.putAll(computeDuplicatedAssignments(memberConfigs, connectorAssignments, taskAssignments));
        log.debug("Connector and task to revoke assignments (include duplicated assignments): {}", toRevoke);

        // Recompute the complete assignment excluding the deleted connectors-and-tasks
        // 重新计算每个connect在剔除被删除的connector和task的assignment
        completeWorkerAssignment = workerAssignment(memberConfigs, deleted);
        connectorAssignments =
                completeWorkerAssignment.stream().collect(Collectors.toMap(WorkerLoad::worker, WorkerLoad::connectors));
        taskAssignments =
                completeWorkerAssignment.stream().collect(Collectors.toMap(WorkerLoad::worker, WorkerLoad::tasks));

        // 处理丢失的assignment
        handleLostAssignments(lostAssignments, newSubmissions, completeWorkerAssignment, memberConfigs);

        // Do not revoke resources for re-assignment while a delayed rebalance is active
        // Also we do not revoke in two consecutive rebalances by the same leader
        canRevoke = delay == 0 && canRevoke;

        // Compute the connectors-and-tasks to be revoked for load balancing without taking into
        // account the deleted ones.
        log.debug("Can leader revoke tasks in this assignment? {} (delay: {})", canRevoke, delay);
        if (canRevoke) {
            // 这一步就是计算当前connect需要放弃哪些connector和task才能使整体平衡
            Map<String, ConnectorsAndTasks> toExplicitlyRevoke =
                    performTaskRevocation(activeAssignments, currentWorkerAssignment);

            log.debug("Connector and task to revoke assignments: {}", toRevoke);

            toExplicitlyRevoke.forEach(
                (worker, assignment) -> {
                    ConnectorsAndTasks existing = toRevoke.computeIfAbsent(
                        worker,
                        v -> new ConnectorsAndTasks.Builder().build());
                    existing.connectors().addAll(assignment.connectors());
                    existing.tasks().addAll(assignment.tasks());
                }
            );
            canRevoke = toExplicitlyRevoke.size() == 0;
        } else {
            canRevoke = delay == 0;
        }

        // 将新提交的connector和task分配给所有的connect
        assignConnectors(completeWorkerAssignment, newSubmissions.connectors());
        assignTasks(completeWorkerAssignment, newSubmissions.tasks());
        log.debug("Current complete assignments: {}", currentWorkerAssignment);
        log.debug("New complete assignments: {}", completeWorkerAssignment);

        Map<String, Collection<String>> currentConnectorAssignments =
                currentWorkerAssignment.stream().collect(Collectors.toMap(WorkerLoad::worker, WorkerLoad::connectors));
        Map<String, Collection<ConnectorTaskId>> currentTaskAssignments =
                currentWorkerAssignment.stream().collect(Collectors.toMap(WorkerLoad::worker, WorkerLoad::tasks));
        Map<String, Collection<String>> incrementalConnectorAssignments =
                diff(connectorAssignments, currentConnectorAssignments);
        Map<String, Collection<ConnectorTaskId>> incrementalTaskAssignments =
                diff(taskAssignments, currentTaskAssignments);

        log.debug("Incremental connector assignments: {}", incrementalConnectorAssignments);
        log.debug("Incremental task assignments: {}", incrementalTaskAssignments);

        coordinator.leaderState(new LeaderState(memberConfigs, connectorAssignments, taskAssignments));

        Map<String, ExtendedAssignment> assignments =
                fillAssignments(memberConfigs.keySet(), Assignment.NO_ERROR, leaderId,
                                memberConfigs.get(leaderId).url(), maxOffset, incrementalConnectorAssignments,
                                incrementalTaskAssignments, toRevoke, delay, protocolVersion);
        previousAssignment = computePreviousAssignment(toRevoke, connectorAssignments, taskAssignments, lostAssignments);
        previousGenerationId = coordinator.generationId();
        previousMembers = memberConfigs.keySet();
        log.debug("Actual assignments: {}", assignments);
        return serializeAssignments(assignments);
    }

计算需要connect需要revoke的逻辑

    private Map<String, ConnectorsAndTasks> performTaskRevocation(ConnectorsAndTasks activeAssignments,
                                                                  Collection<WorkerLoad> completeWorkerAssignment) {
        // 获取当前rebalancing中加入connect已经分配到的connector数量,也可以认为是上一轮rebalancing的connector数量
        int totalActiveConnectorsNum = activeAssignments.connectors().size();
        // 获取当前rebalancing中加入connect已经分配到的task数量,也可以认为是上一轮rebalancing的task数量
        int totalActiveTasksNum = activeAssignments.tasks().size();
        // existingWorkers表示有分配到assignment的connect节点,这里是过滤叼新加入的节点
        Collection<WorkerLoad> existingWorkers = completeWorkerAssignment.stream()
                .filter(wl -> wl.size() > 0)
                .collect(Collectors.toList());
        int existingWorkersNum = existingWorkers.size();
        int totalWorkersNum = completeWorkerAssignment.size();
        // 计算出新加入的connect数量
        int newWorkersNum = totalWorkersNum - existingWorkersNum;

        if (log.isDebugEnabled()) {
            completeWorkerAssignment.forEach(wl -> log.debug(
                    "Per worker current load size; worker: {} connectors: {} tasks: {}",
                    wl.worker(), wl.connectorsSize(), wl.tasksSize()));
        }

        Map<String, ConnectorsAndTasks> revoking = new HashMap<>();
        // If there are no new workers, or no existing workers to revoke tasks from return early
        // after logging the status
        // 当前集群没有新加入的connect时,才需要从其他conenct中分出一部分assignment
        if (!(newWorkersNum > 0 && existingWorkersNum > 0)) {
            log.debug("No task revocation required; workers with existing load: {} workers with "
                    + "no load {} total workers {}",
                    existingWorkersNum, newWorkersNum, totalWorkersNum);
            // This is intentionally empty but mutable, because the map is used to include deleted
            // connectors and tasks as well
            return revoking;
        }

        log.debug("Task revocation is required; workers with existing load: {} workers with "
                + "no load {} total workers {}",
                existingWorkersNum, newWorkersNum, totalWorkersNum);

        // We have at least one worker assignment (the leader itself) so totalWorkersNum can't be 0
        log.debug("Previous rounded down (floor) average number of connectors per worker {}", totalActiveConnectorsNum / existingWorkersNum);
        // 当前每个connect提交的connector数量除以本次rebalancing中加入的connect数量
        // 得到如果需要把connector分配给所有connect,那么每个connect需要分摊的connector数量floorConnectors
        int floorConnectors = totalActiveConnectorsNum / totalWorkersNum;
        // 这一步取整ceilConnectors
        int ceilConnectors = floorConnectors + ((totalActiveConnectorsNum % totalWorkersNum == 0) ? 0 : 1);
        log.debug("New average number of connectors per worker rounded down (floor) {} and rounded up (ceil) {}", floorConnectors, ceilConnectors);


        log.debug("Previous rounded down (floor) average number of tasks per worker {}", totalActiveTasksNum / existingWorkersNum);
        int floorTasks = totalActiveTasksNum / totalWorkersNum;
        int ceilTasks = floorTasks + ((totalActiveTasksNum % totalWorkersNum == 0) ? 0 : 1);
        log.debug("New average number of tasks per worker rounded down (floor) {} and rounded up (ceil) {}", floorTasks, ceilTasks);
        int numToRevoke;

        // 遍历所有有分配到connector和connect(不包括新加入的connect)
        for (WorkerLoad existing : existingWorkers) {
            Iterator<String> connectors = existing.connectors().iterator();
            // 当前connect分配到的connector的数量减去每个connect达到平衡时connector数量
            // 计算出当前connect需要放弃的connector数量numToRevoke
            numToRevoke = existing.connectorsSize() - ceilConnectors;
            // 注意,有可能connect被分配到的connector的数量少于ceilConnectors,导致该connect不仅不需要revoke部分connector
            // 还需要添加更多的connector的数量给它
            for (int i = existing.connectorsSize(); i > floorConnectors && numToRevoke > 0; --i, --numToRevoke) {
                ConnectorsAndTasks resources = revoking.computeIfAbsent(
                    existing.worker(),
                    w -> new ConnectorsAndTasks.Builder().build());
                resources.connectors().add(connectors.next());
            }
        }

        for (WorkerLoad existing : existingWorkers) {
            Iterator<ConnectorTaskId> tasks = existing.tasks().iterator();
            numToRevoke = existing.tasksSize() - ceilTasks;
            log.debug("Tasks on worker {} is higher than ceiling, so revoking {} tasks", existing, numToRevoke);
            for (int i = existing.tasksSize(); i > floorTasks && numToRevoke > 0; --i, --numToRevoke) {
                ConnectorsAndTasks resources = revoking.computeIfAbsent(
                    existing.worker(),
                    w -> new ConnectorsAndTasks.Builder().build());
                resources.tasks().add(tasks.next());
            }
        }

        return revoking;
    }

分配新connector的部分(task也类似)

    protected void assignConnectors(List<WorkerLoad> workerAssignment, Collection<String> connectors) {
        // 这里分配的方式也比较简单,就是让所有的connect按照已经分配的connector排序(已经剔除被删除的)
        // 让后拿第一个connect,假设是ct1,找到比它数量还多的connect的第一个,假定是ct2(其实就是拿第一和第二),往
        // ct1中灌入connector,直到它的数量等于ct2,此时ct1和ct2中connector的数量相等
        // 再拿接下来的ct3,此时假定connector的数量比为ct3>ct2=ct1,那么往ct1和ct2灌入connector,直到
        // ct3=ct2=ct1,依次类推。。。

        // 按connector的数量排序connect
        workerAssignment.sort(WorkerLoad.connectorComparator());
        // 取出第一个connect也就是数量最少的connect
        WorkerLoad first = workerAssignment.get(0);

        // 遍历所有需要分配的connector
        Iterator<String> load = connectors.iterator();
        while (load.hasNext()) {
            // 获取第一个connect中的connector数量
            int firstLoad = first.connectorsSize();
            // 找到connector数量比第一个connect多的
            int upTo = IntStream.range(0, workerAssignment.size())
                    .filter(i -> workerAssignment.get(i).connectorsSize() > firstLoad)
                    .findFirst()
                    .orElse(workerAssignment.size());
            // 这里subList结果是分出到比第一个connect数量多的游标为止(不包含数量多的)
            // 然后将需要分配的connector一直分配
            for (WorkerLoad worker : workerAssignment.subList(0, upTo)) {
                String connector = load.next();
                log.debug("Assigning connector {} to {}", connector, worker.worker());
                worker.assign(connector);
                if (!load.hasNext()) {
                    break;
                }
            }
        }
    }

4、新老协议比对

 该图是取自Kafka官方博客的比较Eager协议和Incremental Cooperative协议,总共90个connector和10个task,在发生Rebalancing时经历的时间。

在每个图表的左侧,Eager Rebalancing(每当连接器及其任务启动或停止时都会STW)的成本与集群中当前运行的任务数量成正比。 连接器启动或停止时的成本相似,分别导致集群稳定的时间约为 14 分钟和 12 分钟。 相比之下,在右侧,增量协作重新平衡在一分钟内平衡了 900 个任务,并且每个单独的重新平衡的成本显然与集群中当前的任务数量无关。 下满的条形图通过比较启动和停止 90 个连接器和 900 个任务所花费的时间清楚地表明了这一点。

一般来说,如果一组Kafka Connect运行很多不同的Connectors和滚动升级插件的情况下,更适合Imcremental Cooperative Rebalance;Eager Rebalance适用于集群上connector和task不多的场景,非二次加入的条件可以让集群立马处于平衡的状态。

5、参考文章

incremental-cooperative-rebalancing-in-kafka/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值