前言:Load balancing 和 scheduling 在kafka中占据很重要的位置,遵循分布式系统中常用的做法,kafka客户端使用 group management api 来形成协作的客户端进程组。这些客户端组的由 kafka coordinator维护关系,同时kafka coordinator也会负责协调同一个group下的各个成员。本文将从基于2.6版本的kafka connect出发,介绍在kafka coordinator协调下的两个协议 Eager Rebalance 和 Incremental Cooperative Rebalancing。
概念,术语,缩写
coordinator:负责处理重平衡生命周期中的各种事件,本质上还是一个broker;
connect:一个负责将数据导出/导入kafka的应用,分为source和sink两个模块;
leader:同一个connect组下的特殊节点,由coordinator选择,并负责分配connector和task给组下的其他connect节点;
Rebalancing:本文指在connect之间进行的重新对connector和task进行分配的操作;
assignment:由leader计算,分配给各个connect成员的connector和task,就是assignment;
JoinGroupRequest:connect加入组的请求,其中包括自身connect节点支持的协议,上一轮平衡中分配到的assignment;
JoinGroupResponse:coordinator向connect发送的加入组的结构,其中包括当前connect组的leader和分配方案assignment;
SyncGroupRequest:每个connect在收到JoinGroupResponse后需要响应给coordinator的请求,如果时leader节点,则这里包含assignment;
SyncGroupResponse:coordinator发送给各个connect由leader计算出来的assignment结果等信息;
HeartbeatResponse:coordinator发送给各个connect的heartbeat相应;
generation:当前执行的rebalancing是第几次,类似epoch机制,由coordinator逐步加1。
1、Rebalancing触发的时机
在集群发生一下情况时会触发rebalancing:
- 集群有新的connect加入或者有connect退出;
- 有connector提交或者connector删除;
- 有task提交或者task删除
当需要触发rebalancing时,所有connect在heartbeat线程中都会收到来自coordinator的错误信息。(Errors.REBALANCE_IN_PROGRESS)
2、Eager Rebalance
在Kafka2.3之前,connect的Rebalancing的分配策略基本都是基于Eager协议的。Eager协议最大的特点就是要求Rebalancing发生时要求每个connect放弃自己当前持有负责connector和task。
举个例子说明下,假设当前connect集群有2个connector,分别是ct1,ct2。当前集群有3个task,分别是t1,t2,t3,此时对应两个connect节点c1,c2分配情况是:{c1 -> ct1,t1,t2}, {c2 -> ct2,t3}。此时如果有一个新的connect成员加入到这个组时(我们假定为c3),触发Rebalancing,会发生如下事情:
- 最开始的时候,c1和c2在各自的hearbeat线程中发送心跳信息给group coordinator;
- coordinator收到了一个来自c3的joinGroupRequest,coordinator知道有新的成员加入;
- coordinator在下一轮的心跳中发送的HeartBeatResponse中中,告知各个connect成员,当前集群处于Rebalancing中,需要重新加入到集群中;
- c1和c2停止当前正在执行的connector和task(这个过程也叫revoke),并将自身的信息(包括支持的协议,上一轮的assignment以及自身标识id)打包生成joinGroupRequest发送给coordinator;
- coordinator从加入的connect成员中选出一位作为leader(默认是第一个加入的),发送joinGroupResponse中,其中包含各个connect自身发送给coordinator的信息,leader据此计算assignment,并夹在SyncGroupRequest中一起发回给coordinator;
- coordinator发给各个connect SyncGroupResponse,这里包含leader信息以及当前connect分配到的assignment,至此Rebalancing结束。
注意点1:第4步的c1和c2收到来自coordinator的rebalancing请求之后,会立即放弃当前正在执行的connector和task,而不是在rebalancing执行完成之后再放弃,这样设计的目的可能是避免有connector和task停止失败导致重复出现在不同的connect,重新加入集群的connect都没有在执行的任务。
注意点2:在第5步中并非只有leader会收到joinGroupResponse,其他非leader的connect也会收到joinGroupResponse,joinGroupResponse其中一个比较重要的属性就是generation,它表示当前是第几次rebalancing,类似epoch,这样带着老的generation的connect重新加入组时,就不会考虑该connect的信息(认为它的assignment是过期的);同样其他非leader节点也会发送SyncGroupRequest,不过发送的assignment为空。
用时序图表示上面过程:
总结一下:整个Rebalancing涉及到三个角色coordinator,leader和其他connect(下面称为member)
- 所有的成员要先向coordinator注册,由coordinator选出leader, 然后由leader来分配assignment。这里存在着3个角色,类似于Yarn的resource manager, application master和node manager. 它们也都是为了解决扩展性的问题。单个Kafka集群可能会存在着比broker的数量大得多的connect组,而每个connect的情况可能是不稳定的,可能会频繁变化,每次变化都需要一次协调,如果由broker来负责实际的协调工作,会给broker增加很多负担。所以,从connect里选出来一个做为leader,由leader来执行性能开销大的协调任务, 这样把负载分配到client端,可以减轻broker的压力,支持更多数量的消费组;
- Kafka不像YARM那样有NodeManager,存在可以实时监测节点状态的功能。也就是说leader和其他conenct一样也需要定时向coordinator发heartbeat;
- YARN的RM是只负责资源分配的,Kafka的coordinator、还需要确定group的成员,即使在leader确定后,leader也不负责确定group的成员,可以推断出,所有group member都需要发心跳给coordinator,这样coordinator才能确定group的成员。为什么心跳不直接发给leader呢?或许是为了可靠性。毕竟,leader和其他member之间是可能存在着网络分区的情况的。但是,coordinator作为broker,如果任何group member无法与coordinator通讯,那也就肯定不能作为这个group的成员了。这也决定了,这个Group Management Protocol不应依赖于member和leader之间可靠的网络通讯,因为leader不应该与member直接交互。而应该通过coordinator来管理这个组。这种行为与YARN有明显的区别,因为YARN的每个节点都在集群内部,而Kafka的client和connect却不是Kafka Broker的一部分,可能存在于这种网络环境和地理位置。
下面分析下Eager协议的assignment计算过程:
leader在获取到由coordinator发送过来的信息之后,将所有的connect(包括自身)组合成头尾相连的链表,随后遍历所有的connector,每遍历到一个connector就将其交给队列中的下一个节点,当分配到队列的尾部节点时又从头开始分配,直到所有的connector都分配完毕。(task也是一样的分配过程)具体分配代码如下
// EagerAssignor 89行
private Map<String, ByteBuffer> performTaskAssignment(String leaderId, long maxOffset,
Map<String, ExtendedWorkerState> memberConfigs,
WorkerCoordinator coordinator) {
// 这里的Map key是connect的id,value则是connectorId和taskId集合
// 也就是存储每个connect分配到connector和task
Map<String, Collection<String>> connectorAssignments = new HashMap<>();
Map<String, Collection<ConnectorTaskId>> taskAssignments = new HashMap<>();
// intensive than connectors).
List<String> connectorsSorted = sorted(coordinator.configSnapshot().connectors());
// 将所有的connect排序后组成一个头尾相连的环行链表
CircularIterator<String> memberIt = new CircularIterator<>(sorted(memberConfigs.keySet()));
// 遍历每一个connector,并将遍历到connector分配给环形链表的下一个节点
// CircularIterator会在遍历到尾节点后,从头开始分配
for (String connectorId : connectorsSorted) {
String connectorAssignedTo = memberIt.next();
log.trace("Assigning connector {} to {}", connectorId, connectorAssignedTo);
Collection<String> memberConnectors = connectorAssignments.get(connectorAssignedTo);
if (memberConnectors == null) {
memberConnectors = new ArrayList<>();
connectorAssignments.put(connectorAssignedTo, memberConnectors);
}
memberConnectors.add(connectorId);
}
for (String connectorId : connectorsSorted) {
// task和connector的分配时分开的,不过task的分配方式和connector也是一样
// 对于遍历到task,交给队列的下一个节点
for (ConnectorTaskId taskId : sorted(coordinator.configSnapshot().tasks(connectorId))) {
String taskAssignedTo = memberIt.next();
log.trace("Assigning task {} to {}", taskId, taskAssignedTo);
Collection<ConnectorTaskId> memberTasks = taskAssignments.get(taskAssignedTo);
if (memberTasks == null) {
memberTasks = new ArrayList<>();
taskAssignments.put(taskAssignedTo, memberTasks);
}
memberTasks.add(taskId);
}
}
coordinator.leaderState(new LeaderState(memberConfigs, connectorAssignments, taskAssignments));
// 对分配的结果assignment进行序列化
return fillAssignmentsAndSerialize(memberConfigs.keySet(), Assignment.NO_ERROR,
leaderId, memberConfigs.get(leaderId).url(), maxOffset, connectorAssignments, taskAssignments);
}
注意:这里connector和task的是分开来的执行循环分配的。因为在某些常见情况下同时分配connector及其task可能会导致工作分配非常不均匀(例如,对于每个仅生成 1 个task的connector;在 2 个或偶数个节点的connect集群中,只有偶数节点会被分配connector,只有奇数节点会被分配task,但平均而言,task实际上比connector更占用资源)。
3、Imcremental Cooperative Rebalance
Eager协议最大的问题在于只要同一个组的成员发生变化,或者有像这个集群发送新增connector等请求,都会立即触发Rebalancing,导致所有的connect不得不放弃当前正在执行的connector和task,这种行为称为Stop The World,下面简称STW。对于同样使用Eager协议的Kafka consumer而言这点影响不大,毕竟consumer执行的任务基本都是一样的。但是Kafka connect就不一样,不同的connect上面执行的connector和task可能会不一样(例如c1执行的debezium-mysql 的任务,而c2执行mongodb的任务),有一个connector变化都会影响其他connector。Kafka社区在意识到这个问题后,在2.3的版本中推出了新的兼容性Rebalance协议Imcremental Cooperative Rebalance。
Imcremental Cooperative Rebalance并非是一套全新的协议,而是在Eager的基础上扩展。通过connect在加入集群的时候,上报当前自身执行的assignment信息。还是以上面的例子介绍下该协议:
- c1和c2得知需要重新加入集群后,并没有撤销自己当前的任务,而是在发送JoinGroupRequest的请求中,带上自己在上一轮Rebalancing中分配的assignment;
- leader在接受到来自coordinator的JoinGroupResponse后,同样并非将所有的connector和task重新洗牌,而是根据现有的connector、task和connect三者进行数量计算,根据当前的connector和task数除以包括新加入的conenct节点数量,得出每个节点需要平摊的connector和task数量,随后leader根据这个数量判定connect要revoke(放弃)哪些connector和task;
- 每个connect在收到SyncGroupResponse后,根据里面的revoke结果,停止部分connector和task(如果revoke为空,则跳过这一步);
- 直到这里connect集群才完成第一阶段的加入,Imcremental Cooperative Rebalance其实是一个让所有connect两次加入的协议,在第3步所有connect都收到SyncGroupResponse并停止了需要停止的conenctor和task之后,coordiantor并没有结束Rebalancing,所有的connect在hearbeat的时候检测到当前仍然处于Rebalancing阶段,connect不得不再一次加入到Rebalancing中;
- 到了这一步connect重新加入的目的就是用于分配第3步由同个集群其他connect放弃的connector和task,这里分配的方式也比较简单,以connector分配为例,让所有的connect按照已经分配的connector排序,然乎拿第一个connect,假设是ct1,它应该是拥有最少connector的节点,找到比它数量还多的connect的第一个,假定是ct2(其实就是拿第一和第二),往ct1中灌入connector,直到它的数量等于ct2,此时ct1和ct2中connector的数量相等
再拿接下来的ct3,此时假定connector的数量比为ct3>ct2=ct1,那么往ct1和ct2灌入connector,直到ct3=ct2=ct1,依次类推。。。
同样用时序图展示的话:
这里再贴下代码分析assignment计算
protected Map<String, ByteBuffer> performTaskAssignment(String leaderId, long maxOffset,
Map<String, ExtendedWorkerState> memberConfigs,
WorkerCoordinator coordinator, short protocolVersion) {
log.debug("Performing task assignment during generation: {} with memberId: {}",
coordinator.generationId(), coordinator.memberId());
// Base set: The previous assignment of connectors-and-tasks is a standalone snapshot that
// can be used to calculate derived sets
log.debug("Previous assignments: {}", previousAssignment);
// 判断当前leader的generationId是否等于上一轮的coordinator的generationId
// 如果不是,则需要清空上一轮leader计算出来的assignment,出现这种情况的原因可能
// 当前connect在某一轮被作为了leader,在计算assignment完成后和coordinator分区
// 导致assignment未能成功同步到coordinator。在重新加入后,如果该connect节点
// 再次被选为leader,则应该清空掉它分配的assignment,它所保留的视图应该过期了。
int lastCompletedGenerationId = coordinator.lastCompletedGenerationId();
if (previousGenerationId != lastCompletedGenerationId) {
log.debug("Clearing the view of previous assignments due to generation mismatch between "
+ "previous generation ID {} and last completed generation ID {}. This can "
+ "happen if the leader fails to sync the assignment within a rebalancing round. "
+ "The following view of previous assignments might be outdated and will be "
+ "ignored by the leader in the current computation of new assignments. "
+ "Possibly outdated previous assignments: {}",
previousGenerationId, lastCompletedGenerationId, previousAssignment);
// 将该leader保存的assignment视图清空
this.previousAssignment = ConnectorsAndTasks.EMPTY;
}
// 从coordinator中获取当前集群的配置快照configSnapshot,这里包括需要创建的connector
// 和task,以及上一轮分配给connect的connector和task
ClusterConfigState snapshot = coordinator.configSnapshot();
// configSnapshot中获取connector和task快照
Set<String> configuredConnectors = new TreeSet<>(snapshot.connectors());
Set<ConnectorTaskId> configuredTasks = configuredConnectors.stream()
.flatMap(c -> snapshot.tasks(c).stream())
.collect(Collectors.toSet());
// Base set: The set of configured connectors-and-tasks is a standalone snapshot that can
// be used to calculate derived sets
ConnectorsAndTasks configured = new ConnectorsAndTasks.Builder()
.with(configuredConnectors, configuredTasks).build();
log.debug("Configured assignments: {}", configured);
// Base set: The set of active connectors-and-tasks is a standalone snapshot that can be
// used to calculate derived sets
// 从各个connect上传过来的assignment获取connector和task
// 和从coordinator获取的assignment的区别是这里的从connect获取的assignment包含了要被删除
// 的connector和task
ConnectorsAndTasks activeAssignments = assignment(memberConfigs);
log.debug("Active assignments: {}", activeAssignments);
// This means that a previous revocation did not take effect. In this case, reset
// appropriately and be ready to re-apply revocation of tasks
// leader同样会存储上一轮让各个connect放弃的assignment到previousRevocation这个属性中
// 如果previousRevocation不为空说明上一轮有需要让connect放弃的assignment
// 和当前其他connect节点提交的assignment比较,如果存在则说明revoke失败,需要清空previousRevocation
// 然后再次重新计算需要revoke
if (!previousRevocation.isEmpty()) {
if (previousRevocation.connectors().stream().anyMatch(c -> activeAssignments.connectors().contains(c))
|| previousRevocation.tasks().stream().anyMatch(t -> activeAssignments.tasks().contains(t))) {
previousAssignment = activeAssignments;
canRevoke = true;
}
previousRevocation.connectors().clear();
previousRevocation.tasks().clear();
}
// Derived set: The set of deleted connectors-and-tasks is a derived set from the set
// difference of previous - configured
// 将上一轮leader分配快照和当前coordinator的快照比较,计算出被删除的assignment
// ps:为啥不拿connect提交的和coordinator的快照比较?
ConnectorsAndTasks deleted = diff(previousAssignment, configured);
log.debug("Deleted assignments: {}", deleted);
// Derived set: The set of remaining active connectors-and-tasks is a derived set from the
// set difference of active - deleted
// 将当前connect提交的assignment和被删除的比较,计算出当前仍然需要保持运行的assignment
ConnectorsAndTasks remainingActive = diff(activeAssignments, deleted);
log.debug("Remaining (excluding deleted) active assignments: {}", remainingActive);
// Derived set: The set of lost or unaccounted connectors-and-tasks is a derived set from
// the set difference of previous - active - deleted
// 将leader快照assignment为基础、剔除connect的assignment以及被删除的assignment进行比较
// 计算出丢失的assignment,丢失的原因可能是有部分connect挂了,导致分配给他们的assignment
// 没有能够提交到这次rebalance中
ConnectorsAndTasks lostAssignments = diff(previousAssignment, activeAssignments, deleted);
log.debug("Lost assignments: {}", lostAssignments);
// Derived set: The set of new connectors-and-tasks is a derived set from the set
// difference of configured - previous - active
// 以当前coordinator的快照为基础,剔除leader的assignment快照和connect提交的assignment
// 计算出新提交的connector和task
ConnectorsAndTasks newSubmissions = diff(configured, previousAssignment, activeAssignments);
log.debug("New assignments: {}", newSubmissions);
// A collection of the complete assignment
// completeWorkerAssignment表示当前每个connect分配到connector和task,也就是每个connect当前的负载
List<WorkerLoad> completeWorkerAssignment = workerAssignment(memberConfigs, ConnectorsAndTasks.EMPTY);
log.debug("Complete (ignoring deletions) worker assignments: {}", completeWorkerAssignment);
// Per worker connector assignments without removing deleted connectors yet
// connectorAssignments计算每个connect当前获取到connector数量
Map<String, Collection<String>> connectorAssignments =
completeWorkerAssignment.stream().collect(Collectors.toMap(WorkerLoad::worker, WorkerLoad::connectors));
log.debug("Complete (ignoring deletions) connector assignments: {}", connectorAssignments);
// Per worker task assignments without removing deleted connectors yet
// taskAssignments计算每个connect当前获取到的task数量
Map<String, Collection<ConnectorTaskId>> taskAssignments =
completeWorkerAssignment.stream().collect(Collectors.toMap(WorkerLoad::worker, WorkerLoad::tasks));
log.debug("Complete (ignoring deletions) task assignments: {}", taskAssignments);
// A collection of the current assignment excluding the connectors-and-tasks to be deleted
// currentWorkerAssignment则是从completeWorkerAssignment中剔除需要删除的assignment
// 得出删除connector和task后的每个connect的负载
List<WorkerLoad> currentWorkerAssignment = workerAssignment(memberConfigs, deleted);
// deleted是上面计算出的需要删除的assignment,需要遍历当前connect分配的assignment
// 才能得知deleted当前是分配在哪个connect上,继而知道每个connect需要revoke的connector和task
Map<String, ConnectorsAndTasks> toRevoke = computeDeleted(deleted, connectorAssignments, taskAssignments);
log.debug("Connector and task to delete assignments: {}", toRevoke);
// Revoking redundant connectors/tasks if the the workers have duplicate assignments
// 这一步是去除被重复分配的connector和task
toRevoke.putAll(computeDuplicatedAssignments(memberConfigs, connectorAssignments, taskAssignments));
log.debug("Connector and task to revoke assignments (include duplicated assignments): {}", toRevoke);
// Recompute the complete assignment excluding the deleted connectors-and-tasks
// 重新计算每个connect在剔除被删除的connector和task的assignment
completeWorkerAssignment = workerAssignment(memberConfigs, deleted);
connectorAssignments =
completeWorkerAssignment.stream().collect(Collectors.toMap(WorkerLoad::worker, WorkerLoad::connectors));
taskAssignments =
completeWorkerAssignment.stream().collect(Collectors.toMap(WorkerLoad::worker, WorkerLoad::tasks));
// 处理丢失的assignment
handleLostAssignments(lostAssignments, newSubmissions, completeWorkerAssignment, memberConfigs);
// Do not revoke resources for re-assignment while a delayed rebalance is active
// Also we do not revoke in two consecutive rebalances by the same leader
canRevoke = delay == 0 && canRevoke;
// Compute the connectors-and-tasks to be revoked for load balancing without taking into
// account the deleted ones.
log.debug("Can leader revoke tasks in this assignment? {} (delay: {})", canRevoke, delay);
if (canRevoke) {
// 这一步就是计算当前connect需要放弃哪些connector和task才能使整体平衡
Map<String, ConnectorsAndTasks> toExplicitlyRevoke =
performTaskRevocation(activeAssignments, currentWorkerAssignment);
log.debug("Connector and task to revoke assignments: {}", toRevoke);
toExplicitlyRevoke.forEach(
(worker, assignment) -> {
ConnectorsAndTasks existing = toRevoke.computeIfAbsent(
worker,
v -> new ConnectorsAndTasks.Builder().build());
existing.connectors().addAll(assignment.connectors());
existing.tasks().addAll(assignment.tasks());
}
);
canRevoke = toExplicitlyRevoke.size() == 0;
} else {
canRevoke = delay == 0;
}
// 将新提交的connector和task分配给所有的connect
assignConnectors(completeWorkerAssignment, newSubmissions.connectors());
assignTasks(completeWorkerAssignment, newSubmissions.tasks());
log.debug("Current complete assignments: {}", currentWorkerAssignment);
log.debug("New complete assignments: {}", completeWorkerAssignment);
Map<String, Collection<String>> currentConnectorAssignments =
currentWorkerAssignment.stream().collect(Collectors.toMap(WorkerLoad::worker, WorkerLoad::connectors));
Map<String, Collection<ConnectorTaskId>> currentTaskAssignments =
currentWorkerAssignment.stream().collect(Collectors.toMap(WorkerLoad::worker, WorkerLoad::tasks));
Map<String, Collection<String>> incrementalConnectorAssignments =
diff(connectorAssignments, currentConnectorAssignments);
Map<String, Collection<ConnectorTaskId>> incrementalTaskAssignments =
diff(taskAssignments, currentTaskAssignments);
log.debug("Incremental connector assignments: {}", incrementalConnectorAssignments);
log.debug("Incremental task assignments: {}", incrementalTaskAssignments);
coordinator.leaderState(new LeaderState(memberConfigs, connectorAssignments, taskAssignments));
Map<String, ExtendedAssignment> assignments =
fillAssignments(memberConfigs.keySet(), Assignment.NO_ERROR, leaderId,
memberConfigs.get(leaderId).url(), maxOffset, incrementalConnectorAssignments,
incrementalTaskAssignments, toRevoke, delay, protocolVersion);
previousAssignment = computePreviousAssignment(toRevoke, connectorAssignments, taskAssignments, lostAssignments);
previousGenerationId = coordinator.generationId();
previousMembers = memberConfigs.keySet();
log.debug("Actual assignments: {}", assignments);
return serializeAssignments(assignments);
}
计算需要connect需要revoke的逻辑
private Map<String, ConnectorsAndTasks> performTaskRevocation(ConnectorsAndTasks activeAssignments,
Collection<WorkerLoad> completeWorkerAssignment) {
// 获取当前rebalancing中加入connect已经分配到的connector数量,也可以认为是上一轮rebalancing的connector数量
int totalActiveConnectorsNum = activeAssignments.connectors().size();
// 获取当前rebalancing中加入connect已经分配到的task数量,也可以认为是上一轮rebalancing的task数量
int totalActiveTasksNum = activeAssignments.tasks().size();
// existingWorkers表示有分配到assignment的connect节点,这里是过滤叼新加入的节点
Collection<WorkerLoad> existingWorkers = completeWorkerAssignment.stream()
.filter(wl -> wl.size() > 0)
.collect(Collectors.toList());
int existingWorkersNum = existingWorkers.size();
int totalWorkersNum = completeWorkerAssignment.size();
// 计算出新加入的connect数量
int newWorkersNum = totalWorkersNum - existingWorkersNum;
if (log.isDebugEnabled()) {
completeWorkerAssignment.forEach(wl -> log.debug(
"Per worker current load size; worker: {} connectors: {} tasks: {}",
wl.worker(), wl.connectorsSize(), wl.tasksSize()));
}
Map<String, ConnectorsAndTasks> revoking = new HashMap<>();
// If there are no new workers, or no existing workers to revoke tasks from return early
// after logging the status
// 当前集群没有新加入的connect时,才需要从其他conenct中分出一部分assignment
if (!(newWorkersNum > 0 && existingWorkersNum > 0)) {
log.debug("No task revocation required; workers with existing load: {} workers with "
+ "no load {} total workers {}",
existingWorkersNum, newWorkersNum, totalWorkersNum);
// This is intentionally empty but mutable, because the map is used to include deleted
// connectors and tasks as well
return revoking;
}
log.debug("Task revocation is required; workers with existing load: {} workers with "
+ "no load {} total workers {}",
existingWorkersNum, newWorkersNum, totalWorkersNum);
// We have at least one worker assignment (the leader itself) so totalWorkersNum can't be 0
log.debug("Previous rounded down (floor) average number of connectors per worker {}", totalActiveConnectorsNum / existingWorkersNum);
// 当前每个connect提交的connector数量除以本次rebalancing中加入的connect数量
// 得到如果需要把connector分配给所有connect,那么每个connect需要分摊的connector数量floorConnectors
int floorConnectors = totalActiveConnectorsNum / totalWorkersNum;
// 这一步取整ceilConnectors
int ceilConnectors = floorConnectors + ((totalActiveConnectorsNum % totalWorkersNum == 0) ? 0 : 1);
log.debug("New average number of connectors per worker rounded down (floor) {} and rounded up (ceil) {}", floorConnectors, ceilConnectors);
log.debug("Previous rounded down (floor) average number of tasks per worker {}", totalActiveTasksNum / existingWorkersNum);
int floorTasks = totalActiveTasksNum / totalWorkersNum;
int ceilTasks = floorTasks + ((totalActiveTasksNum % totalWorkersNum == 0) ? 0 : 1);
log.debug("New average number of tasks per worker rounded down (floor) {} and rounded up (ceil) {}", floorTasks, ceilTasks);
int numToRevoke;
// 遍历所有有分配到connector和connect(不包括新加入的connect)
for (WorkerLoad existing : existingWorkers) {
Iterator<String> connectors = existing.connectors().iterator();
// 当前connect分配到的connector的数量减去每个connect达到平衡时connector数量
// 计算出当前connect需要放弃的connector数量numToRevoke
numToRevoke = existing.connectorsSize() - ceilConnectors;
// 注意,有可能connect被分配到的connector的数量少于ceilConnectors,导致该connect不仅不需要revoke部分connector
// 还需要添加更多的connector的数量给它
for (int i = existing.connectorsSize(); i > floorConnectors && numToRevoke > 0; --i, --numToRevoke) {
ConnectorsAndTasks resources = revoking.computeIfAbsent(
existing.worker(),
w -> new ConnectorsAndTasks.Builder().build());
resources.connectors().add(connectors.next());
}
}
for (WorkerLoad existing : existingWorkers) {
Iterator<ConnectorTaskId> tasks = existing.tasks().iterator();
numToRevoke = existing.tasksSize() - ceilTasks;
log.debug("Tasks on worker {} is higher than ceiling, so revoking {} tasks", existing, numToRevoke);
for (int i = existing.tasksSize(); i > floorTasks && numToRevoke > 0; --i, --numToRevoke) {
ConnectorsAndTasks resources = revoking.computeIfAbsent(
existing.worker(),
w -> new ConnectorsAndTasks.Builder().build());
resources.tasks().add(tasks.next());
}
}
return revoking;
}
分配新connector的部分(task也类似)
protected void assignConnectors(List<WorkerLoad> workerAssignment, Collection<String> connectors) {
// 这里分配的方式也比较简单,就是让所有的connect按照已经分配的connector排序(已经剔除被删除的)
// 让后拿第一个connect,假设是ct1,找到比它数量还多的connect的第一个,假定是ct2(其实就是拿第一和第二),往
// ct1中灌入connector,直到它的数量等于ct2,此时ct1和ct2中connector的数量相等
// 再拿接下来的ct3,此时假定connector的数量比为ct3>ct2=ct1,那么往ct1和ct2灌入connector,直到
// ct3=ct2=ct1,依次类推。。。
// 按connector的数量排序connect
workerAssignment.sort(WorkerLoad.connectorComparator());
// 取出第一个connect也就是数量最少的connect
WorkerLoad first = workerAssignment.get(0);
// 遍历所有需要分配的connector
Iterator<String> load = connectors.iterator();
while (load.hasNext()) {
// 获取第一个connect中的connector数量
int firstLoad = first.connectorsSize();
// 找到connector数量比第一个connect多的
int upTo = IntStream.range(0, workerAssignment.size())
.filter(i -> workerAssignment.get(i).connectorsSize() > firstLoad)
.findFirst()
.orElse(workerAssignment.size());
// 这里subList结果是分出到比第一个connect数量多的游标为止(不包含数量多的)
// 然后将需要分配的connector一直分配
for (WorkerLoad worker : workerAssignment.subList(0, upTo)) {
String connector = load.next();
log.debug("Assigning connector {} to {}", connector, worker.worker());
worker.assign(connector);
if (!load.hasNext()) {
break;
}
}
}
}
4、新老协议比对
该图是取自Kafka官方博客的比较Eager协议和Incremental Cooperative协议,总共90个connector和10个task,在发生Rebalancing时经历的时间。
在每个图表的左侧,Eager Rebalancing(每当连接器及其任务启动或停止时都会STW)的成本与集群中当前运行的任务数量成正比。 连接器启动或停止时的成本相似,分别导致集群稳定的时间约为 14 分钟和 12 分钟。 相比之下,在右侧,增量协作重新平衡在一分钟内平衡了 900 个任务,并且每个单独的重新平衡的成本显然与集群中当前的任务数量无关。 下满的条形图通过比较启动和停止 90 个连接器和 900 个任务所花费的时间清楚地表明了这一点。
一般来说,如果一组Kafka Connect运行很多不同的Connectors和滚动升级插件的情况下,更适合Imcremental Cooperative Rebalance;Eager Rebalance适用于集群上connector和task不多的场景,非二次加入的条件可以让集群立马处于平衡的状态。