JobManager中Slot的管理
相比于TaskExecutor和ResourceManager,JobManager中资源管理的部分可能要相对更为复杂一下,这主要是由于Flink允许通过SlotSharingGroup和CoLocationGroup约束使得多个子任务在相同的的slot中运行。在JobMaster中,主要通过SlotPool和ResourceManager及TaskExecutor进行通信,并管理分配给当前JobMaster的slot;而具体到当前Job的所有子任务的调度和资源分配,则主要依赖Scheduler和SlotSharingManager。其主要的资源调度分配流程如下:
JobManager与其他组件之间的交互:
- Scheduler -> SlotPool:调度器向SlotPool申请资源
- SlotPool -> ResourceManager:SlotPool如果无法满足资源请求,向RM发起申请
- JobMaster -> SlotPool:从TaskManager获取的资源通过JobMaster分配给SlotPool
AllocatedSlot、LogicalSlot、MultiTaskSlot
首先要区分一下AllocatedSlot和LogicalSlot这两个概念:AllocatedSlot表征的是物理意义上TaskExecutor上的一个slot资源,而LogicalSlot表征逻辑上的一个slot,一个task可以部署到一个LogicalSlot上,但它和物理上AllocatedSlot所代表的一个具体的slot并不是一一对应的。由于资源共享等机制的存在,多个LogicalSlot可能被映射到同一个AllocatedSlot上。
AllocatedSlot继承实现了接口SlotContext;其表示JobMaster从TaskExecutor申请获取分配的slot插槽资源,代表了TaskExecutor实际物理上分配的一部分slot资源:
class AllocatedSlot implements SlotContext {
/** The ID under which the slot is allocated. Uniquely identifies the slot. */
private final AllocationID allocationId;
/** The location information of the TaskManager to which this slot belongs */
private final TaskManagerLocation taskManagerLocation;
/** The resource profile of the slot provides */
private final ResourceProfile resourceProfile;
/** RPC gateway to call the TaskManager that holds this slot */
private final TaskManagerGateway taskManagerGateway;
/** The number of the slot on the TaskManager to which slot belongs. Purely informational. */
private final int physicalSlotNumber;
private final AtomicReference<Payload> payloadReference;
public boolean tryAssignPayload(Payload payload) {
return payloadReference.compareAndSet(null, payload);
}
interface Payload { // Payload which can be assigned to an {@link AllocatedSlot}.
void release(Throwable cause); // Releases the payload
}
}
LogicSlot接口和它的实现类SingleLogicalSlot:
public interface LogicalSlot {
TaskManagerLocation getTaskManagerLocation();
TaskManagerGateway getTaskManagerGateway();
int getPhysicalSlotNumber();
AllocationID getAllocationId();
SlotRequestId getSlotRequestId();
Locality getLocality();
@Nullable
SlotSharingGroupId getSlotSharingGroupId();
CompletableFuture<?> releaseSlot(@Nullable Throwable cause);
boolean tryAssignPayload(Payload payload);
Payload getPayload();
interface Payload { // Payload for a logical slot.
void fail(Throwable cause);
CompletableFuture<?> getTerminalStateFuture();
}
}
public class SingleLogicalSlot implements LogicalSlot, AllocatedSlot.Payload {
private final SlotRequestId slotRequestId;
private final SlotContext slotContext;
// null if the logical slot does not belong to a slot sharing group, otherwise non-null
@Nullable
private final SlotSharingGroupId slotSharingGroupId;
// locality of this slot wrt the requested preferred locations
private final Locality locality;
// owner of this slot to which it is returned upon release
private final SlotOwner slotOwner;
private final CompletableFuture<Void> releaseFuture;
private volatile State state;
// LogicalSlot.Payload of this slot
private volatile Payload payload;
}
SingleLogicalSlot实现了AllocatedSlot.Payload接口,也就是说SingleLogicalSlot可以作为payload被分配给AllocatedSlot。类似地,LogicalSlot同样规定了其所能承载的payload负载信息,LogicalSlot.Payload接口的实现类是Execution,也就是需要被调度执行的一个task。
同样在LogicalSlot中可以看到有两个申请的ID:AllocationID和SlotRequestID;其区别是:AllocationID是用来区分物理内存的分配,它总是和AllocatedSlot向关联的;而SlotRequestID是任务调度执行的时候请求LogicalSlot,是和LogicalSlot关联的。
为了实现slot资源的共享,需要把多个LogicalSlot映射到同一个AllocatedSlot上,这个映射的具体实现方式为:其额外引入了AllocatedSlot.Payload接口的另一个实现:SlotSharingManager的内部类SlotSharingManager.MultiTaskSlot。MultiTaskSlot和SingleTaskSlot的公共父类是TaskSlot,通过构造一个由TaskSlot构成的树形结构来实现slot共享和CoLocationGroup的强制约束。MultiTaskSlot对应树形结构的内部节点,它可以包含多个子节点(可以是MultiTaskSlot,也可以是SingleTaskSlot);而SingleTaskSlot对应树形结构的叶子结点。树的根节点是MultiTaskSlot,根节点会被分配一个SlotContext,SlotContext代表了其所分配的TaskExecutor中的一个物理slot,这棵树中所有的任务都会在同一个slot中运行。一个MultiTaskSlot可以包含多个叶子节点,只要用来区分这些叶子节点TaskSlot的AbstractID不同即可(可能是JobVertexID,也可能是CoLocationGroup 的ID)。
TaskSlot
class TaskSlot {
private final SlotRequestId slotRequestId; // 每个TaskSlot都有一个相关联的slotRequestId
// all task slots except for the root slots have a group id assigned
// 除了root节点,每个节点都有一个groupId用来区分一个TaskSlot。可能是JobVertexID,也可能是CoLocationGroup的ID
@Nullable
private final AbstractID groupId;
public boolean contains(AbstractID groupId) {
return Objects.equals(this.groupId, groupId);
}
public abstract void release(Throwable cause);
}
MultiTaskSlot继承了TaskSlot,MultiTaskSlot可以有多个子节点。MultiTaskSlot可以作为根节点,也可以作为内部节点。MultiTaskSlot也实现了AllocatedSlot.Payload接口,可以分配给AllocatedSlot(在作为根节点的情况下):
public final class MultiTaskSlot extends TaskSlot implements AllocatedSlot.Payload {
private final Map<AbstractID, TaskSlot> children;
// the root node has its parent set to null
@Nullable
private final MultiTaskSlot parent;
// underlying allocated slot
private final CompletableFuture<? extends SlotContext> slotContextFuture;
// slot request id of the allocated slot
@Nullable
private final SlotRequestId allocatedSlotRequestId;
}
SingleTaskSlot只能作为叶子节点,它拥有一个LogicalSlot,后续可以用来分配具体的task:
public final class SingleTaskSlot extends TaskSlot {
private final MultiTaskSlot parent;
// future containing a LogicalSlot which is completed once the underlying SlotContext future is completed
private final CompletableFuture<SingleLogicalSlot> singleLogicalSlotFuture;
private SingleTaskSlot(
SlotRequestId slotRequestId,
AbstractID groupId,
MultiTaskSlot parent,
Locality locality) {
super(slotRequestId, groupId);
this.parent = Preconditions.checkNotNull(parent);
Preconditions.checkNotNull(locality);
singleLogicalSlotFuture = parent.getSlotContextFuture()
.thenApply(
(SlotContext slotContext) -> {
LOG.trace("Fulfill single task slot [{}] with slot [{}].", slotRequestId, slotContext.getAllocationId());
return new SingleLogicalSlot( // LogicalSlot用来分配具体的task
slotRequestId,
slotContext,
slotSharingGroupId,
locality,
slotOwner);
});
}
}
对于普通的SlotShargingGroup的约束,形成的树形结构是:MultiTaskSlot作为根节点,多个SingleTaskSlot作为叶子节点,这些叶子节点分别代表不同的任务,用来区分它们的JobVertextID不同。对于CoLocationGroup强制约束,会在MultiTaskSlot根节点的下一级创建一个MultiTaskSlot节点(用CoLocationGroup ID)来区分,同一个CoLocationGroup约束下的子任务进一步作为第二层MultiTaskSlot的叶子节点。
SlotPool
JobManager使用SlotPool来向ResourceManager申请slot,并管理所有分配给该JobManager的slots。这里所说的slot都是指的是TaskExecutor上的物理AllocatedSlot。SlotPool接口的唯一实现类是SlotPoolImpl,其主要的成员变量如下:
class SlotPoolImpl implements SlotPool {
/** The book-keeping of all allocated slots. */
private final AllocatedSlots allocatedSlots; // 所有分配给当前JobManager的slots
/** The book-keeping of all available slots. */
private final AvailableSlots availableSlots; // 所有可用的slots(已经分配给该JobManager,但还没有装载payload)
/** All pending requests waiting for slots. */
private final DualKeyMap<SlotRequestId, AllocationID, PendingRequest> pendingRequests; // 所有处于等待状态的slot request(已经发送请求给ResourceManager)
/** The requests that are waiting for the resource manager to be connected. */
private final HashMap<SlotRequestId, PendingRequest> waitingForResourceManager; // 处于等待状态的slot request(还没有发送请求给ResourceManager,此时没有和ResourceManager建立连接)
}
每一个分配给SlotPool的slot都通过AllocationID进行唯一区分。getAvailableSlotsInformation方法可以获取当前可用的slots(还没有payload),而后可以通过allocateAvailableSlot将特定AllocationID关联的AllocatedSlot分配给指定的SlotRequestID对应的请求:
class SlotPoolImpl implements SlotPool {
@Override
public Collection<SlotInfo> getAvailableSlotsInformation() { // 列出当前可用的slot
return availableSlots.listSlotInfo();
}
@Override
public Optional<PhysicalSlot> allocateAvailableSlot( // 将allocationID关联的slot分配给slotRequestId对应的请求
@Nonnull SlotRequestId slotRequestId,
@Nonnull AllocationID allocationID) {
componentMainThreadExecutor.assertRunningInMainThread();
AllocatedSlot allocatedSlot = availableSlots.tryRemove(allocationID); // 从availableSlots中移除
if (allocatedSlot != null) {
allocatedSlots.add(slotRequestId, allocatedSlot); // 加入已分配的映射关系中
return Optional.of(allocatedSlot);
} else {
return Optional.empty();
}
}
}
如果当前没有可用的slot,则可以要求SlotPool向ResourceManager进行申请:
class SlotPoolImpl implements SlotPool {
public CompletableFuture<PhysicalSlot> requestNewAllocatedSlot( // 向RM申请新的slot
@Nonnull SlotRequestId slotRequestId,
@Nonnull ResourceProfile resourceProfile,
Time timeout) {
componentMainThreadExecutor.assertRunningInMainThread(); // 构造一个PendingRequest
final PendingRequest pendingRequest = PendingRequest.createStreamingRequest(slotRequestId, resourceProfile);
// register request timeout
// ............
return requestNewAllocatedSlotInternal(pendingRequest)
.thenApply((Function.identity()));
}
private CompletableFuture<AllocatedSlot> requestNewAllocatedSlotInternal(PendingRequest pendingRequest) {
if (resourceManagerGateway == null) {
stashRequestWaitingForResourceManager(pendingRequest); // 如果当前没有和RM建立连接,则需要等待RM建立连接
} else {
requestSlotFromResourceManager(resourceManagerGateway, pendingRequest); // 当前已经和RM建立了连接,向RM申请slot
}
return pendingRequest.getAllocatedSlotFuture();
}
// 如果当前没有和RM建立连接,则需要等待RM建立连接,加入waitingForResourceManager
// 一旦和RM建立连接,就会向RM发送请求
private void stashRequestWaitingForResourceManager(final PendingRequest pendingRequest) {
log.info("Cannot serve slot request, no ResourceManager connected. " +
"Adding as pending request [{}]", pendingRequest.getSlotRequestId());
waitingForResourceManager.put(pendingRequest.getSlotRequestId(), pendingRequest);
}
// 当前已经和RM建立了连接,向RM申请slot
private void requestSlotFromResourceManager(
final ResourceManagerGateway resourceManagerGateway,
final PendingRequest pendingRequest) {
checkNotNull(resourceManagerGateway);
checkNotNull(pendingRequest);
log.info("Requesting new slot [{}] and profile {} from resource manager.", pendingRequest.getSlotRequestId(), pendingRequest.getResourceProfile());
final AllocationID allocationId = new AllocationID(); // 生成一个AllocationID,后面分配的slot通过AllocationID区分
pendingRequests.put(pendingRequest.getSlotRequestId(), allocationId, pendingRequest); // 添加到等待处理的请求中
pendingRequest.getAllocatedSlotFuture().whenComplete(
(AllocatedSlot allocatedSlot, Throwable throwable) -> {
if (throwable != null || !allocationId.equals(allocatedSlot.getAllocationId())) {
// cancel the slot request if there is a failure or if the pending request has
// been completed with another allocated slot
resourceManagerGateway.cancelSlotRequest(allocationId);
}
});
// 通过RPC调用向RM请求slot,RM对于resourceManagerGateway.requestSlot的处理流程在前面已经介绍过
CompletableFuture<Acknowledge> rmResponse = resourceManagerGateway.requestSlot(
jobMasterId,
new SlotRequest(jobId, allocationId, pendingRequest.getResourceProfile(), jobManagerAddress),
rpcTimeout);
FutureUtils.whenCompleteAsyncIfNotDone(
rmResponse,
componentMainThreadExecutor,
(Acknowledge ignored, Throwable failure) -> {
// on failure, fail the request future
if (failure != null) {
slotRequestToResourceManagerFailed(pendingRequest.getSlotRequestId(), failure);
}
});
}
}
JobMaster通过SlotPool向ResourceManager进行slot资源的申请;ResourceManager在接收到来自JobMaster#SlotPool的RPC请求resourceManagerGateway.requestSlot()后;RM会委托给其内部组件slotManager去进行对应slot资源的申请slotManager.registerSlotRequest(slotRequest);在RM#slotManager申请内部其又会通过taskExecutorGateway.requestSlot()的RPC接口向TaskExecutor进行具体slot资源的申请;在TaskExecutor接受到来自RM的slot资源申请请求后,其会检查自己taskSlotTable内的空闲资源并进行slot的分配处理流程;在成功分配到资源后,TaskExecutor会将该jobId所申请到的slot资源通过TaskExecutor#offerSlotsToJobManager(jobId)上报给对应的JobMaster,其主要通过RPC方式调用jobMasterGateway.offerSlots()将分配的slot提供给JobMaster,最终JobMaster#SlotPool.offerSlots()方法会被调用:
class SlotPoolImpl {
// 向SlotPool分配slot,返回已经被接受的slot集合。没有被接受的slot,RM可以分配给其他Job。
public CompletableFuture<Collection<SlotOffer>> offerSlots(
final ResourceID taskManagerId,
final Collection<SlotOffer> slots,
final Time timeout) {
Tuple2<TaskManagerLocation, TaskExecutorGateway> taskManager = registeredTaskManagers.get(taskManagerId);
if (taskManager == null) {
return FutureUtils.completedExceptionally(new Exception("Unknown TaskManager " + taskManagerId));
}
final TaskManagerLocation taskManagerLocation = taskManager.f0;
final TaskExecutorGateway taskExecutorGateway = taskManager.f1;
final RpcTaskManagerGateway rpcTaskManagerGateway = new RpcTaskManagerGateway(taskExecutorGateway, getFencingToken());
return CompletableFuture.completedFuture(
// SlotPool可以确定是否接受每一个slot(accepted or rejected by returning the collection of accepted slot offers.)
slotPool.offerSlots(
taskManagerLocation,
rpcTaskManagerGateway,
slots));
}
boolean offerSlot(
final TaskManagerLocation taskManagerLocation,
final TaskManagerGateway taskManagerGateway,
final SlotOffer slotOffer) {
componentMainThreadExecutor.assertRunningInMainThread();
// check if this TaskManager is valid
final ResourceID resourceID = taskManagerLocation.getResourceID();
final AllocationID allocationID = slotOffer.getAllocationId();
if (!registeredTaskManagers.contains(resourceID)) {
log.debug("Received outdated slot offering [{}] from unregistered TaskManager: {}",
slotOffer.getAllocationId(), taskManagerLocation);
return false;
}
// check whether we have already using this slot // 如果当前slot关联的AllocationID已经在SlotPool中出现
AllocatedSlot existingSlot;
if ((existingSlot = allocatedSlots.get(allocationID)) != null ||
(existingSlot = availableSlots.get(allocationID)) != null) {
// we need to figure out if this is a repeated offer for the exact same slot,
// or another offer that comes from a different TaskManager after the ResourceManager
// re-tried the request
// we write this in terms of comparing slot IDs, because the Slot IDs are the identifiers of
// the actual slots on the TaskManagers
// Note: The slotOffer should have the SlotID
final SlotID existingSlotId = existingSlot.getSlotId();
final SlotID newSlotId = new SlotID(taskManagerLocation.getResourceID(), slotOffer.getSlotIndex());
if (existingSlotId.equals(newSlotId)) { // 这个slot在之前已经被SlotPool接受了,相当于TaskExecutor发送了一个重复的offer
log.info("Received repeated offer for slot [{}]. Ignoring.", allocationID);
// return true here so that the sender will get a positive acknowledgement to the retry
// and mark the offering as a success
return true;
} else { // 已经有一个其他的AllocatedSlot和这个AllocationID关联了,因此不能接受当前的这个slot
// the allocation has been fulfilled by another slot, reject the offer so the task executor
// will offer the slot to the resource manager
return false;
}
}
// 这个slot关联的AllocationID此前没有出现过; 新建一个AllocatedSlot对象,表示新分配的slot
final AllocatedSlot allocatedSlot = new AllocatedSlot(
allocationID,
taskManagerLocation,
slotOffer.getSlotIndex(),
slotOffer.getResourceProfile(),
taskManagerGateway);
// check whether we have request waiting for this slot // 检查是否有一个request和这个AllocationID关联
PendingRequest pendingRequest = pendingRequests.removeKeyB(allocationID);
if (pendingRequest != null) {
// we were waiting for this! // 有一个pending request正在等待这个slot
allocatedSlots.add(pendingRequest.getSlotRequestId(), allocatedSlot);
// 尝试去完成那个等待的请求
if (!pendingRequest.getAllocatedSlotFuture().complete(allocatedSlot)) { // 失败了
// we could not complete the pending slot future --> try to fulfill another pending request
allocatedSlots.remove(pendingRequest.getSlotRequestId());
tryFulfillSlotRequestOrMakeAvailable(allocatedSlot); // 尝试去满足其他在等待的请求
} else {
log.debug("Fulfilled slot request [{}] with allocated slot [{}].", pendingRequest.getSlotRequestId(), allocationID);
}
}
else { // 没有请求在等待这个slot,可能请求已经被满足了; 尝试去满足其他在等待的请求
// we were actually not waiting for this:
// - could be that this request had been fulfilled
// - we are receiving the slots from TaskManagers after becoming leaders
tryFulfillSlotRequestOrMakeAvailable(allocatedSlot);
}
// we accepted the request in any case. slot will be released after it idled for
// too long and timed out
return true;
}
}
一旦有新的可用的AllocatedSlot的时候,SlotPoolImpl会尝试用这个AllocatedSlot去提前满足其他还在等待响应的请求:
class SlotManagerImpl implements SlotPool {
private void tryFulfillSlotRequestOrMakeAvailable(AllocatedSlot allocatedSlot) {
Preconditions.checkState(!allocatedSlot.isUsed(), "Provided slot is still in use.");
// 查找和当前AllocatedSlot的计算资源相匹配的还在等待的请求
final PendingRequest pendingRequest = pollMatchingPendingRequest(allocatedSlot); // 查找pending请求
if (pendingRequest != null) {
log.debug("Fulfilling pending slot request [{}] early with returned slot [{}]",
pendingRequest.getSlotRequestId(), allocatedSlot.getAllocationId());
allocatedSlots.add(pendingRequest.getSlotRequestId(), allocatedSlot); // 如果有匹配的请求,那么将AllocatedSlot分配给等待的请求
pendingRequest.getAllocatedSlotFuture().complete(allocatedSlot);
} else {
log.debug("Adding returned slot [{}] to available slots", allocatedSlot.getAllocationId());
availableSlots.add(allocatedSlot, clock.relativeTimeMillis()); // 如果没有,那么这个AllocatedSlot变成available的
}
}
private PendingRequest pollMatchingPendingRequest(final AllocatedSlot slot) { // 查找和当前AllocatedSlot的计算资源相匹配的还在等待的请求
final ResourceProfile slotResources = slot.getResourceProfile();
// try the requests sent to the resource manager first
for (PendingRequest request : pendingRequests.values()) {
if (slotResources.isMatching(request.getResourceProfile())) {
pendingRequests.removeKeyA(request.getSlotRequestId());
return request;
}
}
// try the requests waiting for a resource manager connection next
for (PendingRequest request : waitingForResourceManager.values()) {
if (slotResources.isMatching(request.getResourceProfile())) {
waitingForResourceManager.remove(request.getSlotRequestId());
return request;
}
}
// no request pending, or no request matches
return null;
}
}
slotPool启动的时候会开启一个定时调度的任务,周期性地检查空闲的slot;如果slot空闲时间过长,其会通过RPC接口taskManagerGateway.freeSlot()将该slot归还给TaskManager:
// class SlotPoolImpl implements SlotPool
public void start(
@Nonnull JobMasterId jobMasterId,
@Nonnull String newJobManagerAddress,
@Nonnull ComponentMainThreadExecutor componentMainThreadExecutor) throws Exception {
this.jobMasterId = jobMasterId;
this.jobManagerAddress = newJobManagerAddress;
this.componentMainThreadExecutor = componentMainThreadExecutor;
scheduleRunAsync(this::checkIdleSlot, idleSlotTimeout); // 检查空闲的slot,并将其归还给TaskManager
scheduleRunAsync(this::checkBatchSlotTimeout, batchSlotTimeout);
if (log.isDebugEnabled()) {
scheduleRunAsync(this::scheduledLogStatus, STATUS_LOG_INTERVAL_MS, TimeUnit.MILLISECONDS);
}
}
/**
* Check the available slots, release the slot that is idle for a long time.
*/
protected void checkIdleSlot() {
// The timestamp in SlotAndTimestamp is relative
final long currentRelativeTimeMillis = clock.relativeTimeMillis();
final List<AllocatedSlot> expiredSlots = new ArrayList<>(availableSlots.size());
for (SlotAndTimestamp slotAndTimestamp : availableSlots.availableSlots.values()) {
if (currentRelativeTimeMillis - slotAndTimestamp.timestamp > idleSlotTimeout.toMilliseconds()) {
expiredSlots.add(slotAndTimestamp.slot);
}
}
final FlinkException cause = new FlinkException("Releasing idle slot.");
for (AllocatedSlot expiredSlot : expiredSlots) {
final AllocationID allocationID = expiredSlot.getAllocationId();
if (availableSlots.tryRemove(allocationID) != null) {
log.info("Releasing idle slot [{}].", allocationID);
final CompletableFuture<Acknowledge> freeSlotFuture = expiredSlot.getTaskManagerGateway().freeSlot( // 将空闲的slot归还给TaskManager
allocationID,
cause,
rpcTimeout);
FutureUtils.whenCompleteAsyncIfNotDone(
freeSlotFuture,
componentMainThreadExecutor,
(Acknowledge ignored, Throwable throwable) -> {
if (throwable != null) {
// The slot status will be synced to task manager in next heartbeat.
log.debug("Releasing slot [{}] of registered TaskExecutor {} failed. Discarding slot.",
allocationID, expiredSlot.getTaskManagerId(), throwable);
}
});
}
}
scheduleRunAsync(this::checkIdleSlot, idleSlotTimeout);
}
Scheduler和SlotSharingManager
在JobMaster中SlotPool主要负责的是分配给当前JobMaster的PhysicalSlot的AllocatedSlot管理。但是,具体到每一个Task所需要的计算资源的调度和管理,是按照LogicalSlot进行组织的,不同的Task所分配的LogicalSlot各不相同,但它们底层所对应的TaskExecutor上的PhysicalSlot物理AllocatedSlot可能是同一个。主要的逻辑都封装在SlotSharingManager和Scheduler中。在前面已经提到过,通过构造一个由TaskSlot构成的树形结构可以实现SlotSharingGroup内的资源共享以及CoLocationGroup的强制约束,这主要就是通过SlotSharingManager来完成的。每一个SlotSharingGroup都会有一个与其对应的SlotSharingManager。
在Flink中有两种共享组:
- SlotSharingGroup:非强制性共享约束,slot共享根据组内的JobVertices ID查找是否有可以共享的Slot,只要确保相同的JobVertext ID不能出现在一个共享的slot内即可。在符合资源要求的slot中,找到没有相同JobVertices ID的slot,根据slot选择策略选择一个slot即可,如果没有符合条件的,则申请新的slot。
- CoLocationGroup:又叫做本地约束共享组,具有强制性的slot共享限制,CoLocationGroup用在迭代运算中,迭代运算中的Task必须共享同一个TaskManager的slot。CoLocationGroup可以看成是SlotSharingGroup的特例。
其独占及共享slot资源分配示意如下:
SlotSharingManager
主要的成员变量如下,除了关联的SlotSharingGroupId以外,最重要的就是用于管理TaskSlot的三个Map:
class SlotSharingManager {
private final SlotSharingGroupId slotSharingGroupId;
/** Actions to release allocated slots after a complete multi task slot hierarchy has been released. */
private final AllocatedSlotActions allocatedSlotActions;
/** Owner of the slots to which to return them when they are released from the outside. */
private final SlotOwner slotOwner;
private final Map<SlotRequestId, TaskSlot> allTaskSlots; // 所有的TaskSlot,包括root和inner和leaf
/** Root nodes which have not been completed because the allocated slot is still pending. */
private final Map<SlotRequestId, MultiTaskSlot> unresolvedRootSlots; // root MultiTaskSlot,但底层的Physical Slot还没有分配好
/** Root nodes which have been completed (the underlying allocated slot has been assigned). */
// root MultiTaskSlot,底层的physical slot也已经分配好了,按照两层map的方式组织,
// 可以通过已分配的Physical slot所在的TaskManager的位置进行查找
private final Map<TaskManagerLocation, Map<AllocationID, MultiTaskSlot>> resolvedRootSlots;
}
当需要构造一个新的TaskSlot树的时候,需要调用createRootSlot来创建根节点:
class SlotSharingManager {
MultiTaskSlot createRootSlot(
SlotRequestId slotRequestId,
CompletableFuture<? extends SlotContext> slotContextFuture,
SlotRequestId allocatedSlotRequestId) {
final MultiTaskSlot rootMultiTaskSlot = new MultiTaskSlot(
slotRequestId,
slotContextFuture,
allocatedSlotRequestId);
LOG.debug("Create multi task slot [{}] in slot [{}].", slotRequestId, allocatedSlotRequestId);
allTaskSlots.put(slotRequestId, rootMultiTaskSlot);
unresolvedRootSlots.put(slotRequestId, rootMultiTaskSlot); // 先加入到unresolvedRootSlots中
// add the root node to the set of resolved root nodes once the SlotContext future has
// been completed and we know the slot's TaskManagerLocation
slotContextFuture.whenComplete(
(SlotContext slotContext, Throwable throwable) -> {
if (slotContext != null) {
// 一旦physical slot完成分配,就从unresolvedRootSlots中移除,加入到resolvedRootSlots中
final MultiTaskSlot resolvedRootNode = unresolvedRootSlots.remove(slotRequestId);
if (resolvedRootNode != null) {
final AllocationID allocationId = slotContext.getAllocationId();
LOG.trace("Fulfill multi task slot [{}] with slot [{}].", slotRequestId, allocationId);
final Map<AllocationID, MultiTaskSlot> innerMap = resolvedRootSlots.computeIfAbsent(
slotContext.getTaskManagerLocation(),
taskManagerLocation -> new HashMap<>(4));
MultiTaskSlot previousValue = innerMap.put(allocationId, resolvedRootNode);
Preconditions.checkState(previousValue == null);
}
} else {
rootMultiTaskSlot.release(throwable);
}
});
return rootMultiTaskSlot;
}
}
另外,Flink中不同Task只要在同一个SlotSharingGroup中就可以进行资源共享,但有一个隐含的条件是,这两个Task需要是不同的Operator的子任务。例如,如果map算子的并行度为三,map[1]子任务和map[2]子任务是不能落在同一个PhysicalSlot中的。在listResolvedRootSlotInfo和getUnresolvedRootSlot中,都有!multiTaskSlot.contains(groupId)的逻辑,也就是说要确保一棵TaskSlot构成的树中不会出现同一个算子的不同子任务。
class SlotSharingManager {
// 列出已经分配了physical slot的root MultiTaskSlot,但要求MultiTaskSlot不包含指定的groupId
public Collection<SlotSelectionStrategy.SlotInfoAndResources> listResolvedRootSlotInfo(@Nullable AbstractID groupId) {
return resolvedRootSlots
.values()
.stream()
.flatMap((Map<AllocationID, MultiTaskSlot> map) -> createValidMultiTaskSlotInfos(map, groupId))
.map((MultiTaskSlotInfo multiTaskSlotInfo) -> {
SlotInfo slotInfo = multiTaskSlotInfo.getSlotInfo();
return new SlotSelectionStrategy.SlotInfoAndResources(
slotInfo,
slotInfo.getResourceProfile().subtract(multiTaskSlotInfo.getReservedResources()),
multiTaskSlotInfo.getTaskExecutorUtilization());
}).collect(Collectors.toList());
}
// 根据SlotInfo (TasManagerLocation和AllocationId)找到MultiTaskSlot
public MultiTaskSlot getResolvedRootSlot(@Nonnull SlotInfo slotInfo) {
Map<AllocationID, MultiTaskSlot> forLocationEntry = resolvedRootSlots.get(slotInfo.getTaskManagerLocation());
return forLocationEntry != null ? forLocationEntry.get(slotInfo.getAllocationId()) : null;
}
/**
* Gets an unresolved slot which does not yet contain the given groupId. An unresolved
* slot is a slot whose underlying allocated slot has not been allocated yet.
*
* @param groupId which the returned slot must not contain
* @return the unresolved slot or null if there was no root slot with free capacities
*/
// 找到一个不包含指定groupId的root MultiTaskSlot
MultiTaskSlot getUnresolvedRootSlot(AbstractID groupId) {
return unresolvedRootSlots.values().stream()
.filter(validMultiTaskSlotAndDoesNotContain(groupId))
.findFirst()
.orElse(null);
}
}
任务调度时LogicalSlot资源的申请通过Scheduler接口进行管理,Scheduler接口继承了SlotProvider接口,它的唯一实现类是SchuedulerImpl。
public interface SlotProvider {
// 申请slot,返回值一个LogicalSlot的future
CompletableFuture<LogicalSlot> allocateSlot(
SlotRequestId slotRequestId,
ScheduledUnit scheduledUnit,
SlotProfile slotProfile,
boolean allowQueuedScheduling,
Time allocationTimeout);
void cancelSlotRequest(
SlotRequestId slotRequestId,
@Nullable SlotSharingGroupId slotSharingGroupId,
Throwable cause);
}
public interface Scheduler extends SlotProvider, SlotOwner {
void start(@Nonnull ComponentMainThreadExecutor mainThreadExecutor);
boolean requiresPreviousExecutionGraphAllocations();
}
SchedulerImpl主要使用SlotPool来申请PhysicalSlot,借助SlotSharingManager来实现slot共享。SlotSelectionStrategy接口主要用于从一组slot中选出最符合资源申请偏好的一个。SchedulerImpl 的主要成员变量及方法如下:
class SchedulerImpl implements Scheduler {
private final SlotSelectionStrategy slotSelectionStrategy; // Strategy that selects the best slot for a given slot allocation request.
private final SlotPool slotPool; // The slot pool from which slots are allocated
private final Map<SlotSharingGroupId, SlotSharingManager> slotSharingManagers; // Managers for the different slot sharing groups
public CompletableFuture<LogicalSlot> allocateSlot(
SlotRequestId slotRequestId,
ScheduledUnit scheduledUnit,
SlotProfile slotProfile,
boolean allowQueuedScheduling,
Time allocationTimeout) {
log.debug("Received slot request [{}] for task: {}", slotRequestId, scheduledUnit.getTaskToExecute());
componentMainThreadExecutor.assertRunningInMainThread();
final CompletableFuture<LogicalSlot> allocationResultFuture = new CompletableFuture<>();
// 如果没有指定SlotSharingGroupId,说明这个任务不运行slot共享,要独占一个slot
CompletableFuture<LogicalSlot> allocationFuture = scheduledUnit.getSlotSharingGroupId() == null ?
allocateSingleSlot(slotRequestId, slotProfile, allowQueuedScheduling, allocationTimeout) : // 不进行资源共享 申请单个SingleSlot
allocateSharedSlot(slotRequestId, scheduledUnit, slotProfile, allowQueuedScheduling, allocationTimeout); // 资源共享
allocationFuture.whenComplete((LogicalSlot slot, Throwable failure) -> {
if (failure != null) {
Optional<SharedSlotOversubscribedException> sharedSlotOverAllocatedException =
ExceptionUtils.findThrowable(failure, SharedSlotOversubscribedException.class);
if (sharedSlotOverAllocatedException.isPresent() &&
sharedSlotOverAllocatedException.get().canRetry()) {
// Retry the allocation // 重新尝试分配
internalAllocateSlot(
allocationResultFuture,
slotRequestId,
scheduledUnit,
slotProfile,
allocationTimeout);
} else {
cancelSlotRequest(
slotRequestId,
scheduledUnit.getSlotSharingGroupId(),
failure);
allocationResultFuture.completeExceptionally(failure);
}
} else {
allocationResultFuture.complete(slot);
}
});
return allocationResultFuture;
}
@Override
public void cancelSlotRequest(
SlotRequestId slotRequestId,
@Nullable SlotSharingGroupId slotSharingGroupId,
Throwable cause) {
componentMainThreadExecutor.assertRunningInMainThread();
if (slotSharingGroupId != null) {
releaseSharedSlot(slotRequestId, slotSharingGroupId, cause);
} else {
slotPool.releaseSlot(slotRequestId, cause);
}
}
@Override
public void returnLogicalSlot(LogicalSlot logicalSlot) {
SlotRequestId slotRequestId = logicalSlot.getSlotRequestId();
SlotSharingGroupId slotSharingGroupId = logicalSlot.getSlotSharingGroupId();
FlinkException cause = new FlinkException("Slot is being returned to the SlotPool.");
cancelSlotRequest(slotRequestId, slotSharingGroupId, cause);
}
}
这几个对外暴露的方法的逻辑都比较清晰,接着来看下内部的具体实现。如果不允许资源共享,那么直接从SlotPool中获取PhysicalSlot,然后创建一个LogicalSlot即可
class SchedulerImpl {
private CompletableFuture<LogicalSlot> allocateSingleSlot(
SlotRequestId slotRequestId,
SlotProfile slotProfile,
@Nullable Time allocationTimeout) {
Optional<SlotAndLocality> slotAndLocality = tryAllocateFromAvailable(slotRequestId, slotProfile); // 先尝试从SlotPool可用的AllocatedSlot中获取
if (slotAndLocality.isPresent()) { // 如果有已经有可用的了,就创建一个SingleLogicalSlot,并作为AllocatedSlot的payload
// already successful from available
try {
return CompletableFuture.completedFuture(
completeAllocationByAssigningPayload(slotRequestId, slotAndLocality.get()));
} catch (FlinkException e) {
return FutureUtils.completedExceptionally(e);
}
} else {
// we allocate by requesting a new slot
// 暂时没有可用的,如果允许排队的话,可以要求SlotPool向RM申请一个新的slot
return requestNewAllocatedSlot(slotRequestId, slotProfile, allocationTimeout)
.thenApply((PhysicalSlot allocatedSlot) -> {
try {
return completeAllocationByAssigningPayload(slotRequestId, new SlotAndLocality(allocatedSlot, Locality.UNKNOWN));
} catch (FlinkException e) {
throw new CompletionException(e);
}
});
}
}
private Optional<SlotAndLocality> tryAllocateFromAvailable(
@Nonnull SlotRequestId slotRequestId,
@Nonnull SlotProfile slotProfile) {
Collection<SlotSelectionStrategy.SlotInfoAndResources> slotInfoList =
slotPool.getAvailableSlotsInformation()
.stream()
.map(SlotSelectionStrategy.SlotInfoAndResources::fromSingleSlot)
.collect(Collectors.toList());
Optional<SlotSelectionStrategy.SlotInfoAndLocality> selectedAvailableSlot =
slotSelectionStrategy.selectBestSlotForProfile(slotInfoList, slotProfile);
return selectedAvailableSlot.flatMap(slotInfoAndLocality -> {
Optional<PhysicalSlot> optionalAllocatedSlot = slotPool.allocateAvailableSlot(
slotRequestId,
slotInfoAndLocality.getSlotInfo().getAllocationId());
return optionalAllocatedSlot.map(
allocatedSlot -> new SlotAndLocality(allocatedSlot, slotInfoAndLocality.getLocality()));
});
}
}
如果需要进行资源共享,那么还要进一步考虑CoLocationGroup强制约束的情况,它的核心就在于构造TaskSlot构成的树,然后在树上创建一个叶子节点,叶子节点里封装了需要的LogicalSlot。更详细的流程参考下面代码和添加的注释:
class SchedulerImpl {
private CompletableFuture<LogicalSlot> allocateSharedSlot(
SlotRequestId slotRequestId,
ScheduledUnit scheduledUnit,
SlotProfile slotProfile,
boolean allowQueuedScheduling,
Time allocationTimeout) {
// 每一个SlotSharingGroup对应一个SlotSharingManager
// allocate slot with slot sharing
final SlotSharingManager multiTaskSlotManager = slotSharingManagers.computeIfAbsent(
scheduledUnit.getSlotSharingGroupId(),
id -> new SlotSharingManager(
id,
slotPool,
this));
// 分配MultiTaskSlot
final SlotSharingManager.MultiTaskSlotLocality multiTaskSlotLocality;
try {
if (scheduledUnit.getCoLocationConstraint() != null) {
// 存在ColLocation 约束
multiTaskSlotLocality = allocateCoLocatedMultiTaskSlot(
scheduledUnit.getCoLocationConstraint(),
multiTaskSlotManager,
slotProfile,
allowQueuedScheduling,
allocationTimeout);
} else {
multiTaskSlotLocality = allocateMultiTaskSlot(
scheduledUnit.getJobVertexId(),
multiTaskSlotManager,
slotProfile,
allowQueuedScheduling,
allocationTimeout);
}
} catch (NoResourceAvailableException noResourceException) {
return FutureUtils.completedExceptionally(noResourceException);
}
// sanity check
Preconditions.checkState(!multiTaskSlotLocality.getMultiTaskSlot().contains(scheduledUnit.getJobVertexId()));
// 在MultiTaskSlot下创建叶子节点SingleTaskSlot,并获取可以分配给任务的SingleLogicalSlot
final SlotSharingManager.SingleTaskSlot leaf = multiTaskSlotLocality.getMultiTaskSlot().allocateSingleTaskSlot(
slotRequestId,
scheduledUnit.getJobVertexId(),
multiTaskSlotLocality.getLocality());
return leaf.getLogicalSlotFuture();
}
private SlotSharingManager.MultiTaskSlotLocality allocateCoLocatedMultiTaskSlot(
CoLocationConstraint coLocationConstraint,
SlotSharingManager multiTaskSlotManager,
SlotProfile slotProfile,
boolean allowQueuedScheduling,
Time allocationTimeout) throws NoResourceAvailableException {
// coLocationConstraint会和分配给它的MultiTaskSlot(不是root)的SlotRequestId绑定
// 这个绑定关系只有在分配了MultiTaskSlot之后才会生成
// 可以根据SlotRequestId直接定位到MultiTaskSlot
final SlotRequestId coLocationSlotRequestId = coLocationConstraint.getSlotRequestId();
if (coLocationSlotRequestId != null) {
// we have a slot assigned --> try to retrieve it
final SlotSharingManager.TaskSlot taskSlot = multiTaskSlotManager.getTaskSlot(coLocationSlotRequestId);
if (taskSlot != null) {
Preconditions.checkState(taskSlot instanceof SlotSharingManager.MultiTaskSlot);
return SlotSharingManager.MultiTaskSlotLocality.of(((SlotSharingManager.MultiTaskSlot) taskSlot), Locality.LOCAL);
} else {
// the slot may have been cancelled in the mean time
coLocationConstraint.setSlotRequestId(null);
}
}
if (coLocationConstraint.isAssigned()) {
// refine the preferred locations of the slot profile
slotProfile = new SlotProfile(
slotProfile.getResourceProfile(),
Collections.singleton(coLocationConstraint.getLocation()),
slotProfile.getPreferredAllocations());
}
// 为这个coLocationConstraint分配MultiTaskSlot,先找到符合要求的root MultiTaskSlot
// get a new multi task slot
SlotSharingManager.MultiTaskSlotLocality multiTaskSlotLocality = allocateMultiTaskSlot(
coLocationConstraint.getGroupId(),
multiTaskSlotManager,
slotProfile,
allowQueuedScheduling,
allocationTimeout);
// check whether we fulfill the co-location constraint
if (coLocationConstraint.isAssigned() && multiTaskSlotLocality.getLocality() != Locality.LOCAL) {
multiTaskSlotLocality.getMultiTaskSlot().release(
new FlinkException("Multi task slot is not local and, thus, does not fulfill the co-location constraint."));
throw new NoResourceAvailableException("Could not allocate a local multi task slot for the " +
"co location constraint " + coLocationConstraint + '.');
}
// 在root MultiTaskSlot下面创建一个二级的MultiTaskSlot,分配给这个coLocationConstraint
final SlotRequestId slotRequestId = new SlotRequestId();
final SlotSharingManager.MultiTaskSlot coLocationSlot =
multiTaskSlotLocality.getMultiTaskSlot().allocateMultiTaskSlot(
slotRequestId,
coLocationConstraint.getGroupId());
// 为coLocationConstraint绑定slotRequestId,后续就可以直接通过这个slotRequestId定位到MultiTaskSlot
// mark the requested slot as co-located slot for other co-located tasks
coLocationConstraint.setSlotRequestId(slotRequestId);
// lock the co-location constraint once we have obtained the allocated slot
coLocationSlot.getSlotContextFuture().whenComplete(
(SlotContext slotContext, Throwable throwable) -> {
if (throwable == null) {
// check whether we are still assigned to the co-location constraint
if (Objects.equals(coLocationConstraint.getSlotRequestId(), slotRequestId)) {
// 为这个coLocationConstraint绑定位置
coLocationConstraint.lockLocation(slotContext.getTaskManagerLocation());
} else {
log.debug("Failed to lock colocation constraint {} because assigned slot " +
"request {} differs from fulfilled slot request {}.",
coLocationConstraint.getGroupId(),
coLocationConstraint.getSlotRequestId(),
slotRequestId);
}
} else {
log.debug("Failed to lock colocation constraint {} because the slot " +
"allocation for slot request {} failed.",
coLocationConstraint.getGroupId(),
coLocationConstraint.getSlotRequestId(),
throwable);
}
});
return SlotSharingManager.MultiTaskSlotLocality.of(coLocationSlot, multiTaskSlotLocality.getLocality());
}
private SlotSharingManager.MultiTaskSlotLocality allocateMultiTaskSlot(
AbstractID groupId,
SlotSharingManager slotSharingManager,
SlotProfile slotProfile,
boolean allowQueuedScheduling,
Time allocationTimeout) throws NoResourceAvailableException {
//找到符合要求的已经分配了 AllocatedSlot 的 root MultiTaskSlot 集合,
//这里的符合要求是指 root MultiTaskSlot 不含有当前 groupId, 避免把 groupId 相同(同一个 JobVertex)的不同 task 分配到同一个 slot 中
Collection<SlotInfo> resolvedRootSlotsInfo = slotSharingManager.listResolvedRootSlotInfo(groupId);
//由 slotSelectionStrategy 选出最符合条件的
SlotSelectionStrategy.SlotInfoAndLocality bestResolvedRootSlotWithLocality =
slotSelectionStrategy.selectBestSlotForProfile(resolvedRootSlotsInfo, slotProfile).orElse(null);
//对 MultiTaskSlot 和 Locality 做一层封装
final SlotSharingManager.MultiTaskSlotLocality multiTaskSlotLocality = bestResolvedRootSlotWithLocality != null ?
new SlotSharingManager.MultiTaskSlotLocality(
slotSharingManager.getResolvedRootSlot(bestResolvedRootSlotWithLocality.getSlotInfo()),
bestResolvedRootSlotWithLocality.getLocality()) :
null;
//如果 MultiTaskSlot 对应的 AllocatedSlot 和请求偏好的 slot 落在同一个 TaskManager,那么就选择这个 MultiTaskSlot
if (multiTaskSlotLocality != null && multiTaskSlotLocality.getLocality() == Locality.LOCAL) {
return multiTaskSlotLocality;
}
//这里由两种可能:
// 1)multiTaskSlotLocality == null,说明没有找到符合条件的 root MultiTaskSlot
// 2) multiTaskSlotLocality != null && multiTaskSlotLocality.getLocality() == Locality.LOCAL,不符合 Locality 偏好
//尝试从 SlotPool 中未使用的 slot 中选择
final SlotRequestId allocatedSlotRequestId = new SlotRequestId();
final SlotRequestId multiTaskSlotRequestId = new SlotRequestId();
Optional<SlotAndLocality> optionalPoolSlotAndLocality = tryAllocateFromAvailable(allocatedSlotRequestId, slotProfile);
if (optionalPoolSlotAndLocality.isPresent()) {
//如果从 SlotPool 中找到了未使用的 slot
SlotAndLocality poolSlotAndLocality = optionalPoolSlotAndLocality.get();
//如果未使用的 AllocatedSlot 符合 Locality 偏好,或者前一步没有找到可用的 MultiTaskSlot
if (poolSlotAndLocality.getLocality() == Locality.LOCAL || bestResolvedRootSlotWithLocality == null) {
//基于 新分配的 AllocatedSlot 创建一个 root MultiTaskSlot
final PhysicalSlot allocatedSlot = poolSlotAndLocality.getSlot();
final SlotSharingManager.MultiTaskSlot multiTaskSlot = slotSharingManager.createRootSlot(
multiTaskSlotRequestId,
CompletableFuture.completedFuture(poolSlotAndLocality.getSlot()),
allocatedSlotRequestId);
//将新创建的 root MultiTaskSlot 作为 AllocatedSlot 的 payload
if (allocatedSlot.tryAssignPayload(multiTaskSlot)) {
return SlotSharingManager.MultiTaskSlotLocality.of(multiTaskSlot, poolSlotAndLocality.getLocality());
} else {
multiTaskSlot.release(new FlinkException("Could not assign payload to allocated slot " +
allocatedSlot.getAllocationId() + '.'));
}
}
}
if (multiTaskSlotLocality != null) {
//如果都不符合 Locality 偏好,或者 SlotPool 中没有可用的 slot 了
// prefer slot sharing group slots over unused slots
if (optionalPoolSlotAndLocality.isPresent()) {
slotPool.releaseSlot(
allocatedSlotRequestId,
new FlinkException("Locality constraint is not better fulfilled by allocated slot."));
}
return multiTaskSlotLocality;
}
//到这里,说明 1)slotSharingManager 中没有符合要求的 root MultiTaskSlot && 2)slotPool 中没有可用的 slot 了
if (allowQueuedScheduling) {
//先检查 slotSharingManager 中是不是还有没完成 slot 分配的 root MultiTaskSlot
// there is no slot immediately available --> check first for uncompleted slots at the slot sharing group
SlotSharingManager.MultiTaskSlot multiTaskSlot = slotSharingManager.getUnresolvedRootSlot(groupId);
if (multiTaskSlot == null) {
//如果没有,就需要 slotPool 向 RM 请求新的 slot 了
// it seems as if we have to request a new slot from the resource manager, this is always the last resort!!!
final CompletableFuture<PhysicalSlot> slotAllocationFuture = slotPool.requestNewAllocatedSlot(
allocatedSlotRequestId,
slotProfile.getResourceProfile(),
allocationTimeout);
//请求分配后,就是同样的流程的,创建一个 root MultiTaskSlot,并作为新分配的 AllocatedSlot 的负载
multiTaskSlot = slotSharingManager.createRootSlot(
multiTaskSlotRequestId,
slotAllocationFuture,
allocatedSlotRequestId);
slotAllocationFuture.whenComplete(
(PhysicalSlot allocatedSlot, Throwable throwable) -> {
final SlotSharingManager.TaskSlot taskSlot = slotSharingManager.getTaskSlot(multiTaskSlotRequestId);
if (taskSlot != null) {
// still valid
if (!(taskSlot instanceof SlotSharingManager.MultiTaskSlot) || throwable != null) {
taskSlot.release(throwable);
} else {
if (!allocatedSlot.tryAssignPayload(((SlotSharingManager.MultiTaskSlot) taskSlot))) {
taskSlot.release(new FlinkException("Could not assign payload to allocated slot " +
allocatedSlot.getAllocationId() + '.'));
}
}
} else {
slotPool.releaseSlot(
allocatedSlotRequestId,
new FlinkException("Could not find task slot with " + multiTaskSlotRequestId + '.'));
}
});
}
return SlotSharingManager.MultiTaskSlotLocality.of(multiTaskSlot, Locality.UNKNOWN);
}
throw new NoResourceAvailableException("Could not allocate a shared slot for " + groupId + '.');
}
}