文章目录
1.说明
声明式资源管理的设定是:JobMaster先声明所需资源,ResourceManager根据实际资源进行分配,JobMaster可以接收实际分配的资源调整并行度
旧的模式:JobMaster申请的就是已经确定的资源,并且,申请时是一个slot一个slot的去请求的,任意请求没能满足则申请失败
2.重要角色
ResourceManager:Flink的资源管理角色,做总体管理,根据不同的部署模式(yarn、k8s),会有不同的ResourceManager
SlotManager:ResourceManager的成员,负责slot的管理
JabMaster:Job对应的管理角色,每个Job一个
SlotPool:JabMaster成员,负责接受提供的插槽,并将这些插槽与作业的要求进行匹配
3.流程
3.1.TaskManager启动流程
整体流程如下所示
3.1.1.TaskManager启动注册
TaskManager启动的时候,会注册跟ResourceManager的监听以连接ResourceManager
public void onStart() throws Exception {
try {
startTaskExecutorServices();
private void startTaskExecutorServices() throws Exception {
try {
// start by connecting to the ResourceManager
resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener());
监听器会启动跟ResourceManager的连接
private void notifyOfNewResourceManagerLeader(
String newLeaderAddress, ResourceManagerId newResourceManagerId) {
resourceManagerAddress =
createResourceManagerAddress(newLeaderAddress, newResourceManagerId);
reconnectToResourceManager(
new FlinkException(
String.format(
"ResourceManager leader changed to new address %s",
resourceManagerAddress)));
}
最终会调用到TaskExecutor类的establishResourceManagerConnection接口,通过ResourceManager的gateway启动向ResourceManager注册。
final CompletableFuture<Acknowledge> slotReportResponseFuture =
resourceManagerGateway.sendSlotReport(
getResourceID(),
taskExecutorRegistrationId,
taskSlotTable.createSlotReport(getResourceID()),
taskManagerConfiguration.getRpcTimeout());
注意其中的第三个参数,这是TaskManager的slot状态,向ResourceManager注册时主要就是需要提交这个东西。这里需要注意,从代码逻辑上看,slot数量可能是会超过配置的slot数量的(taskmanager.numberOfTaskSlots)
如下有两次for循环,第一次是基于配置,按配置数量添加;第二次根据allocatedSlots,判断slotid,如果slotid大于numberSlots,这个就是超过范围的slot,新添加
allocatedSlots是对job进行slot分配的时候操作的,看分配流程
for (int i = 0; i < numberSlots; i++) {
for (TaskSlot<T> taskSlot : allocatedSlots.values()) {
if (isDynamicIndex(taskSlot.getIndex())) {
3.1.2.ResourceManager处理
ResourceManager接收sendSlotReport,调用slotManager注册TaskManager,根据资源管理器,slotManager有不同的实现,这里看声明式资源管理器的操作
TaskManager和Slot会分别有注册,做一些记录
if (slotManager.registerTaskManager(
workerTypeWorkerRegistration,
slotReport,
workerTypeWorkerRegistration.getTotalResourceProfile(),
workerTypeWorkerRegistration.getDefaultSlotResourceProfile())) {
onWorkerRegistered(workerTypeWorkerRegistration.getWorker());
}
注册里首先做一些基本的检查,一项重要的检查是集群资源是否达到上限,上限来自配置
slotmanager.number-of-slots.max
if (isMaxSlotNumExceededAfterRegistration(initialSlotReport)) {
private boolean isMaxSlotNumExceededAfterAdding(int numNewSlot) {
return getNumberRegisteredSlots() + getNumberPendingTaskManagerSlots() + numNewSlot
> maxSlotNum;
}
之后调用checkResourceRequirements进行分配,首先查找符合标准的slot
final Optional<TaskManagerSlotInformation> reservedSlot =
slotMatchingStrategy.findMatchingSlot(
requiredResource, freeSlots, this::getNumberRegisteredSlotsOf);
findMatchingSlot实现有两个,一个是选取第一个符合要求的,一个是按利用率,从利用率最低的开始选符合的。根据配置:cluster.evenly-spread-out-slots
接下来是进行allocateSlot,根据匹配的slot信息,获取TaskManager,向TaskManager发送请求
CompletableFuture<Acknowledge> requestFuture =
gateway.requestSlot(
slotId,
jobId,
allocationId,
resourceProfile,
targetAddress,
resourceManagerId,
taskManagerRequestTimeout);
问题:allocationId的概念和slotId的关系
3.1.3.TaskManager的Slot分配
分配slot时首先根据slotId查询,查询是否已经分配等情况。之后会创建一个TaskSlot对象并放入各种队列记录里,同时会给该slot注册一个超时
// register a timeout for this slot since it's in state allocated
timerService.registerTimeout(allocationId, slotTimeout.getSize(), slotTimeout.getUnit());
之后会创建一个Job类型的变量,内容是Job的一些信息,包括地址信息
job =
jobTable.getOrCreateJob(
jobId, () -> registerNewJobAndCreateServices(jobId, targetAddress));
最后,会调用jobMasterGateway向JobMaster提供slot
CompletableFuture<Collection<SlotOffer>> acceptedSlotsFuture =
jobMasterGateway.offerSlots(
getResourceID(),
reservedSlots,
taskManagerConfiguration.getRpcTimeout());
上一步完成好应该就启动任务了,使用的java8的CompletableFuture的特性调用作业的main函数(确认调用的main是不是作业的main函数)
acceptedSlotsFuture.whenCompleteAsync(
handleAcceptedSlotOffers(
jobId, jobMasterGateway, jobMasterId, reservedSlots, slotOfferId),
getMainThreadExecutor());
3.1.4.JobMaster处理
JobMaster主要由SlotPool进行处理,最终调用到DefaultDeclarativeSlotPool的offerSlots接口。slotPool处理也是将slot加入各种列表记录,最后调用notifyNewSlotsAreAvailable提醒slot分配更新。
监听器通知的实际就是调度器。Adaptive调度器看Adaptive调度器相关文章(就是触发监听器然后触发newResourcesAvailable接口的调用)
另一条分支走的是DeclarativeSlotPoolBridge类,这个是普通调度器走的流程
DeclarativeSlotPoolBridge这边会有两个slot的超时检测
componentMainThreadExecutor.schedule(
this::checkIdleSlotTimeout,
idleSlotTimeout.toMilliseconds(),
TimeUnit.MILLISECONDS);
componentMainThreadExecutor.schedule(
this::checkBatchSlotTimeout,
batchSlotTimeout.toMilliseconds(),
TimeUnit.MILLISECONDS);
3.2.作业启动流程
整体流程如下所示,到ResourceManager侧处理后,checkResourceRequirements依然会调用到
3.2.1.JobMaster处理
JobMaster继承了RpcEndpoint,启动的时候基于RpcEndpoint的启动模式,调用onStart接口进行启动。启动过程中,会建立对ResourceManager的监听
resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener());
监听建立成功后会启动连接程序,在establishResourceManagerConnection接口当中,会进行资源申请的声明
public void connectToResourceManager(ResourceManagerGateway resourceManagerGateway) {
assertHasBeenStarted();
resourceRequirementServiceConnectionManager.connect(
resourceRequirements ->
resourceManagerGateway.declareRequiredResources(
jobMasterId, resourceRequirements, rpcTimeout));
declareResourceRequirements(declarativeSlotPool.getResourceRequirements());
}
实际调用方式是一个调度器重试请求,由declareResourceRequirements调用定义。最终,在
DefaultDeclareResourceRequirementServiceConnectionManager的triggerResourceRequirementsSubmission当中有如下方法
FutureUtils.retryWithDelay(
() -> sendResourceRequirements(resourceRequirementsToSend),
new ExponentialBackoffRetryStrategy(
Integer.MAX_VALUE, sleepOnError, maxSleepOnError),
throwable -> !(throwable instanceof CancellationException),
scheduledExecutor);
这是一个使用调度器进行不断的失败重试的方法调用,Integer.MAX_VALUE为其重试次数,基本认为是一直尝试知道成功的;sendResourceRequirements是其实际调用的方法,最终调用到declareRequiredResources
3.2.2.ResourceManager处理
JobMaster通过resourceManagerGateway调用,转到ResourceManager进行处理,最后调用到slotManager的processResourceRequirements声明资源
slotManager.processResourceRequirements(resourceRequirements);
在processResourceRequirements的末尾,会调用到checkResourceRequirements,这一步上面讲过,是进行资源请求和slot存量进行匹配分配的
checkResourceRequirements内部会调用到internalTryAllocateSlots,当中有两个主要操作:1、findMatchingSlot;2、allocateSlot。参看上节当中的内容
4.JobMaster如何启动
从上面的流程看,每个slot单独分配,然后TaskManager提供slot给JobMaster,接口调用成功后就开始执行,但这个是单个任务型的,整个Job的运行入口在哪
如上所述,TaskManager通过gateway向JobMaster提供slot,JobMaster调用链最后在notifyNewSlotsAreAvailable通知新消息,这里会根据调度器做不同处理。
对于Adaptive来说,调用如下
private void newResourcesAvailable(Collection<? extends PhysicalSlot> physicalSlots) {
state.tryRun(
ResourceConsumer.class,
ResourceConsumer::notifyNewResourcesAvailable,
"newResourcesAvailable");
}
其中,notifyNewResourcesAvailable对新Job来说,使用WaitingForResources的实现,会调用checkDesiredOrSufficientResourcesAvailable检查资源是否满足
对于普通的调度器,也会进行资源检查
public void fulfillPendingRequest() {
Preconditions.checkState(
pendingRequest.fulfill(matchedSlot), "Pending requests must be fulfillable.");
}
5.其他
5.1.关于声明式特点
老版本是在请求ResourceManager前就已经确定了并行度,然后独立去申请,申请失败就作业失败了。声明式则在请求ResourceManager前不确定并行度,然后去申请资源,申请成功的由JobMaster记录,最后根据申请成功的数量,确认并行度(看具体的调度器策略)
在老版本中,没有拿到足够的slot就直接失败了,在SlotManagerImpl的fulfillPendingSlotRequestWithPendingTaskManagerSlot当中,有如下,可以看到,没有获取到足够的资源就直接抛出异常了
OptionalConsumer.of(pendingTaskManagerSlotOptional)
.ifPresent(pendingTaskManagerSlot -> assignPendingTaskManagerSlot(pendingSlotRequest, pendingTaskManagerSlot))
.ifNotPresent(() -> {
// request can not be fulfilled by any free slot or pending slot that can be allocated,
// check whether it can be fulfilled by allocated slots
if (failUnfulfillableRequest && !isFulfillableByRegisteredOrPendingSlots(pendingSlotRequest.getResourceProfile())) {
throw new UnfulfillableSlotRequestException(pendingSlotRequest.getAllocationId(), pendingSlotRequest.getResourceProfile());
}
});
而对于声明式,DeclarativeSlotManager当中,对于没有完全满足的情况,不异常退出,只是记录下没满足的情况,交由后续处理
private ResourceCounter tryAllocateSlotsForJob(
JobID jobId, Collection<ResourceRequirement> missingResources) {
ResourceCounter outstandingRequirements = ResourceCounter.empty();
for (ResourceRequirement resourceRequirement : missingResources) {
int numMissingSlots =
internalTryAllocateSlots(
jobId, jobMasterTargetAddresses.get(jobId), resourceRequirement);
if (numMissingSlots > 0) {
outstandingRequirements =
outstandingRequirements.add(
resourceRequirement.getResourceProfile(), numMissingSlots);
}
}
return outstandingRequirements;
}
5.2.关于资源不足的失败
非adaptive调度器,DeclarativeSlotManager在进行资源分配的时候,调用tryFulfillRequirementsWithPendingSlots进行资源配置时,如果资源不足,会触发资源不足的一个通知
if (!allocationResult.isSuccessfulAllocating()
&& sendNotEnoughResourceNotifications) {
LOG.warn(
"Could not fulfill resource requirements of job {}. Free slots: {}",
jobId,
slotTracker.getFreeSlots().size());
resourceActions.notifyNotEnoughResourcesAvailable(
jobId, resourceTracker.getAcquiredResources(jobId));
return pendingSlots;
}
最终调用链在DeclarativeSlotPoolBridge的failPendingRequests,返回失败
if (pendingRequests.values().stream().anyMatch(predicate)) {
log.warn(
"Could not acquire the minimum required resources, failing slot requests. Acquired: {}. Current slot pool status: {}",
acquiredResources,
getSlotServiceStatus());
cancelPendingRequests(
predicate,
NoResourceAvailableException.withoutStackTrace(
"Could not acquire the minimum required resources."));
}