上一节我们分析到了在jobmaster启动后,会将JobGraph转换成ExecutionGraph,同时也会将checkpoint相关配置传给executionGraph,并且还创建了checkpointCoordinator。下面我们接着上节的地方继续往下分析。
1.start
jobmaster启动后在选主完成后会调用它的start方法,start方法中会调用startJobExecution方法,开始job的执行。
2.startJobExecution
然后再startJobExecution方法中又调用了其自身的两个方法 :
startJobMasterService:在这里才是真正启动jobMaster的服务
resetAndScheduler: 重置和启动调度器
private Acknowledge startJobExecution(JobMasterId newJobMasterId) throws Exception {
validateRunsInMainThread();
checkNotNull(newJobMasterId, "The new JobMasterId must not be null.");
if (Objects.equals(getFencingToken(), newJobMasterId)) {
log.info("Already started the job execution with JobMasterId {}.", newJobMasterId);
return Acknowledge.get();
}
setNewFencingToken(newJobMasterId);
/*TODO 真正启动JobMaster服务*/
startJobMasterServices();
log.info("Starting execution of job {} ({}) under job master id {}.", jobGraph.getName(), jobGraph.getJobID(), newJobMasterId);
/*TODO 重置和启动调度器*/
resetAndStartScheduler();
return Acknowledge.get();
}
3.startJobMasterService
在这个方法里会做以下几件事:
1.开启和TaskManager和ResourceManager的心跳服务
2.启动slotPool,这个slotPool是jobmaster这边负责管理slot资源的,里面保存了该job持有的资源,如果slot资源不够时,slotPool会向ResourceManager去申请slot,这个ResourceManager是flink集群自身的ResourceManager,而非yarn的ResourceManager,当ResourceManager收到slotPool发过来的slot申请时,会去向TaskManager申请空闲的slot
3.建立与resourceManager的连接,这个resoureManager就是第二点中讲到的resourceManager
private void startJobMasterServices() throws Exception {
/*TODO 启动心跳服务:taskmanager、resourcemanager*/
startHeartbeatServices();
// start the slot pool make sure the slot pool now accepts messages for this leader
/*TODO 启动 slotpool*/
slotPool.start(getFencingToken(), getAddress(), getMainThreadExecutor());
//TODO: Remove once the ZooKeeperLeaderRetrieval returns the stored address upon start
// try to reconnect to previously known leader
reconnectToResourceManager(new FlinkException("Starting JobMaster component."));
// job is ready to go, try to establish connection with resource manager
// - activate leader retrieval for the resource manager
// - on notification of the leader, the connection will be established and
// the slot pool will start requesting slots
/*TODO 与ResourceManager建立连接,slotpool开始请求资源*/
resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener());
}
4.resetAndStartScheduler
这个方法中为要调度的job创建了调度器并给该job进行了分配,然后调用startScheduling方法进行调度
private void resetAndStartScheduler() throws Exception {
validateRunsInMainThread();
final CompletableFuture<Void> schedulerAssignedFuture;
if (schedulerNG.requestJobStatus() == JobStatus.CREATED) {
schedulerAssignedFuture = CompletableFuture.completedFuture(null);
schedulerNG.setMainThreadExecutor(getMainThreadExecutor());
} else {
suspendAndClearSchedulerFields(new FlinkException("ExecutionGraph is being reset in order to be rescheduled."));
final JobManagerJobMetricGroup newJobManagerJobMetricGroup = jobMetricGroupFactory.create(jobGraph);
//创建调度器,并且为该job分配调度器
final SchedulerNG newScheduler = createScheduler(executionDeploymentTracker, newJobManagerJobMetricGroup);
schedulerAssignedFuture = schedulerNG.getTerminationFuture().handle(
(ignored, throwable) -> {
newScheduler.setMainThreadExecutor(getMainThreadExecutor());
assignScheduler(newScheduler, newJobManagerJobMetricGroup);
return null;
}
);
}
//调用startScheduling方法对job进行调度,异步获取调度的结果
FutureUtils.assertNoException(schedulerAssignedFuture.thenRun(this::startScheduling));
}
在startScheduling方法中为该job注册了job状态的监听器后然后调用startScheduling方法进行调度
在startScheduling方法中判断了当前执行的线程是否为主线程, 注册作业监控,开启算子协调器,然后调用startstartSchedulingInternal方法开始调度
然后再startSchedulingInternal方法中 ,调用prepareExecutionGraphForNgScheduling方法设置executionGraph的一些属性,比如将job状态从created改为running,然后调用scheduling.startScheduling方法开始调度,他有几个实现类,这里我们选择PipelinedRegionSchedulingStrategy,新版本中流式作业默认的调度策略就是PipelinedRegionSchedulingStrategy
5.startScheduling
schedulingTopology就是executionVertex的拓扑结构,首先获取到所有的pipelineRegions,然后获取source节点,调用maybeScheduleRegions方法进行调度
6.maybeScheduleRegions
获取对source节点的拓扑结构进行排序然后循环遍历set中的sourceRegion,调用maybeScheduleRegion方法进行调度
private void maybeScheduleRegions(final Set<SchedulingPipelinedRegion> regions) {
final List<SchedulingPipelinedRegion> regionsSorted =
SchedulingStrategyUtils.sortPipelinedRegionsInTopologicalOrder(schedulingTopology, regions);
for (SchedulingPipelinedRegion region : regionsSorted) {
maybeScheduleRegion(region);
}
}
7.maybeScheduleRegion
在这里就是获取slot准备开始部署对应的executionVertex了
private void maybeScheduleRegion(final SchedulingPipelinedRegion region) {
if (!areRegionInputsAllConsumable(region)) {
return;
}
checkState(areRegionVerticesAllInCreatedState(region), "BUG: trying to schedule a region which is not in CREATED state");
final List<ExecutionVertexDeploymentOption> vertexDeploymentOptions =
SchedulingStrategyUtils.createExecutionVertexDeploymentOptions(
regionVerticesSorted.get(region),
id -> deploymentOption);
schedulerOperations.allocateSlotsAndDeploy(vertexDeploymentOptions);
}
下面就是如何分配资源的代码,比较繁琐,我这里简要列一下代码的互相调用,有兴趣的兄弟可以下去自己看看:
schedulerOperations # allocateSlotsAndDeploy
defaultScheduler # allocateSlotsAndDeploy
defaultScheduler # waitForAllSlotsAndDeploy
defaultScheduler # deployAll
defaultScheduler # deployOrHandleError
defaultScheduler # deployTaskSafe
DefaultExecutionVertexOperations # deploy
ExecutionVertex # deploy
Execution # deploy
public void deploy() throws JobException {
assertRunningInJobMasterMainThread();
//分配slot
final LogicalSlot slot = assignedResource;
checkNotNull(slot, "In order to deploy the execution we first have to assign a resource via tryAssignResource.");
// Check if the TaskManager died in the meantime
// This only speeds up the response to TaskManagers failing concurrently to deployments.
// The more general check is the rpcTimeout of the deployment call
if (!slot.isAlive()) {
throw new JobException("Target slot (TaskManager) for deployment is no longer alive.");
}
// make sure exactly one deployment call happens from the correct state
// note: the transition from CREATED to DEPLOYING is for testing purposes only
ExecutionState previous = this.state;
//更新execution的状态为deploying
if (previous == SCHEDULED || previous == CREATED) {
if (!transitionState(previous, DEPLOYING)) {
// race condition, someone else beat us to the deploying call.
// this should actually not happen and indicates a race somewhere else
throw new IllegalStateException("Cannot deploy task: Concurrent deployment call race.");
}
}
else {
// vertex may have been cancelled, or it was already scheduled
throw new IllegalStateException("The vertex must be in CREATED or SCHEDULED state to be deployed. Found state " + previous);
}
if (this != slot.getPayload()) {
throw new IllegalStateException(
String.format("The execution %s has not been assigned to the assigned slot.", this));
}
try {
// race double check, did we fail/cancel and do we need to release the slot?
if (this.state != DEPLOYING) {
slot.releaseSlot(new FlinkException("Actual state of execution " + this + " (" + state + ") does not match expected state DEPLOYING."));
return;
}
LOG.info("Deploying {} (attempt #{}) with attempt id {} to {} with allocation id {}", vertex.getTaskNameWithSubtaskIndex(),
attemptNumber, vertex.getCurrentExecutionAttempt().getAttemptId(), getAssignedResourceLocation(), slot.getAllocationId());
if (taskRestore != null) {
checkState(taskRestore.getTaskStateSnapshot().getSubtaskStateMappings().stream().allMatch(entry ->
entry.getValue().getInputRescalingDescriptor().equals(InflightDataRescalingDescriptor.NO_RESCALE) &&
entry.getValue().getOutputRescalingDescriptor().equals(InflightDataRescalingDescriptor.NO_RESCALE)),
"Rescaling from unaligned checkpoint is not yet supported.");
}
// 将 IntermediateResultPartition 转化成 ResultPartition
// 将 ExecutionEdge 转成 InputChannelDeploymentDescriptor(最终会在执行时转化成InputGate)
final TaskDeploymentDescriptor deployment = TaskDeploymentDescriptorFactory
.fromExecutionVertex(vertex, attemptNumber)
.createDeploymentDescriptor(
slot.getAllocationId(),
slot.getPhysicalSlotNumber(),
taskRestore,
producedPartitions.values());
// null taskRestore to let it be GC'ed
taskRestore = null;
final TaskManagerGateway taskManagerGateway = slot.getTaskManagerGateway();
final ComponentMainThreadExecutor jobMasterMainThreadExecutor =
vertex.getExecutionGraph().getJobMasterMainThreadExecutor();
getVertex().notifyPendingDeployment(this);
// We run the submission in the future executor so that the serialization of large TDDs does not block
// the main thread and sync back to the main thread once submission is completed.
//提交对应的task
CompletableFuture.supplyAsync(() -> taskManagerGateway.submitTask(deployment, rpcTimeout), executor)
.thenCompose(Function.identity())
.whenCompleteAsync(
(ack, failure) -> {
if (failure == null) {
vertex.notifyCompletedDeployment(this);
} else {
if (failure instanceof TimeoutException) {
String taskname = vertex.getTaskNameWithSubtaskIndex() + " (" + attemptId + ')';
markFailed(new Exception(
"Cannot deploy task " + taskname + " - TaskManager (" + getAssignedResourceLocation()
+ ") not responding after a rpcTimeout of " + rpcTimeout, failure));
} else {
markFailed(failure);
}
}
},
jobMasterMainThreadExecutor);
}
catch (Throwable t) {
markFailed(t);
if (isLegacyScheduling()) {
ExceptionUtils.rethrow(t);
}
}
}
到这里我们的job就已经部署成功了,这里的execution也就是我们常说的物理执行图,接下来就是要提交对应的task。