Flink执行时之client提交作业图-下

最新推荐文章于 2023-03-04 00:00:00 发布

adr5970

最新推荐文章于 2023-03-04 00:00:00 发布

阅读量229

点赞数

文章标签：大数据

原文链接：http://www.cnblogs.com/ljbguanli/p/7268669.html

版权

submitJob方法分析

JobClientActor通过向JobManager的Actor发送SubmitJob消息来提交Job，JobManager接收到消息对象之后。构建一个JobInfo对象以封装Job的基本信息。然后将这两个对象传递给submitJob方法：

case SubmitJob(jobGraph, listeningBehaviour) =>  
    val client = sender()  
    val jobInfo = new JobInfo(client, listeningBehaviour, System.currentTimeMillis(),    
        jobGraph.getSessionTimeout)  
    submitJob(jobGraph, jobInfo)

我们会以submitJob的关键方法调用来串讲其主要逻辑。首先推断jobGraph參数，假设为空则直接回应JobResultFailure消息：

if (jobGraph == null) {  
    jobInfo.client ! decorateMessage(JobResultFailure(    
        new SerializedThrowable(      
            new JobSubmissionException(null, "JobGraph must not be null.")    
        )  
    ))
}

接着，向类库缓存管理器注冊该Job相关的库文件、类路径：

libraryCacheManager.registerJob(jobGraph.getJobID, jobGraph.getUserJarBlobKeys,  
                                jobGraph.getClasspaths)

必须确保该步骤领先成功运行，由于一旦兴许产生不论什么异常才干够确保上传的类库和Jar等被成功从类库缓存管理器中移除。从这開始的整个代码段都被包裹在try语句块中。一旦捕获到不论什么异常。会通过libraryCacheManager的unregisterJob方法将相关Jar文件删除：

catch {  case t: Throwable =>    
    libraryCacheManager.unregisterJob(jobId)
    //...
}

接下来是获得用户代码的类载入器classLoader以及发生失败时的重新启动策略restartStrategy：

val userCodeLoader = libraryCacheManager.getClassLoader(jobGraph.getJobID)
val restartStrategy = Option(jobGraph.getRestartStrategyConfiguration())  
    .map(RestartStrategyFactory.createRestartStrategy(_)) match {    
        case Some(strategy) => strategy    
        case None => defaultRestartStrategy  
}

接着，获得运行图ExecutionGraph对象的实例。首先尝试从缓存中查找。假设缓存中存在则直接返回，否则直接创建然后增加缓存：

executionGraph = currentJobs.get(jobGraph.getJobID) match {  
    case Some((graph, currentJobInfo)) =>    
        currentJobInfo.setLastActive()    
        graph  
    case None =>    
        val graph = new ExecutionGraph(      
            executionContext,      
            jobGraph.getJobID,      
            jobGraph.getName,      
            jobGraph.getJobConfiguration,      
            timeout,      
            restartStrategy,      
            jobGraph.getUserJarBlobKeys,      
            jobGraph.getClasspaths,      
            userCodeLoader)    
        currentJobs.put(jobGraph.getJobID, (graph, jobInfo))    
        graph
}

获得了executionGraph之后会对其相关属性进行设置。这些属性包括调度模式、是否同意被增加调度队列、计划的Json格式表示。

executionGraph.setScheduleMode(jobGraph.getScheduleMode())
executionGraph.setQueuedSchedulingAllowed(jobGraph.getAllowQueuedScheduling())
executionGraph.setJsonPlan(JsonPlanGenerator.generatePlan(jobGraph))

接下来初始化JobVertex的一些属性：

val numSlots = scheduler.getTotalNumberOfSlots()
for (vertex <- jobGraph.getVertices.asScala) {  
    val executableClass = vertex.getInvokableClassName 
    if (vertex.getParallelism() == ExecutionConfig.PARALLELISM_AUTO_MAX) {    
        vertex.setParallelism(numSlots)  
    }  
    vertex.initializeOnMaster(userCodeLoader)
}

获得JobGraph中从source開始的依照拓扑顺序排序的顶点集合，然后将该集合附加到ExecutionGraph上，附加的过程完毕了非常多事情。我们兴许进行分析：

val sortedTopology = jobGraph.getVerticesSortedTopologicallyFromSources()
executionGraph.attachJobGraph(sortedTopology)

接下来将快照配置和检查点配置的信息写入ExecutionGraph：

val snapshotSettings = jobGraph.getSnapshotSettings
if (snapshotSettings != null) {  
    val jobId = jobGraph.getJobID()  
    val idToVertex: JobVertexID => ExecutionJobVertex = id => {    
        val vertex = executionGraph.getJobVertex(id)      
        vertex  
    }  
    val triggerVertices: java.util.List[ExecutionJobVertex] =    
        snapshotSettings.getVerticesToTrigger().asScala.map(idToVertex).asJava  
    val ackVertices: java.util.List[ExecutionJobVertex] =    
        snapshotSettings.getVerticesToAcknowledge().asScala.map(idToVertex).asJava  
    val confirmVertices: java.util.List[ExecutionJobVertex] =    
        snapshotSettings.getVerticesToConfirm().asScala.map(idToVertex).asJava  
    val completedCheckpoints = checkpointRecoveryFactory    
        .createCompletedCheckpoints(jobId, userCodeLoader)  
    val checkpointIdCounter = checkpointRecoveryFactory.createCheckpointIDCounter(jobId)  
    executionGraph.enableSnapshotCheckpointing(    
        snapshotSettings.getCheckpointInterval,    
        snapshotSettings.getCheckpointTimeout,    
        snapshotSettings.getMinPauseBetweenCheckpoints,    
        snapshotSettings.getMaxConcurrentCheckpoints,    
        triggerVertices,    
        ackVertices,    
        confirmVertices,    
        context.system,    
        leaderSessionID.orNull,    
        checkpointIdCounter,    
        completedCheckpoints,    
        recoveryMode,    
        savepointStore)
}

JobManager自身会注冊Job状态变更的事件回调：

executionGraph.registerJobStatusListener(new AkkaActorGateway(self, leaderSessionID.orNull))

假设Client也须要感知到运行结果以及Job状态的变更，那么也会为Client注冊事件回调：

if (jobInfo.listeningBehaviour == ListeningBehaviour.EXECUTION_RESULT_AND_STATE_CHANGES) {    
    val gateway = new AkkaActorGateway(jobInfo.client, leaderSessionID.orNull)  
    executionGraph.registerExecutionListener(gateway)  
    executionGraph.registerJobStatusListener(gateway)
}

以上这些代码从将Job相关的Jar增加到类库缓存管理器開始，都被包裹在try块中。假设产生异常将进入catch代码块中进行异常处理：

catch {  
    case t: Throwable =>    
        log.error(s"Failed to submit job $jobId ($jobName)", t)    
        libraryCacheManager.unregisterJob(jobId)    
        currentJobs.remove(jobId)    
        if (executionGraph != null) {      
            executionGraph.fail(t)    
        }    
        val rt: Throwable = if (t.isInstanceOf[JobExecutionException]) {      
            t    
        } else {      
            new JobExecutionException(jobId, s"Failed to submit job $jobId ($jobName)", t)    
        }    
        jobInfo.client ! decorateMessage(JobResultFailure(new SerializedThrowable(rt)))    
        return
}

异常处理时首先依据jobID移除类库缓存中跟当前Job有关的类库，接着从currentJobsMap中移除job相应的ExecutionGraph，JobInfo元组信息。然后调用ExecutionGraph的fail方法。促使其失败。最后。将产生的异常以JobResultFailure消息告知客户端并结束方法调用。

从当前開始直到最后的这段代码可能会造成堵塞，将会被包裹在future块中并以异步的方式运行。先推断当前的是否是恢复模式，假设是恢复模式则从近期的检查点恢复：

if (isRecovery) {  
    executionGraph.restoreLatestCheckpointedState()
}

假设不是恢复模式，但快照配置中存在保存点路径。也将基于保存点来重置状态：

executionGraph.restoreSavepoint(savepointPath)

然后会把当前的JobGraph信息写入SubmittedJobGraphStore，它主要用于恢复的目的

submittedJobGraphs.putJobGraph(new SubmittedJobGraph(jobGraph, jobInfo))

运行到这一步。就能够向Client回复JobSubmitSuccess消息了：

jobInfo.client ! decorateMessage(JobSubmitSuccess(jobGraph.getJobID))

接下来会基于ExecutionGraph触发Job的调度，这是Task被运行的前提：

if (leaderElectionService.hasLeadership) {  
    executionGraph.scheduleForExecution(scheduler)
} else {  
    self ! decorateMessage(RemoveJob(jobId, removeJobFromStateBackend = false))  
}

为了防止多个JobManager同一时候调度同样的Job的情况产生，这里首先推断当前节点是否是Leader。

假设是，才会进行调度。

否则将会向自身发送一条RemoveJob消息。以进入其它处理逻辑。

到此为止，submitJob方法的梳理就算完毕了。

由于这是JobManager接收到Client提交的Job后的主要处理方法，所以包括的逻辑比較多。