前沿
flink on yarn 主要有两种部署方式
1.on session
2.单独任务
我们主要讲一些第二种单独提交到yarn的任务,这种方式下提交任务的流程。
具体的可参考
https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/deployment/yarn_setup.html
启动的脚本 是
./bin/flink run -m yarn-cluster -yn 2 ./examples/batch/WordCount.jar
我目前看的代码是基于最新的从git上面下载的版本flink-1.8-SNAPSHOT
入口
通过查询bin/flink 的内容看到最终执行的是下面的类,我们进入这个类的main方法。
org.apache.flink.client.cli.CliFrontend
parseParameters ,不同的action进入不同的方法,我们这里./bin/flink后面跟的run,所以action是run,进入run方法。
运行程序进入run方法
final CustomCommandLine<?> customCommandLine = getActiveCustomCommandLine(commandLine);
根据参数获取CustomCommandLine的子类FlinkYarnSessionCli
进入runProgram
private <T> void runProgram(
CustomCommandLine<T> customCommandLine,
CommandLine commandLine,
RunOptions runOptions,
PackagedProgram program) throws ProgramInvocationException, FlinkException {
//这个customCommandLine就是上面FlinkYarnSessionCli的实例,通过createClusterDescriptor方法获取ClusterDescriptor的实现类YarnClusterDescriptor的实例
final ClusterDescriptor<T> clusterDescriptor = customCommandLine.createClusterDescriptor(commandLine);
try {
final T clusterId = customCommandLine.getClusterId(commandLine);
final ClusterClient<T> client;
...............................
//部署集群
final ClusterSpecification clusterSpecification = customCommandLine.getClusterSpecification(commandLine);
client = clusterDescriptor.deploySessionCluster(clusterSpecification);
在方法里获取AbstractYarnClusterDescriptor
org.apache.flink.yarn.cli.FlinkYarnSessionCli#createDescriptor
构造AbstractYarnClusterDescriptor,然后设置参数,返回实例
private AbstractYarnClusterDescriptor createDescriptor(
Configuration configuration,
YarnConfiguration yarnConfiguration,
String configurationDirectory,
CommandLine cmd) {
AbstractYarnClusterDescriptor yarnClusterDescriptor = getClusterDescriptor(
configuration,
yarnConfiguration,
configurationDirectory);
..........................
return yarnClusterDescriptor;
构造YarnClient,用于和yarn集群交互,yarn application的相关处理
private AbstractYarnClusterDescriptor getClusterDescriptor(
Configuration configuration,
YarnConfiguration yarnConfiguration,
String configurationDirectory) {
final YarnClient yarnClient = YarnClient.createYarnClient();
yarnClient.init(yarnConfiguration);
yarnClient.start();
return new YarnClusterDescriptor(
configuration,
yarnConfiguration,
configurationDirectory,
yarnClient,
false);
}
启动flink集群
final ClusterSpecification clusterSpecification = customCommandLine.getClusterSpecification(commandLine);
client = clusterDescriptor.deploySessionCluster(clusterSpecification);
跟踪代码,最后进入了org.apache.flink.yarn.AbstractYarnClusterDescriptor#deployInternal
在这个方法里,将向yarn申请资源,启动ApplicationMaster/JobManager。
这个方法东西比较多,会做一系列的检查,处理,如配置文件是否有效,yarn集群的资源等,然后调用startAppMaster方法启动ApplicationMaster/JobManager。
startAppMaster大概有400行,会做大量的检查和初始化的工作,如加载jar包等,最后通过yarn 客户端提交application。
LOG.info("Submitting application master " + appId);
yarnClient.submitApplication(appContext);
开始执行用户job
再回到CliFrontend的runProgram方法,上面的操作启动集群之后,然后开始真正执行用户的job
executeProgram(program, client, userParallelism);
然后调用run方法
org.apache.flink.client.program.ClusterClient#run(org.apache.flink.client.program.PackagedProgram, int)
最后通过反射方式法来执行
// invoke main method
prog.invokeInteractiveModeForExecution();
集群接收用户job
用户的任务只有在最后执行execute方法的时候才开始真正提交。
// set up the execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
首先构建一个StreamExecutionEnvironment流的子流,我们看到StreamExecutionEnvironment类有下面的子类。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sziSHfyW-1590241819470)(https://note.youdao.com/yws/public/resource/d18a3afb979d44dad77ca88f753a7ecd/xmlnote/AA97CF5A575F4961B208594A405BA9CF/14495)]
我们以RemoteStreamEnvironment为例,进入其execute方法
@Override
public JobExecutionResult execute(String jobName) throws ProgramInvocationException {
StreamGraph streamGraph = getStreamGraph();
streamGraph.setJobName(jobName);
transformations.clear();
return executeRemotely(streamGraph, jarFiles);
}
首先会根据job信息获取StreamGraph,
public JobSubmissionResult run(FlinkPlan compiledPlan,
List<URL> libraries, List<URL> classpaths, ClassLoader classLoader, SavepointRestoreSettings savepointSettings)
throws ProgramInvocationException {
JobGraph job = getJobGraph(flinkConfig, compiledPlan, libraries, classpaths, savepointSettings);
return submitJob(job, classLoader);
}
计入跟踪代码,进入ClusterClient的run方法,生成JobGraph。
public JobSubmissionResult run(FlinkPlan compiledPlan,
List<URL> libraries, List<URL> classpaths, ClassLoader classLoader, SavepointRestoreSettings savepointSettings)
throws ProgramInvocationException {
JobGraph job = getJobGraph(flinkConfig, compiledPlan, libraries, classpaths, savepointSettings);
return submitJob(job, classLoader);
}
进入ClusterClient子类
org.apache.flink.yarn.YarnClusterClient#submitJob方法,判断下执行的两种模式,是否是detach,分别执行不同的方法
@Override
public JobSubmissionResult submitJob(JobGraph jobGraph, ClassLoader classLoader) throws ProgramInvocationException {
if (isDetached()) {
if (newlyCreatedCluster) {
stopAfterJob(jobGraph.getJobID());
}
return super.runDetached(jobGraph, classLoader);
} else {
return super.run(jobGraph, classLoader);
}
}
最后通过akka的客户端发给jobmanager的服务端。