Flink1.7.2 Dataset 并行计算源码分析

Flink1.7.2 Dataset 并行计算源码分析

概述

  • 了解Flink处理流程(用户程序 -> JobGrapth -> ExecutionGraph -> JobVertex -> ExecutionVertex -> 并行度 -> Task(DataSourceTask,BatchTask,DataSinkTask)
  • 了解ExecutionVetex的构建,Task的构建,执行,任务之间的调用关系

原理分析

  • 程序会转成JobGrapth提交,JobGraph最终转为ExecutionGraph进行处理

  • ExecutionGraph会拆分成ExecutionJobVertex执行,按(DataSourceTask,BatchTask,DataSinkTask) 进行拆分

0 = jobVertex = {InputFormatVertex@7675} “CHAIN DataSource (at com.opensourceteams.module.bigdata.flink.example.dataset.worldcount.WordCountRun . m a i n ( W o r d C o u n t R u n . s c a l a : 19 ) ( o r g . a p a c h e . f l i n k . a p i . j a v a . i o . T e x t I n p ) − > F l a t M a p ( F l a t M a p a t c o m . o p e n s o u r c e t e a m s . m o d u l e . b i g d a t a . f l i n k . e x a m p l e . d a t a s e t . w o r l d c o u n t . W o r d C o u n t R u n .main(WordCountRun.scala:19) (org.apache.flink.api.java.io.TextInp) -> FlatMap (FlatMap at com.opensourceteams.module.bigdata.flink.example.dataset.worldcount.WordCountRun .main(WordCountRun.scala:19)(org.apache.flink.api.java.io.TextInp)>FlatMap(FlatMapatcom.opensourceteams.module.bigdata.flink.example.dataset.worldcount.WordCountRun.main(WordCountRun.scala:23)) -> Map (Map at com.opensourceteams.module.bigdata.flink.example.dataset.worldcount.WordCountRun$.main(WordCountRun.scala:23)) -> Combine (SUM(1)) (org.apache.flink.runtime.operators.DataSourceTask)”

1 = jobVertex = {JobVertex@7695} “Reduce (SUM(1)) (org.apache.flink.runtime.operators.BatchTask)”

2 = jobVertex = {OutputFormatVertex@6665} “DataSink (collect()) (org.apache.flink.runtime.operators.DataSinkTask)”
```

  • ExecutionJobVertex 执行流程CREATED -> DEPLOYING ,转成对应的Task(CREATED -->DEPLOYING --> RUNNING)

  • 默认作业调度模式为:LAZY_FROM_SOURCES,只启动Source任务,下游任务是当上游任务开始给他发送数据时才开始

    • 刚开始,只有DataSourceTask对应的ExecutionJobVertex的 jobVertex.inputs 为空(元素个数0个),所以只对DataSourceTask进行调度,部署,任务运行
    • 随着DataSourceTask开始处理,就会产生中间数据,这时候通过输出数据,按key进行分区,分到对应的BatchTask分区数据,这个时候BatchTask就开始调度,部署,任务运行
    • 随着BatchTask开始处理,就会产生中间数据,这时候通过输出数据,按key进行分区,分到对应的DataSinkTask分区数据,这个时候DataSinkTask就开始调度,部署,任务运行
    • 由于后面的任务依赖前边的任务,就不会一开始就运行所有的任务,串行到,只有该任务有上游的数据发送过来,该任务才会启动,运行,换句话说,就是下游的任务是不启动的,只有上游的任务发送数据过来时,才开始启动,运行,这样节省了计算资源
  • 有几个并行度,ExecutionJobVertex 会转成对应的几个ExecutionVertex,ExecutionVertex 是会转化成Task来运行,ExecutionVertex中并行度通过subTaskIndex来区分,第一个subTaskIndex=0 ,第二个subTaskIndex = 1

输入数据

c a a
b c a

程序

  • WordCount.scala进行单词统计
package com.opensourceteams.module.bigdata.flink.example.dataset.worldcount

import com.opensourceteams.module.bigdata.flink.common.ConfigurationUtil
import org.apache.flink.api.scala.ExecutionEnvironment


/**
  * 批处理,DataSet WordCount分析
  */
object WordCountRun {


  def main(args: Array[String]): Unit = {

    //调试设置超时问题
    val env : ExecutionEnvironment= ExecutionEnvironment.createLocalEnvironment(ConfigurationUtil.getConfiguration(true))
    env.setParallelism(2)

    val dataSet = env.readTextFile("file:/opt/n_001_workspaces/bigdata/flink/flink-maven-scala-2/src/main/resources/data/line.txt")


    import org.apache.flink.streaming.api.scala._
    val result = dataSet.flatMap(x => x.split(" ")).map((_,1)).groupBy(0).sum(1)



    result.print()




  }

}

源码分析

JobMaster

JobMaster

  • new JobMaster()

  • 把JobGraph 转换为ExecutionGrapth

    this.executionGraph = createAndRestoreExecutionGraph(jobManagerJobMetricGroup);
    
public JobMaster(
			RpcService rpcService,
			JobMasterConfiguration jobMasterConfiguration,
			ResourceID resourceId,
			JobGraph jobGraph,
			HighAvailabilityServices highAvailabilityService,
			SlotPoolFactory slotPoolFactory,
			JobManagerSharedServices jobManagerSharedServices,
			HeartbeatServices heartbeatServices,
			BlobServer blobServer,
			JobManagerJobMetricGroupFactory jobMetricGroupFactory,
			OnCompletionActions jobCompletionActions,
			FatalErrorHandler fatalErrorHandler,
			ClassLoader userCodeLoader) throws Exception {

		super(rpcService, AkkaRpcServiceUtils.createRandomName(JOB_MANAGER_NAME));

		final JobMasterGateway selfGateway = getSelfGateway(JobMasterGateway.class);

		this.jobMasterConfiguration = checkNotNull(jobMasterConfiguration);
		this.resourceId = checkNotNull(resourceId);
		this.jobGraph = checkNotNull(jobGraph);
		this.rpcTimeout = jobMasterConfiguration.getRpcTimeout();
		this.highAvailabilityServices = checkNotNull(highAvailabilityService);
		this.blobServer = checkNotNull(blobServer);
		this.scheduledExecutorService = jobManagerSharedServices.getScheduledExecutorService();
		this.jobCompletionActions = checkNotNull(jobCompletionActions);
		this.fatalErrorHandler = checkNotNull(fatalErrorHandler);
		this.userCodeLoader = checkNotNull(userCodeLoader);
		this.jobMetricGroupFactory = checkNotNull(jobMetricGroupFactory);

		this.taskManagerHeartbeatManager = heartbeatServices.createHeartbeatManagerSender(
			resourceId,
			new TaskManagerHeartbeatListener(selfGateway),
			rpcService.getScheduledExecutor(),
			log);

		this.resourceManagerHeartbeatManager = heartbeatServices.createHeartbeatManager(
				resourceId,
				new ResourceManagerHeartbeatListener(),
				rpcService.getScheduledExecutor(),
				log);

		final String jobName = jobGraph.getName();
		final JobID jid = jobGraph.getJobID();

		log.info("Initializing job {} ({}).", jobName, jid);

		final RestartStrategies.RestartStrategyConfiguration restartStrategyConfiguration =
				jobGraph.getSerializedExecutionConfig()
						.deserializeValue(userCodeLoader)
						.getRestartStrategy();

		this.restartStrategy = RestartStrategyResolving.resolve(restartStrategyConfiguration,
			jobManagerSharedServices.getRestartStrategyFactory(),
			jobGraph.isCheckpointingEnabled());

		log.info("Using restart strategy {} for {} ({}).", this.restartStrategy, jobName, jid);

		resourceManagerLeaderRetriever = highAvailabilityServices.getResourceManagerLeaderRetriever();

		this.slotPool = checkNotNull(slotPoolFactory).createSlotPool(jobGraph.getJobID());

		this.slotPoolGateway = slotPool.getSelfGateway(SlotPoolGateway.class);

		this.registeredTaskManagers = new HashMap<>(4);

		this.backPressureStatsTracker = checkNotNull(jobManagerSharedServices.getBackPressureStatsTracker());
		this.lastInternalSavepoint = null;

		this.jobManagerJobMetricGroup = jobMetricGroupFactory.create(jobGraph);
		this.executionGraph = createAndRestoreExecutionGraph(jobManagerJobMetricGroup);
		this.jobStatusListener = null;

		this.resourceManagerConnection = null;
		this.establishedResourceManagerConnection = null;
	}

ExecutionGraph

ExecutionGraph.scheduleForExecution()

  • 负责Execution的调度,也就是负责把ExecutionGrapth转成ExecutionJobVertex,ExecutionJobVertex转成ExecutionVertex,再转成任务,这是真正的开始逻辑的地方
  • 更新当前Job的状态,即更新ExecutionGraph的状态,从CREATED更新到RUNNING
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值