当Spark应用程序使用了SparkSQL(包括Hive)或者需要将任务的输出保存到HDFS时,就会用到输出提交协调器OutputCommitCoordinator,OutputCommitCoordinator将决定任务是否可以提交输出到HDFS。无论是Driver还是Executor,在SparkEnv中都包含了子组件OutputCommitCoordinator。在Driver上注册了OutputCommitCoordianatorEndpoint,所有Executor上的OutputCommitCoordinator都是通过OutputCommitCoordinatorEndpoint的RpcEndpointRef来询问Driver上的OutputCommitCoordinator,是否能够将输出提交到HDFS。
SparkEnv中创建OutputCommitCoordinator的实现代码如下:
//org.apache.spark.SparkEnv
val outputCommitCoordinator = mockOutputCommitCoordinator.getOrElse {
new OutputCommitCoordinator(conf, isDriver)
}
val outputCommitCoordinatorRef = registerOrLookupEndpoint("OutputCommitCoordinator",
new OutputCommitCoordinatorEndpoint(rpcEnv, outputCommitCoordinator))
outputCommitCoordinator.coordinatorRef = Some(outputCommitCoordinatorRef)
根据上述代码,OutputCommitCoordinator的创建步骤如下:
- 1)新建OutputCommitCoordinator实例
- 2)如果当前实例是Driver,则创建OutputCommitCoordinatorEndpoint,并且注册到Dispatcher中,注册名为OutputCommitCoordinator。如果当前应用程序是Executor,则从远端Driver实例的NettyRpcEnv的Dispatcher中查找OutputCommitCoordinatorEndpoint的引用
- 3)无论是Driver还是Executor,最后都由OutputCommitCoordinator的属性coordinatorRef持有OutputCommitCoordinatorEndpoint的引用。
1 OutputCommitCoordinatorEndpoint的实现
OutputCommitCoordinatorEndpoint的代码如下:
//org.apache.spark.scheduler.OutputCommitCoordinator
private[spark] object OutputCommitCoordinator {
private[spark] class OutputCommitCoordinatorEndpoint(
override val rpcEnv: RpcEnv, outputCommitCoordinator: OutputCommitCoordinator)
extends RpcEndpoint with Logging {
override def receive: PartialFunction[Any, Unit] = {
case StopCoordinator =>
logInfo("OutputCommitCoordinator stopped!")
stop()
}
override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
case AskPermissionToCommitOutput(stage, partition, attemptNumber) =>
context.reply(
outputCommitCoordinator.handleAskPermissionToCommit(stage, partition, attemptNumber))
}
}
}
OutputCommitCoordinatorEndpoint将接收两个消息:
- StopCoordinator:此消息将停止OutputCommitCoordinatorEndpoint
- AskPermissionToCommitOutput:此消息将通过OutputCommitCoordinator的handleAskPermissionToCommit方法处理,进而确认客户端是否有权限将输出提交到HDFS
2 OutputCommitCoordinator的实现
OutputCommitCoordinator用于判定给定Stage的分区任务是否有权限将输出提交到HDFS,并对同一分区任务的多次尝试(TaskAttempt)进行协调。OutputCommitCoordinator中有以下属性:
- conf:即SparkConf
- isDriver:当前节点是否是Driver
- coordinatorRef:OutputCommitCoordinatorEndpoint的NettyRpcEndpointRef引用
- NO_AUTHORIZED_COMMITTER:值为-1的常量
- authorizedCommitterByStage:缓存Stage的各个分区的任务尝试。
了解了OutpuCommitCoordiantor的各个属性,下面来看看OutputCommitCoordinator实现 的方法
2.1 handlerAskPermissionToCommit方法
用于判断给定的任务尝试是否有权限将给定Stage的指定分区的数据提交到HDFS
//org.apache.spark.scheduler.OutputCommitCoordinator
private[scheduler] def handleAskPermissionToCommit(
stage: StageId,
partition: PartitionId,
attemptNumber: TaskAttemptNumber): Boolean = synchronized {
authorizedCommittersByStage.get(stage) match {
case Some(authorizedCommitters) =>
authorizedCommitters(partition) match {
case NO_AUTHORIZED_COMMITTER =>
logDebug(s"Authorizing attemptNumber=$attemptNumber to commit for stage=$stage, " + s"partition=$partition")
authorizedCommitters(partition) = attemptNumber
true
case existingCommitter =>
logDebug(s"Denying attemptNumber=$attemptNumber to commit for stage=$stage, " +
s"partition=$partition; existingCommitter = $existingCommitter")
false
}
case None =>
logDebug(s"Stage $stage has completed, so not allowing attempt number $attemptNumber of" + s"partition $partition to commit")
false
}
}
根据代码,其实现步骤如下:
- 1)从authorizedCommittersByStage缓存中找到给定Stage的指定分区的TaskAttemptNumber(类型为Int)
- 2)如果第1步获取的TaskAttemptNumber等于NO_AUTHORIZED)COMMITTER,则说明当前是首次提交给定Stage的指定分区的输出,因此按照第一提交者(first committer wins)策略,给定TaskAttemptNumber(即attemptNumber)有权限将给定Stage的指定分区的输出提交到HDFS。为了告诉后来的任务尝试“已经处理过”,还需要将给定分区的索引与attempNumber的关系保存到TaskAttemptNumber数组中
- 3)如果第1步获取的TaskAttemptNumber不等于NO_AUTHORIZED_COMMITTER,则说明之前已经有任务尝试将给定Stage的指定分区的输出提交到HDFS,因此按照第一提交者胜利(first committer wins)策略,给定TaskAttempNumber(即attempNumber)没有权限将给定Stage的指定分区的输出提交到HDFS
2.2 isEmpty
用于判断authorizedCommittersByStage是否为空
//org.apache.spark.scheduler.OutputCommitCoordinator
def isEmpty: Boolean = {
authorizedCommittersByStage.isEmpty
}
2.3 canCommit
用于向OutputCommitCoordinatorEndpoint发送AskPermissionToCommitOutput,并根据OutputCommitCoordinatorEndpoint的响应确认是否有权限将Stage的指定分区的输出提交到HDFS上
//org.apache.spark.scheduler.OutputCommitCoordinator
def canCommit(
stage: StageId,
partition: PartitionId,
attemptNumber: TaskAttemptNumber): Boolean = {
val msg = AskPermissionToCommitOutput(stage, partition, attemptNumber)
coordinatorRef match {
case Some(endpointRef) =>
//询问是否有权限将Stage的指定分区的输出提交到HDFS上
endpointRef.askWithRetry[Boolean](msg)
case None =>
logError(
"canCommit called after coordinator was stopped (is SparkEnv shutdown in progress)?")
false
}
}
2.4 stageStart
stageStart方法用于启动Stage的输出提交到HDFS的协调机制,其实质为创建给定Stage的对应TaskAttempNumber数组,并将TaskAttemptNumber数组中的所有TaskAttemptNumber置为NO_AUTHORIZED_COMMITTER
//org.apache.spark.scheduler.OutputCommitCoordinator
private[scheduler] def stageStart(
stage: StageId,
maxPartitionId: Int): Unit = {
val arr = new Array[TaskAttemptNumber](maxPartitionId + 1)
java.util.Arrays.fill(arr, NO_AUTHORIZED_COMMITTER)
synchronized {
authorizedCommittersByStage(stage) = arr
}
}
2.5 stageEnd
用于停止给定Stage的输出提交到HDFS的协调机制,其实质为将给定Stage及对应的TaskAttemptNumber数组从authorizedCommittersByStage中删除
//org.apache.spark.scheduler.OutputCommitCoordinator
private[scheduler] def stageEnd(stage: StageId): Unit = synchronized {
authorizedCommittersByStage.remove(stage)
}
2.6 taskCompleted
给定Stage的指定分区的任务执行完成后将调用taskCompleted方法
//org.apache.spark.scheduler.OutputCommitCoordinator
private[scheduler] def taskCompleted(
stage: StageId,
partition: PartitionId,
attemptNumber: TaskAttemptNumber,
reason: TaskEndReason): Unit = synchronized {
val authorizedCommitters = authorizedCommittersByStage.getOrElse(stage, {
logDebug(s"Ignoring task completion for completed stage")
return
})
reason match {
case Success =>
case denied: TaskCommitDenied =>
logInfo(s"Task was denied committing, stage: $stage, partition: $partition, " + s"attempt: $attemptNumber")
case otherReason =>
if (authorizedCommitters(partition) == attemptNumber) {
logDebug(s"Authorized committer (attemptNumber=$attemptNumber, stage=$stage, " + s"partition=$partition) failed; clearing lock")
authorizedCommitters(partition) = NO_AUTHORIZED_COMMITTER
}
}
}
根据代码清单,有三种情况可视为任务完成:
- 任务执行成功:此时reason等于特质TaskEndReason的子类Success
- 任务提交被拒绝:此时reason等于特质TaskEndReason的子类TaskCommitDenied。
- 其它原因:此时readson等于特质TaskEndReason的其它子类。针对这种原因,需要将给定Stage的对应TaskAttemptNumber数组中指定分区的值修改为NO_AUTHORIZED_COMMITTER,以便之后的任务尝试能够有权限提交
2.7 stop
stop方法通过向OutputCommitCoordinatorEndpoint发送StopCoordinator消息以停止OutputCommitCoordinatorEndpoint,然后清空authorizedCommittersByStage。
//org.apache.spark.scheduler.OutputCommitCoordinator
def stop(): Unit = synchronized {
if (isDriver) {
coordinatorRef.foreach(_ send StopCoordinator)
coordinatorRef = None
authorizedCommittersByStage.clear()
}
}
3 OutputCommitCoordinator的工作原理
经过对OutputCommitCoordinatorEndpoint和OutputCommitCoordinator的详细介绍,OutputCommitCoordinator决定任务是否可以提交输出到HDFS的工作原理可以用下图总结说明:
authorizedCommittersByStage缓存了每一个Stage及其分区的内存结构,S0、S1及SN代表不同的Stage,P0、P1则代表Stage的各个分区,Pn、Pm说明每个Stage内的分区数量不同。图中每个序号的含义如下:
- ①:表示OutputCommitCoordinatorEndpoint收到StopCoordinator消息,OutputCommitCoordinatorEndpoint将调用父类RpcEndpoint的stop方法,RpcEndpoint的stop方法实际又调用了NettyRpcEnv的stop方法,停止OutputCommitCoordinatorEndpoint的工作
- ②:表示OutputCommitCoordinatorEndpoint在接收到AskPermissionToCommitOutput消息后,调用OutputCommitCoordinator的handleAskPermissionToCommit方法判断给定的任务尝试是否有权限将给定Stage的指定分区的数据提交到HDFS
- ③:表示AskPermissionToCommitOutput消息携带的Stage为S0,分区为Pn,任务尝试号为1,handleAskPermissionToCommit方法执行时发现阶段S0的分区Pn还未有任务尝试占用(即值为-1),则允许当前任务尝试将阶段S0的分区Pn的数据提交到HDFS并且将Pn的值设置为1
- ④:表示AskPermissionToCommitOutput消息携带的Stage为S1,分区为Pn,任务尝试号为11,handleAskPermissionToCommit方法执行时发现阶段S1的分区Pm已经被其它任务尝试占用(占用此分区的任务尝试号为10),则不允许当前任务尝试将阶段S1的分区Pm的数据提交到HDFS