Executor的功能:执行Task任务,并将计算结果传回给Driver(Task划分和调度)
6.1 Standalone模式的Executor分配详解
SparkContext->TaskScheduler->SchedulerBackend->DAGScheduler->AppClient(注册Application)->Master(启动Executor)->Worker->ExecutorRunner(fetchAndRunExecutor)->CoarseGrainedExecutorBackend(注册Executor)->SchedulerBackend(reviveOffers发送LaunchTask消息)->启动Task
6.1.1 SchedulerBackend创建AppClient
SchedulerBackend创建AppClient,AppClient向Master注册Application
RegisterApplication(appDescription:ApplicationDescription)
private[spark] classApplicationDescription(
val name:String, //Application名称
val maxCores:Option[Int],//最多使用CPU Cores数量
val memoryPerSlave:Int //每个Executor可占有的内存数
Val command:Command,//启动Java进程所需的信息
包括ClassName(CoarseGrainedExecutorBackend的类名)
启动参数:
defsubstituteVariables(argument:String):String=argument match{
case “WORKER_URL”=>workUrl
case “EXECUTOR_ID”=>execId.toString
case “HOSTNAME”=>host
case “CORES”=>cores.toString
case “APP_ID”=>appId
}
Val appUiUrl:String,//web ui的hostname
Val eventLogFile:Option[String]//事件日志目录
)
Executor利用Driver URL创建Driver Actor实例与Driver通信,除此之外还实现了AppClientListener,主要功能是确保AppClient在某种情况下能及时通知SparkDeploySchedulerBackend,调用SparkDeploySchedulerBackend的接口通知Driver Application状态更新了
6.1.2 AppClient向Master注册Application
AppClient.ClientActor与Master交互,主要消息:
1)RegisteredApplication成功注册Application
2)ApplicationRemoved 来自Master删除Application的消息
3)ExecutorAdded来自Master启动Executor
4)ExecutorUpdated来自Master的Executor状态更新的消息
5)MasterChanged 来自新竞选成功的Master
6)StopAppClient 停止App
客户端:AppClient.ClientActor.tryRegisterAllMasters(){
for(masterUrl<-masterUrls){
valactor=context.actorSelection(Master.toAkkaUrl(masterUrl))
actor!RegisterApplication(appDescription) //向Master发送ApplicationRegister消息
}
}
服务器端:Master.receive{
caseRegisterApplication(description){
if(state==RecoveryState.STANDBY){//判断Master的RecoveryState,正在恢复的Master忽略此消息,从而触发超时机制
else{
valapp=createApplication(description,sender) //根据Actor接收到的ApplicationDescription创建Application
//保存到Master的成员变量中
apps+=app
idToApp(app.id)=app
actorToApp(app.driver)=app
addressToApp(appAddress)=app
waitingApps+=app
registerApplication(app)
persistenceEngine.addApplication(app)//Master持久化App的元数据信息
sender !RegisteredApplication(app.id,masterUrl) //向AppClient反馈注册消息
scheduler() //为处于待分配资源的Application分配资源
}
}
}
}
如果超过20s仍未收到注册成功的反馈则重新注册,如果重试超过3次仍未成功,则本次注册失败
6.1.3 Master根据AppClient的提交选择Worker
Master.scheduler()//为Application分配Worker,有两种策略
1)尽量打散spark.deploy.spreadOut 将Application尽可能的分散到不同的节点
2)尽量集中将Application尽可能的集中在少数节点,适用于CPU密集型且内存占用较少的Application
Master.scheduler(){
if(spreadOutApps){//选择尽量多的分配策略
for(app<-waitingAppsif app.coresLeft){//app为待分配的Application
val usableWorkers=workers.toArray.filter(_.state==WorkerState.ALIVE).filter(canUse(app,_)).sortBy(_.coresFree).reverse //Worker筛选策略为Worker状态为ALIVE,且可用内存满足要求,并优先选择cores较多的
}
valtoAssign=math.min(app.coresLeft,useableWorkers.map(_.coresFree).sum)//如果当前可用的core数满足Application则直接分配,否则将剩余的core分配给该Application
if(useableWorkers(pos).coresFree-assigned(pos)>0){
useableWorkers按照剩余core倒序排列。拥有最多可用core的Worker会被使用
}
//利用Worker上的cores启动Executor
valexec=app.addExecutor(useableWorker(pos),assigned(pos))
launchExecutor(useableWorker(pos),exec)
app.state=ApplicationState.RUNNING
}
}
当确定了Worker和Worker上Executor上的CPU Core后,Master.launchExecutor(worker,exec)启动Executor
Master.launchExecutor(worker:WorkerInfo,exec:ExecutorInfo){
worker.addExecutor(exec);//worker添加executor
worker.actor!LaunchExecutor(masterUrl,exec,application.id,exec.id,exec.application.desc,exec.cores,exec.memory) //Master向Worker发送LaunchExecutor消息
Exec.application.driver!ExecutorAdded(exec.id,worker.id,worker.hostPort,exec.cores,exec.memory)//Master向AppClient发送ExecutorAdded消息
}
6.1.4 Worker根据Master资源分配结果创建Executor
1)Worker接收到来自Master的LaunchExecutor消息后,创建ExecutorRunner
2)ExecutorRunner将ApplicationDescription以进程方式启动,并替换其中的启动参数
3)ExecutorRunner.fetchAndRunExecutor()启动CoraseGrainedExecutorBackend
4)GoraseGrainedExecutorBackend根据driverUrl创建DriverActor与Driver通信,即发送RegisterExecutor消息,并保存Executor消息在本地
5)CoraseGrainedExecutorBackend接收到RegisterExecutor反馈后创建Executor,并接收到DriverActor.,makeOffers在Executor上执行Task
6.2 Task的执行
CoraseGrainedSchedulerBackend.DriverActor.launchTasks将TaskSet分配给Executor
val executorData = executorDataMap(task.executorId)//Task与Executor映射
executorData.freeCores-=scheduler.CPUS_PER_TASK//指定每个Task占有的CPU数
executorData.executorActor!LaunchTask(newSerializableBuffer(serializedTask)) //Driver向Executor发送LaunchTask消息
CoraseGrainedExecutorBackend.DriverActor.LaunchTask
case LaunchTask(data)=>
If(executor==null){
System.exit(1)
}
else{
valser=env.closureSerializer.newInstance()
valtaskDesc=ser.deserialize[TaskDescription](data.value)//反序列化得到Task
executor.launchTask(this,taskDesc.taskId,taskDesc.name,taskDesc.serializedTask)
}
Executor会为Task创建TaskRunner,TaskRunner会被放入ThreadPool
val tr=newTaskRunner(context,taskId,taskName,serializedTask)
runningTasks.put(taskId,tr)
threadPool.execute(tr)
6.2.1 依赖环境的创建和分发
Driver端封装Task时,会将Task依赖的文件封装到Task,包括currentFiles,currentJars,taskBytes
Executor端需要反序列化得到Task依赖文件和Task本身,即taskFiles,taskJars,subBuffer,然后Executor.updateDependencies下载这些依赖
Utils.fetchFile(name,newFile(SparkFiles.getRootDirectory),con,env.securityManager,hadoopConf,timestamp,useCache=!isLocal)
6.2.2 任务执行
taskStart=System.currentTimeMillis();//task运行的开始时间
val value=task.run(taskId.toInt)//执行Task
valtaskFinish=System.currentTimeMillis//task运行结束时间
task.run(taskId.toInt);//运行Task的核心实现:
1)valcontext=new TaskContextImpl(stageId,partitionId,attemptId,runningLocally=false);//创建TaskContext
2)TaskContextHelper.setTaskContext(context) 其实际上调用TaskContext.setTaskContext()
3)runTask(context)//执行本次Task
taskTask(context)对于ResultTask和ShuffleMapTask有不同实现
Ø 对于ResultTask而言
valser=SparkEnv.get.closureSerializer.newInstance() //创建反序列化器
val(rdd,func)=ser.deserializer[(RDD[T],(TaskContex,Iterator[T])=>U)])
ByteBuffer.wrap(taskBinary.value),
Thread.currentThread.getContextClassLoader)//获取rdd以及作用于rdd结果的函数
metrics=Some(context.taskMetrics)//Task的metrics信息
func(context,rdd.iterator(partition,context))调用rdd.iterator执行rdd的计算
Ø 对于ShuffleMapTask而言
val manager=SparkEnv.get.shuffleManager //创建ShuffleManager(Hash和Sort BasedShuffle)
writer=manager.getWrite[Any,Any](dep.shuffleHandle,partitionId,context)(HashShuffleWrite和SortShuffleWriter)
writer.write(rdd.iterator(partition,context).asInstanceOf[Iterator[_<:Product2[Any,Any]]])执行rdd,并将结果写入文件系统
return writer.stop(success=true).get关闭writer并将计算结果返回
4)context.markTaskCompleted()//执行TaskContext的回调函数
Task结束时会调用TaskContextImpl.markTaskCompleted()回调函数,即
completed=true //标记Task结束
onCompleteCallbacks.reverse.foreach(listener=>
try{
listener.onTaskCompletion(this)//执行回调函数
}
)
6.2.3任务结果的处理
val valueBytes=resultSer.serialize(value)//序列化结果
val directResult=newDirectTaskResult(valueBytes,accumUpdates,task.metrics.orNull) //将结果封装到directResult中
valserializedDirectResult=ser.serialize(directResult) //序列化结果
val resultSize=serializedDirectResult.limit//序列化结果的大小
根据resultSize选择不同策略:
1)如果结果大于1GB,则直接丢弃
2)对于较大的结果,把taskId作为key存入BlockManager
3)对于较小结果,则直接回传Driver
execBackend实际上是Executor与Driver通信的接口
TaskRunner通过execBackend.statusUpdate()将Task的执行状态汇报给Driver,Driver则会转给TaskSchedulerImpl.statusUpdate
6.2.4 Driver端处理
会根据TaskState有不同处理
1)TaskState.FINISHED,则调用TaskResultGetter.enqueueSuccessfulTask处理
若为IndirectTaskResult,则通过SparkEnv.blockManager.getRemoteBytes(blockId)通过blockId获取结果
若为DirectTaskResult,则通过调用以下5个调用栈:
Ø TaskSchedulerImpl.handleSuccessfulTask
Ø TaskSetManager.handleSuccessfulTask
Ø DAGScheduler.taskEnded
Ø DAGScheduler.eventProcessActor
Ø DAGScheduler.handleTaskCompletion
如果是ShuffleMapTask,其结果实际上是MapStatus,DAGScheduler.handleTaskCompletion
valstatus=event.result.asInstanceOf(MapStatus)
mapOtputTracker.registerMapOutputs()将shuffleId作为key将MapStatus列表存入带时间戳的HashMap
2)TaskState.FAILED或者TaskState.KILLED或者TaskState.LOST,则调用TaskResultGetter.enqueueFailedTask进行处理
6.3参数设置
6.3.1 spark.executor.memory
设置Executor的内存大小
6.3.2 日志相关
spark.eventLog.enabled设置为true,spark.eventLog.dir日志写入目录
spark.executor.logs.rolling.strategy 设置Executor日志回滚策略(基于时间和文件大小的回滚)
Spark.executor.logs.rolling.time.interval 设置回滚的时间间隔
spark.executor.logs.rolling.size.maxByte 设置文件的最大值
spark.executor.logs.rolling.maxRetainedFiles系统保留最大日志文件的数量
6.3.3 spark.executor.heartbeatInterval
设置Executor与Driver之间的心跳间隔