ApacheSpark源码走读之3-Task运行期之函数调用关系分析
一.Spark中间结果处理
1.中间计算任务ShuffleMapTask执行完计算后会将计算状态封装为MapStatus并返回给DAGScheduler
2.DAGScheduler将MapStatus保存到MapOutputTrackerMaster中
3.ResultTask执行到ShuffleRDD(只要是shuffle操作都会产生ShuffleRDD中间结果)时会调用BlockStoreShuffleFetcher.fetch()获取数据,它会优先咨询MapOutputTrackerMaster获取MapStatus,根据返回的结果BlockManager.getMultiple()得到真正的数据
二.SparkEnv会创建BlockManager,ConnectionManager,CacheManager,MapOutputTrackerMaster
private[spark] val env = SparkEnv.create( conf, "",
conf.get("spark.driver.host"),
conf.get("spark.driver.port").toInt, isDriver = true, isLocal = isLocal) SparkEnv.set(env)
ApacheSpark源码走读之4-DStream实时流数据处理
一.DStreamGraph中记录了输入的Stream和输出的Stream
private val inputStreams=new ArrayBuffer[InputDStream[_]]()
private val outputStreams=new ArrayBuffer[DStream[_]]()
outputStreams中的元素在有Output类型的Operation作用于DStream上时自动添加到DStreamGraph ,两者的区别在与outputStream会重载generateJob
以socketStream为例,SocketInputDStream启动一个线程,该线程执行receive()接收数据,接收到的数据会被优先存储,最终调用BlockManager中的方法存储
二.创建重复定时器
private val timer = new RecurringTimer(clock,
ssc.graph.batchDuration.milliseconds, longTime => eventActor ! GenerateJobs(new Time(longTime)), "JobGenerator") //一旦超时,则发出GenerateJobs事件,然后调用generateJobs函数处理
private def generateJobs(time: Time) { SparkEnv.set(ssc.env) Try(graph.generateJobs(time)) match { case Success(jobs) => val receivedBlockInfo = graph.getReceiverInputStreams.map { stream =>
val streamId = stream.id
val receivedBlockInfo = stream.getReceivedBlockInfo(time) (streamId, receivedBlockInfo) }.toMap jobScheduler.submitJobSet(JobSet(time, jobs,
receivedBlockInfo))
case Failure(e) => jobScheduler.reportError("Error generating jobs for time " + time, e)
} eventActor ! DoCheckpoint(time) }
一路下去会调用Job.run()从而调用sc.runJob()
ApacheSpark源码走读之5-DStream处理的容错性分析
比如如下代码:
val ssc = new StreamingContext(sc, Seconds(3))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap( _.split(" "))
val pairs = words.map(word => (word,1))
val wordCount = pairs.reduceByKey(_ + _)
wordCount.print()
数据接收过程:
ssc.socketTextStream("localhost", 9999) 的执行流程:
1) SocketReceiver.receive()不断接收“localhost:9999”socket进程的数据,并放入BlockGenerator.currentBuffer中
2) BlockGenerator中有重复定时器,一旦超时便会调用updateCurentBuffer处理,updateCurrentBuffer()将当前buffer中的数据封装为新的Block,并放入ArrayBlockingQueue队列中
3) BlockGenerator中有一个线程,不停的将队列中的Block通过pushArrayBuffer函数传递给BlockManager中,这样BlockManager就将数据块保存在内存中
4) PushArrayBuffer还会将BlockManager存储的Block的Id传递给ReceiverTracker,ReceiverTracker会将存储的blockId放入StreamId的队列中,并将接收到但还未处理的Block放入receiverBlockInfo中
核心函数pushArrayBuffer:
def pushArrayBuffer(
arrayBuffer: ArrayBuffer[_], optionalMetadata: Option[Any],
optionalBlockId: Option[StreamBlockId] ) {
val blockId = optionalBlockId.getOrElse(nextBlockId)
val time = System.currentTimeMillis
blockManager.put(blockId, arrayBuffer.asInstanceOf[ArrayBuffer[Any]], storageLevel, tellMaster = true)
reportPushedBlock(blockId, arrayBuffer.size, optionalMetadata) }
作业处理过程
private def generateJobs(time: Time) { SparkEnv.set(ssc.env)
Try(graph.generateJobs(time)) match { case Success(jobs) =>
val receivedBlockInfo = graph.getReceiverInputStreams.map { stream =>
val streamId = stream.id
val receivedBlockInfo = stream.getReceivedBlockInfo(time)(streamId, receivedBlockInfo)}.toMap
jobScheduler.submitJobSet(JobSet(time, jobs, receivedBlockInfo))
case Failure(e) =>
jobScheduler.reportError("Error generating jobs for time " + time, e) }
eventActor ! DoCheckpoint(time) }
GeneratorJob重复定时器一旦超时则发送消息并调用generateJobs()处理->DStreamGraph.generateJobs(jobs)->DStreamGraph.getReceiverInputStreams()->DSInputStream.getReceivedBlockInfo()-> JobScheduler.submitJobSet()->sc.runJob()
Spark Streaming容错机制
在generateJobs()最后发送DoCheckpoint消息,并由相应的actor将DStreamCheckpointData写入HDFS中,实际上是通过CheckpointWriteHandler真正写入HDFS,并由CheckpointReader读出,在重启时会优先尝试从checkpoint中恢复,否则需要重新创建ssc
def functionToCreateContext(): StreamingContext = {
val ssc = new StreamingContext(...) // new context val lines = ssc.socketTextStream(...) // create DStreams ...
ssc.checkpoint(checkpointDirectory) // set checkpoint directory ssc }
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
ApacheSpark源码走读之6-存储子系统分析
Spark存储子系统的功能模块:
1) CacheManager:RDD进行计算时,通过CacheManager获取数据,并通过CacheManager存储计算结果
2) BlockManager:它决定CacheManager从内存还是磁盘存取数据
3) MemoryStore:存取数据到内存
4) DiskStore:存储数据到磁盘
5) BlockManagerWorker:容错性的数据复制操作,即将数据异步复制到其他节点
6) ConnectionManager:负责与其他节点建立连接,并负责数据的发送与接收
7) BlockManagerMaster:只存在于Driver所在的Executor上,记录记录所有blockIds存储在哪个Slave Worker上
上述所有功能模块均在SparkEnv.create()中创建
数据写入的简要流程:
1) RDD.iterator()是与存储子系统交互的入口
2) CacheManager.getOrCompute调用BlockManager的put接口写入数据
3) 数据优先写入MemoryStore,内存满后将访问不频繁的数据写入磁盘
4) 由BlockManager通知BlockManagerMaster有新数据写入,在BlockManagerMaster中保存元数据
5) 将写入的数据与Slave Worker进行同步,即将数据复制到另外一个节点
数据读取的过程:
1) 若数据块存在于本地的内存或者磁盘,则直接读取getLocal(blockId)
2) 否则读取远程数据块getRemote(blockId)->doGetRemote(blockId)->BlockManagerWorker.syncGetBlock从远程获取数据块
def syncGetBlock(msg: GetBlock, toConnManagerId: ConnectionManagerId): ByteBuffer = {
val blockManager = blockManagerWorker.blockManager
val connectionManager = blockManager.connectionManager
val blockMessage = BlockMessage.fromGetBlock(msg) //构造BlockMessage
val blockMessageArray = new BlockMessageArray(blockMessage) //构造BlockMessageArray
val responseMessage = connectionManager.sendMessageReliablySync( toConnManagerId, blockMessageArray.toBufferMessage) //向远程节点toConnManagerId发送请求blockMessageArray返回响应信息responseMessage
responseMessage match { case Some(message) => {
val bufferMessage = message.asInstanceOf[BufferMessage] logDebug("Response message received " + bufferMessage)
BlockMessageArray.fromBufferMessage(bufferMessage).foreach(blockMessage => { //BufferMessage转换为BlockMessage
logDebug("Found " + blockMessage) return blockMessage.getData }) } //返回BlockMessage.getData()消息
case None => logDebug("No response message received") } null }
def sendMessageReliably(connectionManagerId: ConnectionManagerId, message: Message)
: Future[Option[Message]] = {
val promise = Promise[Option[Message]]
val status = new MessageStatus( message, connectionManagerId, s => promise.success(s.ackMessage)) messageStatuses.synchronized {
messageStatuses += ((message.id, status))
}
sendMessage(connectionManagerId, message)
promise.future //异步的体现,消息响应时会返回ackMessage
}
ApacheSpark源码走读之7-Standalone部署方式分析
1) Worker向Master发送RegisterWorker消息
case RegisterWorker(id, workerHost, workerPort, cores, memory, workerUiPort, publicAddress) => {
if (state == RecoveryState.STANDBY) { // ignore, don't send response
} else if (idToWorker.contains(id)) { //之前注册过
sender ! RegisterWorkerFailed("Duplicate worker ID") } else {
val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory, sender, workerUiPort, publicAddress) //创建worker对象
if (registerWorker(worker)) { //注册成功
persistenceEngine.addWorker(worker) //添加worker对象
sender ! RegisteredWorker(masterUrl, masterWebUiUrl) //Master向Worker发送注册成功消息
schedule() } else {
val workerAddress = worker.actor.path.address
logWarning("Worker registration failed. Attempted to re-register worker at same " +
"address: " + workerAddress) sender ! RegisterWorkerFailed("Attempted to re-register worker at same address: "
+ workerAddress) } } }
2) Driver向Master发送RegisterApplication消息
case RegisterApplication(description) => { if (state == RecoveryState.STANDBY) { // ignore, don't send response }
else {
logInfo("Registering app " + description.name)
val app = createApplication(description, sender)//创建Application对象
registerApplication(app) //注册Application成功
logInfo("Registered app " + description.name + " with ID " + app.id)
persistenceEngine.addApplication(app) //添加Application到持久化引擎
sender ! RegisteredApplication(app.id, masterUrl) //发送注册成功信息
schedule() } }
3) 一旦Application注册成功后,Master会要求Worker启动ExecutorBackend
case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
if (masterUrl != activeMasterUrl) { //判断Master的工作状态
logWarning("Invalid Master (" + masterUrl + ") attempted to launch executor.") } else { try {
logInfo("Asked to launch executor %s/%d for %s".format(appId, execId, appDesc.name))
val manager = new ExecutorRunner(appId, execId, appDesc, cores_, memory_, self, workerId, host, appDesc.sparkHome.map(userSparkHome => new File(userSparkHome)).getOrElse(sparkHome), workDir, akkaUrl, ExecutorState.RUNNING) //创建ExecutorRunner,并修改其状态为RUNNING
executors(appId + "/" + execId) = manager
manager.start() //启动Executor
coresUsed += cores_
memoryUsed += memory_
masterLock.synchronized {
master ! ExecutorStateChanged(appId, execId, manager.state, None, None) } //向Master发送ExecutorStateChanged消息,反馈Master appId,executorId,manager.state
} catch {
case e: Exception => {
logError("Failed to launch exector %s/%d for %s".format(appId, execId, appDesc.name))
if (executors.contains(appId + "/" + execId)) { executors(appId + "/" + execId).kill() executors -= appId + "/" + execId }
masterLock.synchronized {
master ! ExecutorStateChanged(appId, execId, ExecutorState.FAILED, None, None) } }
} }
Cluster部署模式下的执行流程:
1) Master接收到RegisterDriver请求后,会发送LaunchDriver给Worker,要求worker上启动一个Driver的JVM进程
2) Driver Application在新生成的JVM进程中运行开始时会注册到master,发送RegisterApplication给Master
3) Master发送LaunchExecutor给Worker,要求Worker启动执行ExecutorBackend的JVM进程
4) 一旦ExecutorBackend启动完毕,Driver Application就可以将task提交给ExecutorBackend上面执行,即LaunchTask消息
Executor和Driver创建SparkEnv的源码不同:
Executor:
private val env = { if (!isLocal) {
val _env = SparkEnv.create(conf, executorId, slaveHostname, 0, isDriver = false, isLocal = false)
SparkEnv.set(_env)
_env.metricsSystem.registerSource(executorSource) _env }
else {
SparkEnv.get } }
Driver:
private[spark] val env = SparkEnv.create( conf, "",
conf.get("spark.driver.host"),
conf.get("spark.driver.port").toInt, isDriver = true, isLocal = isLocal,
listenerBus = listenerBus)
SparkEnv.set(env)
ApacheSpark源码走读之8-yarn部署方式分析
Yarn资源管理框架:
1)Client发送Application注册消息给ResourceManager(RM),RM根据Application的资源需求以及节点的资源使用情况,在指定NodeManager上启动Container以及启动ApplicationMaster
3) ApplicationMaster向ResourceManager注册
4) ApplicationMaster和ResourceManager协商调度task到指定的NodeManager上,并启动Container执行
Spark on Yarn先申请计算资源,然后进行任务调度
ApacheSpark源码走读之9-ShuffleMapTask计算结果的保存与读取
数据写入过程:
override def runTask(context: TaskContext): MapStatus = {
metrics = Some(context.taskMetrics)
var writer: ShuffleWriter[Any, Any] = null
try {
val manager = SparkEnv.get.shuffleManager
writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
writer.write(rdd.iterator(split, context).asInstanceOf[Iterator[_
if (writer != null) {
writer.stop(success = false) }
throw e } finally {
context.executeOnCompleteCallbacks()
} }
ShuffleManager.getWriter()返回的是HashShuffleWriter.write()->BlockObjectWriter.write()
override def write(records: Iterator[_ <: Product2[K, V]]): Unit = {
val iter = if (dep.aggregator.isDefined) {
if (dep.mapSideCombine) { //是否进行Map端Combine
dep.aggregator.get.combineValuesByKey(records, context) } else { records }
} else if (dep.aggregator.isEmpty && dep.mapSideCombine) {
throw new IllegalStateException("Aggregator is empty for map-side combine") } else { records }
for (elem <- iter) {
val bucketId = dep.partitioner.getPartition(elem._1)
shuffle.writers(bucketId).write(elem)
}
HashShuffleWriter.write()主要做两件事
1) 判断是否需要进行Map端聚合
2) 利用Partitioner决定elem应该写入哪个分区
val blockId = ShuffleBlockId(shuffleId, mapId, bucketId) //根据三个信息创建Shuffle BlockId
val blockFile = blockManager.diskBlockManager.getFile(blockId)//根据BlockId映射为BlockManager管理的文件名
blockManager.getDiskWriter(blockId, blockFile, serializer, bufferSize)//写文件流
写完Shuffle文件后还需要创建MapStatus并发送给Driver
HashShuffleWriter.stop->commitWritesAndBuildStatus
private def commitWritesAndBuildStatus(): MapStatus = {
// Commit the writes. Get the size of each bucket block (total block size).
var totalBytes = 0L var totalTime = 0L
val compressedSizes = shuffle.writers.map { writer: BlockObjectWriter => writer.commit() writer.close()
val size = writer.fileSegment().length totalBytes += size
totalTime += writer.timeWriting() MapOutputTracker.compressSize(size) }
// Update shuffle metrics.
val shuffleMetrics = new ShuffleWriteMetrics shuffleMetrics.shuffleBytesWritten = totalBytes shuffleMetrics.shuffleWriteTime = totalTime
metrics.shuffleWriteMetrics = Some(shuffleMetrics)
new MapStatus(blockManager.blockManagerId, compressedSizes)
MapStatus消息会封装到StatusUpdate消息中,并汇报给SchedulerBackend,由DAGScheduler.handleTaskCompletion()将Status加入相应的Stage,即加入到MapOutputTrackerMaster中
数据读取过程:
ShuffledRDD.compute()是读取过程的入口
override def compute(split: Partition, context: TaskContext): Iterator[P] = {
val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context) .read() .asInstanceOf[Iterator[P]] }
ShuffleManager.getReader()返回HashShuffleReader.read()
override def read(): Iterator[Product2[K, C]] = {
val iter = BlockStoreShuffleFetcher.fetch(handle.shuffleId, startPartition, context, Serializer.getSerializer(dep.serializer))
if (dep.aggregator.isDefined) { if (dep.mapSideCombine) {
new InterruptibleIterator(context,
dep.aggregator.get.combineCombinersByKey(iter, context)) } else {
new InterruptibleIterator(context,
dep.aggregator.get.combineValuesByKey(iter, context)) }
} else if (dep.aggregator.isEmpty && dep.mapSideCombine) {
throw new IllegalStateException("Aggregator is empty for map-side combine") } else { iter } }
HashShuffleReader.read()的两大作用:
1) 通过BlockStoreShuffleFetcher.fetch()得到迭代器
2) 判断是否需要Combine
BlockStoreShuffleFetcher.fetch()执行过程
val blockManager = SparkEnv.get.blockManager
val startTime = System.currentTimeMillis
val statuses = SparkEnv.get.mapOutputTracker.getServerStatuses(shuffleId, reduceId)
val splitsByAddress = new HashMap[BlockManagerId, ArrayBuffer[(Int, Long)]]
for (((address, size), index) (address, splits.map(s => (ShuffleBlockId(shuffleId, s._1, reduceId), s._2))) }
val blockFetcherItr = blockManager.getMultiple(blocksByAddress, serializer)
val itr = blockFetcherItr.flatMap(unpackBlock)
1)从mapOutputTracker 中获取MapStatus,MapOutputTrackerWorker会定期更新数据到本地,因此先找本地,再去MapOutputTrackerMaster中找