本期内容:
1 Executor WAL
2 消息重放
3 其他
一切不能进行实时流处理的数据都是无效的数据。在流处理时代,SparkStreaming有着强大吸引力,而且发展前景广阔,加之Spark的生态系统,Streaming可以方便调用其他的诸如SQL,MLlib等强大框架,它必将一统天下。
Spark Streaming运行时与其说是Spark Core上的一个流式处理框架,不如说是Spark Core上的一个最复杂的应用程序。如果可以掌握Spark streaming这个复杂的应用程序,那么其他的再复杂的应用程序都不在话下了。这里选择Spark Streaming作为版本定制的切入点也是大势所趋。
Executor 的数据安全容错非常重要(计算容错主要借助Spark Core的容错机制容错,天然就是容错的),这里的容错是数据的安全容错。
这里一共有俩种方式来写数据,一种是WriteAheadLogBasedBlockHandler,一种是BlockManagerBasedBlockHandler.这俩种方式保证了数据可以回放。并且在这个过程中,如果没有指定checkpoint的目录的话,会抛出异常。
在这里我们着重研究WAL的方式,也就是WriteAheadLogBasedBlockHandler
private[streaming] object WriteAheadLogBasedBlockHandler{ def checkpointDirToLogDir(checkpointDir:String, streamId: Int): String = { new Path(checkpointDir,new Path("receivedData", streamId.toString)).toString } }
这里checkpointDirToLogDir创建了一个保存数据的路径。
def createLogForReceiver( sparkConf: SparkConf, fileWalLogDirectory: String, fileWalHadoopConf: Configuration ): WriteAheadLog = { createLog(false, sparkConf, fileWalLogDirectory, fileWalHadoopConf) }
该方法的实质的调用了createLog方法
private def createLog( isDriver: Boolean, sparkConf: SparkConf, fileWalLogDirectory: String, fileWalHadoopConf: Configuration ): WriteAheadLog = { val classNameOption= if (isDriver) { sparkConf.getOption(DRIVER_WAL_CLASS_CONF_KEY) } else { sparkConf.getOption(RECEIVER_WAL_CLASS_CONF_KEY) } val wal =classNameOption.map { className => try { instantiateClass( Utils.classForName(className).asInstanceOf[Class[_ <: WriteAheadLog]], sparkConf) } catch { case NonFatal(e) => throw new SparkException(s"Couldnot create a write ahead log of class $className", e) } }.getOrElse { new FileBasedWriteAheadLog(sparkConf,fileWalLogDirectory, fileWalHadoopConf, getRollingIntervalSecs(sparkConf,isDriver), getMaxFailures(sparkConf, isDriver), shouldCloseFileAfterWrite(sparkConf,isDriver)) } if (isBatchingEnabled(sparkConf,isDriver)) { new BatchedWriteAheadLog(wal,sparkConf) } else { wal } }
这里默认创建的是FileBasedWriteAheadLog。
再回到storeBlock中,这里有storeInBlockManagerFuture和storeInWriteAheadLogFuture两个方法。所以数据存入BlockManager和WAl同时进行。完成之后,就可以交给trackEndpoint消息循环体了
def storeBlock(blockId: StreamBlockId, block:ReceivedBlock): ReceivedBlockStoreResult = { var numRecords= None: Option[Long] // Serialize the block so that it can be inserted intoboth val serializedBlock= block match { case ArrayBufferBlock(arrayBuffer)=> numRecords = Some(arrayBuffer.size.toLong) blockManager.dataSerialize(blockId,arrayBuffer.iterator) case IteratorBlock(iterator)=> val countIterator= new CountingIterator(iterator) val serializedBlock= blockManager.dataSerialize(blockId, countIterator) numRecords = countIterator.count serializedBlock case ByteBufferBlock(byteBuffer)=> byteBuffer case _=> throw new Exception(s"Could notpush $blockId to block manager, unexpected block type") } // Store the block in block manager val storeInBlockManagerFuture= Future { val putResult= blockManager.putBytes(blockId,serializedBlock, effectiveStorageLevel, tellMaster = true) if (!putResult.map{ _._1 }.contains(blockId)) { throw new SparkException( s"Could not store $blockId to block manager with storage level $storageLevel") } } // Store the block in write ahead log val storeInWriteAheadLogFuture= Future { writeAheadLog.write(serializedBlock, clock.getTimeMillis()) } // Combine the futures, wait for both to complete, andreturn the write ahead log record handle val combinedFuture= storeInBlockManagerFuture.zip(storeInWriteAheadLogFuture).map(_._2) val walRecordHandle= Await.result(combinedFuture, blockStoreTimeout) WriteAheadLogBasedStoreResult(blockId,numRecords, walRecordHandle) }
备注:
资料来源于:DT_大数据梦工厂(Spark发行版本定制)
更多私密内容,请关注微信公众号:DT_Spark
如果您对大数据Spark感兴趣,可以免费听由王家林老师每天晚上20:00开设的Spark永久免费公开课,地址YY房间号:68917580