Spark版本定制第12天：Executor容错安全性

最新推荐文章于 2020-06-28 11:25:41 发布

banzhuangou9123

最新推荐文章于 2020-06-28 11:25:41 发布

阅读量138

点赞数

文章标签：大数据人工智能

原文链接：http://www.cnblogs.com/pzwxySpark/p/Spark12.html

版权

本期内容：

1 Executor WAL

2 消息重放

3 其他

　　一切不能进行实时流处理的数据都是无效的数据。在流处理时代，SparkStreaming有着强大吸引力，而且发展前景广阔，加之Spark的生态系统，Streaming可以方便调用其他的诸如SQL，MLlib等强大框架，它必将一统天下。

　　Spark Streaming运行时与其说是Spark Core上的一个流式处理框架，不如说是Spark Core上的一个最复杂的应用程序。如果可以掌握Spark streaming这个复杂的应用程序，那么其他的再复杂的应用程序都不在话下了。这里选择Spark Streaming作为版本定制的切入点也是大势所趋。

　　Executor 的数据安全容错非常重要（计算容错主要借助Spark Core的容错机制容错，天然就是容错的），这里的容错是数据的安全容错。

　　这里一共有俩种方式来写数据，一种是WriteAheadLogBasedBlockHandler,一种是BlockManagerBasedBlockHandler.这俩种方式保证了数据可以回放。并且在这个过程中，如果没有指定checkpoint的目录的话，会抛出异常。

　　在这里我们着重研究WAL的方式，也就是WriteAheadLogBasedBlockHandler

private[streaming] object WriteAheadLogBasedBlockHandler{
  def checkpointDirToLogDir(checkpointDir:String, streamId: Int): String = {
    new Path(checkpointDir,new Path("receivedData", streamId.toString)).toString
  }
}

　　这里checkpointDirToLogDir创建了一个保存数据的路径。

def createLogForReceiver(
    sparkConf: SparkConf,
    fileWalLogDirectory: String,
    fileWalHadoopConf: Configuration
  ): WriteAheadLog = {
  createLog(false, sparkConf, fileWalLogDirectory, fileWalHadoopConf)
}

　　该方法的实质的调用了createLog方法

private def createLog(
    isDriver: Boolean,
    sparkConf: SparkConf,
    fileWalLogDirectory: String,
    fileWalHadoopConf: Configuration
  ): WriteAheadLog = {

  val classNameOption= if (isDriver) {
    sparkConf.getOption(DRIVER_WAL_CLASS_CONF_KEY)
  } else {
    sparkConf.getOption(RECEIVER_WAL_CLASS_CONF_KEY)
  }
  val wal =classNameOption.map { className =>
    try {
      instantiateClass(
        Utils.classForName(className).asInstanceOf[Class[_ <: WriteAheadLog]], sparkConf)
    } catch {
      case NonFatal(e) =>
        throw new SparkException(s"Couldnot create a write ahead log of class $className", e)
    }
  }.getOrElse {
   new FileBasedWriteAheadLog(sparkConf,fileWalLogDirectory, fileWalHadoopConf,
      getRollingIntervalSecs(sparkConf,isDriver), getMaxFailures(sparkConf, isDriver),
      shouldCloseFileAfterWrite(sparkConf,isDriver))
  }
  if (isBatchingEnabled(sparkConf,isDriver)) {
    new BatchedWriteAheadLog(wal,sparkConf)
  } else {
    wal
  }
}

　　这里默认创建的是FileBasedWriteAheadLog。

　　再回到storeBlock中，这里有storeInBlockManagerFuture和storeInWriteAheadLogFuture两个方法。所以数据存入BlockManager和WAl同时进行。完成之后，就可以交给trackEndpoint消息循环体了

def storeBlock(blockId: StreamBlockId, block:ReceivedBlock): ReceivedBlockStoreResult = {

  var numRecords= None: Option[Long]
  // Serialize the block so that it can be inserted intoboth
  val serializedBlock= block match {
    case ArrayBufferBlock(arrayBuffer)=>
      numRecords = Some(arrayBuffer.size.toLong)
      blockManager.dataSerialize(blockId,arrayBuffer.iterator)
    case IteratorBlock(iterator)=>
      val countIterator= new CountingIterator(iterator)
      val serializedBlock= blockManager.dataSerialize(blockId, countIterator)
      numRecords = countIterator.count
      serializedBlock
    case ByteBufferBlock(byteBuffer)=>
      byteBuffer
    case _=>
      throw new Exception(s"Could notpush $blockId to block manager, unexpected block type")
  }

  // Store the block in block manager
  val storeInBlockManagerFuture= Future {
    val putResult=
      blockManager.putBytes(blockId,serializedBlock, effectiveStorageLevel, tellMaster = true)
    if (!putResult.map{ _._1 }.contains(blockId)) {
      throw new SparkException(
        s"Could not store $blockId to block manager with storage level $storageLevel")
    }
  }

  // Store the block in write ahead log
  val storeInWriteAheadLogFuture= Future {
    writeAheadLog.write(serializedBlock, clock.getTimeMillis())
  }

  // Combine the futures, wait for both to complete, andreturn the write ahead log record handle
  val combinedFuture= storeInBlockManagerFuture.zip(storeInWriteAheadLogFuture).map(_._2)
  val walRecordHandle= Await.result(combinedFuture, blockStoreTimeout)
  WriteAheadLogBasedStoreResult(blockId,numRecords, walRecordHandle)
}

备注：

资料来源于：DT_大数据梦工厂（Spark发行版本定制）

更多私密内容，请关注微信公众号：DT_Spark

如果您对大数据Spark感兴趣，可以免费听由王家林老师每天晚上20：00开设的Spark永久免费公开课，地址YY房间号：68917580

转载于:https://www.cnblogs.com/pzwxySpark/p/Spark12.html

banzhuangou9123

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark版本定制第12天：Executor容错安全性

本期内容：1 Executor WAL2 消息重放3 其他　　一切不能进行实时流处理的数据都是无效的数据。在流处理时代，SparkStreaming有着强大吸引力，而且发展前景广阔，加之Spark的生态系统，Streaming可以方便调用其他的诸如SQL，MLlib等强大框架，它必将一统天下。　　Spark Streaming运行时与其说是Spark Core上的一个...
复制链接

扫一扫