Spark Streaming源码解读之Executor容错安全性

最新推荐文章于 2021-10-27 00:13:25 发布

askvinson

最新推荐文章于 2021-10-27 00:13:25 发布

阅读量360

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/askvinson/article/details/51493868

版权

Spark 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

Receiver接收到的数据交由ReceiverSupervisorImpl来管理。

ReceiverSupervisorImpl接收到数据后，会数据存储并且将数据的元数据报告给ReceiverTracker 。

Executor的数据容错可以有三种方式：

WAL日志
数据副本
接收receiver的数据流回放

 
  /** Store block and report it to driver */ 
 
  def  
  pushAndReportBlock( 
 
  receivedBlock 
  :  
  ReceivedBlock, 
 
  metadataOption 
  :  
  Option[Any], 
 
  blockIdOption 
  :  
  Option[StreamBlockId] 
 
  ) { 
 
  val  
  blockId  
  =  
  blockIdOption.getOrElse(nextBlockId) 
 
  val  
  time  
  =  
  System.currentTimeMillis 
 
  val  
  blockStoreResult  
  =  
  receivedBlockHandler.storeBlock(blockId, receivedBlock) 
 
  logDebug(s 
  "Pushed block $blockId in ${(System.currentTimeMillis - time)} ms" 
  ) 
 
  val  
  numRecords  
  =  
  blockStoreResult.numRecords 
 
  val  
  blockInfo  
  =  
  ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult) 
 
  trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo)) 
 
  logDebug(s 
  "Reported block $blockId" 
  ) 
 
  }

数据的存储，是借助receiverBlockHandler，它的实现有两种方式：

 
  private  
  val  
  receivedBlockHandler 
  :  
  ReceivedBlockHandler  
  =  
  { 
 
  if  
  (WriteAheadLogUtils.enableReceiverLog(env.conf)) { 
 
  if  
  (checkpointDirOption.isEmpty) { 
 
  throw  
  new  
  SparkException( 
 
  "Cannot enable receiver write-ahead log without checkpoint directory set. "  
  + 
 
  "Please use streamingContext.checkpoint() to set the checkpoint directory. "  
  + 
 
  "See documentation for more details." 
  ) 
 
  } 
 
  new  
  WriteAheadLogBasedBlockHandler(env.blockManager, receiver.streamId, 
 
  receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get) 
 
  }  
  else  
  { 
 
  new  
  BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel) 
 
  } 
 
  } 
 
 WriteAheadLogBaseBlockHandler 一方面将数据交由BlockManager管理，另一方面会写WAL日志。
 一旦节点崩溃，可以由WAL日志恢复内存中的数据。在WAL开始时，就不在建议数据存储多个副本。
 
  private  
  val  
  effectiveStorageLevel  
  =  
  { 
 
  if  
  (storageLevel.deserialized) { 
 
  logWarning(s 
  "Storage level serialization ${storageLevel.deserialized} is not supported when"  
  + 
 
  s 
  " write ahead log is enabled, change to serialization false" 
  ) 
 
  } 
 
  if  
  (storageLevel.replication >  
  1 
  ) { 
 
  logWarning(s 
  "Storage level replication ${storageLevel.replication} is unnecessary when "  
  + 
 
  s 
  "write ahead log is enabled, change to replication 1" 
  ) 
 
  } 
 
  StorageLevel(storageLevel.useDisk, storageLevel.useMemory, storageLevel.useOffHeap,  
  false 
  ,  
  1 
  ) 
 
  } 
 
 而BlockManagerBaseBlockHandler直接将数据交由BlockManager管理。
 如果不写WAL，当节点崩溃了一定会数据丢失吗？ 这个也不一定。因为在构建WriteAheadLogBaseBlockHandler，和BlockManagerBaseBlockHandler的时候会将receiver的storageLevel传入。storageLevel用来描述数据保存的地方(内存、磁盘)以及副本个数。
 
  class  
  StorageLevel  
  private 
  ( 
 
  private  
  var  
  _ 
  useDisk 
  :  
  Boolean, 
 
  private  
  var  
  _ 
  useMemory 
  :  
  Boolean, 
 
  private  
  var  
  _ 
  useOffHeap 
  :  
  Boolean, 
 
  private  
  var  
  _ 
  deserialized 
  :  
  Boolean, 
 
  private  
  var  
  _ 
  replication 
  :  
  Int  
  =  
  1 
  ) 
 
  extends  
  Externalizable 
 
公有如下种类的StorageLevel：

 
  val  
  NONE  
  =  
  new  
  StorageLevel( 
  false 
  ,  
  false 
  ,  
  false 
  ,  
  false 
  ) 
 
 
  val  
  DISK 
  _ 
  ONLY  
  =  
  new  
  StorageLevel( 
  true 
  ,  
  false 
  ,  
  false 
  ,  
  false 
  ) 
 
 
  val  
  DISK 
  _ 
  ONLY 
  _ 
  2  
  =  
  new  
  StorageLevel( 
  true 
  ,  
  false 
  ,  
  false 
  ,  
  false 
  ,  
  2 
  ) 
 
 
  val  
  MEMORY 
  _ 
  ONLY  
  =  
  new  
  StorageLevel( 
  false 
  ,  
  true 
  ,  
  false 
  ,  
  true 
  ) 
 
 
  val  
  MEMORY 
  _ 
  ONLY 
  _ 
  2  
  =  
  new  
  StorageLevel( 
  false 
  ,  
  true 
  ,  
  false 
  ,  
  true 
  ,  
  2 
  ) 
 
 
  val  
  MEMORY 
  _ 
  ONLY 
  _ 
  SER  
  =  
  new  
  StorageLevel( 
  false 
  ,  
  true 
  ,  
  false 
  ,  
  false 
  ) 
 
 
  val  
  MEMORY 
  _ 
  ONLY 
  _ 
  SER 
  _ 
  2  
  =  
  new  
  StorageLevel( 
  false 
  ,  
  true 
  ,  
  false 
  ,  
  false 
  ,  
  2 
  ) 
 
 
  val  
  MEMORY 
  _ 
  AND 
  _ 
  DISK  
  =  
  new  
  StorageLevel( 
  true 
  ,  
  true 
  ,  
  false 
  ,  
  true 
  ) 
 
 
  val  
  MEMORY 
  _ 
  AND 
  _ 
  DISK 
  _ 
  2  
  =  
  new  
  StorageLevel( 
  true 
  ,  
  true 
  ,  
  false 
  ,  
  true 
  ,  
  2 
  ) 
 
 
  val  
  MEMORY 
  _ 
  AND 
  _ 
  DISK 
  _ 
  SER  
  =  
  new  
  StorageLevel( 
  true 
  ,  
  true 
  ,  
  false 
  ,  
  false 
  ) 
 
 
  val  
  MEMORY 
  _ 
  AND 
  _ 
  DISK 
  _ 
  SER 
  _ 
  2  
  =  
  new  
  StorageLevel( 
  true 
  ,  
  true 
  ,  
  false 
  ,  
  false 
  ,  
  2 
  ) 
 
 
  val  
  OFF 
  _ 
  HEAP  
  =  
  new  
  StorageLevel( 
  false 
  ,  
  false 
  ,  
  true 
  ,  
  false 
  ) 
 
 默认情况，数据采用MEMORY_AND_DISK_2,也就是说数据会产生两个副本，并且内存不足时会写入磁盘。
 
 
 数据的最终存储是由BlockManager完成并管理的：
 
 
 
  def  
  storeBlock(blockId 
  :  
  StreamBlockId, block 
  :  
  ReceivedBlock) 
  :  
  ReceivedBlockStoreResult  
  =  
  { 
 

     
 
 
     
  var  
  numRecords  
  =  
  None 
  :  
  Option[Long] 
 

     
 
 
     
  val  
  putResult 
  :  
  Seq[(BlockId, BlockStatus)]  
  =  
  block  
  match  
  { 
 
 
       
  case  
  ArrayBufferBlock(arrayBuffer)  
  = 
  > 
 
 
         
  numRecords  
  =  
  Some(arrayBuffer.size.toLong) 
 
 
         
  blockManager.putIterator(blockId, arrayBuffer.iterator, storageLevel, 
 
 
           
  tellMaster  
  =  
  true 
  ) 
 
 
       
  case  
  IteratorBlock(iterator)  
  = 
  > 
 
 
         
  val  
  countIterator  
  =  
  new  
  CountingIterator(iterator) 
 
 
         
  val  
  putResult  
  =  
  blockManager.putIterator(blockId, countIterator, storageLevel, 
 
 
           
  tellMaster  
  =  
  true 
  ) 
 
 
         
  numRecords  
  =  
  countIterator.count 
 
 
         
  putResult 
 
 
       
  case  
  ByteBufferBlock(byteBuffer)  
  = 
  > 
 
 
         
  blockManager.putBytes(blockId, byteBuffer, storageLevel, tellMaster  
  =  
  true 
  ) 
 
 
       
  case  
  o  
  = 
  > 
 
 
         
  throw  
  new  
  SparkException( 
 
 
           
  s 
  "Could not store $blockId to block manager, unexpected block type ${o.getClass.getName}" 
  ) 
 
 
     
  } 
 
 
     
  if  
  (!putResult.map {  
  _ 
  . 
  _ 
  1  
  }.contains(blockId)) { 
 
 
       
  throw  
  new  
  SparkException( 
 
 
         
  s 
  "Could not store $blockId to block manager with storage level $storageLevel" 
  ) 
 
 
     
  } 
 
 
     
  BlockManagerBasedStoreResult(blockId, numRecords) 
 
 
  }