Sprak RDD缓存

转载:https://www.iteblog.com/archives/1532.html

我们知道,Spark相比Hadoop最大的一个优势就是可以将数据cache到内存,以供后面的计算使用。本文将对这部分的代码进行分析。

  我们可以通过rdd.persist()或rdd.cache()来缓存RDD中的数据,cache()其实就是调用persist()实现的。persist()支持下面的几种存储级别:

val NONE = new StorageLevel( false , false , false , false )
val DISK _ ONLY = new StorageLevel( true , false , false , false )
val DISK _ ONLY _ 2 = new StorageLevel( true , false , false , false , 2 )
val MEMORY _ ONLY = new StorageLevel( false , true , false , true )
val MEMORY _ ONLY _ 2 = new StorageLevel( false , true , false , true , 2 )
val MEMORY _ ONLY _ SER = new StorageLevel( false , true , false , false )
val MEMORY _ ONLY _ SER _ 2 = new StorageLevel( false , true , false , false , 2 )
val MEMORY _ AND _ DISK = new StorageLevel( true , true , false , true )
val MEMORY _ AND _ DISK _ 2 = new StorageLevel( true , true , false , true , 2 )
val MEMORY _ AND _ DISK _ SER = new StorageLevel( true , true , false , false )
val MEMORY _ AND _ DISK _ SER _ 2 = new StorageLevel( true , true , false , false , 2 )
val OFF _ HEAP = new StorageLevel( false , false , true , false )

  而cache()最终调用的是persist(StorageLevel.MEMORY_ONLY),也就是默认的缓存级别。我们可以根据自己的需要去设置不同的缓存级别,这里各种缓存级别的含义我就不介绍了,可以参见官方文档说明。

通过调用rdd.persist()来缓存RDD中的数据,其最终调用的都是下面的代码:

/
  User : 过往记忆
  Date : 2015 - 11 - 17
  Time : 22 : 59
  bolg : https : //www.iteblog.com
  本文地址:https : //www.iteblog.com/archives/1532
  过往记忆博客,专注于hadoop、hive、spark、shark、flume的技术博客,大量的干货
  过往记忆博客微信公共帐号:iteblog _ hadoop
/
 
private def persist(newLevel : StorageLevel, allowOverride : Boolean) : this . type = {
   // TODO: Handle changes of StorageLevel
   if (storageLevel ! = StorageLevel.NONE && newLevel ! = storageLevel && !allowOverride) {
     throw new UnsupportedOperationException(
       "Cannot change storage level of an RDD after it was already assigned a level" )
   }
   // If this is the first time this RDD is marked for persisting, register it
   // with the <span class="wp_keywordlink_affiliate"><a href="https://www.iteblog.com/archives/tag/spark/" title="" target="_blank" data-original-title="View all posts in Spark">Spark</a></span>Context for cleanups and accounting. Do this only once.
   if (storageLevel == StorageLevel.NONE) {
     sc.cleaner.foreach( _ .registerRDDForCleanup( this ))
     sc.persistRDD( this )
   }
   storageLevel = newLevel
   this
}

  这段代码的最主要作用其实就是将storageLevel设置为persist()函数传进来的存储级别,而且一旦设置好RDD的存储级别之后就不能再对相同RDD设置别的存储级别,否则将会出现异常。设置好存储级别在之后除非触发了action操作,否则不会真正地执行缓存操作。当我们触发了action,它会调用sc.runJob方法来真正的计算,而这个方法最终会调用org.apache.spark.scheduler.Task#run,而这个方法最后会调用ResultTask或者ShuffleMapTask的runTask方法,runTask方法最后会调用org.apache.spark.rdd.RDD#iterator方法,iterator的代码如下:

final def iterator(split : Partition, context : TaskContext) : Iterator[T] = {
   if (storageLevel ! = StorageLevel.NONE) {
     < span class = "wp_keywordlink_affiliate" >< a href = "https://www.iteblog.com/archives/tag/spark/" title = "" target = "_blank" data-original-title = "View all posts in Spark" > Spark < /a >< /span > Env.get.cacheManager.getOrCompute( this , split, context, storageLevel)
   } else {
     computeOrReadCheckpoint(split, context)
   }
}

如果当前RDD设置了存储级别(也就是通过上面的rdd.persist()设置的),那么会从cacheManager中判断是否有缓存数据。如果有,则直接获取,如果没有则计算。getOrCompute的代码如下:

def getOrCompute[T](
     rdd : RDD[T],
     partition : Partition,
     context : TaskContext,
     storageLevel : StorageLevel) : Iterator[T] = {
 
   val key = RDDBlockId(rdd.id, partition.index)
   logDebug(s "Looking for partition $key" )
   blockManager.get(key) match {
     case Some(blockResult) = >
       // Partition is already materialized, so just return its values
       val existingMetrics = context.taskMetrics
         .getInputMetricsForReadMethod(blockResult.readMethod)
       existingMetrics.incBytesRead(blockResult.bytes)
 
       val iter = blockResult.data.asInstanceOf[Iterator[T]]
       new InterruptibleIterator[T](context, iter) {
         override def next() : T = {
           existingMetrics.incRecordsRead( 1 )
           delegate.next()
         }
       }
     case None = >
       // Acquire a lock for loading this partition
       // If another thread already holds the lock, wait for it to finish return its results
       val storedValues = acquireLockForPartition[T](key)
       if (storedValues.isDefined) {
         return new InterruptibleIterator[T](context, storedValues.get)
       }
 
       // Otherwise, we have to load the partition ourselves
       try {
         logInfo(s "Partition $key not found, computing it" )
         val computedValues = rdd.computeOrReadCheckpoint(partition, context)
 
         // If the task is running locally, do not persist the result
         if (context.isRunningLocally) {
           return computedValues
         }
 
         // Otherwise, cache the values and keep track of any updates in block statuses
         val updatedBlocks = new ArrayBuffer[(BlockId, BlockStatus)]
         val cachedValues = putInBlockManager(key, computedValues, storageLevel, updatedBlocks)
         val metrics = context.taskMetrics
         val lastUpdatedBlocks = metrics.updatedBlocks.getOrElse(Seq[(BlockId, BlockStatus)]())
         metrics.updatedBlocks = Some(lastUpdatedBlocks ++ updatedBlocks.toSeq)
         new InterruptibleIterator(context, cachedValues)
 
       } finally {
         loading.synchronized {
           loading.remove(key)
           loading.notifyAll()
         }
       }
   }
}

  首先通过RDD的ID和当前计算的分区ID构成一个key,并向blockManager中查找是否存在相关的block信息。如果能够获取得到,说明当前分区已经被缓存了;否者需要重新计算。如果重新计算,我们需要获取到相关的锁,因为可能有多个线程对请求同一分区的数据。如果获取到相关的锁,则会调用rdd.computeOrReadCheckpoint(partition, context)计算当前分区的数据,并放计算完的数据放到BlockManager中,如果有相关的线程等待该分区的计算,那么在计算完数据之后还得通知它们(loading.notifyAll())。

如果获取锁失败,则说明已经有其他线程在计算该分区中的数据了,那么我们就得等(loading.wait()),获取锁的代码如下:

/
  User : 过往记忆
  Date : 2015 - 11 - 17
  Time : 22 : 59
  bolg : https : //www.iteblog.com
  本文地址:https : //www.iteblog.com/archives/1532
  过往记忆博客,专注于hadoop、hive、spark、shark、flume的技术博客,大量的干货
  过往记忆博客微信公共帐号:iteblog _ hadoop
/
 
private def acquireLockForPartition[T](id : RDDBlockId) : Option[Iterator[T]] = {
   loading.synchronized {
     if (!loading.contains(id)) {
       // If the partition is free, acquire its lock to compute its value
       loading.add(id)
       None
     } else {
       // Otherwise, wait for another thread to finish and return its result
       logInfo(s "Another thread is loading $id, waiting for it to finish..." )
       while (loading.contains(id)) {
         try {
           loading.wait()
         } catch {
           case e : Exception = >
             logWarning(s "Exception while waiting for another thread to load $id" , e)
         }
       }
       logInfo(s "Finished waiting for $id" )
       val values = blockManager.get(id)
       if (!values.isDefined) {
         /* The block is not guaranteed to exist even after the other thread has finished.
          * For instance, the block could be evicted after it was put, but before our get.
          * In this case, we still need to load the partition ourselves. */
         logInfo(s "Whoever was loading $id failed; we'll try it ourselves" )
         loading.add(id)
       }
       values.map( _ .data.asInstanceOf[Iterator[T]])
     }
   }
}

  等待的线程(也就是没有获取到锁的线程)是通过获取到锁的线程调用loading.notifyAll()唤醒的,唤醒之后之后调用new InterruptibleIterator[T](context, storedValues.get)获取已经缓存的数据。以后后续RDD需要这个RDD的数据我们就可以直接在缓存中获取了,而不需要再计算了。后面我会对checkpoint相关代码进行分析。

当该job执行完成该缓存也就失效了,如果出现内存空间不足,就会导致分区的数据丢失,进而开启容错机制,重新执行。


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值