1. storageLevel不为none,说明之前持久化过数据,则尝试优先读取缓存数据,读不到的话,再重新计算。
/**
* Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
* This should ''not'' be called by users directly, but is available for implementors of custom
* subclasses of RDD.
*/
final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
if (storageLevel != StorageLevel.NONE) {
// storageLevel不为none,说明之前持久化过数据
getOrCompute(split, context)
} else {
computeOrReadCheckpoint(split, context)
}
}
2. 先存缓存中读,先读本地,再读远端。根据存储级别,从内存或磁盘中加载数据。
读到了,则返回结果。
没有读到,就进行计算,并持久化,然后返回结果。
如果缓存中没有读到数据,还会尝试从Checkpoint加载数据。
2.1 如果是从缓存中获取到的数据,则会对existingMetrics的读取记录+1,并将数据封装进入InterruptibleIterator。
2.2 如果没有存缓存中读到数据,但是计算获取到了数据,并将数据成功加入了缓存,则直接将数据封装进入InterruptibleIterator。
2.3 如果计算出的数据没能成功放入缓存持久化,则直接将拿到的迭代器iter封装进入InterruptibleIterator。
/**
* Gets or computes an RDD partition. Used by RDD.iterator() when an RDD is cached.
*/
private[spark] def getOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {
val blockId = RDDBlockId(id, partition.index)
var readCachedBlock = true
// This method is called on executors, so we need call SparkEnv.get instead of sc.env.
// 从blockManager.getOrElseUpdate获取数据
SparkEnv.get.blockManager.getOrElseUpdate(blockId, storageLevel, elementClassTag, () => {
// 这里是一个匿名函数,读到了数据就不会执行,否则就会执行这里面的逻辑
readCachedBlock = false
computeOrReadCheckpoint(partition, context) //根据checkpoint获取数据
}) match {
case Left(blockResult) => // 存缓存拿到了数据
if (readCachedBlock) { // 数据是存缓存直接拿到的,没有经过计算
val existingMetrics = context.taskMetrics().inputMetrics
existingMetrics.incBytesRead(blockResult.bytes)
new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {
override def next(): T = {
existingMetrics.incRecordsRead(1) // existingMetrics读取记录 +1
delegate.next()
}
}
} else {
new InterruptibleIterator(context, blockResult.data.asInstanceOf[Iterator[T]]) // 直接封装数据
}
case Right(iter) =>
new InterruptibleIterator(context, iter.asInstanceOf[Iterator[T]]) // 将iter封装进入InterruptibleIterator
}
}
// BlockManager类
/**
* Retrieve the given block if it exists, otherwise call the provided `makeIterator` method
* to compute the block, persist it, and return its values.
*
* @return either a BlockResult if the block was successfully cached, or an iterator if the block
* could not be cached.
*/
def getOrElseUpdate[T](
blockId: BlockId,
level: StorageLevel,
classTag: ClassTag[T],
makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {
// Attempt to read the block from local or remote storage. If it's present, then we don't need
// to go through the local-get-or-put path.
// 尝试从本地或远端读取block
get[T](blockId)(classTag) match {
case Some(block) =>
return Left(block) // 读到了,就返回
case _ =>
// Need to compute the block.
}
// Initially we hold no locks on this block.
// 没有读到
doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true) match {
case None =>
// doPut() didn't hand work back to us, so the block already existed or was successfully
// stored. Therefore, we now hold a read lock on the block.
// doPut()已将数据放入本地缓存,再次尝试从本地获取,如果获取成功,则返回结果,否则抛出异常
val blockResult = getLocalValues(blockId).getOrElse {
// Since we held a read lock between the doPut() and get() calls, the block should not
// have been evicted, so get() not returning the block indicates some internal error.
releaseLock(blockId)
throw new SparkException(s"get() failed for block $blockId even though we held a lock")
}
// We already hold a read lock on the block from the doPut() call and getLocalValues()
// acquires the lock again, so we need to call releaseLock() here so that the net number
// of lock acquisitions is 1 (since the caller will only call release() once).
releaseLock(blockId)
Left(blockResult)
case Some(iter) =>
// 无法放入缓存持久化
// The put failed, likely because the data was too large to fit in memory and could not be
// dropped to disk. Therefore, we need to pass the input iterator back to the caller so
// that they can decide what to do with the values (e.g. process them without caching).
Right(iter)
}
}