老规矩,一张图先说明全流程
进入源码:
RDD#iterator
final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
if (storageLevel != StorageLevel.NONE) {
SparkEnv.get.cacheManager.getOrCompute(this, split, context, storageLevel)
} else {
computeOrReadCheckpoint(split, context)
}
}
首先判断该RDD有没有持久化级别
先进入if分支
CacheManager#getOrCompute
blockManager.get(key)
进入
def get(blockId: BlockId): Option[BlockResult] = {
val local = getLocal(blockId)
if (local.isDefined) {
logInfo(s"Found block $blockId locally")
return local
}
val remote = getRemote(blockId)
if (remote.isDefined) {
logInfo(s"Found block $blockId remotely")
return remote
}
None
}
可以看到先从本地获取,再从远程获取,这里上一节分析过了,不继续分析了
如果没有拿到数据
val storedValues = acquireLockForPartition[T](key)
这里会尝试再次获取一次数据,点击进去:发现有这么一句代码
val values = blockManager.get(id)
如果还是没获取到数据。那么尝试去checkpoint中获取
val computedValues = rdd.computeOrReadCheckpoint(partition, context)
如果设置了checkpoit,那么直接从checkpoint中获取,否则就重新计算
if (isCheckpointedAndMaterialized) {
firstParent[T].iterator(split, context)
} else {
compute(split, context)
}
点击进入
val cachedValues = putInBlockManager(key, computedValues, storageLevel, updatedBlocks)
if (!putLevel.useMemory) {
/*
。。。
} else {
这里会根据存储级别不同操作,磁盘级别比较简单,不说
看else分支
blockManager.memoryStore.unrollSafely(key, values, updatedBlocks) match {
case Left(arr) =>
// We have successfully unrolled the entire partition, so cache it in memory
updatedBlocks ++=
blockManager.putArray(key, arr, level, tellMaster = true, effectiveStorageLevel)
arr.iterator.asInstanceOf[Iterator[T]]
case Right(it) =>
// There is not enough space to cache this partition in memory
val returnValues = it.asInstanceOf[Iterator[T]]
if (putLevel.useDisk) {
logWarning(s"Persisting partition $key to disk instead.")
val diskOnlyLevel = StorageLevel(useDisk = true, useMemory = false,
useOffHeap = false, deserialized = false, putLevel.replication)
putInBlockManager[T](key, returnValues, level, updatedBlocks, Some(diskOnlyLevel))
} else {
returnValues
}
首先会调用unrollSafely尝试将数据写入内存,我们进入unrolSafely
keepUnrolling = reserveUnrollMemoryForThisTask(blockId, initialMemoryThreshold, droppedBlocks)
这边会清理一些数据
while (values.hasNext && keepUnrolling) {
vector += values.next()
if (elementsUnrolled % memoryCheckPeriod == 0) {
// If our vector's size has exceeded the threshold, request more memory
val currentSize = vector.estimateSize()
if (currentSize >= memoryThreshold) {
val amountToRequest = (currentSize * memoryGrowthFactor - memoryThreshold).toLong
keepUnrolling = reserveUnrollMemoryForThisTask(
blockId, amountToRequest, droppedBlocks)
// New threshold is currentSize * memoryGrowthFactor
memoryThreshold += amountToRequest
}
}
elementsUnrolled += 1
}
这边会反复吧数据写入内存,如果不够就会调用reserveUnrollMemoryForThisTask清理内存。
unrollSafely返回之后这里也有两个分支,
如果可以写入,就写入,如果不能,又有一个分支。
if (putLevel.useDisk) {
logWarning(s"Persisting partition $key to disk instead.")
val diskOnlyLevel = StorageLevel(useDisk = true, useMemory = false,
useOffHeap = false, deserialized = false, putLevel.replication)
putInBlockManager[T](key, returnValues, level, updatedBlocks, Some(diskOnlyLevel))
} else {
returnValues
如果使用了磁盘就会写入磁盘,不能,则丢弃。
回到CacheManager#getOrCompute方法
val metrics = context.taskMetrics
val lastUpdatedBlocks = metrics.updatedBlocks.getOrElse(Seq[(BlockId, BlockStatus)]())
metrics.updatedBlocks = Some(lastUpdatedBlocks ++ updatedBlocks.toSeq)
这里会做一些监控啥的,。
---------------------
接着说一下checkpoint
首先用一张图来说明checkpoint机制
checkpoint与持久化区别:
1.checkpoint比持久化更加安全,
2.checkpoint会改变RDD的血统关系,而持久化不会
通常设置checkpoint的RDD要进行持久化,因为如果不设置持久化,那么在job执行完之后checkpoint又要重新计算一次。
源码:
找到RDD#
private[spark] def doCheckpoint(): Unit = {
RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, ignoreParent = true) {
if (!doCheckpointCalled) {
doCheckpointCalled = true
if (checkpointData.isDefined) {
checkpointData.get.checkpoint()
} else {
dependencies.foreach(_.rdd.doCheckpoint())
}
}
}
}
final def checkpoint(): Unit = {
// Guard against multiple threads checkpointing the same RDD by
// atomically flipping the state of this RDDCheckpointData
RDDCheckpointData.synchronized {
if (cpState == Initialized) {
cpState = CheckpointingInProgress
} else {
return
}
}
可以看到这边会沿着依赖关系将状态设置成CheckpointingInProgress
我们找到ReliableRDDCheckpointData#doCheckpoint方法
protected override def doCheckpoint(): CheckpointRDD[T] = {
val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)
// Optionally clean our checkpoint files if the reference is out of scope
if (rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints", false)) {
rdd.context.cleaner.foreach { cleaner =>
cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id)
}
}
logInfo(s"Done checkpointing RDD ${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}")
newRDD
}
这里会根据路径吧RDDcheckpoint,并且返回一个newRDD
同样是这个类中的compute方法
override def compute(split: Partition, context: TaskContext): Iterator[T] = {
val file = new Path(checkpointPath, ReliableCheckpointRDD.checkpointFileName(split.index))
ReliableCheckpointRDD.readCheckpointFile(file, broadcastedConf, context)
}
我们看到已经是调用hdfs的api了