经过查看源代码spark2.0得知,在DagScheduler中,存在一个名为cacheLocs的变量,存储了每个RDD分区的缓存位置,定义如下:
//org.apache.spark.scheduler.DAGScheduler
/**
* Contains the locations that each RDD's partitions are cached on. This map's keys are RDD ids
* and its values are arrays indexed by partition numbers. Each array value is the set of
* locations where that RDD partition is cached.
*
* All accesses to this map should be guarded by synchronizing on it (see SPARK-4454).
*/
private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
得知rdd缓存情况的主要流程如下:
//org.apache.spark.scheduler.DAGScheduler
1)handleJobSubmitted
2)submitStage
3)getMissingParentStages
4)getCacheLocs
通过调用getCacheLocs(),访问cacheLocs变量得知rdd缓存情况。
在生成stage过程中,一旦回溯到某个rdd,并得知此rdd已经缓存,则停止回溯。