spark中生成stage的过程中，是如何得知某个rdd的缓存情况。

最新推荐文章于 2022-03-03 14:28:04 发布

孤影渐苍茫

最新推荐文章于 2022-03-03 14:28:04 发布

阅读量1.2k

点赞数 1

分类专栏： spark 文章标签： spark 缓存

本文链接：https://blog.csdn.net/u013106951/article/details/52238820

版权

spark 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

经过查看源代码spark2.0得知，在DagScheduler中，存在一个名为cacheLocs的变量，存储了每个RDD分区的缓存位置，定义如下：

  //org.apache.spark.scheduler.DAGScheduler

  /**
   * Contains the locations that each RDD's partitions are cached on.  This map's keys are RDD ids
   * and its values are arrays indexed by partition numbers. Each array value is the set of
   * locations where that RDD partition is cached.
   *
   * All accesses to this map should be guarded by synchronizing on it (see SPARK-4454).
   */
  private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]

得知rdd缓存情况的主要流程如下：

//org.apache.spark.scheduler.DAGScheduler

1)handleJobSubmitted
2)submitStage
3)getMissingParentStages
4)getCacheLocs

通过调用getCacheLocs()，访问cacheLocs变量得知rdd缓存情况。

在生成stage过程中，一旦回溯到某个rdd，并得知此rdd已经缓存，则停止回溯。

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

孤影渐苍茫

关注关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark中生成stage的过程中，是如何得知某个rdd的缓存情况。

经过查看源代码得知，在DagScheduler中，存在一个名为cacheLocs的变量，存储了每个RDD分区的缓存位置，定义如下： //org.apache.spark.scheduler.DAGScheduler /** * Contains the locations that each RDD's partitions are cached on. This map's keys
复制链接

扫一扫