Spark入门3-RDD的实现

最新推荐文章于 2023-06-26 16:10:48 发布

graphnj

最新推荐文章于 2023-06-26 16:10:48 发布

阅读量397

点赞数

分类专栏： hadoop 文章标签： spark

本文链接：https://blog.csdn.net/njzhujinhua/article/details/78808869

版权

hadoop 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

njzhujinhua@2017/12/15

RDD的实现

1. RDD的实现

1.1 作业调度

当对RDD执行转换操作时，调度器根据RDD的血统Lineage构建若干调度阶段Stage组成的有向无环图DAG，每个Stage包含尽可能多的连续窄依赖转换操作。调度器按照有向无环图顺序依次进行计算并得到最终结果RDD。

调度器向各执行节点分配task采用lazy策略，根据数据真正存储位置来确定其计算位置，尽量少移动数据，改为移动计算任务的机制。

真正触发RDD计算的是Action操作，其执行时会以宽依赖来构建各个执行阶段Stage，各个Stage内部为窄依赖，其在该stage内前后链接构成流水线。

1.2 内存管理

在RDD的作业调度过程中，如果分区数据丢失，则只需要依次计算其依赖分区的数据即可恢复，对于宽依赖则需要从其所有父节点进行计算。RDD中每一个action操作都将触发其依赖的stage的执行，这将对性能十分不利。

Spark支持三种持久化RDD的存储策略：内存中未序列化对象，内存中序列化数据，磁盘存储。第一种显然是性能最优的，第三种则用于RDD太大的情况，多次使用且每次重新计算该RDD消耗资源太多的情形。

血统可用于错误恢复后RDD的恢复，若血统很长，则其恢复时间也很可观，需要通过checkpoint操作保存到外部存储，通常情况对于包含宽依赖的很长的Lineage的RDD设置检查点是十分必要的，但对于只有窄依赖的RDD做checkpoint操作则基本没必要了。

1.3 cache() persist() 与checkpoint()的关系

cache与persist

cache适用于频繁使用但不太大的RDD，cache只能写内存，persist则支持多种存储级别，看其源码实现，cache也是通过persist的参数的特殊情形实现的

/**
   * Mark this RDD for persisting using the specified level.
   *
   * @param newLevel the target storage level
   * @param allowOverride whether to override any existing level with the new one
   */
   //persist实现，支持自定义存储级别及是否覆盖已存在的数据
  private def persist(newLevel: StorageLevel, allowOverride: Boolean): this.type = {
    // TODO: Handle changes of StorageLevel
    //目前实现：新旧存储级别不一致且不支持覆盖的话报错，
    if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && !allowOverride) {
      throw new UnsupportedOperationException(
        "Cannot change storage level of an RDD after it was already assigned a level")
    }
    // If this is the first time this RDD is marked for persisting, register it
    // with the SparkContext for cleanups and accounting. Do this only once.
    RDD第一次persist需要SparkContext相应操作
    if (storageLevel == StorageLevel.NONE) {
      sc.cleaner.foreach(_.registerRDDForCleanup(this))
      sc.persistRDD(this)
    }
    storageLevel = newLevel
    this
  }

  /**
   * Set this RDD's storage level to persist its values across operations after the first time
   * it is computed. This can only be used to assign a new storage level if the RDD does not
   * have a storage level set yet. Local checkpointing is an exception.
   */
  def persist(newLevel: StorageLevel): this.type = {
    if (isLocallyCheckpointed) {
      // This means the user previously called localCheckpoint(), which should have already
      // marked this RDD for persisting. Here we should override the old storage level with
      // one that is explicitly requested by the user (after adapting it to use disk).
      persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
    } else {
      persist(newLevel, allowOverride = false)
    }
  }

  /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   */
   persist的无参方法，默认也是只存到内存中
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

  /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   */
   cache与不带参数的persis是一致的
  def cache(): this.type = persist()

cache/persist与checkpoint

cache或persist时是将要cache的对象直接存到内存或磁盘中，而checkpoint则是等到job结束另起一个job进行checkpoint操作，也就是说这个RDD需要计算两遍，为了减少不必要的操作，可以在checkpoint前线执行rdd.cache()缓存下来，以便checkpoint时可以直接使用。

cache将RDD保存到内存或磁盘中，此时其血统仍保存在RDD的依赖之中，因而若缓存的RDD的某部分丢失，其仍可计算得到。大checkpoint将RDD保存到HDFS上后就将血统完全清除了，这个可用于很长血统的截断，并将数据彻底保存与HDFS中。

persist和checkpoint都将数据保存到磁盘，其区别除了上面的之外，另外一个是其生命周期。persist将RDD持久化到磁盘，氮气partition由BlockManager管理，在driver进程执行结束后，executor进程也将结束，blockManager进而停止，此时persist到磁盘的RDD被清空。但checkpoint将RDD持久化到HDFS或本地文件夹中后除非明确手动删除，其不会消失。可被其他应用读取使用，但是persist的数据则无法被其他应用使用。

checkpoint的实现

  /**
   * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
   * directory set with `SparkContext#setCheckpointDir` and all references to its parent
   * RDDs will be removed. This function must be called before any job has been
   * executed on this RDD. It is strongly recommended that this RDD is persisted in
   * memory, otherwise saving it on a file will require recomputation.
   */
   checkpoint操作只是标记此RDD在checkpoint。将数据保存到通过SparkContext#setCheckpointDir设置的路径中区，所有血统被移除
   此函数必须在此RDD执行任何其他job前被调用。 且强烈建议RDD先persist到内存中，否则该RDD将被重新计算...
  def checkpoint(): Unit = RDDCheckpointData.synchronized {
    // NOTE: we use a global lock here due to complexities downstream with ensuring
    // children RDD partitions point to the correct parent partitions. In the future
    // we should revisit this consideration.
    if (context.checkpointDir.isEmpty) {
      throw new SparkException("Checkpoint directory has not been set in the SparkContext")
    } else if (checkpointData.isEmpty) {
      checkpointData = Some(new ReliableRDDCheckpointData(this))
    }
  }