Spark RDD算子源码解读

最新推荐文章于 2024-07-16 22:38:10 发布

tanglizhe1105

最新推荐文章于 2024-07-16 22:38:10 发布

阅读量3.4k

点赞数 2

分类专栏： Spark 文章标签： spark RDD RDD算子 scala

本文链接：https://blog.csdn.net/tanglizhe1105/article/details/49582815

版权

本文深入解析Spark 1.5.0中RDD的基础算子，包括`cache`、`filter`、`reduce`、`map`等，探讨其功能、原理及使用接口。通过源码分析，帮助读者理解Spark RDD操作的内部机制。

摘要由CSDN通过智能技术生成

本文结合spark1.5.0的源码，简单介绍Spark RDD算子的功能、原理及调用接口。此处介绍只包含RDD默认基础算子，并不包含RDD[(K, V)]的扩展算子（即PairRDDFunctions中的方法），对于扩展算子，请见后续文章。

++
返回两个RDD元素合并后组成的RDD，保留重复元素

/**
 * Return the union of this RDD and another one. Any identical elements will appear multiple
 * times (use `.distinct()` to eliminate them).
   */
  def ++(other: RDD[T]): RDD[T] = withScope {
    this.union(other)
  }

aggregate
将RDD中元素聚集，须提供0初值（因为累积元素，所有要提供累积的初值）。先在分区内依照seqOp函数聚集元素（把T类型元素聚集为U类型的分区“结果”），再在分区间按照combOp函数聚集分区计算结果，最后返回这个标量

  /**
   * Aggregate the elements of each partition, and then the results for all the partitions, using
   * given combine functions and a neutral "zero value". This function can return a different result
   * type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U
   * and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are
   * allowed to modify and return their first argument instead of creating a new U to avoid memory
   * allocation.
   */
  def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U

cache
将RDD缓冲在内存中

  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
  def cache(): this.type = persist()

cartesian
返回一个RDD，其元素为两个RDD元素的笛卡尔乘积
如：
RDD1 RDD2 => RDD3
a b (a,b)
c d (a,d)
(c,b)
(c,d)

  /**
   * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
   * elements (a, b) where a is in `this` and b is in `other`.
   */
  def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
    new CartesianRDD(sc, this, other)
  }

checkpoint
为RDD设置检查点，即将RDD内容入分布式文件系统（如HDFS）中，移除RDD的依赖关系（lineage)。是一种典型的以空间换取时间做法，采用存储数据来避免失效时通过计算依赖链的时间代价。跟persist意义不一样。

  /**
   * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
   * directory set with `SparkContext#setCheckpointDir` and all references to its parent
   * RDDs will be removed. This function must be called before any job has been
   * executed on this RDD. It is strongly recommended that this RDD is persisted in
   * memory, otherwise saving it on a file will require recomputation.
   */
  def checkpoint(): Unit = RDDCheckpointData.synchronized {
    // NOTE: we use a global lock here due to complexities downstream with ensuring
    // children RDD partitions point to the correct parent partitions. In the future
    // we should revisit this consideration.
    if (context.checkpointDir.isEmpty) {
      throw new SparkException("Checkpoint directory has not been set in the SparkContext")
    } else if (checkpointData.isEmpty) {
      checkpointData = Some(new ReliableRDDCheckpointData(this))
    }
  }

coalesce
重新给RDD的元素分区。
当适当缩小分区数时，如1000->100，spark会把之前的10个分区当作一个分区，并行度变为100，不会引起数据shuffle。
当严重缩小分区数时，如1000->1，运算时的并行度会变成1。为了避免并行效率低下问题，可将shuffle设为true。shuffle之前的运算和之后的运算分为不同stage，它们的并行度分别为1000,1。
当把分区数增大时，必会存在shuffle，shuffle须设为true。

  /**
   * Return a new RDD that is reduced into `numPartitions` partitions.
   *
   * This results in a narrow dependency, e.g. if you go from 1000 partitions
   * to 100 partitions, there will not be a shuffle, instead each of the 100
   * new partitions will claim 10 of the current partitions.
   *
   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
   * this may result in your computation taking place on fewer nodes than
   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
   * you can pass shuffle = true. This will add a shuffle step, but means the
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
   * Note: With shuffle = true, you can actually coalesce to a larger number
   * of partitions. This is useful if you have a small number of partitions,
   * say 100, potentially with a few partitions being abnormally large. Calling
   * coalesce(1000, shuffle = true) will result in 1000 partitions with the
   * data distributed using a hash partitioner.
   */
  def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
      : RDD[T]

collect
返回包含RDD中所有元素的数组，当RDD元素超多时，driver节点可能会out of memory。

  /**
   * Return an array that contains all of the elements in this RDD.
   */
  def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

  /**
   * Return an RDD that contains all matching values by applying `f`.
   */
  def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    filter(cleanF.isDefinedAt).map(cleanF)
  }

count
返回RDD中元素个数

  /**
   * Return the number of elements in the RDD.
   */
  def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

countByValue
返回RDD中的不同元素的个数

  /**
   * Return the count of each unique value in this RDD as a local map of (value, count) pairs.
   *
   * Note that this method should only be used if the resulting map is expected to be small, as
   * the whole thing is loaded into the driver's memory.
   * To handle very large results, consider using rdd.map(x =&gt; (x, 1L)).reduceByKey(_ + _), which
   * returns an RDD[T, Long] instead of a map.
   */
  def countByValue()(implicit ord: Ordering[T] = null): Map[T, Long] = withScope {
    map(value => (value, null)).countByKey()
  }

dependencies
返回RDD的依赖列表，从lineage的检查点算起。

  /**
   * Get the list of dependencies of this RDD, taking into account whether the
   * RDD is checkpointed or not.
   */
  final def dependencies: Seq[Dependency[_]] = {
    checkpointRDD.map(r => List(new OneToOneDependency(r))).getOrElse {
      if (dependencies_ == null) {
        dependencies_ = getDependencies
      }
      dependencies_
    }
  }

distinct
返回RDD中元素去重后的RDD

  /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(): RDD[T] = withScope {
    distinct(partitions.length)
  }

  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
  }

filter
过滤RDD，返回一个包含所有满足f函数判定的元素的RDD

  /**
   * Return a new RDD containing only the elements that satisfy a predicate.
   */
  def filter(f: T => Boolean): RDD[T] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[T, T](
      this,
      (context, pid, iter) => iter.filter(cleanF),
      preservesPartitioning = true)
  }

filterWith
过滤RDD，返回一个包含所有满足f函数判定的元素的RDD。为每个分区的判断函数f指定一个参数，该参数依据分区号产生。可以实现各分区区别对待进行过滤。
详见mapPartitionsWithIndex。

  /**
   * Filters this RDD with p, where p takes an additional parameter of type A.  This
   * additional parameter is produced by constructA, which is called in each
   * partition with the index of that partition.
   */
  @deprecated("use mapPartitionsWithIndex and filter", "1.0.0")
  def filterWith[A](constructA: Int => A)(p: (T, A) => Boolean): RDD[T] = withScope {
    val cleanP = sc.clean(p)
    val cleanA = sc.clean(constructA)
    mapPartitionsWithIndex((index, iter) => {
      val a = cleanA(index)
      iter.filter(t => cleanP(t, a))
    }, preservesPartitioning = true)
  }