spark 2.2.0源码解读(一) rdd源码解读

spark 2.2.0源码解读(一) rdd源码解读

spark中有很多rdd,每个rdd都有自己的作用,恰当用好rdd可以达到事半功倍的效果.
闲话少说,直接上代码
cache

 /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
    * 持久化RDD使用默认的存储级别(`MEMORY_ONLY`).
   */
  def cache(): JavaPairRDD[K, V] = new JavaPairRDD[K, V](rdd.cache())

persist

/**
   * Set this RDD's storage level to persist its values across operations after the first time
   * it is computed. Can only be called once on each RDD.
    * 设置这个RDD的存储级别,在操作第一次计算后持久化它的值。,每个RDD只能调用一次。persist有十二种级别,在其他文章中我会介绍persist和cache的区别,以及persist的十二种级别.
   */
  def persist(newLevel: StorageLevel): JavaPairRDD[K, V] =
    new JavaPairRDD[K, V](rdd.persist(newLevel))

unpersist

 /**
   * Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
   * This method blocks until all blocks are deleted.
    * 标志RDD为非持久化的,并从内存和磁盘消除RDD的所有块。这方法阻塞直到所有的块删除。
   */
  def unpersist(): JavaPairRDD[K, V] = wrapRDD(rdd.unpersist())

distinct

 /**
   * Return a new RDD containing the distinct elements in this RDD.
    * 返回一个新的RDD包含在这个RDD的不同的元素。 简单说就是去重
   */
  def distinct(): JavaPairRDD[K, V] = new JavaPairRDD[K, V](rdd.distinct())

filter

/**
   * Return a new RDD containing only the elements that satisfy a predicate.
    * 返回一个新的RDD仅包含满足filter条件的元素。也就是过滤掉不符合这种情况的数据
   */
  def filter(f: JFunction[(K, V), java.lang.Boolean]): JavaPairRDD[K, V] =
    new JavaPairRDD[K, V](rdd.filter(x => f.call(x).booleanValue()))

coalesce

 /**
   * Return a new RDD that is reduced into `numPartitions` partitions.
    * 返回一个新的RDD,组成根据指定的numPartitions分区数分区。coalesce和rdd有点类似但也有区别,在我的其他文章中我会讲解coalesce和rdd的躯体区别
   */
  def coalesce(numPartitions: Int, shuffle: Boolean): JavaPairRDD[K, V] =
    fromRDD(rdd.coalesce(numPartitions, shuffle))

repartition

  /**
   * Return a new RDD that has exactly numPartitions partitions.
    * 返回一个新的RDD,恰有numpartitions个分区。
   *
   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
   * a shuffle to redistribute data.
    * 可以增加或减少RDD的并行级别。在内部,这里采用的是一种洗牌重新分配数据。
   *
   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
   * which can avoid performing a shuffle.
    * 如果你降低这个RDD分区的数量,考虑使用`凝聚``coalesce`,,可以避免进行洗牌。
    *
    *
    * 优化:
    * 重新分区
    * 为什么分区?为了负载均衡
    * 条件   :es数据中有一个index有5个主分片,那么读取过来的数据就默认有5个分片,A1(100条数据),A2(100条数据),A3(600条数据),A4(100条数据),A5(100条数据)
    * 直接运算:                                                        1秒          1秒             100秒           1秒          1秒
    * 问题   :A3数据量太大,处理时间是其他的100倍。整体拖累了应用程序的效率
    * 重新分区:可以多分区,也可以少分区,看具体情况:    A1(250条数据),A2(250条数据),A3(250条数据),A4(250条数据)
    * 运算   :                                       1.5秒      1.5秒          1.5秒        1.5秒
    * 简单理解就是把数据打散,重新进行分区,也是解决数据倾斜的一种方法
   */
  def repartition(numPartitions: Int): JavaPairRDD[K, V] = fromRDD(rdd.repartition(numPartitions))

sample

  /**
   * Return a sampled subset of this RDD.
    * 返回一个该RDD的采样集。随机返回几个rdd中的样本数据
   */
  def sample(withReplacement: Boolean, fraction: Double): JavaPairRDD[K, V] =
    sample(withReplacement, fraction, Utils.random.nextLong)

sampleByKey

/**
   * Return a subset of this RDD sampled by key (via stratified sampling).
    * 返回此RDD采样集合根据key(通过分层抽样)。
   *
   * Create a sample of this RDD using variable sampling rates for different keys as specified by
   * `fractions`, a key to sampling rate map, via simple random sampling with one pass over the
   * RDD, to produce a sample of size that's approximately equal to the sum of
   * math.ceil(numItems * samplingRate) over all key values.
    *
    * 创建一个该RDD的采样集,采用可变采样率不同的密钥所指定的`分数 fractions`采样,
    * 采样率map的key,通过简单的通过一个以上的RDD随机抽样产生的样本大小,约等于数学的总和。
    * math.ceil(numItems * samplingRate) 在所有关键值。
    *
   */
  def sampleByKey(withReplacement: Boolean,
      fractions: java.util.Map[K, jl.Double],
      seed: Long): JavaPairRDD[K, V] =
    new JavaPairRDD[K, V](rdd.sampleByKey(
      withReplacement,
      fractions.asScala.mapValues(_.toDouble).toMap, // map to Scala Double; toMap to serialize
      seed))

union

/**
   * Return the union of this RDD and another one. Any identical elements will appear multiple
   * times (use `.distinct()` to eliminate them).
    * 返回此两个RDD的并集union。任何相同的元素会出现多次(使用`.distinct()` 消除相同的元素)。
   */
  def union(other: JavaPairRDD[K, V]): JavaPairRDD[K, V] =
    new JavaPairRDD[K, V](rdd.union(other.rdd))

intersection

  /**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.
    * 返回两个RDD的交集。输出将不包含任何重复的元素,即使输入RDDS做。
   *
   * @note This method performs a shuffle internally.
    *       此方法在内部执行洗牌。
   */
  def intersection(other: JavaPairRDD[K, V]): JavaPairRDD[K, V] =
    new JavaPairRDD[K, V](rdd.intersection(other.rdd))

first

  // first() has to be overridden here so that the generated method has the signature
  // 'public scala.Tuple2 first()'; if the trait's definition is used,
  // then the method has the signature 'public java.lang.Object first()',
  // causing NoSuchMethodErrors at runtime.
  //取出rdd中的第一个元素
  override def first(): (K, V) = rdd.first()

combineByKey

 /**
   * Simplified version of combineByKey that hash-partitions the output RDD and uses map-side
   * aggregation.
    *
    * 对combinebykey哈希分区的输出和使用地图边聚集RDD的简化版本。
    * 就是对rdd中数据根据kv键值对,对key进行合并,reducebykey和groupbykey都是依据这个算子
    *
   */
  def combineByKey[C](createCombiner: JFunction[V, C],
      mergeValue: JFunction2[C, V, C],
      mergeCombiners: JFunction2[C, C, C],
      numPartitions: Int): JavaPairRDD[K, C] =
    combineByKey(createCombiner, mergeValue, mergeCombiners, new HashPartitioner(numPartitions))

reduceByKey

 /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce.
    *
    * 合并值为每个key使用关联和交换减少功能。在将结果发送到减速器之前,这也将在每个映射器上执行本地合并,
    * 类似于MapReduce的“组合器”。 根据kv键值对,对k一样的元素进行v的count. 它和groupbykey有些区别,在我的其他文章中我会讲具体的区别
    *
   */
  def reduceByKey(partitioner: Partitioner, func: JFunction2[V, V, V]): JavaPairRDD[K, V] =
    fromRDD(rdd.reduceByKey(partitioner, func))

reduceByKeyLocally

/**
   * Merge the values for each key using an associative and commutative reduce function, but return
   * the result immediately to the master as a Map. This will also perform the merging locally on
   * each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
    *
    * 使用关联和交换reduce函数合并每个kye的值,但将结果立即作为映射map返回给主节点。
    * 在将结果发送到减速器之前,这也将在每个映射器上执行本地合并,类似于MapReduce的“组合器”。
    *
   */
  def reduceByKeyLocally(func: JFunction2[V, V, V]): java.util.Map[K, V] =
    mapAsSerializableJavaMap(rdd.reduceByKeyLocally(func))

countByKey

/** Count the number of elements for each key, and return the result to the master as a Map.
    * 计算每个key的元素数,并将结果组成一个map返回给主节点。
    * */
  def countByKey(): java.util.Map[K, jl.Long] =
    mapAsSerializableJavaMap(rdd.countByKey()).asInstanceOf[java.util.Map[K, jl.Long]]

countByKeyApprox

/**
   * Approximate version of countByKey that can return a partial result if it does
   * not finish within a timeout.
    *
    * countbykey近似版本,如果它不在一个超时时间内完成,那么就返回部分结果。多了一个时间限制
    *
   */
  def countByKeyApprox(timeout: Long): PartialResult[java.util.Map[K, BoundedDouble]] =
    rdd.countByKeyApprox(timeout).map(mapAsSerializableJavaMap)

aggregateByKey

 /**
   * Aggregate the values of each key, using given combine functions and a neutral "zero value".
   * This function can return a different result type, U, than the type of the values in this RDD,
   * V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
   * as in scala.TraversableOnce. The former operation is used for merging values within a
   * partition, and the latter is used for merging values between partitions. To avoid memory
   * allocation, both of these functions are allowed to modify and return their first argument
   * instead of creating a new U.
    *
    * 汇总每个键的值,使用给定的combine函数和一个中性的“零值”。这个函数可以返回不同的结果类型,U,
    * RDD值的类型,因此,我们需要一个操作合并成一个U和V的一个操作合并两个U的,如在scala.traversableonce。
    * 前者操作用于合并在一个价值观,后者是用于并购价值之间的分区。为了避免内存分配,这两个函数都可以修改和
    * 返回第一个参数,而不是创建一个新的参数。
   */
  def aggregateByKey[U](zeroValue: U, partitioner: Partitioner, seqFunc: JFunction2[U, V, U],
      combFunc: JFunction2[U, U, U]): JavaPairRDD[K, U] = {
    implicit val ctag: ClassTag[U] = fakeClassTag
    fromRDD(rdd.aggregateByKey(zeroValue, partitioner)(seqFunc, combFunc))
  }

groupByKey

 /**
   * Group the values for each key in the RDD into a single sequence. Allows controlling the
   * partitioning of the resulting key-value pair RDD by passing a Partitioner.
    * 为RDD每个键组成一个序列。允许控制产生的键值对RDD的分区通过分割。
   *
   * @note If you are grouping in order to perform an aggregation (such as a sum or average) over
   * each key, using `JavaPairRDD.reduceByKey` or `JavaPairRDD.combineByKey`
   * will provide much better performance.
    *
    * 如果你是分组为了执行聚集aggregation(如求总和或平均值)在每个key上,使用`JavaPairRDD.reduceByKey`
    * 或 `JavaPairRDD.combineByKey`将提供更好的性能。
    *
   */
  def groupByKey(partitioner: Partitioner): JavaPairRDD[K, JIterable[V]] =
    fromRDD(groupByResultToJava(rdd.groupByKey(partitioner)))

subtract

 /**
   * Return an RDD with the elements from `this` that are not in `other`.
    * 返回一个RDD,它的元素是从`this` 而不是 `other`.
   *
   * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
   * RDD will be <= us.
    *
    * 使用`this` partitioner/partition 大小,因为如果`other`另外一个RDD非常大,那么结果RDD将会
    * 小雨等于 us
    *
    * 求两个RDD的差集
   */
  def subtract(other: JavaPairRDD[K, V]): JavaPairRDD[K, V] =
    fromRDD(rdd.subtract(other))

subtractByKey

/**
   * Return an RDD with the pairs from `this` whose keys are not in `other`.
    * subtract意思是减去的意思 也就是求差集 属于A但是不属于B
   *
   * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
   * RDD will be <= us.
   */
  def subtractByKey[W](other: JavaPairRDD[K, W]): JavaPairRDD[K, V] = {
    implicit val ctag: ClassTag[W] = fakeClassTag
    fromRDD(rdd.subtractByKey(other))
  }

partitionBy

 /**
   * Return a copy of the RDD partitioned using the specified partitioner.
    * 返回一个拷贝的RDD分区使用指定的分区。
   */
  def partitionBy(partitioner: Partitioner): JavaPairRDD[K, V] =
    fromRDD(rdd.partitionBy(partitioner))

join

 /**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
    *
    * 这个就是类似mysql的join查询,id相同的会被连接在一起,在第一个RDD中有个(keyA,valueA)
    * 在第二个RDD中有个(keyA,valueB),那么就组成(keyA, (valueA, valueB)) 的元组返回结果
    *
   */
  def join[W](other: JavaPairRDD[K, W], partitioner: Partitioner): JavaPairRDD[K, (V, W)] =
    fromRDD(rdd.join(other, partitioner))

leftOuterJoin

 /**
   * Perform a left outer join of `this` and `other`. For each element (k, v) in `this`, the
   * resulting RDD will either contain all pairs (k, (v, Some(w))) for w in `other`, or the
   * pair (k, (v, None)) if no elements in `other` have key k. Uses the given Partitioner to
   * partition the output RDD.
    *
    * 左连接
   */
  def leftOuterJoin[W](other: JavaPairRDD[K, W], partitioner: Partitioner)
  : JavaPairRDD[K, (V, Optional[W])] = {
    val joinResult = rdd.leftOuterJoin(other, partitioner)
    fromRDD(joinResult.mapValues{case (v, w) => (v, JavaUtils.optionToOptional(w))})
  }

rightOuterJoin

  /**
   * Perform a right outer join of `this` and `other`. For each element (k, w) in `other`, the
   * resulting RDD will either contain all pairs (k, (Some(v), w)) for v in `this`, or the
   * pair (k, (None, w)) if no elements in `this` have key k. Uses the given Partitioner to
   * partition the output RDD.
    *
    *  右连接
   */
  def rightOuterJoin[W](other: JavaPairRDD[K, W], partitioner: Partitioner)
  : JavaPairRDD[K, (Optional[V], W)] = {
    val joinResult = rdd.rightOuterJoin(other, partitioner)
    fromRDD(joinResult.mapValues{case (v, w) => (JavaUtils.optionToOptional(v), w)})
  }

fullOuterJoin

/**
   * Perform a full outer join of `this` and `other`. For each element (k, v) in `this`, the
   * resulting RDD will either contain all pairs (k, (Some(v), Some(w))) for w in `other`, or
   * the pair (k, (Some(v), None)) if no elements in `other` have key k. Similarly, for each
   * element (k, w) in `other`, the resulting RDD will either contain all pairs
   * (k, (Some(v), Some(w))) for v in `this`, or the pair (k, (None, Some(w))) if no elements
   * in `this` have key k. Uses the given Partitioner to partition the output RDD.
    *
    * 全连接
    *
   */
  def fullOuterJoin[W](other: JavaPairRDD[K, W], partitioner: Partitioner)
  : JavaPairRDD[K, (Optional[V], Optional[W])] = {
    val joinResult = rdd.fullOuterJoin(other, partitioner)
    fromRDD(joinResult.mapValues{ case (v, w) =>
      (JavaUtils.optionToOptional(v), JavaUtils.optionToOptional(w))
    })
  }

collectAsMap

  /**
   * Return the key-value pairs in this RDD to the master as a Map.
    *
    * 返回在一个key-value形式的键值RDD对组成map返回给主节点。
   *
   * @note this method should only be used if the resulting data is expected to be small, as
   * all the data is loaded into the driver's memory.
    * @注意,只有当所有的数据都被加载到驱动程序内存中时,如果结果数据很小,则只应使用此方法。
   */
  def collectAsMap(): java.util.Map[K, V] = mapAsSerializableJavaMap(rdd.collectAsMap())

mapValues

 /**
   * Pass each value in the key-value pair RDD through a map function without changing the keys;
   * this also retains the original RDD's partitioning.
    * 不改变key的情况下,对value值进行map函数处理;这也保留了原有的RDD的分区。
   */
  def mapValues[U](f: JFunction[V, U]): JavaPairRDD[K, U] = {
    implicit val ctag: ClassTag[U] = fakeClassTag
    fromRDD(rdd.mapValues(f))
  }

saveAsHadoopFile

 /** Output the RDD to any Hadoop-supported file system.
    *  输出保存RDD到任何hadoop支持的文件系统
    * */
  def saveAsHadoopFile[F <: OutputFormat[_, _]](
      path: String,
      keyClass: Class[_],
      valueClass: Class[_],
      outputFormatClass: Class[F],
      conf: JobConf) {
    rdd.saveAsHadoopFile(path, keyClass, valueClass, outputFormatClass, conf)
  }

saveAsNewAPIHadoopDataset

 /**
   * Output the RDD to any Hadoop-supported storage system, using
   * a Configuration object for that storage system.
    *  输出保存RDD到任何hadoop支持的文件系统,用所提供的编解码器压缩。
    *  使用一个配置对象存储系统。
    */
  def saveAsNewAPIHadoopDataset(conf: Configuration) {
    rdd.saveAsNewAPIHadoopDataset(conf)
  }

saveAsHadoopDataset

 /**
   * Output the RDD to any Hadoop-supported storage system, using a Hadoop JobConf object for
   * that storage system. The JobConf should set an OutputFormat and any output paths required
   * (e.g. a table name to write to) in the same way as it would be configured for a Hadoop
   * MapReduce job.
    *
    * 输出保存RDD到任何hadoop支持的文件系统。使用Hadoop jobconf对象存储系统。
    * JobConf应该设置一个输出格式和输出路径的要求(如表名写)以同样的方式将它配置为Hadoop的MapReduce的工作。
   */
  def saveAsHadoopDataset(conf: JobConf) {
    rdd.saveAsHadoopDataset(conf)
  }

repartitionAndSortWithinPartitions

/**
   * Repartition the RDD according to the given partitioner and, within each resulting partition,
   * sort records by their keys.
    *
    * 再分配的RDD根据给定的分区,每个分区内,通过按key排序记录。
   *
   * This is more efficient than calling `repartition` and then sorting within each partition
   * because it can push the sorting down into the shuffle machinery.
    *
    * 这比调用 `repartition`更加效率高,而且在每个分区内进行再次排序,因为它可以推动分拣到洗牌的机器。
   */
  def repartitionAndSortWithinPartitions(partitioner: Partitioner): JavaPairRDD[K, V] = {
    val comp = com.google.common.collect.Ordering.natural().asInstanceOf[Comparator[K]]
    repartitionAndSortWithinPartitions(partitioner, comp)
  }

sortByKey

 /**
   * Sort the RDD by key, so that each partition contains a sorted range of the elements in
   * ascending order. Calling `collect` or `save` on the resulting RDD will return or output an
   * ordered list of records (in the `save` case, they will be written to multiple `part-X` files
   * in the filesystem, in order of the keys).
    *
    * 通过key进行RDD排序,这样每个分区包含一个元素的排序顺序。结果RDD调用 `collect` 或者 `save`
    * 将返回或输出的有序列表记录(在`save`的情况下,他们将在文件系统中写出多个`part-X`文件,in order of the keys)。
    *
   */
  def sortByKey(): JavaPairRDD[K, V] = sortByKey(true)

keys

  /**
   * Return an RDD with the keys of each tuple.
    * 每个元组返回一个keys值的RDD。
   */
  def keys(): JavaRDD[K] = JavaRDD.fromRDD[K](rdd.map(_._1))

values

/**
   * Return an RDD with the values of each tuple.
    * 每个元组返回一个values值的RDD。
   */
  def values(): JavaRDD[V] = JavaRDD.fromRDD[V](rdd.map(_._2))

countApproxDistinctByKey

 /**
   * Return approximate number of distinct values for each key in this RDD.
    * 返回不同的值近似数在这一关键区域。
   *
   * The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice:
   * Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available
   * <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>.
   *
   * @param relativeSD Relative accuracy. Smaller values create counters that require more space.
   *                   It must be greater than 0.000017.
   * @param partitioner partitioner of the resulting RDD.
   */
  def countApproxDistinctByKey(relativeSD: Double, partitioner: Partitioner)
  : JavaPairRDD[K, jl.Long] = {
    fromRDD(rdd.countApproxDistinctByKey(relativeSD, partitioner)).
      asInstanceOf[JavaPairRDD[K, jl.Long]]
  }

好了 ,全部rdd api写完了.有疑问的请留言.

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值