spark 2.2.0源码解读(一) rdd源码解读
spark中有很多rdd,每个rdd都有自己的作用,恰当用好rdd可以达到事半功倍的效果.
闲话少说,直接上代码
cache
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
* 持久化RDD使用默认的存储级别(`MEMORY_ONLY`).
*/
def cache(): JavaPairRDD[K, V] = new JavaPairRDD[K, V](rdd.cache())
persist
/**
* Set this RDD's storage level to persist its values across operations after the first time
* it is computed. Can only be called once on each RDD.
* 设置这个RDD的存储级别,在操作第一次计算后持久化它的值。,每个RDD只能调用一次。persist有十二种级别,在其他文章中我会介绍persist和cache的区别,以及persist的十二种级别.
*/
def persist(newLevel: StorageLevel): JavaPairRDD[K, V] =
new JavaPairRDD[K, V](rdd.persist(newLevel))
unpersist
/**
* Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
* This method blocks until all blocks are deleted.
* 标志RDD为非持久化的,并从内存和磁盘消除RDD的所有块。这方法阻塞直到所有的块删除。
*/
def unpersist(): JavaPairRDD[K, V] = wrapRDD(rdd.unpersist())
distinct
/**
* Return a new RDD containing the distinct elements in this RDD.
* 返回一个新的RDD包含在这个RDD的不同的元素。 简单说就是去重
*/
def distinct(): JavaPairRDD[K, V] = new JavaPairRDD[K, V](rdd.distinct())
filter
/**
* Return a new RDD containing only the elements that satisfy a predicate.
* 返回一个新的RDD仅包含满足filter条件的元素。也就是过滤掉不符合这种情况的数据
*/
def filter(f: JFunction[(K, V), java.lang.Boolean]): JavaPairRDD[K, V] =
new JavaPairRDD[K, V](rdd.filter(x => f.call(x).booleanValue()))
coalesce
/**
* Return a new RDD that is reduced into `numPartitions` partitions.
* 返回一个新的RDD,组成根据指定的numPartitions分区数分区。coalesce和rdd有点类似但也有区别,在我的其他文章中我会讲解coalesce和rdd的躯体区别
*/
def coalesce(numPartitions: Int, shuffle: Boolean): JavaPairRDD[K, V] =
fromRDD(rdd.coalesce(numPartitions, shuffle))
repartition
/**
* Return a new RDD that has exactly numPartitions partitions.
* 返回一个新的RDD,恰有numpartitions个分区。
*
* Can increase or decrease the level of parallelism in this RDD. Internally, this uses
* a shuffle to redistribute data.
* 可以增加或减少RDD的并行级别。在内部,这里采用的是一种洗牌重新分配数据。
*
* If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
* which can avoid performing a shuffle.
* 如果你降低这个RDD分区的数量,考虑使用`凝聚``coalesce`,,可以避免进行洗牌。
*
*
* 优化:
* 重新分区
* 为什么分区?为了负载均衡
* 条件 :es数据中有一个index有5个主分片,那么读取过来的数据就默认有5个分片,A1(100条数据),A2(100条数据),A3(600条数据),A4(100条数据),A5(100条数据)
* 直接运算: 1秒 1秒 100秒 1秒 1秒
* 问题 :A3数据量太大,处理时间是其他的100倍。整体拖累了应用程序的效率
* 重新分区:可以多分区,也可以少分区,看具体情况: A1(250条数据),A2(250条数据),A3(250条数据),A4(250条数据)
* 运算 : 1.5秒 1.5秒 1.5秒 1.5秒
* 简单理解就是把数据打散,重新进行分区,也是解决数据倾斜的一种方法
*/
def repartition(numPartitions: Int): JavaPairRDD[K, V] = fromRDD(rdd.repartition(numPartitions))
sample
/**
* Return a sampled subset of this RDD.
* 返回一个该RDD的采样集。随机返回几个rdd中的样本数据
*/
def sample(withReplacement: Boolean, fraction: Double): JavaPairRDD[K, V] =
sample(withReplacement, fraction, Utils.random.nextLong)
sampleByKey
/**
* Return a subset of this RDD sampled by key (via stratified sampling).
* 返回此RDD采样集合根据key(通过分层抽样)。
*
* Create a sample of this RDD using variable sampling rates for different keys as specified by
* `fractions`, a key to sampling rate map, via simple random sampling with one pass over the
* RDD, to produce a sample of size that's approximately equal to the sum of
* math.ceil(numItems * samplingRate) over all key values.
*
* 创建一个该RDD的采样集,采用可变采样率不同的密钥所指定的`分数 fractions`采样,
* 采样率map的key,通过简单的通过一个以上的RDD随机抽样产生的样本大小,约等于数学的总和。
* math.ceil(numItems * samplingRate) 在所有关键值。
*
*/
def sampleByKey(withReplacement: Boolean,
fractions: java.util.Map[K, jl.Double],
seed: Long): JavaPairRDD[K, V] =
new JavaPairRDD[K, V](rdd.sampleByKey(
withReplacement,
fractions.asScala.mapValues(_.toDouble).toMap, // map to Scala Double; toMap to serialize
seed))
union
/**
* Return the union of this RDD and another one. Any identical elements will appear multiple
* times (use `.distinct()` to eliminate them).
* 返回此两个RDD的并集union。任何相同的元素会出现多次(使用`.distinct()` 消除相同的元素)。
*/
def union(other: JavaPairRDD[K, V]): JavaPairRDD[K, V] =
new JavaPairRDD[K, V](rdd.union(other.rdd))
intersection
/**
* Return the intersection of this RDD and another one. The output will not contain any duplicate
* elements, even if the input RDDs did.
* 返回两个RDD的交集。输出将不包含任何重复的元素,即使输入RDDS做。
*
* @note This method performs a shuffle internally.
* 此方法在内部执行洗牌。
*/
def intersection(other: JavaPairRDD[K, V]): JavaPairRDD[K, V] =
new JavaPairRDD[K, V](rdd.intersection(other.rdd))
first
// first() has to be overridden here so that the generated method has the signature
// 'public scala.Tuple2 first()'; if the trait's definition is used,
// then the method has the signature 'public java.lang.Object first()',
// causing NoSuchMethodErrors at runtime.
//取出rdd中的第一个元素
override def first(): (K, V) = rdd.first()
combineByKey
/**
* Simplified version of combineByKey that hash-partitions the output RDD and uses map-side
* aggregation.
*
* 对combinebykey哈希分区的输出和使用地图边聚集RDD的简化版本。
* 就是对rdd中数据根据kv键值对,对key进行合并,reducebykey和groupbykey都是依据这个算子
*
*/
def combineByKey[C](createCombiner: JFunction[V, C],
mergeValue: JFunction2[C, V, C],
mergeCombiners: JFunction2[C, C, C],
numPartitions: Int): JavaPairRDD[K, C] =
combineByKey(createCombiner, mergeValue, mergeCombiners, new HashPartitioner(numPartitions))
reduceByKey
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce.
*
* 合并值为每个key使用关联和交换减少功能。在将结果发送到减速器之前,这也将在每个映射器上执行本地合并,
* 类似于MapReduce的“组合器”。 根据kv键值对,对k一样的元素进行v的count. 它和groupbykey有些区别,在我的其他文章中我会讲具体的区别
*
*/
def reduceByKey(partitioner: Partitioner, func: JFunction2[V, V, V]): JavaPairRDD[K, V] =
fromRDD(rdd.reduceByKey(partitioner, func))
reduceByKeyLocally
/**
* Merge the values for each key using an associative and commutative reduce function, but return
* the result immediately to the master as a Map. This will also perform the merging locally on
* each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
*
* 使用关联和交换reduce函数合并每个kye的值,但将结果立即作为映射map返回给主节点。
* 在将结果发送到减速器之前,这也将在每个映射器上执行本地合并,类似于MapReduce的“组合器”。
*
*/
def reduceByKeyLocally(func: JFunction2[V, V, V]): java.util.Map[K, V] =
mapAsSerializableJavaMap(rdd.reduceByKeyLocally(func))
countByKey
/** Count the number of elements for each key, and return the result to the master as a Map.
* 计算每个key的元素数,并将结果组成一个map返回给主节点。
* */
def countByKey(): java.util.Map[K, jl.Long] =
mapAsSerializableJavaMap(rdd.countByKey()).asInstanceOf[java.util.Map[K, jl.Long]]
countByKeyApprox
/**
* Approximate version of countByKey that can return a partial result if it does
* not finish within a timeout.
*
* countbykey近似版本,如果它不在一个超时时间内完成,那么就返回部分结果。多了一个时间限制
*
*/
def countByKeyApprox(timeout: Long): PartialResult[java.util.Map[K, BoundedDouble]] =
rdd.countByKeyApprox(timeout).map(mapAsSerializableJavaMap)
aggregateByKey
/**
* Aggregate the values of each key, using given combine functions and a neutral "zero value".
* This function can return a different result type, U, than the type of the values in this RDD,
* V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
* as in scala.TraversableOnce. The former operation is used for merging values within a
* partition, and the latter is used for merging values between partitions. To avoid memory
* allocation, both of these functions are allowed to modify and return their first argument
* instead of creating a new U.
*
* 汇总每个键的值,使用给定的combine函数和一个中性的“零值”。这个函数可以返回不同的结果类型,U,
* RDD值的类型,因此,我们需要一个操作合并成一个U和V的一个操作合并两个U的,如在scala.traversableonce。
* 前者操作用于合并在一个价值观,后者是用于并购价值之间的分区。为了避免内存分配,这两个函数都可以修改和
* 返回第一个参数,而不是创建一个新的参数。
*/
def aggregateByKey[U](zeroValue: U, partitioner: Partitioner, seqFunc: JFunction2[U, V, U],
combFunc: JFunction2[U, U, U]): JavaPairRDD[K, U] = {
implicit val ctag: ClassTag[U] = fakeClassTag
fromRDD(rdd.aggregateByKey(zeroValue, partitioner)(seqFunc, combFunc))
}
groupByKey
/**
* Group the values for each key in the RDD into a single sequence. Allows controlling the
* partitioning of the resulting key-value pair RDD by passing a Partitioner.
* 为RDD每个键组成一个序列。允许控制产生的键值对RDD的分区通过分割。
*
* @note If you are grouping in order to perform an aggregation (such as a sum or average) over
* each key, using `JavaPairRDD.reduceByKey` or `JavaPairRDD.combineByKey`
* will provide much better performance.
*
* 如果你是分组为了执行聚集aggregation(如求总和或平均值)在每个key上,使用`JavaPairRDD.reduceByKey`
* 或 `JavaPairRDD.combineByKey`将提供更好的性能。
*
*/
def groupByKey(partitioner: Partitioner): JavaPairRDD[K, JIterable[V]] =
fromRDD(groupByResultToJava(rdd.groupByKey(partitioner)))
subtract
/**
* Return an RDD with the elements from `this` that are not in `other`.
* 返回一个RDD,它的元素是从`this` 而不是 `other`.
*
* Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
* RDD will be <= us.
*
* 使用`this` partitioner/partition 大小,因为如果`other`另外一个RDD非常大,那么结果RDD将会
* 小雨等于 us
*
* 求两个RDD的差集
*/
def subtract(other: JavaPairRDD[K, V]): JavaPairRDD[K, V] =
fromRDD(rdd.subtract(other))
subtractByKey
/**
* Return an RDD with the pairs from `this` whose keys are not in `other`.
* subtract意思是减去的意思 也就是求差集 属于A但是不属于B
*
* Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
* RDD will be <= us.
*/
def subtractByKey[W](other: JavaPairRDD[K, W]): JavaPairRDD[K, V] = {
implicit val ctag: ClassTag[W] = fakeClassTag
fromRDD(rdd.subtractByKey(other))
}
partitionBy
/**
* Return a copy of the RDD partitioned using the specified partitioner.
* 返回一个拷贝的RDD分区使用指定的分区。
*/
def partitionBy(partitioner: Partitioner): JavaPairRDD[K, V] =
fromRDD(rdd.partitionBy(partitioner))
join
/**
* Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
* pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
* (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
*
* 这个就是类似mysql的join查询,id相同的会被连接在一起,在第一个RDD中有个(keyA,valueA)
* 在第二个RDD中有个(keyA,valueB),那么就组成(keyA, (valueA, valueB)) 的元组返回结果
*
*/
def join[W](other: JavaPairRDD[K, W], partitioner: Partitioner): JavaPairRDD[K, (V, W)] =
fromRDD(rdd.join(other, partitioner))
leftOuterJoin
/**
* Perform a left outer join of `this` and `other`. For each element (k, v) in `this`, the
* resulting RDD will either contain all pairs (k, (v, Some(w))) for w in `other`, or the
* pair (k, (v, None)) if no elements in `other` have key k. Uses the given Partitioner to
* partition the output RDD.
*
* 左连接
*/
def leftOuterJoin[W](other: JavaPairRDD[K, W], partitioner: Partitioner)
: JavaPairRDD[K, (V, Optional[W])] = {
val joinResult = rdd.leftOuterJoin(other, partitioner)
fromRDD(joinResult.mapValues{case (v, w) => (v, JavaUtils.optionToOptional(w))})
}
rightOuterJoin
/**
* Perform a right outer join of `this` and `other`. For each element (k, w) in `other`, the
* resulting RDD will either contain all pairs (k, (Some(v), w)) for v in `this`, or the
* pair (k, (None, w)) if no elements in `this` have key k. Uses the given Partitioner to
* partition the output RDD.
*
* 右连接
*/
def rightOuterJoin[W](other: JavaPairRDD[K, W], partitioner: Partitioner)
: JavaPairRDD[K, (Optional[V], W)] = {
val joinResult = rdd.rightOuterJoin(other, partitioner)
fromRDD(joinResult.mapValues{case (v, w) => (JavaUtils.optionToOptional(v), w)})
}
fullOuterJoin
/**
* Perform a full outer join of `this` and `other`. For each element (k, v) in `this`, the
* resulting RDD will either contain all pairs (k, (Some(v), Some(w))) for w in `other`, or
* the pair (k, (Some(v), None)) if no elements in `other` have key k. Similarly, for each
* element (k, w) in `other`, the resulting RDD will either contain all pairs
* (k, (Some(v), Some(w))) for v in `this`, or the pair (k, (None, Some(w))) if no elements
* in `this` have key k. Uses the given Partitioner to partition the output RDD.
*
* 全连接
*
*/
def fullOuterJoin[W](other: JavaPairRDD[K, W], partitioner: Partitioner)
: JavaPairRDD[K, (Optional[V], Optional[W])] = {
val joinResult = rdd.fullOuterJoin(other, partitioner)
fromRDD(joinResult.mapValues{ case (v, w) =>
(JavaUtils.optionToOptional(v), JavaUtils.optionToOptional(w))
})
}
collectAsMap
/**
* Return the key-value pairs in this RDD to the master as a Map.
*
* 返回在一个key-value形式的键值RDD对组成map返回给主节点。
*
* @note this method should only be used if the resulting data is expected to be small, as
* all the data is loaded into the driver's memory.
* @注意,只有当所有的数据都被加载到驱动程序内存中时,如果结果数据很小,则只应使用此方法。
*/
def collectAsMap(): java.util.Map[K, V] = mapAsSerializableJavaMap(rdd.collectAsMap())
mapValues
/**
* Pass each value in the key-value pair RDD through a map function without changing the keys;
* this also retains the original RDD's partitioning.
* 不改变key的情况下,对value值进行map函数处理;这也保留了原有的RDD的分区。
*/
def mapValues[U](f: JFunction[V, U]): JavaPairRDD[K, U] = {
implicit val ctag: ClassTag[U] = fakeClassTag
fromRDD(rdd.mapValues(f))
}
saveAsHadoopFile
/** Output the RDD to any Hadoop-supported file system.
* 输出保存RDD到任何hadoop支持的文件系统
* */
def saveAsHadoopFile[F <: OutputFormat[_, _]](
path: String,
keyClass: Class[_],
valueClass: Class[_],
outputFormatClass: Class[F],
conf: JobConf) {
rdd.saveAsHadoopFile(path, keyClass, valueClass, outputFormatClass, conf)
}
saveAsNewAPIHadoopDataset
/**
* Output the RDD to any Hadoop-supported storage system, using
* a Configuration object for that storage system.
* 输出保存RDD到任何hadoop支持的文件系统,用所提供的编解码器压缩。
* 使用一个配置对象存储系统。
*/
def saveAsNewAPIHadoopDataset(conf: Configuration) {
rdd.saveAsNewAPIHadoopDataset(conf)
}
saveAsHadoopDataset
/**
* Output the RDD to any Hadoop-supported storage system, using a Hadoop JobConf object for
* that storage system. The JobConf should set an OutputFormat and any output paths required
* (e.g. a table name to write to) in the same way as it would be configured for a Hadoop
* MapReduce job.
*
* 输出保存RDD到任何hadoop支持的文件系统。使用Hadoop jobconf对象存储系统。
* JobConf应该设置一个输出格式和输出路径的要求(如表名写)以同样的方式将它配置为Hadoop的MapReduce的工作。
*/
def saveAsHadoopDataset(conf: JobConf) {
rdd.saveAsHadoopDataset(conf)
}
repartitionAndSortWithinPartitions
/**
* Repartition the RDD according to the given partitioner and, within each resulting partition,
* sort records by their keys.
*
* 再分配的RDD根据给定的分区,每个分区内,通过按key排序记录。
*
* This is more efficient than calling `repartition` and then sorting within each partition
* because it can push the sorting down into the shuffle machinery.
*
* 这比调用 `repartition`更加效率高,而且在每个分区内进行再次排序,因为它可以推动分拣到洗牌的机器。
*/
def repartitionAndSortWithinPartitions(partitioner: Partitioner): JavaPairRDD[K, V] = {
val comp = com.google.common.collect.Ordering.natural().asInstanceOf[Comparator[K]]
repartitionAndSortWithinPartitions(partitioner, comp)
}
sortByKey
/**
* Sort the RDD by key, so that each partition contains a sorted range of the elements in
* ascending order. Calling `collect` or `save` on the resulting RDD will return or output an
* ordered list of records (in the `save` case, they will be written to multiple `part-X` files
* in the filesystem, in order of the keys).
*
* 通过key进行RDD排序,这样每个分区包含一个元素的排序顺序。结果RDD调用 `collect` 或者 `save`
* 将返回或输出的有序列表记录(在`save`的情况下,他们将在文件系统中写出多个`part-X`文件,in order of the keys)。
*
*/
def sortByKey(): JavaPairRDD[K, V] = sortByKey(true)
keys
/**
* Return an RDD with the keys of each tuple.
* 每个元组返回一个keys值的RDD。
*/
def keys(): JavaRDD[K] = JavaRDD.fromRDD[K](rdd.map(_._1))
values
/**
* Return an RDD with the values of each tuple.
* 每个元组返回一个values值的RDD。
*/
def values(): JavaRDD[V] = JavaRDD.fromRDD[V](rdd.map(_._2))
countApproxDistinctByKey
/**
* Return approximate number of distinct values for each key in this RDD.
* 返回不同的值近似数在这一关键区域。
*
* The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice:
* Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available
* <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>.
*
* @param relativeSD Relative accuracy. Smaller values create counters that require more space.
* It must be greater than 0.000017.
* @param partitioner partitioner of the resulting RDD.
*/
def countApproxDistinctByKey(relativeSD: Double, partitioner: Partitioner)
: JavaPairRDD[K, jl.Long] = {
fromRDD(rdd.countApproxDistinctByKey(relativeSD, partitioner)).
asInstanceOf[JavaPairRDD[K, jl.Long]]
}
好了 ,全部rdd api写完了.有疑问的请留言.