Spark RDD的函数详解

最新推荐文章于 2024-09-24 14:16:36 发布

Code_LT

最新推荐文章于 2024-09-24 14:16:36 发布

阅读量832

点赞数

分类专栏： Spark 文章标签： rdd

本文链接：https://blog.csdn.net/Code_LT/article/details/87719748

版权

Spark 专栏收录该内容

40 篇文章 5 订阅

订阅专栏

RDD支持两种操作：转换（transformation）从现有的数据集创建一个新的数据集；而动作（actions）在数据集上运行计算后，返回一个值给驱动程序。区别是tranformation输入RDD，输出RDD，而action输入RDD，输出非RDD。transformation是缓释执行的，action是即刻执行的。例如，df1.map就是一种转换，它在使用时，并没有被调用，只有和df1相关的action发生时，df1才会被加载到内存，及时前面df1被加载过，若没有persist或cache，也需要重新加载。reduce是一种action，通过一些函数将所有的元素叠加起来，并将最终结果返回给Driver程序。（不过还有一个并行的reduceByKey，能返回一个分布式数据集）

Spark中的所有转换都是惰性的，也就是说，他们并不会直接计算结果。相反的，它们只是记住应用到基础数据集（例如一个文件）上的这些转换动作。只有当发生一个要求返回结果给Driver的动作时，这些转换才会真正运行。这个设计让Spark更加有效率的运行。例如，我们可以实现：通过map创建的一个新数据集，并在reduce中使用，最终只返回reduce的结果给driver，而不是整个大的新数据集。

默认情况下，每一个转换过的RDD都会在你在它之上执行一个动作时被重新计算。不过，你也可以使用persist(或者cache)方法，持久化一个RDD在内存中。在这种情况下，Spark将会在集群中，保存相关元素，下次你查询这个RDD时，它将能更快速访问。在磁盘上持久化数据集，或在集群间复制数据集也是支持的。

下表列出了Spark中的RDD转换和动作(Spark 1.5.1)。每个操作都给出了标识，其中方括号表示类型参数。前面说过转换是延迟操作，用于定义新的RDD；而动作启动计算操作，并向用户程序返回值或向外部存储写数据。

转换(Transformation)：

函数名	描述
map[U: ClassTag](f: T => U): RDD[U]	将变换函数f应用于RDD的每个元素，并将结果返回构成新的RDD
flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]	将变换函数f应用于RDD的每个元素，并将结果fatten后返回构成新的RDD
filter(f: T => Boolean): RDD[T]	返回由断言为true的元素构成的新RDD
distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]	返还RDD中的不同元素构成新RDD
repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]	返还拥有numPartitions个分区的RDD，数据先被shuffle再分配
coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty) (implicit ord: Ordering[T] = null): RDD[T]	shuffle在false情况下只能返还减少分区到numPartition的RDD。在true情况下可向上增大分区到numPartition
sample(withReplacement: Boolean,fraction: Double, seed: Long = Utils.random.nextLong): RDD[T]	返还当前RDD子集元素的RDD。withReplacement:true 选取元素可重复，fraction决定每个元素重复次数（>=0）；withReplacement:false 选取元素不可重复，fraction决定每个元素被选取的概率（[0,1]）
randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]	RDD被按比例数组随机分成RDD数组返回，比例数组和不为1的话会被均一化，注意各部分RDD数据是根据seed随机抽取的
takeSample(withReplacement: Boolean,num: Int, seed: Long = Utils.random.nextLong): Array[T]	返回包含num个随机元素的数组，返回数组储存在driver内存，慎用
union(other: RDD[T]): RDD[T]	返回RDD1和RDD2的合集，相同元素重复出现
++(other: RDD[T]): RDD[T]	同上
sortBy[K](f: (T) => K,ascending: Boolean = true, numPartitions: Int = this.partitions.length) (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]	根据函数f排序，返回排序后的RDD
intersection(other: RDD[T]): RDD[T]	返回RDD1和RDD1的交集，无重复元素，即使RDD1含有重复元素，并且交集被shuffle后返回
intersection(other: RDD[T],partitioner: Partitioner) (implicit ord: Ordering[T] = null): RDD[T]	同上，将使用分区器
intersection(other: RDD[T], numPartitions: Int): RDD[T]	同上上，使用hash分区将返回RDD分为numPartittion个区
glom(): RDD[Array[T]]	RDD中每一个分区中类型为T的元素转换成Array[T]，这样每一个分区就只有一个数组元素
cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]	该函数返回的是Pair类型的RDD，计算结果是当前RDD和other RDD中每个元素进行笛卡儿计算的结果
groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])]	返回分组RDD，每个组包含一个key，以及一列映射到这个key的元素，每次调用每组的元素顺序是无法保证的。注意：此方法非常耗费资源，若仅是为了配合agg函数（如cont,sum等），推荐使用更高效的`PairRDDFunctions.aggregateByKey`或 `PairRDDFunctions.reduceByKey`
groupBy[K](f: T => K,numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])]	同上，多了分区数设置
groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K]	同上，多了分区设置
pipe(command: String): RDD[String]	准备好外部程序A，rdd.pipe(A) 将会把rdd中每个元素做为A的输入，然后输出组成一个新的RDD。有点像map
pipe(command: String, env: Map[String, String]): RDD[String]	同上
pipe(command: Seq[String],env: Map[String, String] = Map(), printPipeContext: (String => Unit) => Unit = null, printRDDElement: (T, String => Unit) => Unit = null, separateWorkingDir: Boolean = false,bufferSize: Int = 8192, encoding: String = Codec.defaultCharsetCodec.name): RDD[String]	超复杂，暂略
mapPartitions[U: ClassTag](f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]	按partition抽取数据并通过函数f，详见该函数与map函数的区别，这里略。pre..Par..g参数只有在pair RDD时且输入函数不改变键值时才为true
mapPartitionsWithIndex[U: ClassTag]( f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning : Boolean = false): RDD[U]	同上，同时保持对原partition index的跟踪
zip[U: ClassTag](other: RDD[U]): RDD[(T, U)]	按元素zip两个RDD形成key-value RDD, 两RDD需要有相同数量的分区且每分区有相同数量的元素（最方便的是一个rdd由另一个map而来）
zipPartitions[B: ClassTag, V: ClassTag] (rdd2: RDD[B], preservesPartitioning: Boolean) (f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V]	按分区zip rdd，要求两rdd有相同的分区数，不要求每分区数有相同元素，同时将zip后的rdd应用于函数f，得到新的RDD。
zipPartitions[B: ClassTag, V: ClassTag] (rdd2: RDD[B], (f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V]	同上，没有是否保留父rdd信息参数
def zipPartitions[B: ClassTag, C: ClassTag, D: ClassTag, V: ClassTag] (rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D]) (f: (Iterator[T], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V]): RDD[V]	同上，多了几个rdd参数，这是能支持的最多情况，也支持保留父rdd选项形式，这里省略

动作（Actions）:

A&C函数：满足结合律和交换律的函数

Associative ：a + (b + c) = (a + b) + c , f(a,f(b,c))=f(f(a,b),c) if f(a,b)=a+b

Commutative：ab=ba ，f(a,b)=f(b,a) if f(a,b)=a*b

foreach(f: T => Unit): Unit	将无返回函数应用于rdd每个元素
foreachPartition(f: Iterator[T] => Unit): Unit	将无返回函数应用于rdd每个分区
collect(): Array[T]	返回包含rdd所有元素的数组，返回到driver端，注意内存管理
toLocalIterator: Iterator[T]	返回包含所有元素的iterator，占用内存为rdd最大分区内存注意：使用此动作前rdd最好先persist
collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U]	返回所有匹配函数f的元素组成的rdd
subtract(other: RDD[T]): RDD[T]	返回在调用rdd中但不在other rdd中的元素组成的rdd，采用调用rdd的分区
subtract(other: RDD[T], numPartitions: Int): RDD[T]	同上，支持重新分区
subtract( other: RDD[T],p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]	同上，支持分区和排序
reduce(f: (T, T) => T): T	根据A&C映射函数f，对RDD中的元素进行二元计算，返回计算结果，若函数不是A&C的，结果不可测
treeReduce(f: (T, T) => T, depth: Int = 2): T	以多层tree模式进行reduce，可用来减少reduce开销，f需是A&C函数
fold(zeroValue: T)(op: (T, T) => T): T	在每个分区内做聚合，再将分区内结果做聚合，聚合顺序是不固定的，需要f为A&C的
aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U	在每个分区内做聚合，再将分区内结果做聚合。可变换类型即rdd[T]=>U，seqOp用于分区内运算，可做类型变换，combOp用于分区结果合并，需满足结合律A
treeAggregate[U: ClassTag](zeroValue: U)( seqOp: (U, T) => U, combOp: (U, U) => U, depth: Int = 2): U	同上，用了多层树模型，更高效
count(): Long	返回rdd的元素数
countApprox(timeout: Long,confidence: Double = 0.95) : PartialResult[BoundedDouble]	在最长等待时间timeout 毫秒内返回置信度为confidence的近似结果
countByValue()(implicit ord: Ordering[T] = null): Map[T, Long]	计算每个元素出现的次数，返回Map到driver端。对于很大的rdd建议用rdd.map(x => (x, 1L)).reduceByKey(_ + _) 得到RDD[T,Long]而不是本地Map
countByValueApprox(timeout: Long, confidence: Double = 0.95) (implicit ord: Ordering[T] = null) : PartialResult[Map[T, BoundedDouble]]	上面的近似值
countApproxDistinct(p: Int, sp: Int): Long	返回rdd中不同元素数的近似值
countApproxDistinct(relativeSD: Double = 0.05): Long	同上，提供准确度设置值，越小越准，花费空间越大，必须大于0.000017
zipWithIndex(): RDD[(T, Long)]	zip rdd的元素和相应元素编号，编号顺序先按分区再按分区内元素顺序。第一个分区第一个元素编号为0，最后一分区最后一元素编号最大。如需顺序，需要用sortByKey保证
zipWithUniqueId(): RDD[(T, Long)]	zip rdd的元素和相应的独立id, 独立id比起编号来说是有间隙的，如需顺序，需要用sortByKey保证
take(num: Int): Array[T]	返回rdd的前num个元素组成array到driver。若rdd为nothing或null会报错
first(): T	返回rdd的第一个元素
top(num: Int)(implicit ord: Ordering[T]): Array[T]	返回排序后的前num个元素组成的数组到driver端，默认降序排列，sc.parallelize(Seq(2, 3, 4, 5, 6)).top(2) 返回Array(6, 5)
takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]	同上，顺序相反，sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2)返回Array(2, 3)
max()(implicit ord: Ordering[T]): T	按指定顺序返回rdd的max元素
min()(implicit ord: Ordering[T]): T	按指定顺序返回rdd的min元素
isEmpty(): Boolean	判断rdd是否为空（空分区或空元素都为空，即使分区有一个，元素为空也为空）。注意：为Nothing或null的RDD引用会抛出异常。 `parallelize(Seq())` 为 `RDD[Nothing]`, (`parallelize(Seq())` 可通过 `parallelize(Seq[T]())`.)避免
saveAsTextFile(path: String): Unit	将RDD储存为text文件，元素用其toString方式表示，所以注意不要有类似Array这类toString后丧失信息的元素。此函数兼容null值
saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit	同上，利用压缩形式储存
saveAsObjectFile(path: String): Unit	储存为二进制序列化文件，对象为序列化对象
keyBy[K](f: T => K): RDD[(K, T)]	通过f生成key-value rdd
private[spark] def collectPartitions(): Array[Array[T]]	私有测试方法，查看每个分区内容
checkpoint(): Unit = RDDCheckpointData.synchronized	储存到指定SparkContext#setCheckpointDir位置，使用后会清除所有lineage信息，作业完成后也不会清除数据。建议使用前先persist，否则会重复计算
localCheckpoint(): this.type = RDDCheckpointData.synchronized	用于缩减lineage信息，由于使用了本地临时存储，而不是容灾存储，所以不
isCheckpointed: Boolean	判断rdd是否checkpointed和materialized
private[rdd] def isLocallyCheckpointed: Boolean	私有方法，判断是否localCheckpointed
getCheckpointFile: Option[String]	获取rdd checkpointed的路径名称

还有些其他内部方法和域，省略。

特例：

sc.parallelize(Array(2., 3.)).fold(0.0)((p, v) => p+v*v)

由于(p, v) => p+v*v不满足交换律，所以结果不可知，可改为：

sc.parallelize(Array(2., 3.)).map(v=>v*v).reduce(_+_)

其他：

class BoundedDouble(val mean: Double, val confidence: Double, val low: Double, val high: Double) {
  override def toString(): String = "[%.3f, %.3f]".format(low, high)
}

persist和checkpoint区别：

详情参考：https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/6-CacheAndCheckpoint.md

https://stackoverflow.com/questions/35127720/what-is-the-difference-between-spark-checkpoint-and-persist-to-a-disk

Persist

Persisting or caching with StorageLevel.DISK_ONLY cause the generation of RDD to be computed and stored in a location such that subsequent use of that RDD will not go beyond that points in recomputing the linage.
After persist is called, Spark still remembers the lineage of the RDD even though it doesn't call it.
Secondly, after the application terminates, the cache is cleared or file destroyed

Checkpointing

Checkpointing stores the rdd physically to hdfs and destroys the lineage that created it.
The checkpoint file won't be deleted even after the Spark application terminated.
Checkpoint files can be used in subsequent job run or driver program
Checkpointing an RDD causes double computation because the operation will first call a cache before doing the actual job of computing and writing to the checkpoint directory.

PairRDD 函数

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

参考资料

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala