RDD执行算子

最新推荐文章于 2024-09-10 10:56:36 发布

数智侠

最新推荐文章于 2024-09-10 10:56:36 发布

阅读量525

点赞数 21

分类专栏： Spark 文章标签：开发语言 Spark rdd

本文链接：https://blog.csdn.net/taogumo/article/details/141026613

版权

Spark 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

官方文档

RDD Programming Guide - Spark 3.5.1 Documentation

Action	Meaning
reduce(func)	Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
collect()	Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
count()	Return the number of elements in the dataset.
first()	Return the first element of the dataset (similar to take(1)).
take(n)	Return an array with the first n elements of the dataset.
takeSample(withReplacement, num, [seed])	Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.
takeOrdered(n, [ordering])	Return the first n elements of the RDD using either their natural order or a custom comparator.
saveAsTextFile(path)	Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
saveAsSequenceFile(path) (Java and Scala)	Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop's Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).
saveAsObjectFile(path) (Java and Scala)	Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using `SparkContext.objectFile()`.
countByKey()	Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.
foreach(func)	Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems. Note: modifying variables other than Accumulators outside of the `foreach()` may result in undefined behavior. See Understanding closures for more details.

reduce(func)

通过func函数聚集RDD中的所有元素

val rdd1 = sc.makeRDD(1 to 10)
rdd1.reduce(_+_)

Int = 55

val rdd2 = sc.makeRDD(Array(("a",1),("a",3),("c",3),("d",5)))
rdd2.reduce((x,y)=>(x._1 + y._1, x._2 + y._2))

(String, Int) = (aacd,12)

collect()

在驱动程序中，以数组的形式返回数据集的所有元素。将RDD分散存储的元素转换为单机上的Scala数组并返回，类似于toArray功能

val rdd = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
rdd.collect

Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu, Rat)

var rdd1 = sc.makeRDD(1 to 10)
rdd1.collect()

Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

collectAsMap

与collect类似，将元素类型为key-value对的RDD，转换为Scala Map并返回，保存元素的KV结构。

val a = sc.parallelize(List(1, 2, 1, 3), 1)
val b = a.zip(a)
b.collectAsMap
b.collect

a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:24
b: org.apache.spark.rdd.RDD[(Int, Int)] = ZippedPartitionsRDD2[5] at zip at <console>:26
scala.collection.Map[Int,Int] = Map(2 -> 2, 1 -> 1, 3 -> 3)scala>
Array[(Int, Int)] = Array((1,1), (2,2), (1,1), (3,3))

foreach(func)

是对RDD中的每个元素执行无参数的f函数，返回Unit。定义如下： def foreach(f: T => Unit)

在数据集的每一个元素上，运行函数func。

和map的区别是，foreach没返回值，map返回RDD

var rdd = sc.makeRDD(1 to 10,2)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at makeRDD at <console>:24

rdd.foreach(println)

1
6
2
7
3
8
4
9
5
10

count

返回RDD的元素个数。

var rdd2 = sc.makeRDD(1 to 10)
rdd2.count()

Long = 10

top()

返回RDD中最大/最小的K个元素。

val rdd=sc.parallelize(Array(6, 9, 4, 7, 5, 8), 2)
rdd.top(2)

Array[Int] = Array(9, 8)

first()

返回RDD的第一个元素

var rdd3=sc.makeRDD(1 to 10)
rdd3.first()

Int = 1

take(n)

返回一个由数据集的前n个元素组成的数组。

var rdd = sc.makeRDD(1 to 10)
rdd.take(5)

Array[Int] = Array(1, 2, 3, 4, 5)

takeSample()

val x = sc.parallelize(1 to 10, 3)
x.takeSample(true, 5, 1)

res3: Array[Int] = Array(1, 8, 10, 3, 6)

takeOrdered(n)

返回前几个的排序。

val rdd6 = sc.makeRDD(Seq(10, 4, 2, 12, 3))
// top 从大到小取前n个
rdd6.top(2)
//takeOrdered 从小到大取前n个
rdd6.takeOrdered(4)

Array[Int] = Array(12, 10)

Array[Int] = Array(2, 3, 4, 10)

aggregate(value)(seqOp, combOp)

aggregate函数将每个分区里面的元素通过seqOp和初始值进行聚合，然后用combine函数将每个分区的结果和初始值(zeroValue)进行combine操作。这个函数最终返回的类型不需要和RDD中元素类型一致。

先看例子：

val rdd4 = sc.parallelize(List(1, 2, 3, 4, 5), 2)
//起始值为1，每一个分区的元素进行相加，然后对每个分区的结果进行相加
val result = rdd4.aggregate(1)(_ + _, _ + _)

result: Int = 18

再来看一下具体执行过程：

val rdd4 = sc.parallelize(List(1, 2, 3, 4, 5), 2)
//起始值为0，每一个分区的元素进行相加，然后对每个分区的结果进行相加
val result = rdd4.aggregate(1)((x, y) => {
    println("x1:" + x + ",y1:" + y)
    x + y
}, (x, y) => {
        println("x2:" + x + ",y2:" + y)
        x + y
    })

println(result)

第一个分区，第一行的x1 为指定的初始值：

x1:1,y1:1

x1:2,y1:2

第二分区：

x1:1,y1:3

x1:4,y1:4

x1:8,y1:5

对两个分区的结果再进行聚合，第一行的x2为指定的初始值

x2:1,y2:4

x2:5,y2:13

计算的结果为：

18

aggregate其实和reduce类似，都是对每个元素进行聚合，只是多了一个初始值，以及对每个分区进行聚合。

aggregate内的参数表示初始值，还包括两个函数，第一个函数是对每个分区内的元素进行聚合；第二个函数是对每个分区聚合后的结果，进行聚合

aggregate(1)：1表示一个初始值，对于每一个分区：刚开始会把初始值和第一个元素按照第一个函数进行运算，再对每个分区聚合的结果进行运算时，还会再把这个初始值包含进去。

第1个 _ + _：其实就是(x, y) => (x + y) 的缩写，表示对一个分区内的元素进行累加。

第2个 _ + _：对每个分区的结果再进行累加。

比如第一个分区元素为：1 2 3，第二个分区为：4 5，那执行第一个相加函数后，则变为：6 9，再执行第二个函数：6+9 得到15.

如果初始值为1，则运算过程为：1 + 1+2+3 = 7，1 + 4+5 = 10， 1 + 7+10 = 18

fold(num)(func)

aggregate的简化操作，只有一个函数，对所有元素聚合。

和reduce的区别是多了一个初始值

var rdd = sc.makeRDD(1 to 4,2)
rdd.aggregate(1)(
     | {(x : Int,y : Int) => x + y},
     | {(a : Int,b : Int) => a + b}
     | )

Int = 13

rdd.fold(1)(_+_)

Int = 13

saveAsTextFile(path)

将数据集的元素以textfile的形式保存到HDFS文件系统或者其他支持的文件系统，对于每个元素，Spark将会调用toString方法，将它装换为文件中的文本。

val rdd = sc.parallelize(1 to 10000, 3)
rdd.saveAsTextFile("/data/output1")

查看HDFS内容

saveAsObjectFile

将RDD分区中每10个元素保存为一个数组并将其序列化，映射为（null, BytesWritable（Y））的元素，以SequenceFile的格式写入HDFS

val rdd = sc.parallelize(1 to 100, 3)
rdd.saveAsObjectFile("/data/output2")
val rdd1 = sc.objectFile[Int](" E:\\data\\output2")
rdd.collect

Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)

countByKey()

针对(K,V)类型的RDD，返回一个(K,Int)的map，表示每一个key对应的元素个数。

val rdd = sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)))
rdd.countByKey()

scala.collection.Map[Int,Long] = Map(2 -> 1, 1 -> 3, 3 -> 2)

lookup

扫描RDD的所有元素，选择与参数匹配的Key，并将其Value以Scala sequence的形式返回

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.lookup(5)

Seq[String] = WrappedArray(tiger, eagle)

mean\sum\max\min

var rdd = sc.makeRDD(1 to 100)
rdd.sum()
rdd.max()

Double = 5050.0

Int = 100

数智侠

关注

21
点赞
踩
17

收藏

觉得还不错? 一键收藏
打赏
0
评论
RDD执行算子

Spark RDD执行算子
复制链接

扫一扫

专栏目录