通过例子学习spark rdd–Action函数
action函数
Action函数会产生任务,并会把任务提交到spark集群中。
注意:这些函数一般的返回值都是Unit。
foreach
功能说明
在RDD的所有元素使用函数。函数原型
def foreach(f: T => Unit): Unit
- 函数实现
/**
* Applies a function f to all elements of this RDD.
*/
def foreach(f: T => Unit): Unit = withScope {
// 清理闭包,使其准备好序列化并发送到任务
val cleanF = sc.clean(f)
// 提交任务到spark集群,并使用函数f处理RDD的每个分区的每个元素
sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}
- 使用例子
注意:若spark是集群模型,会看不到foreach的输出。
scala> val r1 = sc.parallelize(Array("a", "b", "c", "d"))
r1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[19] at parallelize at <console>:24
scala> r1.foreach(item => println("hello " + item))
若是在单机的spark上,可以看到以下效果:
scala> val r1 = sc.parallelize(Array("a", "b", "c", "d"))
r1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> r1.foreach(item => println("hello " + item))
[Stage 0:> (0 + 0) / 2]hello a
hello b
hello c
hello d
foreachPartition
功能说明
在RDD的每个分区上使用同一个处理函数。函数原型
def foreachPartition(f: Iterator[T] => Unit): Unit
- 函数实现
/**
* Applies a function f to each partition of this RDD.
*/
def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
val cleanF = sc.clean(f)
// 注意这里函数的参数是每个分区
sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
}
- 使用例子
查看每个分区的数据量大小。
scala> val r1 = sc.parallelize(Array("a", "b", "c", "d", "e", "f", "g", "h", "i"), 3)
r1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> r1.foreachPartition(c=>println(c.length))
3
3
3
collect
功能
把RDD的所有元素都返回到本地的数组中。
该函数可以用于打印RDD的内容,在调试的时候非常有用。但注意:若数据量太大,可能导致OOM。函数原型和实现
def collect(): Array[T]
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U]函数实现
/**
* Return an array that contains all of the elements in this RDD.
*
* @note this method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*/
def collect(): Array[T] = withScope {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
- 使用举例
scala> val r1 = sc.parallelize(Array("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"), 3)
r1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4] at parallelize at <console>:24
scala> r1.foreachPartition(c=>println(c.length))
3
4
4
scala> r1.collect()
res23: Array[String] = Array(a, b, c, d, e, f, g, h, i, j, k)
subtract
功能
执行标准的集合相减操作:A-B。
注意:在使用该函数时,要保证两个RDD的元素类型是相同的。函数原型
def subtract(other: RDD[T]): RDD[T]
def subtract(other: RDD[T], numPartitions: Int): RDD[T]
def subtract(other: RDD[T], p: Partitioner): RDD[T]
- 使用举例
scala> val r1 = sc.parallelize(1 to 10, 3)
r1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[23] at parallelize at <console>:24
scala> val r2 = sc.parallelize(5 to 15, 3)
r2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24] at parallelize at <console>:24
scala> r1.subtract(r2)
res21: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[28] at subtract at <console>:29
scala> r2.collect()
res26: Array[Int] = Array(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
scala> r1.collect()
res27: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> r3.collect()
res28: Array[Int] = Array(3, 1, 4, 2)
reduce
函数功能
使用指定的交换和关联二元运算符来reduce此RDD的元素。函数原型
def reduce(f: (T, T) => T): T
- 使用举例
scala> val a = sc.parallelize(1 to 100, 3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[33] at parallelize at <console>:24
scala> a.reduce(_+_)
res29: Int = 5050
// 和map结合使用
scala> a.map(_*2).reduce(_+_)
res30: Int = 10100
treeReduce
函数功能
和reduce函数的功能相似,不同的是该函数通过tree的方式对结果进行聚合。函数原型
def treeReduce(f: (T, T) ⇒ T, depth: Int = 2): T
- 使用举例
val z = sc.parallelize(List(1,2,3,4,5,6), 2)
z.treeReduce(_+_)
res49: Int = 21
fold
函数功能
聚合每个分区的值。每个分区内的聚合变量用zeroValue初始化,然后聚合所有分区的结果。
使用给定的关联函数和中性的“零值”,聚合每个分区的元素,。函数原型
def fold(zeroValue: T)(op: (T, T) => T): T
- 使用举例
scala> val a = sc.parallelize(1 to 100, 3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[33] at parallelize at <console>:24
scala> a.foreachPartition(c=>println(c.length))
scala> a.fold(0)(_+_)
res38: Int = 5050
aggregate
函数功能
aggregate函数允许用户将两种不同的reduce functions应用于RDD。 在每个分区中应用第一个reduce函数,以将每个分区内的数据减少为单个结果。第二个reduce函数用于将所有分区的不同缩减结果组合在一起,以得出最终结果。内部分区与跨分区减少具有两个单独的减少功能的能力增加了很大的灵活性。 例如,第一个reduce函数可以是max函数,第二个可以是sum函数。 用户还指定一个初始值。 但这里要注意:- 不要为分区计算或组合分区假定任何执行顺序。
- 初始值适用于所有的reduce函数
函数原型
def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U
- 使用举例
val z = sc.parallelize(List(1,2,3,4,5,6), 2)
z.aggregate(0)(math.max(_,_), _+_)
// 使用不同的集合函数
val z = sc.parallelize(List("12","23","345","4567"),2)
z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
res141: String = 42
aggregateByKey
函数功能
像聚合函数一样工作,除了聚合被应用于具有相同键的值。 也不像aggregate函数那样,初始值不适用于第二个reduce。
参数zeroValue:执行第一个reduce函数的初始值。函数原型
def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V)
- 使用举例
scala> val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2)
scala> pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect
res3: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))
scala> pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect
res4: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))
count
函数功能
返回RDD中所有元素的个数。函数原型
def count(): Long
- 使用举例
scala> val a = sc.parallelize(1 to 100, 3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24
scala> a.count()
res30: Long = 100
countByValue
功能
返回RDD中每个元素的个数,并按java的Map数据结构的方式返回。函数原型
def countByValue()(implicit ord: Ordering[T] = null): Map[T, Long]
- 使用举例
scala> val a = sc.parallelize(1 to 10, 3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:24
scala> a.countByValue
res35: scala.collection.Map[Int,Long] = Map(5 -> 1, 10 -> 1, 1 -> 1, 6 -> 1, 9 -> 1, 2 -> 1, 7 -> 1, 3 -> 1, 8 -> 1, 4 -> 1)
scala> val r1 = sc.parallelize(Array("a", "b", "c", "d", "e", "f", "g", "c", "b", "a", "a"), 3)
r1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> r1.countByValue
res0: scala.collection.Map[String,Long] = Map(e -> 1, f -> 1, a -> 3, b -> 2, g -> 1, c -> 2, d -> 1)
zipWithIndex
take
函数功能
从RDD中返回n个元素。
该函数先会先扫描一个分区,若分区的元素个数不够,则会扫描另一个分区,直到元素个数足够。
注意:该函数会把所有的数据保存到driver端,所以若take的数据量太大,可能会导致OOM。函数原型
def take(num: Int): Array[T]
- 使用举例
scala> val r1 = sc.parallelize(Array("a", "b", "c", "d", "e", "f", "g", "c", "b", "a", "a"), 3)
r1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> r1.take(3)
res4: Array[String] = Array(a, b, c)
first
函数功能
查找RDD的第一个数据项并返回它。函数原型
def first()
- 使用举例
scala> val r1 = sc.parallelize(Array("a", "b", "c", "d", "e", "f", "g", "c", "b", "a", "a"), 3)
r1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> r1.first
res1: String = a
takeOrdered
函数功能
返回RDD的n个元素,并对这n个元素进行排序。函数原型
def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]
- 使用举例
scala> val r1 = sc.parallelize(Array("a", "b", "c", "d", "e", "f", "g", "c", "b", "a", "a"), 3)
r1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> r1.takeOrdered(3)
res6: Array[String] = Array(a, a, a)
scala> val b = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"), 2)
b: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[7] at parallelize at <console>:24
scala> b.takeOrdered(2)
res7: Array[String] = Array(ape, cat)
scala> b.takeOrdered(3)
res8: Array[String] = Array(ape, cat, dog)
top
函数功能
返回RDD的前n个元素。
返回元素的顺序是按takeOrdered相反的顺序排列的。函数实现
def top(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
takeOrdered(num)(ord.reverse)
}
- 使用举例
scala> val b = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"), 2)
b: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[7] at parallelize at <console>:24
scala> b.top(3)
res12: Array[String] = Array(salmon, gnu, dog)
max
功能
返回RDD中最大的元素。函数原型
def max()(implicit ord: Ordering[T]): T
- 使用举例
scala> val b = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"), 2)
scala> b.max()
res13: String = salmon
isEmpty
功能
判断一个RDD是否为空。函数原型
def isEmpty(): Boolean
- 使用举例
scala> val b = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"), 2)
b: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[7] at parallelize at <console>:24
scala> b.isEmpty()
res16: Boolean = false
saveAsTextFile
功能
把RDD的内容保存到text文件中。函数原型
def saveAsTextFile(path: String): Unit
def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit
- 使用举例
scala> val b = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"), 3)
scala> b.saveAsTextFile("/user/zxh/testdata/rddresult1/")
$ hadoop fs -cat /user/zxh/testdata/rddresult1/*
dog
cat
ape
salmon
gnu
keyBy
功能
通过在每个数据项上应用一个函数来构造两部分元组(键 - 值对)。函数的结果成为键,原始数据项成为新创建的元组的值。函数原型
def keyBy[K](f: T => K): RDD[(K, T)]
- 使用举例
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
b.collect
res26: Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon), (3,rat), (8,elephant))
keys
函数功能
从所有包含的元组中提取key,并将它们返回到新的RDD中。函数原型
def keys: RDD[K]
- 使用举例
scala> val b = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"), 3)
b: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> b.collect()
res2: Array[String] = Array(dog, cat, ape, salmon, gnu)
scala> val c = b.keyBy(_.length)
c: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[4] at keyBy at <console>:26
scala> c.collect()
res4: Array[(Int, String)] = Array((3,dog), (3,cat), (3,ape), (6,salmon), (3,gnu))
scala> c.keys
res5: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at keys at <console>:29
scala> c.keys.collect()
res6: Array[Int] = Array(3, 3, 3, 6, 3)