spark算子总体上分为两类,transformations(转换算子)和actions(行动算子)
1、transformations(转换算子)
1.1Value类型
1.1.1map(func):返回一个新的RDD,该RDD由每一个输入元素经过func函数转换后组成
scala> var source=sc.parallelize(1 to 10)
source: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:25
scala> source.map(_*2).collect
res18: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
1.1.2mapPartitions(func):类似于map,map的计算对象是每一个元素,而mapPartitions是计算的每一个分区,按照分区为单位来计算
scala> val res=source.mapPartitions(x=>x.map(_*3))
res: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[11] at mapPartitions at <console>:27
scala> res.collect
res21: Array[Int] = Array(3, 6, 9, 12, 15, 18, 21, 24, 27, 30)
1.1.3mapPartitionsWithIndex(func):类似于mapPartitions,但func带有一个整数参数表示分片的索引值(就是分区号),因此在类型为T的RDD上运行时,func的函数类型必须是(Int, Interator[T]) => Iterator[U]
scala> val indexRDD=source.mapPartitionsWithIndex((index,items)=>(items.map((index,_))))
indexRDD: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[12] at mapPartitionsWithIndex at <console>:27
scala> indexRDD.collect
res22: Array[(Int, Int)] = Array((0,1), (0,2), (1,3), (1,4), (1,5), (2,6), (2,7), (3,8), (3,9), (3,10))
1.1.4flatMap(func):类似于map,但是每一个输入元素可以被映射为0或多个输出元素(所以func应该返回一个序列,而不是单一元素)
scala> val sourceFlat = sc.parallelize(1 to 5)
sourceFlat: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[13] at parallelize at <console>:25
scala> sourceFlat.flatMap(1 to _).collect
res24: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5)
1.1.5glom:将一个分区形成一个数组,形成新的RDD类型时RDD[Array[T]]
scala> rdd.collect
res26: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
scala> rdd.glom.collect
res27: Array[Array[Int]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8), Array(9, 10, 11, 12), Array(13, 14, 15, 16))
1.1.6groupBy(func):按照传入函数的返回值进行分组,将相同的key对应的值放入一个迭代器中
scala> rdd.collect
res28: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
scala> rdd.groupBy(_%2).collect
res29: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(2, 4, 6, 8, 10, 12, 14, 16)), (1,CompactBuffer(1, 3, 5, 7, 9, 11, 13, 15)))
1.1.7filter(func):返回一个新的RDD,该RDD由经过func函数计算后返回值为true的输入元素组成
scala> rdd.collect
res30: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
scala> rdd.filter(_%2==0).collect
res33: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16)
1.1.8sample(withReplacement, fraction, seed):以指定的随机种子随机抽样出数量为fraction的数据,withReplacement表示是抽出的数据是否放回,true为有放回的抽样,false为无放回的抽样,seed用于指定随机数生成器种子
scala> rdd.collect
res34: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
scala> rdd.sample(true,0.4,2).collect
res36: Array[Int] = Array(1, 2, 2, 10, 10, 11, 12, 13, 14)
1.1.9distinct([numTasks]):对源RDD进行去重后返回一个新的RDD。默认情况下,只有8个并行任务来操作,但是可以传入一个可选的numTasks参数改变它
scala> val distinctRdd = sc.parallelize(List(1,2,1,5,2,9,6,1))
distinctRdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24] at parallelize at <console>:25
scala> distinctRdd.distinct.collect
res38: Array[Int] = Array(1, 9, 5, 6, 2)
1.1.10coalesce(numPartitions):缩减分区数,用于大数据集过滤后,提高小数据集的执行效率
scala> rdd.glom.collect
res40: Array[Array[Int]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8), Array(9, 10, 11, 12), Array(13, 14, 15, 16))
scala> rdd.coalesce(3).glom.collect
res41: Array[Array[Int]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8), Array(9, 10, 11, 12, 13, 14, 15, 16))
1.1.11repartition(numPartitions):根据分区数,重新通过网络随机洗牌所有数据
scala> rdd.collect
res42: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
scala> rdd.partitions.size
res44: Int = 4
scala> var rerdd=rdd.repartition(2)
rerdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[34] at repartition at <console>:27
scala> rerdd.partitions.size
res45: Int = 2
scala> rerdd.glom.collect
res46: Array[Array[Int]] = Array(Array(1, 3, 5, 7, 9, 11, 13, 15), Array(2, 4, 6, 8, 10, 12, 14, 16))
coalesce重新分区,可以选择是否进行shuffle过程。由参数shuffle: Boolean = false/true决定。
repartition实际上是调用的coalesce,默认是进行shuffle的
1.1.12sortBy(func,[ascending], [numTasks]):使用func先对数据进行处理,按照处理后的数据比较结果排序,默认为正序
scala> rdd.collect
res47: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
scala> rdd.sortBy(x=>x%6).collect
res48: Array[Int] = Array(6, 12, 1, 7, 13, 2, 8, 14, 3, 9, 15, 4, 10, 16, 5, 11)
1.2双Value类型交互
1.2.1union(otherDataset):对源RDD和参数RDD求并集后返回一个新的RDD
scala> val rdd1 = sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[41] at parallelize at <console>:25
scala> val rdd2 = sc.parallelize(5 to 10)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[42] at parallelize at <console>:25
scala> rdd1.union(rdd2).collect
res49: Array[Int] = Array(1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10)
1.2.2subtract (otherDataset):计算差的一种函数,去除两个RDD中相同的元素,不同的RDD将保留下来
scala> val rdd = sc.parallelize(3 to 8)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> val rdd1 = sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> rdd.subtract(rdd1).collect
res0: Array[Int] = Array(8, 6, 7)
1.2.3intersection(otherDataset):对源RDD和参数RDD求交集后返回一个新的RDD
scala> val rdd1 = sc.parallelize(1 to 7)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> val rdd2 = sc.parallelize(5 to 10)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24
scala> rdd1.intersection(rdd2).collect
res0: Array[Int] = Array(5, 6, 7)
1.2.4cartesian(otherDataset):笛卡尔积(尽量避免使用)
scala> rdd1.cartesian(rdd2).collect
res1: Array[(Int, Int)] = Array((1,5), (1,6), (1,7), (1,8), (1,9), (1,10), (2,5), (3,5), (2,6), (2,7), (3,6), (3,7), (2,8), (3,8), (2,9), (2,10), (3,9), (3,10), (4,5), (5,5), (4,6), (4,7), (5,6), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10), (6,5), (7,5), (6,6), (6,7), (7,6), (7,7), (6,8), (7,8), (6,9), (6,10), (7,9), (7,10))
1.2.5zip(otherDataset):将两个RDD组合成Key/Value形式的RDD,这里默认两个RDD的partition数量以及元素数量都相同,否则会抛出异常
scala> val rdd1 = sc.parallelize(Array(1,2,3),3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24
scala> val rdd2 = sc.parallelize(Array("a","b","c"),3)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[11] at parallelize at <console>:24
scala> rdd1.zip(rdd2).collect
res2: Array[(Int, String)] = Array((1,a), (2,b), (3,c))
1.3Key-Value类型
1.3.1partitionBy:对pairRDD进行分区操作,如果原有的partionRDD和现有的partionRDD是一致的话就不进行分区, 否则会生成ShuffleRDD,即会产生shuffle过程
scala> val rdd = sc.parallelize(Array((1,"aaa"),(2,"bbb"),(3,"ccc"),(4,"ddd")),4)
scala> val rdd2=rdd.partitionBy(new org.apache.spark.HashPartitioner(2))
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[15] at partitionBy at <console>:26
scala> rdd2.partitions.size
res7: Int = 2
1.3.2groupByKey:groupByKey也是对每个key进行操作,但只生成一个sequence
scala> val words = Array("one", "two", "two", "three", "three", "three")
words: Array[String] = Array(one, two, two, three, three, three)
scala> val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
scala> val group=wordPairsRDD.groupByKey()
scala> group.collect
res9: Array[(String, Iterable[Int])] = Array((two,CompactBuffer(1, 1)), (one,CompactBuffer(1)), (three,CompactBuffer(1, 1, 1)))
1.3.3reduceByKey(func, [numTasks]):在一个(K,V)的RDD上调用,返回一个(K,V)的RDD,使用指定的reduce函数,将相同key的值聚合到一起,reduce任务的个数可以通过第二个可选的参数来设置
scala> wordPairsRDD.collect
res10: Array[(String, Int)] = Array((one,1), (two,1), (two,1), (three,1), (three,1), (three,1))
scala> wordPairsRDD.reduceByKey((x,y)=>x+y).collect
res11: Array[(String, Int)] = Array((two,2), (one,1), (three,3))
reduceByKey:按照key进行聚合,在shuffle之前有combine(预聚合)操作,返回结果是RDD[k,v].
groupByKey:按照key进行分组,直接进行shuffle
1.3.4aggregateByKey:aggregateByKey(zero_value)(seqOp,combOp),在kv对的RDD中,,按key将value进行分组合并,合并时,将每个value和初始值作为seq函数的参数,进行计算,返回的结果作为一个新的kv对,然后再将结果按照key进行合并,最后将每个分组的value传递给combine函数进行计算(先将前两个value进行计算,将返回结果和下一个value传给combine函数,以此类推),将key与计算结果作为一个新的kv对输出
(1)zeroValue:给每一个分区中的每一个key一个初始值;
(2)seqOp:函数用于在每一个分区中用初始值逐步迭代value;
(3)combOp:函数用于合并每个分区中的结果
取出每个分区相同key对应值的最大值,然后相加
scala> rdd.glom.collect
res12: Array[Array[(String, Int)]] = Array(Array((a,3), (a,2), (c,4)), Array((b,3), (c,6), (c,8)))
scala> rdd.aggregateByKey(0)(math.max(_,_),_+_).collect
res14: Array[(String, Int)] = Array((b,3), (a,3), (c,12))
1.3.5foldByKey:aggregateByKey的简化操作,seqop和combop相同
scala> rdd.glom.collect
res0: Array[Array[(String, Int)]] = Array(Array((a,3), (a,2), (c,4)), Array((b,3), (c,6), (c,8)))
scala> rdd.foldByKey(0)(_+_).collect
res1: Array[(String, Int)] = Array((b,3), (a,5), (c,18))
1.3.6combineByKey[C]:对相同K,把V合并成一个集合
根据key计算每种key的均值。(先计算每个key出现的次数以及可以对应值的总和,再相除得到结果)
scala> input.glom.collect
res2: Array[Array[(String, Int)]] = Array(Array((a,88), (b,95), (a,91)), Array((b,93), (a,95), (b,98)))
scala> val combine = input.combineByKey((_,1),(acc:(Int,Int),v)=>(acc._1+v,acc._2+1),(acc1:(Int,Int),acc2:(Int,Int))=>(acc1._1+acc2._1,acc1._2+acc2._2))
combine: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[5] at combineByKey at <console>:26
scala> combine.collect
res3: Array[(String, (Int, Int))] = Array((b,(286,3)), (a,(274,3)))
scala> combine.map{case (key,value)=>(key,value._1/value._2.toDouble)}.collect
res4: Array[(String, Double)] = Array((b,95.33333333333333), (a,91.33333333333333))
1.3.7sortByKey([ascending], [numTasks]):在一个(K,V)的RDD上调用,K必须实现Ordered接口,返回一个按照key进行排序的(K,V)的RDD
scala> rdd.collect
res5: Array[(String, Int)] = Array((a,3), (a,2), (c,4), (b,3), (c,6), (c,8))
scala> rdd.sortByKey(true).collect
res6: Array[(String, Int)] = Array((a,3), (a,2), (b,3), (c,4), (c,6), (c,8))
1.3.8mapValues:针对于(K,V)形式的类型只对V进行操作
scala> rdd.collect
res7: Array[(String, Int)] = Array((a,3), (a,2), (c,4), (b,3), (c,6), (c,8))
scala> rdd.mapValues(_+"swh").collect
res8: Array[(String, String)] = Array((a,3swh), (a,2swh), (c,4swh), (b,3swh), (c,6swh), (c,8swh))
1.3.9join(otherDataset, [numTasks]):在类型为(K,V)和(K,W)的RDD上调用,返回一个相同key对应的所有元素对在一起的(K,(V,W))的RDD
scala> val rdd = sc.parallelize(Array((1,"a"),(2,"b"),(3,"c")))
rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[11] at parallelize at <console>:24
scala> val rdd1 = sc.parallelize(Array((1,4),(2,5),(3,6)))
rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[12] at parallelize at <console>:24
scala> rdd.join(rdd1).collect
res9: Array[(Int, (String, Int))] = Array((1,(a,4)), (2,(b,5)), (3,(c,6)))
1.3.10cogroup(otherDataset, [numTasks]):在类型为(K,V)和(K,W)的RDD上调用,返回一个(K,(Iterable<V>,Iterable<W>))类型的RDD
scala> rdd.cogroup(rdd1).collect
res10: Array[(Int, (Iterable[String], Iterable[Int]))] = Array((1,(CompactBuffer(a),CompactBuffer(4))), (2,(CompactBuffer(b),CompactBuffer(5))), (3,(CompactBuffer(c),CompactBuffer(6))))
2.Actions行动算子
2.1reduce(func):通过func函数聚集RDD中的所有元素,先聚合分区内数据,再聚合分区间数据
scala> rdd1.collect
res11: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> rdd1.reduce(_+_)
res13: Int = 55
2.2collect():在驱动程序中,以数组的形式返回数据集的所有元素
scala> val rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> rdd.collect
res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
2.3count():返回RDD中元素的个数
scala> rdd.count
res1: Long = 10
2.4first():返回RDD中第一个元素
2.5tabke(n):返回RDD中前n个元素组成的数组
scala> rdd.take(5)
res2: Array[Int] = Array(1, 2, 3, 4, 5)
2.6takeOrdered(n):返回该RDD排序后的前n个元素组成的数组
scala> val rdd = sc.parallelize(Array(2,5,4,6,8,3))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> rdd.takeOrdered(4)
res3: Array[Int] = Array(2, 3, 4, 5)
2.7aggregate:ggregate函数将每个分区里面的元素通过seqOp和初始值进行聚合,然后用combine函数将每个分区的结果和初始值(zeroValue)进行combine操作。这个函数最终返回的类型不需要和RDD中元素类型一致
scala> val rdd = sc.parallelize(Array(2,5,4,6,8,3))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> rdd.aggregate(0)(_+_,_+_)
res4: Int = 28
2.8fold(num)(func):折叠操作,aggregate的简化操作,seqop和combop一样
scala> val rdd = sc.parallelize(Array(2,5,4,6,8,3))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> rdd.fold(0)(_+_)
res5: Int = 28
2.9saveAsTestFile(path):将数据集的元素以textfile的形式保存到HDFS文件系统或者其他支持的文件系统,对于每个元素,Spark将会调用toString方法,将它装换为文件中的文本
2.10saveAsSequenceFile(path):将数据集中的元素以Hadoop sequencefile的格式保存到指定的目录下,可以使HDFS或者其他Hadoop支持的文件系统
2.11saveAsObjectFile(path):用于将RDD中的元素序列化成对象,存储到文件中
2.12countByKey():针对(K,V)类型的RDD,返回一个(K,Int)的map,表示每一个key对应的元素个数
scala> rdd.collect
res6: Array[(Int, Int)] = Array((1,3), (1,2), (1,4), (2,3), (3,6), (3,8))
scala> rdd.countByKey
res7: scala.collection.Map[Int,Long] = Map(3 -> 2, 1 -> 3, 2 -> 1)
2.13foreach(func):在数据集的每一个元素上,运行函数func进行更新。
scala> rdd.foreach(println)
(1,3)
(1,2)
(1,4)
(3,6)
(3,8)
(2,3)