spark算子解析

最新推荐文章于 2025-03-15 09:38:40 发布

数仓民工

最新推荐文章于 2025-03-15 09:38:40 发布

阅读量421

点赞数

文章标签： spark hadoop hdfs mapreduce

本文链接：https://blog.csdn.net/wenzi880607/article/details/115208112

版权

本文深入解析Spark中的transformations（转换算子）和actions（行动算子）。转换算子包括Value类型如map、flatMap、groupBy等，双Value类型交互的union、subtract等，以及Key-Value类型的操作如partitionBy、groupByKey等。行动算子如reduce、collect、count等，用于触发实际计算并将结果返回给驱动程序。这些算子在大数据处理中起到关键作用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

spark算子总体上分为两类，transformations（转换算子）和actions(行动算子）

1、transformations（转换算子）

1.1Value类型

1.1.1map(func):返回一个新的RDD，该RDD由每一个输入元素经过func函数转换后组成

scala> var source=sc.parallelize(1 to 10)
source: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:25

scala> source.map(_*2).collect
res18: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)

1.1.2mapPartitions(func):类似于map,map的计算对象是每一个元素，而mapPartitions是计算的每一个分区，按照分区为单位来计算
scala> val res=source.mapPartitions(x=>x.map(_*3))
res: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[11] at mapPartitions at <console>:27

scala> res.collect
res21: Array[Int] = Array(3, 6, 9, 12, 15, 18, 21, 24, 27, 30)

1.1.3mapPartitionsWithIndex(func)：类似于mapPartitions，但func带有一个整数参数表示分片的索引值（就是分区号），因此在类型为T的RDD上运行时，func的函数类型必须是(Int, Interator[T]) => Iterator[U]

scala> val indexRDD=source.mapPartitionsWithIndex((index,items)=>(items.map((index,_))))
indexRDD: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[12] at mapPartitionsWithIndex at <console>:27

scala> indexRDD.collect
res22: Array[(Int, Int)] = Array((0,1), (0,2), (1,3), (1,4), (1,5), (2,6), (2,7), (3,8), (3,9), (3,10))

1.1.4flatMap(func):类似于map，但是每一个输入元素可以被映射为0或多个输出元素（所以func应该返回一个序列，而不是单一元素）

scala> val sourceFlat = sc.parallelize(1 to 5)
sourceFlat: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[13] at parallelize at <console>:25

scala> sourceFlat.flatMap(1 to _).collect
res24: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5)

1.1.5glom：将一个分区形成一个数组，形成新的RDD类型时RDD[Array[T]]

scala> rdd.collect
res26: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)

scala> rdd.glom.collect
res27: Array[Array[Int]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8), Array(9, 10, 11, 12), Array(13, 14, 15, 16))

1.1.6groupBy(func):按照传入函数的返回值进行分组，将相同的key对应的值放入一个迭代器中

scala> rdd.collect
res28: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)

scala> rdd.groupBy(_%2).collect
res29: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(2, 4, 6, 8, 10, 12, 14, 16)), (1,CompactBuffer(1, 3, 5, 7, 9, 11, 13, 15)))

1.1.7filter(func):返回一个新的RDD，该RDD由经过func函数计算后返回值为true的输入元素组成

scala> rdd.collect
res30: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)

scala> rdd.filter(_%2==0).collect
res33: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16)

1.1.8sample(withReplacement, fraction, seed):以指定的随机种子随机抽样出数量为fraction的数据，withReplacement表示是抽出的数据是否放回，true为有放回的抽样，false为无放回的抽样，seed用于指定随机数生成器种子

scala> rdd.collect
res34: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)

scala> rdd.sample(true,0.4,2).collect
res36: Array[Int] = Array(1, 2, 2, 10, 10, 11, 12, 13, 14)

1.1.9distinct([numTasks]):对源RDD进行去重后返回一个新的RDD。默认情况下，只有8个并行任务来操作，但是可以传入一个可选的numTasks参数改变它

scala> val distinctRdd = sc.parallelize(List(1,2,1,5,2,9,6,1))
distinctRdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24] at parallelize at <console>:25

scala> distinctRdd.distinct.collect
res38: Array[Int] = Array(1, 9, 5, 6, 2)

1.1.10coalesce(numPartitions):缩减分区数，用于大数据集过滤后，提高小数据集的执行效率

scala> rdd.glom.collect
res40: Array[Array[Int]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8), Array(9, 10, 11, 12), Array(13, 14, 15, 16))

scala> rdd.coalesce(3).glom.collect
res41: Array[Array[Int]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8), Array(9, 10, 11, 12, 13, 14, 15, 16))

1.1.11repartition(numPartitions):根据分区数，重新通过网络随机洗牌所有数据

scala> rdd.collect
res42: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)

scala> rdd.partitions.size
res44: Int = 4

scala> var rerdd=rdd.repartition(2)
rerdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[34] at repartition at <console>:27

scala> rerdd.partitions.size
res45: Int = 2

scala> rerdd.glom.collect
res46: Array[Array[Int]] = Array(Array(1, 3, 5, 7, 9, 11, 13, 15), Array(2, 4, 6, 8, 10, 12, 14, 16))

coalesce重新分区，可以选择是否进行shuffle过程。由参数shuffle: Boolean = false/true决定。

repartition实际上是调用的coalesce，默认是进行shuffle的

1.1.12sortBy(func,[ascending], [numTasks]):使用func先对数据进行处理，按照处理后的数据比较结果排序，默认为正序

scala> rdd.collect
res47: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)

scala> rdd.sortBy(x=>x%6).collect
res48: Array[Int] = Array(6, 12, 1, 7, 13, 2, 8, 14, 3, 9, 15, 4, 10, 16, 5, 11)

1.2双Value类型交互

1.2.1union(otherDataset):对源RDD和参数RDD求并集后返回一个新的RDD

scala> val rdd1 = sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[41] at parallelize at <console>:25

scala> val rdd2 = sc.parallelize(5 to 10)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[42] at parallelize at <console>:25

scala> rdd1.union(rdd2).collect
res49: Array[Int] = Array(1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10)

1.2.2subtract (otherDataset):计算差的一种函数，去除两个RDD中相同的元素，不同的RDD将保留下来

scala> val rdd = sc.parallelize(3 to 8)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> val rdd1 = sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24

scala> rdd.subtract(rdd1).collect
res0: Array[Int] = Array(8, 6, 7)

1.2.3intersection(otherDataset):对源RDD和参数RDD求交集后返回一个新的RDD

scala> val rdd1 = sc.parallelize(1 to 7)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24

scala> val rdd2 = sc.parallelize(5 to 10)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24

scala> rdd1.intersection(rdd2).collect
res0: Array[Int] = Array(5, 6, 7)

1.2.4cartesian(otherDataset)：笛卡尔积（尽量避免使用）

scala> rdd1.cartesian(rdd2).collect
res1: Array[(Int, Int)] = Array((1,5), (1,6), (1,7), (1,8), (1,9), (1,10), (2,5), (3,5), (2,6), (2,7), (3,6), (3,7), (2,8), (3,8), (2,9), (2,10), (3,9), (3,10), (4,5), (5,5), (4,6), (4,7), (5,6), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10), (6,5), (7,5), (6,6), (6,7), (7,6), (7,7), (6,8), (7,8), (6,9), (6,10), (7,9), (7,10))

1.2.5zip(otherDataset):将两个RDD组合成Key/Value形式的RDD,这里默认两个RDD的partition数量以及元素数量都相同，否则会抛出异常

scala> val rdd1 = sc.parallelize(Array(1,2,3),3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24

scala> val rdd2 = sc.parallelize(Array("a","b","c"),3)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[11] at parallelize at <console>:24

scala> rdd1.zip(rdd2).collect
res2: Array[(Int, String)] = Array((1,a), (2,b), (3,c))

1.3Key-Value类型

1.3.1partitionBy:对pairRDD进行分区操作，如果原有的partionRDD和现有的partionRDD是一致的话就不进行分区，否则会生成ShuffleRDD，即会产生shuffle过程

scala> val rdd = sc.parallelize(Array((1,"aaa"),(2,"bbb"),(3,"ccc"),(4,"ddd")),4)

scala> val rdd2=rdd.partitionBy(new org.apache.spark.HashPartitioner(2))
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[15] at partitionBy at <console>:26

scala> rdd2.partitions.size
res7: Int = 2

1.3.2groupByKey:groupByKey也是对每个key进行操作，但只生成一个sequence

scala> val words = Array("one", "two", "two", "three", "three", "three")
words: Array[String] = Array(one, two, two, three, three, three)

scala> val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
scala> val group=wordPairsRDD.groupByKey()

scala> group.collect
res9: Array[(String, Iterable[Int])] = Array((two,CompactBuffer(1, 1)), (one,CompactBuffer(1)), (three,CompactBuffer(1, 1, 1)))

1.3.3reduceByKey(func, [numTasks]):在一个(K,V)的RDD上调用，返回一个(K,V)的RDD，使用指定的reduce函数，将相同key的值聚合到一起，reduce任务的个数可以通过第二个可选的参数来设置

scala> wordPairsRDD.collect
res10: Array[(String, Int)] = Array((one,1), (two,1), (two,1), (three,1), (three,1), (three,1))

scala> wordPairsRDD.reduceByKey((x,y)=>x+y).collect
res11: Array[(String, Int)] = Array((two,2), (one,1), (three,3))

reduceByKey：按照key进行聚合，在shuffle之前有combine（预聚合）操作，返回结果是RDD[k,v].

groupByKey：按照key进行分组，直接进行shuffle

1.3.4aggregateByKey:aggregateByKey(zero_value)(seqOp,combOp),在kv对的RDD中，，按key将value进行分组合并，合并时，将每个value和初始值作为seq函数的参数，进行计算，返回的结果作为一个新的kv对，然后再将结果按照key进行合并，最后将每个分组的value传递给combine函数进行计算（先将前两个value进行计算，将返回结果和下一个value传给combine函数，以此类推），将key与计算结果作为一个新的kv对输出

（1）zeroValue：给每一个分区中的每一个key一个初始值；

（2）seqOp：函数用于在每一个分区中用初始值逐步迭代value；

（3）combOp：函数用于合并每个分区中的结果

取出每个分区相同key对应值的最大值，然后相加

scala> rdd.glom.collect
res12: Array[Array[(String, Int)]] = Array(Array((a,3), (a,2), (c,4)), Array((b,3), (c,6), (c,8)))
scala> rdd.aggregateByKey(0)(math.max(_,_),_+_).collect
res14: Array[(String, Int)] = Array((b,3), (a,3), (c,12))

1.3.5foldByKey:aggregateByKey的简化操作，seqop和combop相同

scala> rdd.glom.collect
res0: Array[Array[(String, Int)]] = Array(Array((a,3), (a,2), (c,4)), Array((b,3), (c,6), (c,8)))

scala> rdd.foldByKey(0)(_+_).collect
res1: Array[(String, Int)] = Array((b,3), (a,5), (c,18))

1.3.6combineByKey[C]:对相同K，把V合并成一个集合

根据key计算每种key的均值。（先计算每个key出现的次数以及可以对应值的总和，再相除得到结果）

scala> input.glom.collect
res2: Array[Array[(String, Int)]] = Array(Array((a,88), (b,95), (a,91)), Array((b,93), (a,95), (b,98)))

scala> val combine = input.combineByKey((_,1),(acc:(Int,Int),v)=>(acc._1+v,acc._2+1),(acc1:(Int,Int),acc2:(Int,Int))=>(acc1._1+acc2._1,acc1._2+acc2._2))
combine: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[5] at combineByKey at <console>:26
scala> combine.collect
res3: Array[(String, (Int, Int))] = Array((b,(286,3)), (a,(274,3)))

scala> combine.map{case (key,value)=>(key,value._1/value._2.toDouble)}.collect
res4: Array[(String, Double)] = Array((b,95.33333333333333), (a,91.33333333333333))

1.3.7sortByKey([ascending], [numTasks]):在一个(K,V)的RDD上调用，K必须实现Ordered接口，返回一个按照key进行排序的(K,V)的RDD

scala> rdd.collect
res5: Array[(String, Int)] = Array((a,3), (a,2), (c,4), (b,3), (c,6), (c,8))

scala> rdd.sortByKey(true).collect
res6: Array[(String, Int)] = Array((a,3), (a,2), (b,3), (c,4), (c,6), (c,8))

1.3.8mapValues:针对于(K,V)形式的类型只对V进行操作

scala> rdd.collect
res7: Array[(String, Int)] = Array((a,3), (a,2), (c,4), (b,3), (c,6), (c,8))

scala> rdd.mapValues(_+"swh").collect
res8: Array[(String, String)] = Array((a,3swh), (a,2swh), (c,4swh), (b,3swh), (c,6swh), (c,8swh))

1.3.9join(otherDataset, [numTasks]):在类型为(K,V)和(K,W)的RDD上调用，返回一个相同key对应的所有元素对在一起的(K,(V,W))的RDD

scala> val rdd = sc.parallelize(Array((1,"a"),(2,"b"),(3,"c")))
rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[11] at parallelize at <console>:24

scala> val rdd1 = sc.parallelize(Array((1,4),(2,5),(3,6)))
rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[12] at parallelize at <console>:24

scala> rdd.join(rdd1).collect
res9: Array[(Int, (String, Int))] = Array((1,(a,4)), (2,(b,5)), (3,(c,6)))

1.3.10cogroup(otherDataset, [numTasks]):在类型为(K,V)和(K,W)的RDD上调用，返回一个(K,(Iterable<V>,Iterable<W>))类型的RDD

scala> rdd.cogroup(rdd1).collect
res10: Array[(Int, (Iterable[String], Iterable[Int]))] = Array((1,(CompactBuffer(a),CompactBuffer(4))), (2,(CompactBuffer(b),CompactBuffer(5))), (3,(CompactBuffer(c),CompactBuffer(6))))