RDD 中的所有转换都是延迟加载的,也就是说,它们并不会直接计算结果。相反的,它们只是记住这些应用到基础数据集(例如一个文件)上的转换动作。只有当发生一个要求返回结果给 Driver 的动作时,这些转换才会真正运行。这种设计让 Spark 更加有效率地运行。
常用的Transformation
map,filter,flatMap,mapPartitions,mapPartitonsWithIndex,sample,takeSample,union,intersection,distinct,partitionBy,reduceByKey,groupByKey,combinByKey,aggregateByKey,foldByKey,sortByKey,sortBy,join,cogroup,cartesian,pipe,coalesce,repartition,repartitionAndSortWithinPartitons,glom,mapValues,subtract
map(func):返回一个新的RDD,该RDD由每一个输入元素经过func函数转换后组成
map将原来RDD的每个数据项通过map中的用户自定义函数f映射转变为一个新的元素。源码中的map算子相当于初始化一个RDD,新RDD叫做MappedRDD(this,sc.clean(f))
scala> val rdd = sc.makeRDD(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at makeRDD at <console>:24
scala> rdd.map(_*2).collect
res4: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
filter(func):返回一个新的RDD,该RDD由经过func函数计算后返回值为true的输入元素组成
scala> val rdd = sc.makeRDD(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at makeRDD at <console>:24
scala> rdd.filter(_%2==0).collect
res5: Array[Int] = Array(2, 4, 6, 8, 10)
flatMap(func):类似于 map ,但是每一个输入元素可以被为映射为 0 或多个输出元素(所以 func 应该返回一个序列,而不是单一元素)
flatMap将原来RDD中的每个元素通过函数f转化为新的元素,并将生成的RDD的每个集合中的元素合并为一个元素。内部创建FlatMappedRDD(this,sc.clean(f))
scala> val rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24
scala> val flatMap = rdd.flatMap(1 to _)
flatMap: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[10] at flatMap at <console>:26
scala> flatMap.collect
res6: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
mapPartitions(func):类似于 map,但独立地在 RDD 的每一个分片上运行,因此在类型为 T 的 RDD 上运行时,func 的函数类型必须是Iterator[T] => Iterator[U]。假设有 N 个元素,有 M 个分区,那么 map 的函数的将被调用 N 次,而 mapPartitions 被调用 M 次,一个函数一次处理所有分区
#表示划分为4个分区
scala> val rdd = sc.parallelize(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:24
scala> rdd.partitions.size
res13: Int = 4
# items表示一个分区中的数据
scala> rdd.mapPartitions(items => items.map(_*2)).collect
res14: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
mapPartitionsWithIndex(func):类似于 mapPartitions,但 func 带有一个整数参数表示分片的索引值,因此在类型为
T 的 RDD 上运行时,func 的函数类型必须是(Int, Interator[T]) => Iterator[U]
scala> val rdd = sc.parallelize(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[22] at parallelize at <console>:24
scala> rdd.mapPartitionsWithIndex((index,items) => Iterator(index + ":" + items.toList)).collect
res18: Array[String] = Array(0:List(1, 2), 1:List(3, 4, 5), 2:List(6, 7), 3:List(8, 9, 10))
sample(withReplacement,fraction, seed):以指定的随机种子随机抽样出数量为fraction 的数据,withReplacement 表示是抽出的数据是否放回,true 为有放回的抽样,false 为无放回的抽样,seed 用于指定随机数生成器种子。例子从 RDD 中随机且有放回的抽出 50%的数据,随机种子值为 2(即可能以 1 2 的其中一个起始值)
scala> val rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[44] at parallelize at <console>:24
scala> rdd.sample(true,0.4,2).collect
res36: Array[Int] = Array(1, 2, 2)
takeSample:和 Sample 的区别是takeSample 返回的是最终的结果集合。
union(otherDataset): 对源 RDD 和参数 RDD 求并集后返回一个新的 RDD
scala> val rdd1 = sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[54] at parallelize at <console>:24
scala> val rdd2 = sc.parallelize(5 to 10)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[55] at parallelize at <console>:24
scala> rdd1.union(rdd2).collect
res44: Array[Int] = Array(1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10)
intersection(otherDataset):是数据交集,返回一个新的数据集,包含两个数据集的交集数据。
scala> val rdd1 = sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[73] at parallelize at <console>:24
scala> val rdd2 = sc.parallelize(3 to 7)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[74] at parallelize at <console>:24
scala> rdd1.intersection(rdd2).collect
res50: Array[Int] = Array(4, 3, 5)
distinct([numTasks])) :数据去重,返回一个数据集,它是对两个数据集去除重复数据,numTasks参数是设置任务并行数量。
scala> val rdd = sc.parallelize(List(1,2,3,4,4,2,1))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[82] at parallelize at <console>:24
scala> rdd.distinct.collect
res51: Array[Int] = Array(4, 2, 1, 3)
partitionBy:对 RDD 进行分区操作,如果原有的partionRDD 和现有的 partionRDD 是一致的话就不进行分区,否则会生成 ShuffleRDD.
scala> val rdd = sc.parallelize(Array((1,"aaa"),(2,"bbb"),(3,"ccc"),(4,"ddd")),4)
rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[90] at parallelize at <console>:24
scala> rdd.collect
res55: Array[(Int, String)] = Array((1,aaa), (2,bbb), (3,ccc), (4,ddd))
scala> val rdd1 = rdd.partitionBy(new org.apache.spark.HashPartitioner(2))
rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[91] at partitionBy at <console>:26
scala> rdd1.collect
res56: Array[(Int, String)] = Array((2,bbb), (4,ddd), (1,aaa), (3,ccc))
reduceByKey(func,[numTasks]):在一个(K,V)的 RDD 上调用,返回一个(K,V)的 RDD,使用指定的 reduce 函数,将相同 key 的值聚合到一起,reduce 任务的个数可以通过第二个可选的参数来设置
scala> val words = Array("one","two","two","three","three","three")
words: Array[String] = Array(one, two, two, three, three, three)
scala> val wordPairsRDD = sc.parallelize(words).map((_,1))
wordPairsRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[100] at map at <console>:26
scala> val group = wordPairsRDD.groupByKey
group: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[101] at groupByKey at <console>:28
scala> group.collect
res61: Array[(String, Iterable[Int])] = Array((two,CompactBuffer(1, 1)), (one,CompactBuffer(1)), (three,CompactBuffer(1, 1, 1)))
groupByKey:groupByKey 也是对每个 key 进行操作,但只生成一个 sequence。
scala> val words = Array("one","two","two","three","three","three")
words: Array[String] = Array(one, two, two, three, three, three)
scala> val wordPairsRDD = sc.parallelize(words).map((_,1))
wordPairsRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[100] at map at <console>:26
scala> val group = wordPairsRDD.groupByKey
group: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[101] at groupByKey at <console>:28
scala> group.collect
res61: Array[(String, Iterable[Int])] = Array((two,CompactBuffer(1, 1)), (one,CompactBuffer(1)), (three,CompactBuffer(1, 1, 1)))
combineByKey[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C)
对相同 K,把 V 合并成一个集合.createCombiner: combineByKey() 会遍历分区中的所有元素,因此每个元素的键要么还没有遇到过,要么就 和之前的某个元素的键相同。如果这是一个新的元素,combineByKey() 会使用一个叫作 createCombiner() 的函数来创建那个键对应的累加器的初始值mergeValue: 如果这是一个在处理当前分区之前已经遇到的键, 它会使用 mergeValue() 方法将该键的累加器对应的当前值与这个新的值进行合并mergeCombiners: 由于每个分区都是独立处理的, 因此对于同一个键可以有多个累加器。如果有两个或者更多的分区都有对应同一个键的累加器, 就需要使用用户提供的mergeCombiners() 方法将各个分区的结果进行合并。
scala> val scores = Array(("Fred",88),("Fred",95),("Fred",91),("Wilma",93),("Wilma",98))
scores: Array[(String, Int)] = Array((Fred,88), (Fred,95), (Fred,91), (Wilma,93), (Wilma,98))
scala> val input = sc.makeRDD(scores)
input: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[113] at makeRDD at <console>:26
scala> val combine = input.combineByKey(
| v=>(v,1),
| (c:(Int,Int),v)=>(c._1+v,c._2+1),
| (c1:(Int,Int),c2:(Int,Int))=>(c1._1+c2._1,c1._2+c2._2))
combine: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[114] at combineByKey at <console>:28
scala> combine.map{case (key,value) => (key,value._1/value._2.toDouble)}.collect
res75: Array[(String, Double)] = Array((Wilma,95.5), (Fred,91.33333333333333))
aggregateByKey(zeroValue:U,[
partitioner: Partitioner]) (seqOp:
(U, V) => U,combOp: (U, U) =>U)
在 kv 对的 RDD 中,,按 key 将 value 进行分组合并,合并时,将每个 value 和初始值作为 seq 函数的参数,进行计算,返回的结果作为一个新的 kv 对,然后再将结果按照 key 进行合并,最后将每个分组的 value 传递给 combine 函数进行计算(先将前两个 value 进行计算,将返回结果和下一个 value 传给 combine 函数,以此类推),将 key 与计算结果作为一个新的 kv 对输出。seqOp 函数用于在每一个分区中用初始值逐步迭代 value,combOp 函数用于合并每个分区中的结果
scala> val rdd =
sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),3)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] =
ParallelCollectionRDD[12] at parallelize at <console>:24
scala> val agg = rdd.aggregateByKey(0)(math.max(_,_),_+_)
agg: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[13] at
aggregateByKey at <console>:26
scala> agg.collect()
res7: Array[(Int, Int)] = Array((3,8), (1,7), (2,3))
scala> agg.partitions.size
res8: Int = 3
scala> val rdd =
sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),1)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] =
ParallelCollectionRDD[10] at parallelize at <console>:24
scala> val agg = rdd.aggregateByKey(0)(math.max(_,_),_+_).collect()
agg: Array[(Int, Int)] = Array((1,4), (3,8), (2,3))
foldByKey(zeroValue: V)(func:(V, V) => V): RDD[(K, V)]
aggregateByKey 的简化操作,seqop 和combop 相同
scala> val rdd =
sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),3)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] =
ParallelCollectionRDD[91] at parallelize at <console>:24
scala> val agg = rdd.foldByKey(0)(_+_)
agg: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[92] at
foldByKey at <console>:26
scala> agg.collect()
res61: Array[(Int, Int)] = Array((3,14), (1,9), (2,3))
sortByKey([ascending],[numTasks]):在一个(K,V)的 RDD 上调用,K 必须实现Ordered 接口,返回一个按照 key 进行排序的(K,V)的 RDD
scala> val rdd =
sc.parallelize(Array((3,"aa"),(6,"cc"),(2,"bb"),(1,"dd")))
rdd: org.apache.spark.rdd.RDD[(Int, String)] =
ParallelCollectionRDD[14] at parallelize at <console>:24
scala> rdd.sortByKey(true).collect()
res9: Array[(Int, String)] = Array((1,dd), (2,bb), (3,aa), (6,cc))
scala> rdd.sortByKey(false).collect()
res10: Array[(Int, String)] = Array((6,cc), (3,aa), (2,bb), (1,dd))
sortBy(func,[ascending],[numTasks]):与 sortByKey 类似,但是更灵活,可以用func 先对数据进行处理,按照处理后的数据比较结果排序。
scala> val rdd = sc.parallelize(List(1,2,3,4))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[21] at
parallelize at <console>:24
scala> rdd.sortBy(x => x).collect()
res11: Array[Int] = Array(1, 2, 3, 4)
scala> rdd.sortBy(x => x%3).collect()
res12: Array[Int] = Array(3, 4, 1, 2)
join(otherDataset, [numTasks]):在类型为(K,V)和(K,W)的 RDD 上调用,返回一个相同 key 对应的所有元素对在一起的(K,(V,W))的 RDD
scala> val rdd = sc.parallelize(Array((1,"a"),(2,"b"),(3,"c")))
rdd: org.apache.spark.rdd.RDD[(Int, String)] =
ParallelCollectionRDD[32] at parallelize at <console>:24
scala> val rdd1 = sc.parallelize(Array((1,4),(2,5),(3,6)))
rdd1: org.apache.spark.rdd.RDD[(Int, Int)] =
ParallelCollectionRDD[33] at parallelize at <console>:24
scala> rdd.join(rdd1).collect()
res13: Array[(Int, (String, Int))] = Array((1,(a,4)), (2,(b,5)),
(3,(c,6)))
cogroup(otherDataset,[numTasks]):在类型为(K,V)和(K,W)的 RDD 上调用,返回一个(K,(Iterable<V>,Iterable<W>))类型的 RDD
scala> val rdd = sc.parallelize(Array((1,"a"),(2,"b"),(3,"c")))
rdd: org.apache.spark.rdd.RDD[(Int, String)] =
ParallelCollectionRDD[37] at parallelize at <console>:24
scala> val rdd1 = sc.parallelize(Array((1,4),(2,5),(3,6)))
rdd1: org.apache.spark.rdd.RDD[(Int, Int)] =
ParallelCollectionRDD[38] at parallelize at <console>:24
scala> rdd.cogroup(rdd1).collect()
res14: Array[(Int, (Iterable[String], Iterable[Int]))] =
Array((1,(CompactBuffer(a),CompactBuffer(4))),
(2,(CompactBuffer(b),CompactBuffer(5))),
(3,(CompactBuffer(c),CompactBuffer(6))))
scala> val rdd2 = sc.parallelize(Array((4,4),(2,5),(3,6)))
rdd2: org.apache.spark.rdd.RDD[(Int, Int)] =
ParallelCollectionRDD[41] at parallelize at <console>:24
scala> rdd.cogroup(rdd2).collect()
res15: Array[(Int, (Iterable[String], Iterable[Int]))] =
Array((4,(CompactBuffer(),CompactBuffer(4))),
(1,(CompactBuffer(a),CompactBuffer())),
(2,(CompactBuffer(b),CompactBuffer(5))),
(3,(CompactBuffer(c),CompactBuffer(6))))
scala> val rdd3 =
sc.parallelize(Array((1,"a"),(1,"d"),(2,"b"),(3,"c")))
rdd3: org.apache.spark.rdd.RDD[(Int, String)] =
ParallelCollectionRDD[44] at parallelize at <console>:24
scala> rdd3.cogroup(rdd2).collect()
[Stage 36:> (0 + 0)
res16: Array[(Int, (Iterable[String], Iterable[Int]))] =
Array((4,(CompactBuffer(),CompactBuffer(4))), (1,(CompactBuffer(d,
a),CompactBuffer())), (2,(CompactBuffer(b),CompactBuffer(5))),
(3,(CompactBuffer(c),CompactBuffer(6))))
cartesian(otherDataset) :笛卡尔积
scala> val rdd1 = sc.parallelize(1 to 3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[47] at
parallelize at <console>:24
scala> val rdd2 = sc.parallelize(2 to 5)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[48] at
parallelize at <console>:24
scala> rdd1.cartesian(rdd2).collect()
res17: Array[(Int, Int)] = Array((1,2), (1,3), (1,4), (1,5), (2,2),
(2,3), (2,4), (2,5), (3,2), (3,3), (3,4), (3,5))
pipe(command, [envVars]) : 对于每个分区,都执行一个 perl 或者shell 脚本,的 返回输出的 RDD
注意:shell 脚本需要集群中的所有节点都能访问到
coalesce(numPartitions):缩减分区数,用于 大数据集过滤后,提高小数据集的执行效率。
scala> val rdd = sc.parallelize(1 to 16,4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[54] at
parallelize at <console>:24
scala> rdd.partitions.size
res20: Int = 4
scala> val coalesceRDD = rdd.coalesce(3)
coalesceRDD: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[55] at
coalesce at <console>:26
scala> coalesceRDD.partitions.size
res21: Int = 3
repartition(numPartitions):根据分区数,从新通过网络随机洗牌所有数据。
scala> val rdd = sc.parallelize(1 to 16,4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[56] at
parallelize at <console>:24
scala> rdd.partitions.size
res22: Int = 4
scala> val rerdd = rdd.repartition(2)
rerdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[60] at
repartition at <console>:26
scala> rerdd.partitions.size
res23: Int = 2
scala> val rerdd = rdd.repartition(4)
rerdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[64] at
repartition at <console>:26
scala> rerdd.partitions.size
res24: Int = 4
repartitionAndSortWithinPartitions(partitioner)
repartitionAndSortWithinPartitions 函数是repartition 函数的变种,与 repartition 函数不同的是,repartitionAndSortWithinPartitions 在给定的 partitioner 内部进行排序,性能比repartition 要高。
glom :将每一个分区形成一个数组,形成新的RDD 类型时 RDD[Array[T]]
scala> val rdd = sc.parallelize(1 to 16,4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[65] at
parallelize at <console>:24
scala> rdd.glom().collect()
res25: Array[Array[Int]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7,
8), Array(9, 10, 11, 12), Array(13, 14, 15, 16))
mapValues:针对于(K,V)形式的类型只对 V 进行操作
scala> val rdd3 =
sc.parallelize(Array((1,"a"),(1,"d"),(2,"b"),(3,"c")))
rdd3: org.apache.spark.rdd.RDD[(Int, String)] =
ParallelCollectionRDD[67] at parallelize at <console>:24
scala> rdd3.mapValues(_+"|||").collect()
res26: Array[(Int, String)] = Array((1,a|||), (1,d|||), (2,b|||),(3,c|||))
subtract:计算差的一种函数去除两个 RDD 中相同的元素,不同的 RDD 将保留下来
scala> val rdd = sc.parallelize(3 to 8)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[70] at
parallelize at <console>:24
scala> val rdd1 = sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[71] at
parallelize at <console>:24
scala> rdd.subtract(rdd1).collect()
res27: Array[Int] = Array(8, 6, 7)