spark算子总体上分为两类,transformations(转换算子)和actions(行动算子)
1、transformations(转换算子)
1.1Value类型
1.1.1map(func):返回一个新的RDD,该RDD由每一个输入元素经过func函数转换后组成
scala> var source=sc.parallelize(1 to 10)
source: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:25
scala> source.map(_*2).collect
res18: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
1.1.2mapPartitions(func):类似于map,map的计算对象是每一个元素,而mapPartitions是计算的每一个分区,按照分区为单位来计算
scala> val res=source.mapPartitions(x=>x.map(_*3))
res: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[11] at mapPartitions at <console>:27
scala> res.collect
res21: Array[Int] = Array(3, 6, 9, 12, 15, 18, 21, 24, 27, 30)
1.1.3mapPartitionsWithIndex(func):类似于mapPartitions,但func带有一个整数参数表示分片的索引值(就是分区号),因此在类型为T的RDD上运行时,func的函数类型必须是(Int, Interator[T]) => Iterator[U]
scala> val indexRDD=source.mapPartitionsWithIndex((index,items)=>(items.map((index,_))))
indexRDD: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[12] at mapPartitionsWithIndex at <console>:27
scala> indexRDD.collect
res22: Array[(Int, Int)] = Array((0,1), (0,2), (1,3), (1,4), (1,5), (2,6), (2,7), (3,8), (3,9), (3,10))
1.1.4flatMap(func):类似于map,但是每一个输入元素可以被映射为0或多个输出元素(所以func应该返回一个序列,而不是单一元素)
scala> val sourceFlat = sc.parallelize(1 to 5)
sourceFlat: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[13] at parallelize at <console>:25
scala> sourceFlat.flatMap(1 to _).collect
res24: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5)
1.1.5glom:将一个分区形成一个数组,形成新的RDD类型时RDD[Array[T]]
scala> rdd.collect
res26: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
scala> rdd.glom.collect
res27: Array[Array[Int]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8), Array(9, 10, 11, 12), Array(13, 14, 15, 16))
1.1.6groupBy(func):按照传入函数的返回值进行分组,将相同的key对应的值放入一个迭代器中
scala> rdd.collect
res28: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
scala> rdd.groupBy(_%2).collect
res29: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(2, 4, 6, 8, 10, 12, 14, 16)), (1,CompactBuffer(1, 3, 5, 7, 9, 11, 13, 15)))
1.1.7filter(func):返回一个新的RDD,该RDD由经过func函数计算后返回值为true的输入元素组成
scala> rdd.collect
res30: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
scala> rdd.filter(_%2==0).collect
res33: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16)
1.1.8sample(withReplacement, fraction, seed):以指定的随机种子随机抽样出数量为fraction的数据,withReplacement表示是抽出的数据是否放回,true为有放回的抽样,false为无放回的抽样,seed用于指定随机数生成器种子
scala> rdd.collect
res34: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
scala> rdd.sample(true,0.4,2).collect
res36: Array[Int] = Array(1, 2, 2, 10, 10, 11, 12, 13, 14)<