map
将一个RDD[U] 转换为 RRD[T]类型。在转换的时候需要用户提供一个匿名函数 func: U => T
scala> var rdd:RDD[String]=sc.makeRDD(List("a","b","c","a"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[120] at makeRDD at
<console>:25
scala> val mapRDD:RDD[(String,Int)] = rdd.map(w => (w, 1))
mapRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[121] at map at
<console>:26
filter
将对一个RDD[U]类型元素进行过滤,过滤产生新的RDD[U],但是需要用户提供 func:U => Boolean 系统仅会保留返回true的元素。
scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[122] at makeRDD at
<console>:25
scala> val mapRDD:RDD[Int]=rdd.filter(num=> num %2 == 0)
mapRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[123] at filter at
<console>:26
scala> mapRDD.collect
res63: Array[Int] = Array(2, 4)
flatMap
和map类似,也是将一个RDD[U] 转换为 RRD[T]类型。但是需要用户提供一个方法 func:U => Seq[T]
scala> var rdd:RDD[String]=sc.makeRDD(List("this is","good good"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[124] at makeRDD at
<console>:25
scala> var flatMapRDD:RDD[(String,Int)]=rdd.flatMap( line=>
line.split("\\s+").map((_,1)))
flatMapRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[126] at flatMap
at <console>:26
scala> flatMapRDD.collect
res64: Array[(String, Int)] = Array((this,1), (is,1), (good,1), (good,1))
mapPartitions
和map类似,但是该方法的输入时一个分区的全量数据,因此需要用户提供一个分区的转换方法:func:Iterator => Iterator
scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[128] at makeRDD at
<console>:25
scala> var mapPartitionsRDD=rdd.mapPartitions(values => values.map(n=>(n,n%2==0)),true)
mapPartitionsRDD: org.apache.spark.rdd.RDD[(Int, Boolean)] = MapPartitionsRDD[129] at
mapPartitions at <console>:26
scala> mapPartitionsRDD.collect
res70: Array[(Int, Boolean)] = Array((1,false), (2,true), (3,false), (4,true),
(5,false))
mapPartitionsWithIndex
同mapPartitions算子一样,只不过是在传入参数的时候传进去了父类RDD的分区数。
scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6),2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[139] at makeRDD at
<console>:25
scala> var mapPartitionsWithIndexRDD=rdd.mapPartitionsWithIndex((p,values) =>
values.map(n=>(n,p)))
mapPartitionsWithIndexRDD: org.apache.spark.rdd.RDD[(Int, Int)] =
MapPartitionsRDD[140] at mapPartitionsWithIndex at <console>:26
scala> mapPartitionsWithIndexRDD.collect
res77: Array[(Int, Int)] = Array((1,0), (2,0), (3,0), (4,1), (5,1), (6,1))
sample
sample( withReplacement , fraction , seed )取样算子
抽取RDD中的样本数据,可以通过 withReplacement :是否允许重复抽样、 fraction :控制抽样大致比例、 seed控制的是随机抽样过程中产生随机数。
scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[150] at makeRDD at
<console>:25
scala> var simpleRDD:RDD[Int]=rdd.sample(false,0.5d,1L)
simpleRDD: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[7] at sample at <console>:26
scala> simpleRDD.collect
res6: Array[Int] = Array(1, 5, 6)
scala> var simpleRDD:RDD[Int]=rdd.sample(false,0.5d,2L)
simpleRDD: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[8] at sample at <console>:26
scala> simpleRDD.collect
res7: Array[Int] = Array(2, 3)
scala> var simpleRDD:RDD[Int]=rdd.sample(false,0.5d,2L)
simpleRDD: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[10] at sample at <console>:26
scala> simpleRDD.collect
res9: Array[Int] = Array(2, 3)
scala> var simpleRDD:RDD[Int]=rdd.sample(false,0.5d,1L)
simpleRDD: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[7] at sample at <console>:26
scala> simpleRDD.collect
res6: Array[Int] = Array(1, 5, 6)
第一个参数代表是否放回抽样,第二个参数代表抽取占总个数的比例(大致),第三个参数代表种子,只要种子一样,结果不变
union( otherDataset )
是将两个同种类型的RDD的元素进行合并。
scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[154] at makeRDD at
<console>:25
scala> var rdd2:RDD[Int]=sc.makeRDD(List(6,7))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[155] at makeRDD at
<console>:25
scala> rdd.union(rdd2).collect
res95: Array[Int] = Array(1, 2, 3, 4, 5, 6, 6, 7)
intersection( otherDataset )
是将两个同种类型的RDD的元素进行计算交集。
scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[154] at makeRDD at
<console>:25
scala> var rdd2:RDD[Int]=sc.makeRDD(List(6,7))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[155] at makeRDD at
<console>:25
scala> rdd.intersection(rdd2).collect
res100: Array[Int] = Array(6)
distinct([ numPartitions ]))
去除RDD中重复元素,其中 numPartitions 是一个可选参数,是否修改RDD的分区数,一般是在当数据集经过去重之后,如果数据量级大规模降低,可以尝试传递 numPartitions减少分区数。
scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[154] at makeRDD at
<console>:25
scala> rdd.distinct(3).collect
res106: Array[Int] = Array(6, 3, 4, 1, 5, 2)
√join
√join( otherDataset , [ numPartitions ])
当调用RDD[(K,V)]和RDD[(K,W)]系统可以返回一个新的RDD[(k,(v,w))](默认内连接),目前支leftOuterJoin, rightOuterJoin, 和 fullOuterJoin.
scala> var userRDD:RDD[(Int,String)]=sc.makeRDD(List((1,"zhangsan"),(2,"lisi")))
userRDD: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[204] at
makeRDD at <console>:25
scala> case class OrderItem(name:String,price:Double,count:Int)
defined class OrderItem
scala> var
orderItemRDD:RDD[(Int,OrderItem)]=sc.makeRDD(List((1,OrderItem("apple",4.5,2))))
orderItemRDD: org.apache.spark.rdd.RDD[(Int, OrderItem)] = ParallelCollectionRDD[206]
at makeRDD at <console>:27
scala> userRDD.join(orderItemRDD).collect
res107: Array[(Int, (String, OrderItem))] = Array((1,
(zhangsan,OrderItem(apple,4.5,2))))
scala> userRDD.leftOuterJoin(orderItemRDD).collect
res108: Array[(Int, (String, Option[OrderItem]))] = Array((1,
(zhangsan,Some(OrderItem(apple,4.5,2)))), (2,(lisi,None)))
repartition( numPartitions )
和 coalesce 相似,但是该算子能够变大或者缩小RDD的分区数。
scala> var rdd1:RDD[Int]=sc.makeRDD(0 to 100)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[252] at makeRDD at
<console>:25
scala> rdd1.getNumPartitions
res129: Int = 6
scala> rdd1.filter(n=> n%2 == 0).repartition(12).getNumPartitions
res130: Int = 12
scala> rdd1.filter(n=> n
coalesce( numPartitions )
当经过大规模的过滤数据以后,可以使 coalesce 对RDD进行分区的缩放(只能减少分区,不可以增加)。
scala> var rdd1:RDD[Int]=sc.makeRDD(0 to 100)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[252] at makeRDD at
<console>:25
scala> rdd1.getNumPartitions
res129: Int = 6
scala> rdd1.filter(n=> n%2 == 0).coalesce(3).getNumPartitions
res127: Int = 3
scala> rdd1.filter(n=> n%2 == 0).coalesce(12).getNumPartitions
res128: Int = 6