Spark: ------ RDD的Transformations转换算子

map

将一个RDD[U] 转换为 RRD[T]类型。在转换的时候需要用户提供一个匿名函数 func: U => T

scala> var rdd:RDD[String]=sc.makeRDD(List("a","b","c","a"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[120] at makeRDD at
<console>:25
scala> val mapRDD:RDD[(String,Int)] = rdd.map(w => (w, 1))
mapRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[121] at map at
<console>:26

filter

将对一个RDD[U]类型元素进行过滤,过滤产生新的RDD[U],但是需要用户提供 func:U => Boolean 系统仅会保留返回true的元素。

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[122] at makeRDD at
<console>:25
scala> val mapRDD:RDD[Int]=rdd.filter(num=> num %2 == 0)
mapRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[123] at filter at
<console>:26
scala> mapRDD.collect
res63: Array[Int] = Array(2, 4)

flatMap

和map类似,也是将一个RDD[U] 转换为 RRD[T]类型。但是需要用户提供一个方法 func:U => Seq[T]

scala> var rdd:RDD[String]=sc.makeRDD(List("this is","good good"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[124] at makeRDD at
<console>:25
scala> var flatMapRDD:RDD[(String,Int)]=rdd.flatMap( line=>
line.split("\\s+").map((_,1)))
flatMapRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[126] at flatMap
at <console>:26
scala> flatMapRDD.collect
res64: Array[(String, Int)] = Array((this,1), (is,1), (good,1), (good,1))

mapPartitions

和map类似,但是该方法的输入时一个分区的全量数据,因此需要用户提供一个分区的转换方法:func:Iterator => Iterator

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[128] at makeRDD at
<console>:25
scala> var mapPartitionsRDD=rdd.mapPartitions(values => values.map(n=>(n,n%2==0)),true)
mapPartitionsRDD: org.apache.spark.rdd.RDD[(Int, Boolean)] = MapPartitionsRDD[129] at
mapPartitions at <console>:26
scala> mapPartitionsRDD.collect
res70: Array[(Int, Boolean)] = Array((1,false), (2,true), (3,false), (4,true),
(5,false))

mapPartitionsWithIndex

同mapPartitions算子一样,只不过是在传入参数的时候传进去了父类RDD的分区数。

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6),2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[139] at makeRDD at
<console>:25
scala> var mapPartitionsWithIndexRDD=rdd.mapPartitionsWithIndex((p,values) =>
values.map(n=>(n,p)))
mapPartitionsWithIndexRDD: org.apache.spark.rdd.RDD[(Int, Int)] =
MapPartitionsRDD[140] at mapPartitionsWithIndex at <console>:26
scala> mapPartitionsWithIndexRDD.collect
res77: Array[(Int, Int)] = Array((1,0), (2,0), (3,0), (4,1), (5,1), (6,1))

sample

sample( withReplacement , fraction , seed )取样算子

抽取RDD中的样本数据,可以通过 withReplacement :是否允许重复抽样、 fraction :控制抽样大致比例、 seed控制的是随机抽样过程中产生随机数。

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[150] at makeRDD at
<console>:25
scala> var simpleRDD:RDD[Int]=rdd.sample(false,0.5d,1L)
simpleRDD: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[7] at sample at <console>:26
scala> simpleRDD.collect
res6: Array[Int] = Array(1, 5, 6)
scala> var simpleRDD:RDD[Int]=rdd.sample(false,0.5d,2L)
simpleRDD: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[8] at sample at <console>:26
scala> simpleRDD.collect
res7: Array[Int] = Array(2, 3)                                                  
scala> var simpleRDD:RDD[Int]=rdd.sample(false,0.5d,2L)
simpleRDD: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[10] at sample at <console>:26
scala> simpleRDD.collect
res9: Array[Int] = Array(2, 3)
scala> var simpleRDD:RDD[Int]=rdd.sample(false,0.5d,1L)
simpleRDD: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[7] at sample at <console>:26
scala> simpleRDD.collect
res6: Array[Int] = Array(1, 5, 6)

第一个参数代表是否放回抽样,第二个参数代表抽取占总个数的比例(大致),第三个参数代表种子,只要种子一样,结果不变

union( otherDataset )

是将两个同种类型的RDD的元素进行合并。

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[154] at makeRDD at
<console>:25
scala> var rdd2:RDD[Int]=sc.makeRDD(List(6,7))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[155] at makeRDD at
<console>:25
scala> rdd.union(rdd2).collect
res95: Array[Int] = Array(1, 2, 3, 4, 5, 6, 6, 7)

intersection( otherDataset )

是将两个同种类型的RDD的元素进行计算交集。

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[154] at makeRDD at
<console>:25
scala> var rdd2:RDD[Int]=sc.makeRDD(List(6,7))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[155] at makeRDD at
<console>:25
scala> rdd.intersection(rdd2).collect
res100: Array[Int] = Array(6)

distinct([ numPartitions ]))

去除RDD中重复元素,其中 numPartitions 是一个可选参数,是否修改RDD的分区数,一般是在当数据集经过去重之后,如果数据量级大规模降低,可以尝试传递 numPartitions减少分区数。

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[154] at makeRDD at
<console>:25
scala> rdd.distinct(3).collect
res106: Array[Int] = Array(6, 3, 4, 1, 5, 2)

√join

√join( otherDataset , [ numPartitions ])

当调用RDD[(K,V)]和RDD[(K,W)]系统可以返回一个新的RDD[(k,(v,w))](默认内连接),目前支leftOuterJoin, rightOuterJoin, 和 fullOuterJoin.

scala> var userRDD:RDD[(Int,String)]=sc.makeRDD(List((1,"zhangsan"),(2,"lisi")))
userRDD: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[204] at
makeRDD at <console>:25
scala> case class OrderItem(name:String,price:Double,count:Int)
defined class OrderItem
scala> var
orderItemRDD:RDD[(Int,OrderItem)]=sc.makeRDD(List((1,OrderItem("apple",4.5,2))))
orderItemRDD: org.apache.spark.rdd.RDD[(Int, OrderItem)] = ParallelCollectionRDD[206]
at makeRDD at <console>:27
scala> userRDD.join(orderItemRDD).collect
res107: Array[(Int, (String, OrderItem))] = Array((1,
(zhangsan,OrderItem(apple,4.5,2))))
scala> userRDD.leftOuterJoin(orderItemRDD).collect
res108: Array[(Int, (String, Option[OrderItem]))] = Array((1,
(zhangsan,Some(OrderItem(apple,4.5,2)))), (2,(lisi,None)))

repartition( numPartitions )

和 coalesce 相似,但是该算子能够变大或者缩小RDD的分区数。

scala> var rdd1:RDD[Int]=sc.makeRDD(0 to 100)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[252] at makeRDD at
<console>:25
scala> rdd1.getNumPartitions
res129: Int = 6
scala> rdd1.filter(n=> n%2 == 0).repartition(12).getNumPartitions
res130: Int = 12
scala> rdd1.filter(n=> n

coalesce( numPartitions )

当经过大规模的过滤数据以后,可以使 coalesce 对RDD进行分区的缩放(只能减少分区,不可以增加)。

scala> var rdd1:RDD[Int]=sc.makeRDD(0 to 100)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[252] at makeRDD at
<console>:25
scala> rdd1.getNumPartitions
res129: Int = 6
scala> rdd1.filter(n=> n%2 == 0).coalesce(3).getNumPartitions
res127: Int = 3
scala> rdd1.filter(n=> n%2 == 0).coalesce(12).getNumPartitions
res128: Int = 6
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值