RDD的方法

方法介绍简单使用
 flatmap对RDD中的每一个元素进行先map再压扁,最后返回操作的结果

scala> sc.parallelize(Array("a b c", "d e f", "h i j")).collect
res31: Array[String] = Array(a b c, d e f, h i j)

scala> sc.parallelize(Array("a b c", "d e f", "h i j")).flatMap(_.split(" ")).collect
res32: Array[String] = Array(a, b, c, d, e, f, h, i, j)

sortBy

sortBy(x=>x,true) //默认升序

sortBy(x=>x+"",true)//变成了字符串,结果为字典顺序

 

scala> sc.parallelize(List(5, 6, 4, 7, 3, 8, 2, 9, 1, 10)).sortBy(x=>x,true).collect
res33: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> sc.parallelize(List(5, 6, 4, 7, 3, 8, 2, 9, 1, 10)).sortBy(x=>x+"",true).collect
res34: Array[Int] = Array(1, 10, 2, 3, 4, 5, 6, 7, 8, 9)

union并集 
intersection交集 
subtract差集 
cartesian笛卡尔积 
joinjoin(内连接)聚合具有相同key组成的value元组

val rdd1 = sc.parallelize(List(("tom", 1), ("jerry", 2), ("kitty", 3)))
val rdd2 = sc.parallelize(List(("jerry", 9), ("tom", 8), ("shuke", 7),
("tom", 2)))

 

scala> rdd1.join(rdd2).collect
res39: Array[(String, (Int, Int))] = Array((tom,(1,8)), (tom,(1,2)), (jerry,(2,9)))

rightOuterJoin scala>  rdd1.rightOuterJoin(rdd2).collect
res43: Array[(String, (Option[Int], Int))] = Array((tom,(Some(1),8)), (tom,(Some(1),2)), (jerry,(Some(2),9)), (shuke,(None,7)))
groupbykeygroupByKey()的功能是,对具有相同键的值进行分组

scala> val rdd6 = sc.parallelize(Array(("tom",1), ("jerry",2), ("kitty",3),
     | ("jerry",9), ("tom",8), ("shuke",7), ("tom",2)))
rdd6: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[109] at parallelize at <console>:24

scala> rdd6.groupByKey.collect
res44: Array[(String, Iterable[Int])] = Array((tom,CompactBuffer(8, 2, 1)), (jerry,CompactBuffer(9, 2)), (shuke,CompactBuffer(7)), (kitty,CompactBuffer(3)))

cogroup先在RDD内部按照key分组,再在多个RDD间按照key分组

val rdd1 = sc.parallelize(List(("tom", 1), ("tom", 2), ("jerry", 3),
("kitty", 2)))
val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 1), ("shuke", 2)))
val rdd3 = rdd1.cogroup(rdd2)

 

 

scala> rdd1.cogroup(rdd2).collect
res47: Array[(String, (Iterable[Int], Iterable[Int]))] = Array((tom,(CompactBuffer(1, 2),CompactBuffer(1))), (jerry,(CompactBuffer(3),CompactBuffer(2))), (shuke,(CompactBuffer(),CompactBuffer(2))), (kitty,(CompactBuffer(2),CompactBuffer())))

groupBy根据指定的函数中的规则/key进行分组


scala> val intRdd = sc.parallelize(List(1,2,3,4,5,6))
intRdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[117] at parallelize at <console>:24

scala> intRdd.groupBy(x=>{if(x%2==0) "even" else "odd"}).collect
res48: Array[(String, Iterable[Int])] = Array((even,CompactBuffer(4, 6, 2)), (odd,CompactBuffer(1, 3, 5)))
 

reduce注意reduce是Action算子val rdd1 = sc.parallelize(List(1, 2, 3, 4, 5))
//reduce聚合
val result = rdd1.reduce(_ + _) //第一_ 上次一个运算的结果,第二个_ 这一
次进来的元素
reducebykey注意reducebykey是转换算子 
repartition

改变分区数:

注意:
repartition可以增加和减少rdd中的分区数,
coalesce默认减少rdd分区数,增加rdd分区数不会生效。
不管增加还是减少分区数原rdd分区数不变

scala> val rdd1 = sc.parallelize(1 to 10,3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[124] at parallelize at <console>:24

scala> rdd1.repartition(2).partitions.length
res52: Int = 2

scala> rdd1.partitions.length
res53: Int = 3

collect显示数据 
count求RDD中最外层元素的个数

scala> val rdd3 = sc.parallelize(List(List("a b c", "a b b"),List("e f g", "a f g"), List("h i j", "a a b")))
rdd3: org.apache.spark.rdd.RDD[List[String]] = ParallelCollectionRDD[129] at parallelize at <console>:24

scala> rdd3.count
res54: Long = 3

distinct去重


scala> val rdd = sc.parallelize(Array(1,2,3,4,5,5,6,7,8,1,2,3,4), 3)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[130] at parallelize at <console>:24

scala> rdd.distinct.collect
res55: Array[Int] = Array(6, 3, 4, 1, 7, 8, 5, 2)

top取出最大的前N个
scala>  sc.parallelize(List(3,6,1,2,4,5)).top(2)
res56: Array[Int] = Array(6, 5)
take//按照原来的顺序取前N个scala>  sc.parallelize(List(3,6,1,2,4,5)).take(2)
res57: Array[Int] = Array(3, 6)
first按照原来的顺序取前第一个scala>  sc.parallelize(List(3,6,1,2,4,5)).first
res58: Int = 3

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值