RDD练习(二)

23 篇文章 1 订阅
19 篇文章 1 订阅

1.aggregateByKey

   val rdd1 = sc.parallelize(List(("tom",20),("tom",12),("tom1",18),("tom1",19),("tom2",22),("tom2",10),("tom2",30),("tom1",19)),2)
   rdd1.aggregateByKey(0)(math.max(_,_),_+_).collect

结果

   res38: Array[(String, Int)] = Array((tom2,30), (tom,20), (tom1,38))

2.combineByKey
(createCombine : V=>c ,mergeValue :(c,v)=>c, mergeCombiners :(c,v)=>c)
第一个参数:拿到分区的第一个元素,并按照给定的函数进行返回相应的类型
第二个参数:局部聚合
第三个参数:全局聚合

2.1.统计字符串

   val rdd1= sc.textFile("hdfs://hlm1:9000/user/hdfs/emp.txt").flatMap(_.split(",")).map((_,1))

val rdd2= rdd1.combineByKey(x=>10+x,(a:Int,b :Int)=>a+b,(a:Int,b :Int)=>a+b)

2.2计算平均分数

 val score = Array(("tom",80),("tom",90),("tom",70),("tom1",80),("tom1",88),("tom1",87))
   val rdd = sc.parallelize(score)
   rdd.combineByKey(score=> (1,score),(x:Tuple2[Int,Int],y:Int)=>(x._1+1,x._2+y),(m:Tuple2[Int,Int],n:Tuple2[Int,Int])=>(m._1+n._1,m._2+n._2))
  (求得总分)

结果1(总分)

res41: Array[(String, (Int, Int))] = Array((tom,(3,240)), (tom1,(3,255)))
   rdd.combineByKey(score=> (1,score),(x:Tuple2[Int,Int],y:Int)=>(x._1+1,x._2+y),(m:Tuple2[Int,Int],n:Tuple2[Int,Int])=>(m._1+n._1,m._2+n._2))
   .map{case (name,(num,scores))=>(name,scores/num)}.collect
  (求得平均分)

结果2(平均分)

  res42: Array[(String, Int)] = Array((tom,80), (tom1,85))

3.action算子
val rdd = sc.parallelize(List(2,1,3,6,5),2)

 rdd.reduce(_+_)    res43: Int = 17
  rdd.count          res44: Long = 5
  rdd.top(3)         res45: Array[Int] = Array(6, 5, 3)
  rdd.take(3)        res47: Array[Int] = Array(2, 1, 3)
  rdd.takeOrdered(3) res48: Array[Int] = Array(1, 2, 3)
  rdd.first          res49: Int = 2

countByKey

  val rdd1 = sc.parallelize(List(("tom",80),("tom",90),("tom",70),("tom1",80),("tom1",88),("tom1",87)),2)
  rdd1.countByKey

结果

  res51: scala.collection.Map[String,Long] = Map(tom -> 3, tom1 -> 3)

countByValue

  rdd1.countByValue

结果

  res52: scala.collection.Map[(String, Int),Long] = Map((tom1,87) -> 1, (tom1,88) -> 1, (tom,80) -> 1, (tom,90) -> 1, (tom1,80) -> 1, (tom,70) -> 1)
  (相同的元素统计)

foreach和foreachPartition区别
都没有返回值,foreach将func作用于每一个元素上
foreachPartition将func作用于每一个分区上
应用场景:一般都是作用在将结果输出
如果结果数据量小,可以用foreach进行存储
如果数据量大,会拿很多的连接进行存储,可能数据库直接宕机。可以用foreachPartition,用一个分区对用一个连接

 val rdd = sc.parallelize(List(2,1,3,6,5,3,5,6,8,33),2)
  rdd.foreachPartition(x=>println(x.reduce(_+_)))
  17
  55

filterByRange

 val rdd1 = sc.parallelize(List(("tom",80),("tom",90),("tom",70),("tom1",80),("tom1",88),("tom1",87)),2)
  rdd1.filterByRange("tom","tom").collect

tom to tom

  res60: Array[(String, Int)] = Array((tom,80), (tom,90), (tom,70))

tom1 to tom

 rdd1.filterByRange("tom1","tom").collect

结果

  res62: Array[(String, Int)] = Array()

tom to tom1

  rdd1.filterByRange("tom","tom1").collect

结果

  res64: Array[(String, Int)] = Array((tom,80), (tom,90), (tom,70), (tom1,80), (tom1,88), (tom1,87))

flatMapValues

  val rdd1 = sc.parallelize(List(("tom","80 90"),("tom1","88 90"),("tom2","87 94")),2)
  rdd1.flatMapValues(_.split(" ")).collect

结果

  res65: Array[(String, String)] = Array((tom,80), (tom,90), (tom1,88), (tom1,90), (tom2,87), (tom2,94))

foldByKey

  val rdd2 = sc.parallelize(Array((12,"uzi1"),(2,"uzi2"),(3,"uzi3"),(3,"uzi4")))
  rdd2.foldByKey("ADC")(_+_).collect

结果

  res69: Array[(Int, String)] = Array((12,ADCuzi1), (2,ADCuzi2), (3,ADCuzi3uzi4))

keyBy

 val rdd = sc.parallelize(List("jxlg","hdjd","shjps","javaweb","hadoop"))
 rdd.keyBy(_.length).collect

结果

 res71: Array[(Int, String)] = Array((4,jxlg), (4,hdjd), (5,shjps), (7,javaweb), (6,hadoop))

keys

 val rdd = sc.parallelize(List("jxlg","hdjd","shjps","javaweb","hadoop")).map(x=>(x.length,x))
 rdd.keys.collect

结果

 res73: Array[Int] = Array(4, 4, 5, 7, 6)

values

 rdd.values.collect
 res74: Array[String] = Array(jxlg, hdjd, shjps, javaweb, hadoop)

collect:

 属于Action算子,会将每一个Executor计算的结果汇总到Driver并将结果数据封装到Array

collectAsMap

 val rdd = sc.parallelize(List(("a",1),("b",2)))
 rdd.collectAsMap

结果

 res75: scala.collection.Map[String,Int] = Map(b -> 2, a -> 1)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值