一、介绍
每个spark应用程序包含一个驱动程序,这个驱动程序可以在集群中运行用户的main方法,可以执行各种各样的并行操作。Spark提供了最主要的抽象的是弹性的分布式的数据集(resilient distribute dataset,RDD)。RDD是一个在很多节点上的元素分区集合,可以被并行处理。RDD可以从HDFS中读取数据来创建RDD(或者通过hadoop支持的其他的文件系统),或者存在的scala集合来创建RDD。
二、RDD算子分类,大致可以分为两类,即:
Transformation:转换算子,这类转换并不触发提交作业,完成作业中间过程处理(延迟加载)。
Action:行动算子,这类算子会触发SparkContext提交Job作业。
Transformation延迟执行,Transformation会记录元数据信息,当计算任务触发Action时才会真正开始计算。
三、一些小例子
scala> val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
rdd1: org.apache.spark.rdd.RDD[Int] =ParallelCollectionRDD[0] at parallelize at <console>:27
注:
scala> val rdd2 = rdd1.map(_*10)
rdd2: org.apache.spark.rdd.RDD[Int] =MapPartitionsRDD[1] at map at <console>:29
注:rdd2中每个元素没有乘以10,仅仅是记录是map算子,会记录哪个匿名函数
scala> val rdd3 = rdd2.filter(_<50)
rdd3: org.apache.spark.rdd.RDD[Int] =MapPartitionsRDD[2] at filter at <console>:25
注:没有立即过滤
scala> rdd3.collect
res0: Array[Int] = Array(10, 20, 30,40)
小结:
创建RDD有两种方式
1. 通过HDFS支持的文件系统创建RDD,RDD里面没有真正要计算的数据,只记录了一些元数据
2. 通过scala集合或数组以并行化的方式创建RDD
RDD五个特性:
--有一系列的分区(一个分区肯定在一台机器上,但一台机器上可以有多个分区)
--有一个函数会作用在每一个分区上
--RDD之间有一系列的依赖(根据依赖关系恢复前面丢掉的数据)
--可选地,key-value的RDD会有一个分区器(默认的是hash-partition)
--可选地,在一系列最佳位置计算每个分区(例如一个HDFS文件的块位置)
#查看该RDD的分区数
rdd1.partitions.length
#可以指定分区,指定五个分区
val rdd1 =sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10),5)
#排序
rdd1.map(_*10).sortBy(x =>x,true).collect
rdd1.map(_*10).sortBy(x =>x+””,true).collect
scala> val rdd4 =sc.parallelize(Array("a b c","d e f","h i j"))
rdd4: org.apache.spark.rdd.RDD[String] =ParallelCollectionRDD[0] at parallelize at <console>:27
scala> rdd4.flatMap(_.split("")).collect
res7: Array[String] = Array(a, b, c, d, e,f, h, i, j)
scala> val rdd5 =sc.parallelize(List(List("a b c","a b b"),List("e fg","a f g"),List("h i j","a a b")))
rdd5:org.apache.spark.rdd.RDD[List[String]] = ParallelCollectionRDD[4] atparallelize at <console>:27
scala>rdd5.flatMap(_.flatMap(_.split(" "))).collect
res9: Array[String] = Array(a, b, c, a, b,b, e, f, g, a, f, g, h, i, j, a, a, b)
#求并集
scala> val rdd6 =sc.parallelize(List(5,6,4,7))
rdd6: org.apache.spark.rdd.RDD[Int] =ParallelCollectionRDD[7] at parallelize at <console>:27
scala> val rdd7 =sc.parallelize(List(1,2,3,4))
rdd7: org.apache.spark.rdd.RDD[Int] =ParallelCollectionRDD[8] at parallelize at <console>:27
scala> val rdd8 = rdd6.union(rdd7)
rdd8: org.apache.spark.rdd.RDD[Int] =UnionRDD[9] at union at <console>:31
scala> rdd8.collect
res10: Array[Int] = Array(5, 6, 4, 7, 1, 2,3, 4)
#求交集
scala> val rdd9 =rdd6.intersection(rdd7)
rdd9: org.apache.spark.rdd.RDD[Int] =MapPartitionsRDD[15] at intersection at <console>:31
scala> rdd9.collect
res11: Array[Int] = Array(4)
scala> val rdd1 =sc.parallelize(List(("tom", 1), ("jerry", 2), ("kitty",3)))
rdd1: org.apache.spark.rdd.RDD[(String,Int)] = ParallelCollectionRDD[16] at parallelize at <console>:27
scala> val rdd2 =sc.parallelize(List(("jerry", 9), ("tom", 8),("shuke", 7)))
rdd2: org.apache.spark.rdd.RDD[(String,Int)] = ParallelCollectionRDD[17] at parallelize at <console>:27
scala> rdd1.intersection(rdd2).collect
res12: Array[(String, Int)] = Array()
注;交集为空
#join(key相同)
scala> rdd1.join(rdd2).collect
res13: Array[(String, (Int, Int))] = Array((tom,(1,8)),(jerry,(2,9)))
#修改rdd2
scala> val rdd2 =sc.parallelize(List(("jerry", 9), ("tom", 8),("shuke", 7),("tom",2)))
rdd2: org.apache.spark.rdd.RDD[(String,Int)] = ParallelCollectionRDD[1] at parallelize at <console>:27
scala> rdd1.join(rdd2).collect
res0: Array[(String, (Int, Int))] =Array((tom,(1,8)), (tom,(1,2)), (jerry,(2,9)))
scala> rdd1.leftOuterJoin(rdd2).collect
res1: Array[(String, (Int, Option[Int]))] =Array((tom,(1,Some(2))), (tom,(1,Some(8))), (jerry,(2,Some(9))),(kitty,(3,None)))
注:左外链接,左边保留
scala> rdd1.rightOuterJoin(rdd2).collect
res2: Array[(String, (Option[Int], Int))] =Array((tom,(Some(1),2)), (tom,(Some(1),8)), (jerry,(Some(2),9)),(shuke,(None,7)))
#groupByKey
scala> val rdd3 = rdd1 union rdd2
rdd3: org.apache.spark.rdd.RDD[(String,Int)] = UnionRDD[3] at union at <console>:31
scala> rdd3.groupByKey
res0: org.apache.spark.rdd.RDD[(String,Iterable[Int])] = ShuffledRDD[4] at groupByKey at <console>:34
scala> rdd3.groupByKey.collect
res1: Array[(String, Iterable[Int])] =Array((tom,CompactBuffer(1, 8, 2)), (jerry,CompactBuffer(2, 9)),(shuke,CompactBuffer(7)), (kitty,CompactBuffer(3)))
scala>rdd3.groupByKey.map(x=>(x._1,x._2.sum)).collect
res5: Array[(String, Int)] =Array((tom,11), (jerry,11), (shuke,7), (kitty,3))
scala> rdd3.groupByKey.mapValues(_.sum).collect
res7: Array[(String, Int)] =Array((tom,11), (jerry,11), (shuke,7), (kitty,3))
# cogroup
scala> val rdd1 =sc.parallelize(List(("tom", 1), ("tom", 2),("jerry", 3), ("kitty", 2)))
rdd1: org.apache.spark.rdd.RDD[(String,Int)] = ParallelCollectionRDD[15] at parallelize at <console>:27
scala> val rdd2 =sc.parallelize(List(("jerry", 2), ("tom", 1),("shuke", 2)))
rdd2: org.apache.spark.rdd.RDD[(String,Int)] = ParallelCollectionRDD[16] at parallelize at <console>:27
scala> val rdd3 = rdd1.cogroup(rdd2)
rdd3: org.apache.spark.rdd.RDD[(String,(Iterable[Int], Iterable[Int]))] = MapPartitionsRDD[18] at cogroup at<console>:31
scala> val rdd4 = rdd3.map(t=>(t._1,t._2._1.sum + t._2._2.sum))
rdd4: org.apache.spark.rdd.RDD[(String,Int)] = MapPartitionsRDD[19] at map at <console>:33
scala> val rdd4 = rdd3.map(t=>(t._1,t._2._1.sum + t._2._2.sum))
rdd4: org.apache.spark.rdd.RDD[(String, Int)]= MapPartitionsRDD[19] at map at <console>:33
#cartesian笛卡尔积
scala> val rdd1 =sc.parallelize(List("tom", "jerry"))
rdd1: org.apache.spark.rdd.RDD[String] =ParallelCollectionRDD[20] at parallelize at <console>:27
scala> val rdd2 = sc.parallelize(List("tom","kitty", "shuke"))
rdd2: org.apache.spark.rdd.RDD[String] =ParallelCollectionRDD[21] at parallelize at <console>:27
scala> val rdd3 = rdd1.cartesian(rdd2)
rdd3: org.apache.spark.rdd.RDD[(String,String)] = CartesianRDD[22] at cartesian at <console>:31
scala> rdd3.collect
res9: Array[(String, String)] =Array((tom,tom), (tom,kitty), (tom,shuke), (jerry,tom), (jerry,kitty),(jerry,shuke))
scala> val rdd1 =sc.parallelize(List(1,2,3,4,5), 2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[23]at parallelize at <console>:27
scala> rdd1.collect
res10: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val rdd2 = rdd1.reduce(_+_)
rdd2: Int = 15
scala> rdd1.count
res11: Long = 5
scala> rdd1.top(2)
res12: Array[Int] = Array(5, 4)
scala> rdd1.take(2)
res13: Array[Int] = Array(1, 2)
scala> rdd1.first
res14: Int = 1
scala>rdd1.takeOrdered(3)