数组转成RDD(并行化scala集合创建RDD)(Transformation转换,延迟加载)
scala> val r1 = sc.parallelize(Array(1,2,3,4,5,6))
r1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:24
查看该rdd分区数量
scala> r1.partitions.length
res26: Int = 1
将不可变List集合转成RDD(Transformation转换,延迟加载)
scala> val r2 = sc.parallelize(List(4,5,6,7,8))
r2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] at parallelize at <console>:24
list转RDD,每个元素*2,排序,true: 升序
scala> val r3 = sc.parallelize(List(1,2,3,4,5,6,10,1)).map(_*2).sortBy(x=>x,true)
r3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[37] at sortBy at <console>:24
scala> r3.collect
res27: Array[Int] = Array(2, 2, 4, 6, 8, 10, 12, 20)
filter :过滤,留下每个大于5 的元素
scala> val r4 = r2.filter(_ > 5)
r4: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[38] at filter at <console>:26
scala> val r4 = r2.filter(_ > 5).collect
r4: Array[Int] = Array(6, 7, 8)
list转RDD,每个元素*2,字符串排序,true: 升序
scala> val r2 = sc.parallelize(List(1,2,3,4,5,3,7,9)).map(_*2).sortBy(x=>x+"",true)
r2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[44] at sortBy at <console>:24
scala> val r2 = sc.parallelize(List(1,2,3,4,5,3,7,9)).map(_*2).sortBy(x=>x+"",true).collect
r2: Array[Int] = Array(10, 14, 18, 2, 4, 6, 6, 8)
含义同上
scala> val r2 = sc.parallelize(List(1,2,3,4,5)).map(_*2).sortBy(x=>x.toString,true)
r2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[54] at sortBy at <console>:24
转换,压平,分割
scala> val r4 = sc.parallelize(Array("1 2 a b","c d e f","g h j"))
r4: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[55] at parallelize at <console>:24
scala> r4.flatMap(_.split(" ")).collect
res28: Array[String] = Array(1, 2, a, b, c, d, e, f, "", g, h, j)
转换,压平,分割,注意: 此处第二个flatMap是调用的集合本身方法,而非RDD
scala> val r5 = sc.parallelize(List(List("a b c","1 2 3"),List("1 2 c","d f g")))
r5: org.apache.spark.rdd.RDD[List[String]] = ParallelCollectionRDD[57] at parallelize at <console>:24
scala> r5.flatMap(_.flatMap(_.split(" "))).collect
res29: Array[String] = Array(a, b, c, 1, 2, 3, 1, 2, c, d, f, g)