RDD基本操作
RDD是Spark提供给程序员操作的基本对象,很多Map/Reduce的操作都是在RDD上进行的,
1. 将List转化为RDD
scala> val rdd = sc.parallelize(List(1,2,3,4,5)); rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12
2. 对RDD进行过滤
scala> val filteredRDD = rdd.filter(_ >= 4); filteredRDD: org.apache.spark.rdd.RDD[Int] = FilteredRDD[1] at filter at <console>:14
3. 对filteredRDD执行collect操作
scala> filteredRDD.collect();
res0: Array[Int] = Array(4, 5) //可见满足条件的元素是4和5
4. 对RDD做map操作
scala> var mappedRDD = rdd.map(_ * 2)
mappedRDD: org.apache.spark.rdd.RDD[Int] = MappedRDD[2] at map at <console>:14
5. 对mappedRDD做collect操作
scala> mappedRDD.collect();
res1: Array[Int] = Array(2, 4, 6, 8, 10)
6. 对rdd做函数式编程
scala> s