一、创建RDD
1、外部数据集(external datasets)
var rdd = sc.textFile("hdfs://hadoop-senior.ibeifeng.com:8020/user/root/mapreduce/wordcount/input/wc.input")
rdd.collect
res6: Array[String] = Array(hadoop hive, hive hadoop, hbase sqoop, hbase sqoop, hadoop hive)
rdd.map(s => s.length).reduce((x, y) => (x + y))
res9: Int = 55
2、并行化数据集(parallelized collections)
scala> val datas = Array(1, 2, 3, 4, 5, 6)
datas: Array[Int] = Array(1, 2, 3, 4, 5, 6)
scala> val arrRdd = sc.parallelize(datas)
arrRdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:23
scala> arrRdd.map(num => num * num).collect
res5: Array[Int] = Array(1, 4, 9, 16, 25, 36)
二、转化操作
转化操作(transformations),创建一个新的数据集RDD;lazy懒