创建数据, 贼简单那种
入门时,快速了解RDD属性,熟悉RDD算子,大多采用 Spark-shell
spark-shell
val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10)) // 数组
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24
// This method is used only for testing but not in realtime as the entire data will reside on one node which is not ideal for production.
val rdd=spark.sparkContext.parallelize(Seq(("Java", 20000), ("Python", 100000), ("Scala", 3000)))
scala
import org.apache