创建普通集合
scala> val a1=Array(1,2,3,4,5,6)
a1: Array[Int] = Array(1, 2, 3, 4, 5, 6)
创建RDD加两个分区
scala> val r1=sc.parallelize(a1,2)
r1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at :26
查看分区数
scala> r1.partitions.size
res6: Int = 2
查看分区数据
cala> r1.glom.collect
res7: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6))
从本地文件里取数据
scala> val r3=sc.textFile(“file:///home/hadoop/1.txt”,2)
r3: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/1.txt MapPartitionsRDD[2] at textFile at :24
scala> r3.collect
res1: Array[String] = Array(192.168.234.21, 192.168.234.22, 192.168.234.23, “”)
从hdfs里读取文件数据
scala> val r3=sc.textFile(“hdfs://mast1:9000/txt/ip.txt”,2)
r3: org.apache.spark.rdd.RDD[String] = hdfs://mast1:9000/txt/ip.txt MapPartitionsRDD[4] at textFile at :24
scala> r3.collect
res2: Array[String] = Array(10.9.80.16, 10.9.132.111, 10.9.152.65, 10.9.21.119, 10.9.132.111, 10.9.130.83, 10.9.80.16, 10.9.152.65, 10…