parallelize
调用SparkContext 的 parallelize(),将一个存在的集合,变成一个RDD.
Scala版本
def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T]
- 第一个参数一是一个 Seq集合
- 第二个参数是分区数
- 返回的是RDD[T]
scala> sc.parallelize(List("nanjing","is a beautiful city"))
res0: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:25
Java版本
def parallelize[T](list : java.util.List[T], numSlices : scala.Int) : org.apache.spark.api.java.JavaRDD[T] = { /* compiled code */ }
- 第一个参数是一个List集合
- 第二个参数是一个分区,可以默认
- 返回的是一个JavaRDD[T]
注意:Java版本只能接收List集合
List<String> strings = Arrays.asList("nanjing","is a beautiful city");
makeRDD
只有Scala版本才有makeRDD
scala> sc.makeRDD(List("nanjing","is a beautiful city"))
res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at makeRDD at <console>:25
textFile
调用SparkContext.textFile()方法,从外部存储中读取数据来创建RDD
Scala版本
val rdd2=sc.textFile("D:/ideashuju/sparkdemo/in/word.txt")
注意:textFile支持分区,支持模式匹配,也可以从HDFS的文件
Java版本
JavaRDD<String> stringJavaRDD = sc.textFile("D:/ideashuju/sparkdemo/in/word.txt");
代码演示
Scala版本
package nj.zb.sparkstu
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object ParallelizeScala {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[*]").setAppName("scala1")
val sc = new SparkContext(conf)
val rdd1=sc.parallelize(List("hello world", "hello java", "hello spark"))
rdd1.collect.foreach(println)
println("-------------------------------------")
val rdd2=sc.textFile("in/word.txt")
rdd2.collect.foreach(println)
println("------------------------------------")
val rdd3=sc.textFile("hdfs://hadoop100:9000/kb09workspace/word2.txt")
rdd3.collect.foreach(println)
}
}
结果展示:
Java版本
package nj.zb.sparkstu;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import java.util.Arrays;
import java.util.List;
public class ParallelizeJava {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("java1");
JavaSparkContext sc = new JavaSparkContext(conf);
List<String> strings = Arrays.asList("hello world", "hello java", "hello spark");
JavaRDD<String> rdd1 = sc.parallelize(strings);
List<String> collect = rdd1.collect();
for (String value:collect) {
System.out.println(value);
}
System.out.println("-------------------------------------------");
JavaRDD<String> stringJavaRDD = sc.textFile("in/word.txt");
List<String> collect1 = stringJavaRDD.collect();
for (String str:collect1){
System.out.println(str);
}
System.out.println("----------------------------------------------");
JavaRDD<String> stringJavaRDD1 = sc.textFile("hdfs://hadoop100:9000/kb09workspace/word2.txt");
List<String> collect2 = stringJavaRDD1.collect();
for (String st1:collect2){
System.out.println(st1);
}
}
}
结果展示: