主要内容:
1. List转JavaRDD,打印JavaRDD
2. List转JavaRDD,JavaRDD转JavaPairRDD,打印JavaPairRDD
3. JavaRDD 转 JavaRDD
1. 先将List转为JavaRDD,再通过collect()和foreach打印JavaRDD
/***@authorYu Wanlong*/
importorg.apache.spark.SparkConf;importorg.apache.spark.api.java.JavaRDD;importorg.apache.spark.api.java.JavaSparkContext;public classReadTextToRDD {public static voidmain(String[] args) {//configure spark
SparkConf sparkConf = new SparkConf().setAppName("Read Text to RDD")
.setMaster("local[2]").set("spark.executor.memory","2g");//start a spark context
JavaSparkContext jsc = newJavaSparkContext(sparkConf);//build List
List list = Arrays.asList("a:1", "a:2", "b:1", "b:1", "c:1","d:1");//List to JavaRDD
JavaRDD javaRDD =jsc.parallelize(list);//使用collect打印JavaRDD
for (Stringstr : javaRDD.collect()) {
System.out.println(str);
}//使用foreach打印JavaRDD
javaRDD.foreach(new VoidFunction() {
@Overridepublic void call(String s) throwsException {
System.out.println(s);
}
});
}
}
a:1a:2b:1b:1c:1d:1
2. List转JavaRDD,JavaRDD转JavaPairRDD,打印JavaPairRDD
/***@authorYu Wanlong*/
importorg.apache.spark.SparkConf;importorg.apache.spark.api.java.JavaRDD;importorg.apache.spark.api.java.JavaSparkContext;public classReadTextToRDD {public static voidmain(String[] args) {//configure spark
SparkConf sparkConf = new SparkConf().setAppName("Read Text to RDD")
.setMaster("local[2]").set("spark.executor.memory","2g");//start a spark context
JavaSparkContext jsc = newJavaSparkContext(sparkConf);//build List
List list = Arrays.asList("a:1", "a:2", "b:1", "b:1", "c:1","d:1");//List to JavaRDD
JavaRDD javaRDD =jsc.parallelize(list);//JavaRDD to JavaPairRDD
JavaPairRDD javaPairRDD =javaRDD.mapToPair(new PairFunction() {
@Overridepublic Tuple2 call(String s) throwsException {
String[] ss= s.split(":");return new Tuple2(ss[0], Integer.parseInt(ss[1]));
}
});//使用collect对JavaPairRDD打印
for (Tuple2str : javaPairRDD.collect()) {
System.out.println(str.toString());
}
}
}
(a,1)
(a,2)
(b,1)
(b,1)
(c,1)
(d,1)
在JavaRDD转为JavaPairRDD的过程中,关键点为:
第一:mapToPair函数中的PairFunction():PairFunction()
第二:由于JavaPairRDD的存储形式本是key-value形式,Tuple2 为需要返回的键值对类型,Tuple2
第三:String s,String类型为JavaRDD中的String,s代表其值
第四:return new Tuple2(ss[0], Integer.parseInt(ss[1])),此处为返回的key-value的返回结果
小结:JavaRDD在转换成JavaPairRDD的时候,实际上是对单行的数据整合成key-value形式的过程,由JavaPairRDD在进行key-value运算时效率能大大提升
3. JavaRDD 转 JavaRDD
/***@authorYu Wanlong*/
importorg.apache.spark.sql.Row;importorg.apache.spark.SparkConf;importorg.apache.spark.sql.RowFactory;importorg.apache.spark.api.java.JavaRDD;importorg.apache.spark.api.java.JavaSparkContext;public classReadTextToRDD {public static voidmain(String[] args) {//configure spark
SparkConf sparkConf = new SparkConf().setAppName("Read Text to RDD")
.setMaster("local[2]").set("spark.executor.memory","2g");//start a spark context
JavaSparkContext jsc = newJavaSparkContext(sparkConf);//build List
List list = Arrays.asList("a:1", "a:2", "b:1", "b:1", "c:1","d:1");//List to JavaRDD
JavaRDD javaRDD =jsc.parallelize(list);//JavaRDD to JavaRDD
JavaRDD javaRDDRow = javaRDD.map(new Function() {
@Overridepublic Row call(String s) throwsException {
String[] ss= s.split(":");return RowFactory.create(ss[0], ss[1]);
}
});//打印JavaRDD
for(Row str : javaRDDRow.collect()) {
System.out.println(str.toString());
}
}
}
[a,1]
[a,2]
[b,1]
[b,1]
[c,1]
[d,1]