一、distinct
distinct用于去重,此方法涉及到混洗,操作开销很大
scala版本
val conf=new SparkConf().setMaster("local[2]").setAppName("distinctdemo")
val sc=new SparkContext(conf)
val rdd=sc.makeRDD(List("a","b","a","c"))
rdd.distinct.collect.foreach(println)
java版本
SparkConf conf=new SparkConf().setMaster("local[2]").setAppName("distinctJava");
JavaSparkContext sc=new JavaSparkContext(conf);
List<String> strings =Arrays.asList("a","b","a","b","c");
JavaRDD<String> strRdd=sc.parallelize(strings);
JavaRDD<String> distinctRdd=strRdd.distinct();
List<String> collect=ditinctRdd.collect();
for(String s:collect){
System.out.println(s);
}
二、union
合并两个Rdd
scala版本
val rdd1=sc.makeRDD(List("a","b"))
val rdd2=sc.makeRDD(List("c","d"))
val unionRdd=rdd1.union(rdd2)
unionRdd.collect.foreach(println)
java版本
JavaRDD<String> rdd1=sc.parallelize(Arrays.asList("a","b"));
JavaRDD<String> rdd2=sc.parallelize(Arrays.asList("c","d"));
JavaRDD<String> unionRdd=rdd1.uinon(rdd2);
List<Stirng> collect=unionRdd.collect();
for(String s:collect){
System.out.println(s)
}
三、intersection
返回两个RDD的交集,并且去重
intersection 需要混洗数据,比较浪费性能
scala版本
val rdd1=sc.makeRDD(List("aa","bb","cc"))
val rdd2=sc.makeRDD(List("cc","bb","dd"))
val intersectionRdd = rdd1.intersection(rdd2)
intersectionRdd.collect.foreach(println)
java版本
JavaRDD<String> rdd1=sc.parallelize(Arrays.asList("aa","bb","cc"));
JavaRDD<String> rdd2=sc.parallelize(Arrays.asList("cc","bb","dd"));
JavaRDD<String> intersectionRdd=rdd1.intersection(rdd2);
List<String> collect=intersection.collect();
for(String s:collect){
System.out.println(s)
}
四、subtract
返回在RDD1中出现,但是不在RDD2中出现的元素,不去重
scala版本
val rdd1=sc.makeRDD(List("aa","bb","cc"))
val rdd2=sc.makeRDD(List("cc","bb","dd"))
val subtractRdd = rdd1.subtract(rdd2)
subtractRdd.collect.foreach(println)
java版本
JavaRDD<String> rdd1=sc.parallelize(Arrays.asList("aa","bb","cc"));
JavaRDD<String> rdd2=sc.parallelize(Arrays.asList("cc","bb","dd"));
JavaRDD<String> subtractRdd=rdd1.subtract(rdd2);
List<String> collect=subtract.collect();
for(String s:collect){
System.out.println(s)
}
五、cartesian
返回Rdd1和Rdd2的笛卡儿积,这个开销非常大,慎用
scala版本
val rdd1 = sc.makeRDD(List("1","2","3"))
val rdd2 = sc.makeRDD(List("a","b","c"))
val cartesianRdd=rdd1.cartesian(rdd2)
cartesian.collect.foreach(println)
java版本
JavaRDD<String> rdd1 = sc.parallelize(Arrays.asList("1","2","3"));
JavaRDD<String> rdd2 = sc.parallelize(Arrays.asList("a","b","c"));
JavaRDD<String,String> cartesianRdd=rdd1.cartesian(rdd2);
List<Tuple2<String,String>> collect=cartesianRdd.collect();
for(Tuple2 t:collect){
System.out.print(t);
}