章节目录
一、distinct
distinct用于去重, 我们生成的RDD可能有重复的元素,使用distinct方法可以去掉重复的元素, 不过此方法涉及到混洗,操作开销很大
scala版本
val conf = new SparkConf().setMaster("local[3]").setAppName("rdddemo")
val sc = SparkContext.getOrCreate(conf)
println("-----------distinct算子---------------")
val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9,9,2,6))
val rdd2 = rdd1.distinct
println("rdd1的分区数:"+rdd1.partitions.length)
println("rdd2的分区数:"+rdd2.partitions.length)
rdd2.collect.foreach(println)
val rdd3 = rdd1.distinct(2)
println("rdd3的分区数:"+rdd3.partitions.length)
Java版本
List<String> strings = Arrays.asList("aa", "bb", "aa", "bb", "cc", "dd");
JavaRDD<String> strRdd = sc.parallelize(strings);
JavaRDD<String> distinctRdd = strRdd.distinct();
List<String> collect = distinctRdd.collect();
for (String str: