【SparkAPI JAVA版】JavaPairRDD——distinct(十四)

JavaPairRDD的distinct方法讲解
官方文档
 /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
说明

返回去重的一个新的RDD

函数原型
// java
public JavaPairRDD<K,V> distinct()
public JavaPairRDD<K,V> distinct(int numPartitions)
// scala
def distinct(): JavaPairRDD[K, V]
def distinct(numPartitions: Int): JavaPairRDD[K, V]
示例
public class Distinct {
    public static void main(String[] args) {

        System.setProperty("hadoop.home.dir", "E:\\hadoop-2.7.1");
        SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("Spark_DEMO");
        JavaSparkContext sc = new JavaSparkContext(sparkConf);
        JavaPairRDD<String, String> javaPairRDD1 = sc.parallelizePairs(Lists.newArrayList(
                new Tuple2<String, String>("cat", "11"), new Tuple2<String, String>("dog", "22"),
                new Tuple2<String, String>("cat", "11"), new Tuple2<String, String>("pig", "44"),
                new Tuple2<String, String>("duck", "55"), new Tuple2<String, String>("cat", "11"),
                new Tuple2<String, String>("cat", "12"), new Tuple2<String, String>("dog", "23"),
                new Tuple2<String, String>("cat", "11"), new Tuple2<String, String>("pig", "22"),
                new Tuple2<String, String>("duck", "55"), new Tuple2<String, String>("cat", "15")), 2);
        javaPairRDD1.foreach(new VoidFunction<Tuple2<String, String>>() {
            public void call(Tuple2<String, String> stringStringTuple2) throws Exception {
                System.out.println(stringStringTuple2);
            }
        });
        
        // 去重
        JavaPairRDD<String,String> javaPairRDD = javaPairRDD1.distinct();
        javaPairRDD.foreach(new VoidFunction<Tuple2<String, String>>() {
            public void call(Tuple2<String, String> stringStringTuple2) throws Exception {
                System.out.println(stringStringTuple2);
            }
        });
        // 输出分区数
        System.out.println("分区的个数:"+javaPairRDD.partitions().size());

        // 带有numPartitions参数的
        JavaPairRDD<String,String> javaPairRDD2 = javaPairRDD1.distinct(3);

        javaPairRDD2.foreach(new VoidFunction<Tuple2<String, String>>() {
            public void call(Tuple2<String, String> stringStringTuple2) throws Exception {
                System.out.println("-->"+stringStringTuple2);
            }
        });
        // 输出分区数
        System.out.println("分区的个数:"+javaPairRDD2.partitions().size());
    }
}
结果
19/03/21 10:53:57 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
(cat,11)
(dog,22)
(cat,11)
(pig,44)
(duck,55)
(cat,11)
19/03/21 10:53:57 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
(cat,12)
(dog,23)
(cat,11)
(pig,22)
(duck,55)
(cat,15)
19/03/21 10:53:57 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 110 ms on localhost (executor driver) (1/2)
19/03/21 10:53:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 16 ms
(cat,15)
(cat,12)
(cat,11)
(pig,22)
(pig,44)
19/03/21 10:53:57 INFO Executor: Finished task 0.0 in stage 2.0 (TID 4). 1052 bytes result sent to driver
19/03/21 10:53:57 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 4) in 62 ms on localhost (executor driver) (1/2)
(dog,23)
(dog,22)
(duck,55)
分区的个数:2
19/03/21 10:53:57 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks



19/03/21 10:53:57 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 8) in 16 ms on localhost (executor driver) (1/3)
-->(cat,15)
-->(dog,22)
-->(duck,55)
-->(pig,22)
-->(pig,44)
19/03/21 10:53:57 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 2 blocks
19/03/21 10:53:57 INFO TaskSetManager: Finished task 1.0 in stage 4.0 (TID 9) in 16 ms on localhost (executor driver) (2/3)
19/03/21 10:53:57 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/03/21 10:53:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/03/21 10:53:57 INFO Executor: Finished task 2.0 in stage 4.0 (TID 10). 923 bytes result sent to driver
19/03/21 10:53:57 INFO TaskSetManager: Finished task 2.0 in stage 4.0 (TID 10) in 0 ms on localhost (executor driver) (3/3)
19/03/21 10:53:57 INFO DAGScheduler: ResultStage 4 (foreach at Distinct.java:45) finished in 0.032 s
-->(dog,23)
-->(cat,12)
-->(cat,11)
分区的个数:3
19/03/21 10:53:57 INFO DAGScheduler: Job 2 finished: foreach at Distinct.java:45, took 0.127207 s

从日志里可以看出来distinct不改变分区数,但是分区的数据会去重后改变,不是单独去重。而且参数numPartitions指定多少分区,就会生成多少分区。有可能会返回空数据的分区。

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值