Spark 二次排序

最新推荐文章于 2022-07-22 15:37:35 发布

Oasen

最新推荐文章于 2022-07-22 15:37:35 发布

阅读量723

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/dec_sun/article/details/90597440

版权

Spark 专栏收录该内容

12 篇文章 2 订阅

订阅专栏

文章目录

二次排序

二次排序

自定义类继承 Comparable, Serializable

要求

将 old 数据按 new 数据方式来排列显示。

old	new
1 5	3 6
2 4	1 5
3 6	2 4
1 3	1 3
2 1	2 1

步骤：

自定义类，继承：Comparable<自定义类名>,Serializable，并编写 compareTo() 方法
在具体操作中，将需要进行二次排序的字段使用 “自定义类” 来进行封装，然后再 sortByKey 进行触发。

JavaPairRDD<SecondarySortKey,String> pairs = lines.mapToPair(new PairFunction<String, 			   SecondarySortKey, String>() {
	private static final long serialVersionUID = 1L;
            @Override
            public Tuple2<SecondarySortKey, String> call(String s) throws Exception {
                String[] lineSplited = s.split(" ");
                SecondarySortKey key = new SecondarySortKey(Integer.valueOf(lineSplited[0]),Integer.valueOf(lineSplited[1]));
                return new Tuple2<SecondarySortKey,String>(key,s);
            }
        });

      // 此处的 sortByKey，就是针对之前 SecondarySortKey 的 compareTo 来进行排序的。
JavaPairRDD<SecondarySortKey,String> sortedPairs = pairs.sortByKey(false);

案例

自定义类，继承 Comparable, Serializable

public class SecondarySortKey implements Comparable<SecondarySortKey>,Serializable {
    private static final long  serialVersionUID = 1L;

    private int first;
    private int second;

    public SecondarySortKey(int first, int second) {
        this.first = first;
        this.second = second;
    }

    public int getFirst() {
        return first;
    }

    public void setFirst(int first) {
        this.first = first;
    }

    public int getSecond() {
        return second;
    }

    public void setSecond(int second) {
        this.second = second;
    }

    @Override
    public int compareTo(SecondarySortKey other) {
        int compare = this.first -other.first;
        
        if(compare == 0){
            return this.second - other.second; 
        }else{
            return this.first - other.first;
        }
    }
}

这里的思路即使对 key 进行 sort 排序，而构建 key 的方式就是根据之前的自定义类。当 RDD 调用 sortByKey 时，就会触发 SecondarySortKey 类。根据 compareTo 方法来进行排序。

public class SecondarySort {
    public static void main(String[] args) {
        SparkConf conf = new SparkConf();
        conf.setAppName("WordCountSort");
        conf.setMaster("local");

        JavaSparkContext jsc = new JavaSparkContext(conf);
        jsc.setLogLevel("ERROR");

        JavaRDD<String> lines = jsc.textFile("in/sort");
        
        JavaPairRDD<SecondarySortKey,String> pairs = lines.mapToPair(new PairFunction<String, SecondarySortKey, String>() {
            private static final long serialVersionUID = 1L;
            @Override
            public Tuple2<SecondarySortKey, String> call(String s) throws Exception {
                String[] lineSplited = s.split(" ");
                SecondarySortKey key = new SecondarySortKey(Integer.valueOf(lineSplited[0]),Integer.valueOf(lineSplited[1]));
                return new Tuple2<SecondarySortKey,String>(key,s);
            }
        });

      // 此处的 sortByKey，就是针对之前 SecondarySortKey 的 compareTo 来进行排序的。
        JavaPairRDD<SecondarySortKey,String> sortedPairs = pairs.sortByKey(false);

        JavaRDD<String> mapLines = sortedPairs.map(new Function<Tuple2<SecondarySortKey,String>, String>() {
            private static final long serialVersionUID = 1L;
            @Override
            public String call(Tuple2<SecondarySortKey, String> v1) throws Exception {
                return v1._2;
            }
        });

        mapLines.foreach(new VoidFunction<String>() {
            private static final long serialVersionUID = 1L;
            @Override
            public void call(String s) throws Exception {
                System.out.println("s: "+s);
            }
        });

        jsc.stop();
    }
}

所以上述方式就是，通过自定义一个类。该类是用来封装需要进行二次排序的字段。通过 sortByKey 来触发自定义类 compareTo() 方法的实现。

对 wordCount 结果进行排序

要求

将 wordCount 的计算出来的数据，按照单词出现的最大次数到最小次数依次排序显示

步骤：

计算得到 wordCount 的所有单词出现的个数
使用 map 第一次翻转 <word, count> -> <count,word>，然后进行 sortByKey 排序，最后在 map操作，<count,word> -> <word,count>

案例

public class WordCountSort {
    public static void main(String[] args) {
        SparkConf conf = new SparkConf();
        conf.setAppName("WordCountSort");
        conf.setMaster("local");

        JavaSparkContext jsc = new JavaSparkContext(conf);
        jsc.setLogLevel("ERROR");

        JavaRDD<String> lines = jsc.textFile("in/README.md");

        //以空格切分输入的数据
        JavaRDD<String> flatMapWords = lines.flatMap(new FlatMapFunction<String, String>() {
            private static final long serialVersionUID = 1L;
            @Override
            public Iterator<String> call(String s) throws Exception {
                return Arrays.asList(s.split(" ")).iterator();
            }
        });

        // 对每个单词初始默认值为 1
        JavaPairRDD<String,Integer> pairs = flatMapWords.mapToPair(new PairFunction<String, String, Integer>() {
            private static final long serialVersionUID = 1L;
            @Override
            public Tuple2<String, Integer> call(String s) throws Exception {
                return new Tuple2<String, Integer>(s,1);
            }
        });

        // 将相同 key 的 value 进行 + 计算。可以得到 每个 key 出现的次数
        JavaPairRDD<String, Integer> reducePairCounts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
            private static final long serialVersionUID = 1L;
            @Override
            public Integer call(Integer v1, Integer v2) throws Exception {
                return v1 + v2;
            }
        });

        // 反转，生成 <次数，单词>
        JavaPairRDD<Integer,String>countWords = reducePairCounts.mapToPair(new PairFunction<Tuple2<String,Integer>, Integer, String>() {
            @Override
            public Tuple2<Integer, String> call(Tuple2<String, Integer> t2) throws Exception {
                return new Tuple2<Integer, String>(t2._2,t2._1);
            }
        });

        // 对反转后的数据进行逆向排序，并再次反转得到 <单词，次数>
        JavaPairRDD<String,Integer> wordCounts = countWords.sortByKey(false).mapToPair(new PairFunction<Tuple2<Integer,String>, String, Integer>() {
            @Override
            public Tuple2<String, Integer> call(Tuple2<Integer, String> t) throws Exception {
                return new Tuple2<String, Integer>(t._2,t._1);
            }
        });

        wordCounts.foreach(new VoidFunction<Tuple2<String, Integer>>() {
            @Override
            public void call(Tuple2<String, Integer> s2) throws Exception {
                System.out.println("s2: "+s2._1+" appears "+s2._2+" times");
            }
        });

        jsc.stop();
    }
}

对 names 进行排序

要求

pollock,Divito
pollock,Divlio
richard,Kingry
richard,Kings
pollock,Dixey
pollock,Dixie
pollock,Dixion
pollock,Dixon
richard,Mathwich
richard,Mathys
richard,Matias
richard,Matice
richard,Matier
richard,Matin

对上述数据进行二次排序。

public class SecondarySortName {
    public static void main(String[] args) {
        SparkConf conf = new SparkConf();
        conf.setAppName("WordCountSort");
        conf.setMaster("local");

        JavaSparkContext jsc = new JavaSparkContext(conf);
        jsc.setLogLevel("ERROR");

        JavaRDD<String> lines = jsc.textFile("in/names.csv");
        JavaPairRDD<String,String> pairRDD = lines.mapToPair(new PairFunction<String, String, String>() {
            @Override
            public Tuple2<String, String> call(String s) throws Exception {
                return new Tuple2<String, String>(s.split(",")[0],s.split(",")[1]);
            }
        });
         // 对 value 进行排序
        JavaPairRDD<String,Iterable<String>> listRDD = pairRDD.groupByKey().mapValues(new Function<Iterable<String>, Iterable<String>>() {
            @Override
            public Iterable<String> call(Iterable<String> v1) throws Exception {
                List list = IteratorUtils.toList(v1.iterator());
                list.sort(new StringComparator());
                return list;
            }
        });

         // 对 key 进行排序
        JavaRDD<Tuple2<String, String>> resultRDD = listRDD.sortByKey().flatMap(new FlatMapFunction<Tuple2<String,Iterable<String>>, Tuple2<String,String>>() {
            @Override
            public Iterator<Tuple2<String, String>> call(Tuple2<String, Iterable<String>> t2) throws Exception {
                List<Tuple2<String,String>> list=new ArrayList<Tuple2<String,String>>();
                String key=t2._1;
                Iterable<String> iter = t2._2;
                for(String str : iter){
                    list.add(new Tuple2(key,str));
                }
                return list.iterator();
            }
        });

        resultRDD.foreach(new VoidFunction<Tuple2<String,String>>() {
            @Override
            public void call(Tuple2<String, String> t2) throws Exception {
                System.out.println(t2._1+","+t2._2);
            }
        });


        jsc.stop();
    }

    private static class StringComparator implements Comparator{

        @Override
        public int compare(Object o1, Object o2) {
            String name1 = (String) o1;
            String name2 = (String) o2;
            return name1.compareTo(name2);
        }
    }
}