1、reduce 是用于一元组,遍历一元组的数据,进行处理。
List<Integer> data = Arrays.asList(1,2,3,4,5,6); JavaRDD<Integer> parallelizeRdd = jsc.parallelize(data); Integer reduceSum = parallelizeRdd.reduce(new Function2<Integer, Integer, Integer>() { @Override public Integer call(Integer v1, Integer v2) throws Exception { return v1 + v2; } }); System.out.println(reduceSum);
call一次处理为: 1+2=3 ;3+3=9 ;9+4=13 ;13+5 =18 ;18+6 =24
2、reduceByKey是用于二元组When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs,可以通过function,对一个相同的key的value进行操作,得到一个U类型的值。
List<String> listKyes = Arrays.asList("k1", "k2", "k3","k1","k1","k2"); JavaRDD<String> keysRDD = jsc.parallelize(listKyes); keysRDD.mapToPair(new PairFunction<String, String, Integer>() { @Override public Tuple2<String, Integer> call(String s) throws Exception { return new Tuple2(s,1); } }).reduceByKey(new Function2<Integer, Integer, Integer>() { @Override public Integer call(Integer v1, Integer v2) throws Exception { return v1 + v2; } }).sortByKey(false).foreach(new VoidFunction<Tuple2<String, Integer>>() { //sortByKey(false) 升序 @Override public void call(Tuple2<String, Integer> wordCount) throws Exception { System.out.println(wordCount._1 + "=====" + wordCount._2); } });
reduceByKey的call方法对传入的是相同的key的value进行执行;k1:1+1=2, 2+1 =3
k2:1+1=2
k3:1