countByKey
统计pairRDD(键值对形式的rdd)key出现的次数,返回的结果是一个map,key为原key,value为key出现的次数
scala版本
val rdd = sc.parallelize(List(("a",1),("a",2),("b",1),("c",3),("a",5)))
rdd.countByKey
java版本
JavaRDD<String> rdd1 = sc.parallelize(Arrays.asList("a", "a", "b", "b", "c", "a"));
JavaPairRDD<String, Integer> pairRDD = rdd1.mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String s) throws Exception {
return new Tuple2<>(s, 1);
}
});
Map<String, Long> countByKeyResult = pairRDD.countByKey();
Set<Map.Entry<String, Long>> entrySet = countByKeyResult.entrySet();
for (Map.Entry<String, Long> entry : entrySet) {
System.out.println("("+entry.getKey()+","+entry.getValue()+")");
}
countByValue
各元素在 RDD 中出现的次数 返回一个Map
scala版本
val rdd = sc.parallelize(List(1, 2, 1, 3, 3, 3, 4))
rdd.countByValue
java版本
JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 1, 3, 3, 3, 4));
Map<Integer, Long> result = rdd.countByValue();
Set<Integer> keySet = result.keySet();
for (Integer i : keySet) {
System.out.println(i+"出现次数:"+result.get(i));
}
collectAsMap
将pair类型(键值对类型)的RDD转换成map,若key有重复的,后出现的会把先出现的覆盖,结果是不重复的,按最后一次出现为准的map
scala版本
val rdd = sc.parallelize(List(("a",1),("a",5),("b",1),("c",3),("a",3)))
rdd.collectAsMap
java版本
List<Tuple2<String, Integer>> list = new ArrayList<>();
list.add(new Tuple2("a", 1));
list.add(new Tuple2("a", 3));
list.add(new Tuple2("b", 1));
list.add(new Tuple2("c", 1));
list.add(new Tuple2("a", 5));
list.add(new Tuple2("c", 8));
JavaRDD<Tuple2<String,Integer>> rdd1 = sc.parallelize(list);
JavaPairRDD<String, Integer> pairRDD = rdd1.mapToPair(new PairFunction<Tuple2<String, Integer>, String, Integer>() {
@Override
public Tuple2<String, Integer> call(Tuple2<String, Integer> t) throws Exception {
return new Tuple2<>(t._1, t._2);
}
});
Map<String, Integer> collectAsMap = pairRDD.collectAsMap();
Set<Map.Entry<String, Integer>> entrySet = collectAsMap.entrySet();
for (Map.Entry<String, Integer> entry : entrySet) {
System.out.println("("+entry.getKey()+","+entry.getValue()+")");
}