JavaPairRDD的collectAsMap方法讲解
官方文档说明
/**
* Return the key-value pairs in this RDD to the master as a Map.
*
* @note this method should only be used if the resulting data is expected to be small, as
* all the data is loaded into the driver's memory.
*/
中文含义
将此RDD中的键值对最终最为一个map返回给主方法
注意:只有当结果数据很小时才应使用此方法,所有的数据都被载入内存中。
方法原型
//scala
/**
* Return the key-value pairs in this RDD to the master as a Map.
*/
def collectAsMap(): Map[K, V]
//java
public java.util.Map<K,V> collectAsMap()
实例
public class CollectAsMap {
public static void main(String[] args) {
System.setProperty("hadoop.home.dir", "E:\\hadoop-2.7.1");
SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("Spark_DEMO");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaPairRDD<String,String> javaPairRDD1 = sc.parallelizePairs(Lists.newArrayList(new Tuple2<String, String>("1","abc11"),
new Tuple2<String, String>("2","abc22"),new Tuple2<String, String>("3","33333"),new Tuple2<String, String>("3","mmmmmm")));
// 返回一个Map
Map<String,String> map = javaPairRDD1.collectAsMap();
for(Map.Entry<String,String> entry : map.entrySet()){
System.out.println(entry.getKey()+"->"+entry.getValue());
}
}
}
结果
19/03/19 16:16:26 INFO DAGScheduler: Job 0 finished: collectAsMap at CollectAsMap.java:22, took 0.742896 s
19/03/19 16:16:26 INFO SparkContext: Invoking stop() from shutdown hook
2->abc22
1->abc11
3->mmmmmm
19/03/19 16:16:26 INFO SparkUI: Stopped Spark web UI at http://10.124.209.6:4040
19/03/19 16:16:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/03/19 16:16:26 INFO MemoryStore: MemoryStore cleared
19/03/19 16:16:26 INFO BlockManager: BlockManager stopped
可以看到返回的map中如果一个key存在多个value,后面的value会覆盖前面的value,最终只留下一个唯一的key-value
注意
数据量太大的情况下,不要用collect,会造成内存溢出