spark pairRDD操作

最新推荐文章于 2022-07-01 09:35:50 发布

qq_32216775

最新推荐文章于 2022-07-01 09:35:50 发布

阅读量1.7k

点赞数

分类专栏： spark

本文链接：https://blog.csdn.net/qq_32216775/article/details/79710261

版权

本文介绍了Spark中创建和操作PairRDD的各种方法，包括通过map、mapToPair、parallelize等函数创建，以及如何使用filter、mapValues、reduceByKey、countByValue、combineByKey等进行转化操作。详细展示了Python和Java的示例代码，帮助理解PairRDD的使用和转换过程。

摘要由CSDN通过智能技术生成

一、创建pairRDD的方法

①python脚本，使用 map() 函数

示例把句子的第一个单词作为键，句子作为值：

   >>> line=sc.parallelize(["hello world","very good","yes right"]) 
 
   >>> map = line.map(lambda s:((s.split(" "))[0],s)) 
 
   >>> map.collect() 
 
   [('hello', 'hello world'), ('very', 'very good'), ('yes', 'yes right')] 
 
   >>>

可以看到python中的转化非常方便，直接返回一个二元组即可。

②使用java， mapToPair( ) 函数，传入一个PairFunction对象参数

   //使用mapToPair转化为键值对，例如： (world,1) 
 
   JavaPairRDD<String, Integer> pair = words.mapToPair(new PairFunction<String, String, Integer>() { 
 
       public Tuple2<String, Integer> call(String s) throws Exception { 
 
           return new Tuple2<String, Integer>(s,1); 
 
       } 
 
   });

在java中，键值对RDD的类型为 JavaPairRDD<T1,T2>

③从内存中创建pairRDD

python和scala只要使用 parallelize()函数传入键值对形式的参数。

python脚本示例：

   >>> pair=sc.parallelize([("hello",1),("good",2),("yes",3)]) 
 
   >>> pair.collect() 
 
   [('hello', 1), ('good', 2), ('yes', 3)] 
 
   >>>

java中从内存创建pairRDD需要使用 parallelizePairs()函数

示例：

   public static void main(String[] args) { 
 
       SparkConf conf = new SparkConf().setMaster("local").setAppName("test07"); 
 
       JavaSparkContext sc = new JavaSparkContext(conf); 
 
       List<Tuple2<String,String>> list = new ArrayList<Tuple2<String,String>>(); 
 
       list.add(new Tuple2<String, String>("hello","hello world")); 
 
       list.add(new Tuple2<String, String>("good","good job")); 
 
       list.add(new Tuple2<String, String>("hi","hi world")); 
 
       JavaPairRDD<String, String> pair = sc.parallelizePairs(list); 
 
       System.out.println(pair.collect()); 
 
       sc.stop(); 
 
   }

二、pairRDD转化操作

① filter()函数同样可以适用于键值对RDD

python示例如下：

   >>> pair.collect() 
 
   [(1, 2), (2, 3), (1, 4), (3, 4), (3, 5), (5, 5)] 
 
   >>> pair2=pair.filter(lambda value:value[1]<5) 
 
   >>> pair2.collect() 
 
   [(1, 2), (2, 3), (1, 4), (3, 4)] 
 
   >>>

java示例如下:

   public static void main(String[] args) { 
 
       SparkConf conf = new SparkConf().setMaster("local").setAppName("test08"); 
 
       JavaSparkContext sc = new JavaSparkContext(conf); 
 
       //插入数据 
 
       List<Tuple2<Integer, Integer>> list = new ArrayList<Tuple2<Integer, Integer>>(); 
 
       list.add(new Tuple2<Integer, Integer>(1, 2)); 
 
       list.add(new Tuple2<Integer, Integer>(2, 4)); 
 
       list.add(new Tuple2<Integer, Integer>(3, 5)); 
 
       JavaPairRDD<Integer, Integer> pair = sc.parallelizePairs(list); 
 
       //filter筛选 
 
       JavaPairRDD<Integer, Integer> filter =