Spark RDD 之 pair (k.v) 操作

最新推荐文章于 2022-09-19 20:03:39 发布

颓废的大饼

最新推荐文章于 2022-09-19 20:03:39 发布

阅读量629

点赞数 1

分类专栏： Spark 文章标签： Spark

本文链接：https://blog.csdn.net/chenxu_0209/article/details/86407606

版权

Spark 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

这篇文章是关于spark RDD Key/Value Pair 的操作

1. 创建 k/v pair 的RDD

val sc = new SparkContext(conf);
val strArray = List("this is spark","It is fun!","spark is cool");
val strRDD = sc.parallelize(strArray);
val lenRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l.length,l));
lenRDD.collect().foreach(println);

#####输出

(4,this)
(2,is)
(5,spark)
(2,It)
(2,is)
(4,fun!)
(5,spark)
(2,is)
(4,cool)

2. Key/Value Pair RDD Transformations

NAME	DESC
groupByKey([numTasks])	Groups all the values of the same key together. For a dataset of (K,V) pairs, the returned rDD has the type (K, Iterable).
reduceByKey(func,[numTasks])	First performs the grouping of values with the same key and then applies the specified func to return the list of values down to a single value. For a dataset of (K,V) pairs, the returned rDD has the type of (K, V).
sortByKey([ascending],[numTasks])	sorts the rows according to the keys. by default, the keys are sorted in ascending order.
join(otherRDD,[numTasks])	Joins the rows in both rDDs by matching their keys. each row of the returned rDD contains a tuple where the first element is the key and the second element is another tuple containing the values from both rDDs.

案例1. groupByKey([numTasks])

val sc = new SparkContext(conf);
val strArray = List("this is spark","It is fun!","spark is cool");
val strRDD = sc.parallelize(strArray);
val lenRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l.length,l));
lenRDD.groupByKey().foreach(println);

输出

(5,CompactBuffer(spark, spark))
(4,CompactBuffer(this, fun!, cool))
(2,CompactBuffer(is, It, is, is))

案例2. reduceByKey(func, [numTasks])

 val sc = new SparkContext(conf);
 val strArray = List("this is spark","It is fun!","spark is cool");
 val strRDD = sc.parallelize(strArray);
 val countRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l,1));
 countRDD.reduceByKey((_1,_2)=>_1+_2).collect().foreach(println);

输出

(this,1)
(is,3)
(fun!,1)
(cool,1)
(spark,2)
(It,1)

案例3. sortByKey([ascending],[numTasks])

默认升序，传入参数false为降序

val sc = new SparkContext(conf);
val strArray = List("this is spark","It is fun!","spark is cool");
val strRDD = sc.parallelize(strArray);
val countRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l,1));
countRDD.reduceByKey((_1,_2)=>_1+_2).map(t=>(t._2,t._1)).sortByKey().collect().foreach(println);

输出

(1,this)
(1,fun!)
(1,cool)
(1,It)
(2,spark)
(3,is)

案例4.join(otherRDD)

用于2个RDD之间的连接： By joining the dataset of type (K,V) and dataset (K,W), the result of the joined dataset is (K,(V,W))

 val sc = new SparkContext(conf);
 val parentRDD = sc.parallelize(List((1,"Jason")));
 val childRDD = sc.parallelize(List((1,"Tom"),(1,"Mike")));
 parentRDD.join(childRDD).map((t => {t._2._1+"-->"+t._2._2})).foreach(println);

输出

Jason-->Tom
Jason-->Mike

3. Key/Value Pair RDD Actions

name	desc
countByKey()	returns a map where each entry contains the key and a count of values
collectAsMap()	similar behavior as the collect action; return type is a map
lookup(key)	performs a look by key and returns all values that have the same specified key

案例1. countByKey( )

 val sc = new SparkContext(conf);
 sc.setLogLevel("ERROR");
 sc.parallelize(List(("Jason",1),("Jason",3),("Tom",1),("Tom",9))).
 countByKey().map(t=>{t._1+"-->"+t._2}).foreach(println);

输出信息

Tom-->2
Jason-->2

案例2. collectAsMap( )

val sc = new SparkContext(conf);
sc.setLogLevel("ERROR");
sc.parallelize(List(("Jason",1),("Jason",3),("Tom",1),("Tom",9))).collectAsMap().foreach(println);

输出信息

(Tom,9)
(Jason,3)

案例3. lookup(key)

 val sc = new SparkContext(conf);
 sc.setLogLevel("ERROR");
 val mapRDD = sc.parallelize(List(("Jason",1),("Jason",3),("Tom",1),("Tom",9))).reduceByKey(((t1:Int,t2:Int)=>{t1+t2}));
 mapRDD.lookup("Jason").foreach(println);
 mapRDD.lookup("Tom").foreach(println);

输出信息

4
10

颓废的大饼

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark RDD 之 pair (k.v) 操作

这篇文章是关于spark RDD Key/Value Pair 的操作1. 创建 k/v pair 的RDDval sc = new SparkContext(conf);val strArray = List(&quot;this is spark&quot;,&quot;It is fun!&quot;,&quot;spark is cool&quot;);val strRDD = sc.parallelize(strArray);va
复制链接

扫一扫