Spark RDD 之 pair (k.v) 操作

这篇文章是关于spark RDD Key/Value Pair 的操作

1. 创建 k/v pair 的RDD
val sc = new SparkContext(conf);
val strArray = List("this is spark","It is fun!","spark is cool");
val strRDD = sc.parallelize(strArray);
val lenRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l.length,l));
lenRDD.collect().foreach(println);

#####输出

(4,this)
(2,is)
(5,spark)
(2,It)
(2,is)
(4,fun!)
(5,spark)
(2,is)
(4,cool)
2. Key/Value Pair RDD Transformations
NAMEDESC
groupByKey([numTasks])Groups all the values of the same key together. For a dataset of (K,V) pairs, the returned rDD has the type (K, Iterable).
reduceByKey(func,[numTasks])First performs the grouping of values with the same key and then applies the specified func to return the list of values down to a single value. For a dataset of (K,V) pairs, the returned rDD has the type of (K, V).
sortByKey([ascending],[numTasks])sorts the rows according to the keys. by default, the keys are sorted in ascending order.
join(otherRDD,[numTasks])Joins the rows in both rDDs by matching their keys. each row of the returned rDD contains a tuple where the first element is the key and the second element is another tuple containing the values from both rDDs.
案例1. groupByKey([numTasks])
val sc = new SparkContext(conf);
val strArray = List("this is spark","It is fun!","spark is cool");
val strRDD = sc.parallelize(strArray);
val lenRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l.length,l));
lenRDD.groupByKey().foreach(println);
输出
(5,CompactBuffer(spark, spark))
(4,CompactBuffer(this, fun!, cool))
(2,CompactBuffer(is, It, is, is))
案例2. reduceByKey(func, [numTasks])
 val sc = new SparkContext(conf);
 val strArray = List("this is spark","It is fun!","spark is cool");
 val strRDD = sc.parallelize(strArray);
 val countRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l,1));
 countRDD.reduceByKey((_1,_2)=>_1+_2).collect().foreach(println);
输出
(this,1)
(is,3)
(fun!,1)
(cool,1)
(spark,2)
(It,1)
案例3. sortByKey([ascending],[numTasks])
默认升序,传入参数false为降序
val sc = new SparkContext(conf);
val strArray = List("this is spark","It is fun!","spark is cool");
val strRDD = sc.parallelize(strArray);
val countRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l,1));
countRDD.reduceByKey((_1,_2)=>_1+_2).map(t=>(t._2,t._1)).sortByKey().collect().foreach(println);
输出
(1,this)
(1,fun!)
(1,cool)
(1,It)
(2,spark)
(3,is)
案例4.join(otherRDD)
用于2个RDD之间的连接: By joining the dataset of type (K,V) and dataset (K,W), the result of the joined dataset is (K,(V,W))
 val sc = new SparkContext(conf);
 val parentRDD = sc.parallelize(List((1,"Jason")));
 val childRDD = sc.parallelize(List((1,"Tom"),(1,"Mike")));
 parentRDD.join(childRDD).map((t => {t._2._1+"-->"+t._2._2})).foreach(println);
输出
Jason-->Tom
Jason-->Mike
3. Key/Value Pair RDD Actions
namedesc
countByKey()returns a map where each entry contains the key and a count of values
collectAsMap()similar behavior as the collect action; return type is a map
lookup(key)performs a look by key and returns all values that have the same specified key
案例1. countByKey( )
 val sc = new SparkContext(conf);
 sc.setLogLevel("ERROR");
 sc.parallelize(List(("Jason",1),("Jason",3),("Tom",1),("Tom",9))).
 countByKey().map(t=>{t._1+"-->"+t._2}).foreach(println);
输出信息
Tom-->2
Jason-->2
案例2. collectAsMap( )
val sc = new SparkContext(conf);
sc.setLogLevel("ERROR");
sc.parallelize(List(("Jason",1),("Jason",3),("Tom",1),("Tom",9))).collectAsMap().foreach(println);
输出信息
(Tom,9)
(Jason,3)
案例3. lookup(key)
 val sc = new SparkContext(conf);
 sc.setLogLevel("ERROR");
 val mapRDD = sc.parallelize(List(("Jason",1),("Jason",3),("Tom",1),("Tom",9))).reduceByKey(((t1:Int,t2:Int)=>{t1+t2}));
 mapRDD.lookup("Jason").foreach(println);
 mapRDD.lookup("Tom").foreach(println);
输出信息
4
10
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值