val sc =newSparkContext(conf);
val strArray =List("this is spark","It is fun!","spark is cool");
val strRDD = sc.parallelize(strArray);
val lenRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l.length,l));
lenRDD.collect().foreach(println);
Groups all the values of the same key together. For a dataset of (K,V) pairs, the returned rDD has the type (K, Iterable).
reduceByKey(func,[numTasks])
First performs the grouping of values with the same key and then applies the specified func to return the list of values down to a single value. For a dataset of (K,V) pairs, the returned rDD has the type of (K, V).
sortByKey([ascending],[numTasks])
sorts the rows according to the keys. by default, the keys are sorted in ascending order.
join(otherRDD,[numTasks])
Joins the rows in both rDDs by matching their keys. each row of the returned rDD contains a tuple where the first element is the key and the second element is another tuple containing the values from both rDDs.
案例1. groupByKey([numTasks])
val sc =newSparkContext(conf);
val strArray =List("this is spark","It is fun!","spark is cool");
val strRDD = sc.parallelize(strArray);
val lenRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l.length,l));
lenRDD.groupByKey().foreach(println);
输出
(5,CompactBuffer(spark, spark))(4,CompactBuffer(this, fun!, cool))(2,CompactBuffer(is, It, is, is))
案例2. reduceByKey(func, [numTasks])
val sc =newSparkContext(conf);
val strArray =List("this is spark","It is fun!","spark is cool");
val strRDD = sc.parallelize(strArray);
val countRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l,1));
countRDD.reduceByKey((_1,_2)=>_1+_2).collect().foreach(println);
输出
(this,1)(is,3)(fun!,1)(cool,1)(spark,2)(It,1)
案例3. sortByKey([ascending],[numTasks])
默认升序,传入参数false为降序
val sc =newSparkContext(conf);
val strArray =List("this is spark","It is fun!","spark is cool");
val strRDD = sc.parallelize(strArray);
val countRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l,1));
countRDD.reduceByKey((_1,_2)=>_1+_2).map(t=>(t._2,t._1)).sortByKey().collect().foreach(println);
输出
(1,this)(1,fun!)(1,cool)(1,It)(2,spark)(3,is)
案例4.join(otherRDD)
用于2个RDD之间的连接: By joining the dataset of type (K,V) and dataset (K,W), the result of the joined dataset is (K,(V,W))
val sc =newSparkContext(conf);
val parentRDD = sc.parallelize(List((1,"Jason")));
val childRDD = sc.parallelize(List((1,"Tom"),(1,"Mike")));
parentRDD.join(childRDD).map((t =>{t._2._1+"-->"+t._2._2})).foreach(println);
输出
Jason-->Tom
Jason-->Mike
3. Key/Value Pair RDD Actions
name
desc
countByKey()
returns a map where each entry contains the key and a count of values
collectAsMap()
similar behavior as the collect action; return type is a map
lookup(key)
performs a look by key and returns all values that have the same specified key
案例1. countByKey( )
val sc =newSparkContext(conf);
sc.setLogLevel("ERROR");
sc.parallelize(List(("Jason",1),("Jason",3),("Tom",1),("Tom",9))).countByKey().map(t=>{t._1+"-->"+t._2}).foreach(println);
输出信息
Tom-->2
Jason-->2
案例2. collectAsMap( )
val sc =newSparkContext(conf);
sc.setLogLevel("ERROR");
sc.parallelize(List(("Jason",1),("Jason",3),("Tom",1),("Tom",9))).collectAsMap().foreach(println);
输出信息
(Tom,9)(Jason,3)
案例3. lookup(key)
val sc =newSparkContext(conf);
sc.setLogLevel("ERROR");
val mapRDD = sc.parallelize(List(("Jason",1),("Jason",3),("Tom",1),("Tom",9))).reduceByKey(((t1:Int,t2:Int)=>{t1+t2}));
mapRDD.lookup("Jason").foreach(println);
mapRDD.lookup("Tom").foreach(println);