RDDs containing key/value pairs are called pair RDDs.
Aggregation
combineByKey()
The proceduer is as follows.
Tuning the level of parallelism
sc.parallelize(data).reduceByKey((x,y) => x+y , 10).partitions.size
Grouping Data
In additon to grouping data from a single RDD, we can group data sharing the same key from multiple RDDs using a function called cogroup()
. cogroup() over two RDDs sharing the same key type, k, with the respective value types gives us back RDD[(K, (Iterable[V],Iterable[W]))]
. If one of the RDDs doesn’t have elements for a given key that is present in the other RDD, the corresponding Iterable is simply empty. cogroup() gives us the power to group data from multiple RDDs.
cogroup() is used as a building block for the joins.
Joins
leftOuterJoin()
rightOuterJoin()
Sorting data
Action Available on Pair RDDs
Determining an RDD’s Partitioner
Operations That Affect Partitioning
Example: PageRank
Tip
To maximize the potential for partitioning-related optimizations, you should use mapValues() or flatMapValues() whenever you are not changing an element’s key.