RDD操作:
常用的转换算子:
map算子
对RDD中的每个元素都执行一个指定的函数来产生一个新的RDD
任何原RDD中的元素在新RDD中都有且只有一个元素与之对应
输入分区与输出分区一一对应
//map把普通RDD变成PairRDD
val a=sc.parallelize(List("dog","tiger","lion","cat","panther","eagle"))
val b=a.map(x=>(x,1))
b.collect
//将原RDD中每个元素都乘以2来产生一个新的RDD
val a=sc.parallelize(1 to 9)
val b=a.map(x=>x*2)
a.collect
b.collect
filter算子
对元素进行过滤,对每个元素应用指定函数,返回值为true的元素保留在新的RDD中
val a=sc.parallelize(1 to 10)
a.filter(_%2==0).collect
a.filter(_<4).collect
//map&filter
val rdd=sc.parallelize(List(1 to 6))
val mapRdd=rdd.map(_*2)
mapRdd.collect
val filterRdd=mapRdd.filter(_>5)
filterRdd.collect
mapValues算子
原RDD中的Key保持不变,与新的Value一起组成新的RDD中的元素,仅适用于PairRDD
val a=sc.parallelize(List("dog","tiger","lion","cat","panther","eagle"))
val b=a.map(x=>(x.length,x))
b.mapValues("x"+_+"x").collect
输出结果:
Array((3,xdogx), (5,xtigerx), (4,xlionx), (3,xcatx), (7,xpantherx), (5,xeaglex))
更多常用转换算子
distinct、reduceByKey、groupByKey、sortByKey、union、join、count
……
//distinct
val dis = sc.parallelize(List(1,2,3,4,5,6,7,8,9,9,2,6))
dis.distinct.collect
dis.distinct(2).partitions.length
//reduceByKey、groupByKey
val a = sc.parallelize(List("dog", "salmon", "pig"), 3)
val f = a.map(x=>(x.length,x))
f.reduceByKey((a,b)=>(a+b)).collect
f.reduceByKey(_+_).collect
f.groupByKey.collect
//sortByKey
val a = sc.parallelize(List("dog", "salmon", "pig"), 3)
val f = a.map(x=>(x.length,x))
f.sortByKey().collect
f.sortByKey(false).collect
//union
val u1 = sc.parallelize(1 to 3)
val u2 = sc.parallelize(3 to 4)
u1.union(u2).collect
(u1 ++ u2).collect
u1.intersection(u2).collect
//join、leftOuterJoin、rightOuterJoin
val j1 = sc.parallelize(List("abe", "abby", "apple")).map(a => (a, 1))
val j2 = sc.parallelize(List("apple", "beatty", "beatrice")).map(a => (a, 1))
j1.join(j2).collect
j1.leftOuterJoin(j2).collect
j1.rightOuterJoin(j2).collect
常用的动作算子
count
返回的是数据集中的元素的个数
val rdd=sc.parallelize(List(1,2,3,4,5,6))
rdd.count
collect
以Array返回RDD的所有元素。一般在过滤或者处理足够小的结果的时候使用
应注意到,前面所有转换操作都结合了collect动作算子进行计算输出
val rdd=sc.parallelize(List(1,2,3,4,5,6))
rdd.collect
take
返回前n个元素
val rdd=sc.parallelize(List(1,2,3,4,5,6))
rdd.take(3)
first
返回RDD第一个元素
val rdd=sc.parallelize(List(1,2,3,4,5,6))
rdd.first
reduce
根据指定函数,对RDD中的元素进行两两计算,返回计算结果
val a=sc.parallelize(1 to 100)
a.reduce((x,y)=>x+y)
a.reduce(_+_) //与上面等价
val b=sc.parallelize(Array(("A",0), ("A",2), ("B",1), ("B",2), ("C",1)))
b.reduce((x,y)=>{(x._1+y._1,x._2+y._2)}) //(AABBC,6)
foreach
对RDD中的每个元素都使用指定函数,无返回值
val rdd=sc.parallelize(1 to 100)
rdd.foreach(println)
lookup
用于PairRDD,返回K对应的所有V值
val rdd=sc.parallelize(List(('a',1), ('a',2), ('b',3), ('c',4)))
rdd.lookup('a') //输出WrappedArray(1, 2)
最值:max、min
返回最大值、最小值
val y=sc.parallelize(10 to 30)
y.max //求最大值
y.min //求最小值
saveAsTextFile
保存RDD数据至文件系统
val rdd=sc.parallelize(1 to 10,2)
rdd.saveAsTextFile("hdfs://hadoop000:8020/data/rddsave/")
课堂代码:
在pom.xml配置文件中加入依赖:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
配置log日志级别:
Maven:org.apache.spark:spark-core_2.11:2.1.1
->spark-core_2.11-2.1.1.jar -> org.apache.spark ->log4j-defaults.properties
复制到resources 目录下,改文件名:log4j.properties
//修改日志级别
log4j.rootCategory=ERROR, console
MapDemo:
map:
//map操作,创建了MapPartitionsRDD
/**
* Return a new RDD by applying a function to all elements of this RDD.
*/
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
//MapPartitionsRDD
/**
* An RDD that applies the provided function to every partition of the parent RDD.
*/
private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
var prev: RDD[T],
f: (TaskContext, Int, Iterator[T]) => Iterator[U], // (TaskContext, partition index, iterator)
preservesPartitioning: Boolean = false)
extends RDD[U](prev) {
override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None
override def getPartitions: Array[Partition] = firstParent[T].partitions
override def compute(split: Partition, context: TaskContext): Iterator[U] =
f(context, split.index, firstParent[T].iterator(split, context))
override def clearDependencies() {
super.clearDependencies()
prev = null
}
}
//iter.map方法
/** Creates a new iterator that maps all produced values of this iterator
* to new values using a transformation function.
*
* @param f the transformation function
* @return a new iterator which transforms every value produced by this
* iterator by applying the function `f` to it.
* @note Reuse: $consumesAndProducesIterator
*/
def map[B](f: A => B): Iterator[B] = new AbstractIterator[B] {
def hasNext = self.hasNext
def next() = f(self.next())
}
map练习1:
object MapDemo {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[1]").setAppName("mapdemo")
val sc: SparkContext = SparkContext.getOrCreate(conf)
val rdd1: RDD[Int] = sc.parallelize(1 to 9 ,3)
val rdd2: RDD[Int] = rdd1.map(_*2)
rdd2.collect().foreach(println) //输出结果:2 4 6 8 10 12 14 16 18
}
}
val rdd1: RDD[Int] = sc.parallelize(1 to 9 ,3)
//SparkContext的parallelize方法
def parallelize[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T] = withScope {
assertNotStopped()
new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
}
//第一个参数Seq 第二个参数:defaultParallelism并行度
val seq: Seq[Range.Inclusive] = Seq(1 to 9)
//Seq类型
type Seq[+A] = scala.collection.Seq[A]
val Seq = scala.collection.Seq
//1 to 9 的实际
class Inclusive(start: Int, end: Int, step: Int) extends Range(start, end, step) {
override def par = new ParRange(this)
override def isInclusive = true
override protected def copy(start: Int, end: Int, step: Int): Range = new Inclusive(start, end, step)
}
//new的ParallelCollectionRDD对象
private[spark] class ParallelCollectionRDD[T: ClassTag](
sc: SparkContext,
@transient private val data: Seq[T],
numSlices: Int,
locationPrefs: Map[Int, Seq[String]])
extends RDD[T](sc, Nil) {
// TODO: Right now, each split sends along its full data, even if later down the RDD chain it gets
// cached. It might be worthwhile to write the data to a file in the DFS and read it in the split
// instead.
// UPDATE: A parallel collection can be checkpointed to HDFS, which achieves this goal.
override def getPartitions: Array[Partition] = {
val slices = ParallelCollectionRDD.slice(data, numSlices).toArray
slices.indices.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray
}
override def compute(s: Partition, context: TaskContext): Iterator[T] = {
new InterruptibleIterator(context, s.asInstanceOf[ParallelCollectionPartition[T]].iterator)
}
override def getPreferredLocations(s: Partition): Seq[String] = {
locationPrefs.getOrElse(s.index, Nil)
}
}
map练习2:
val strRdd1: RDD[String] = sc.parallelize(List("kb02","kb05","kb07","kb09","spark","study"),2)
val strRdd2: RDD[(String, Int)] = strRdd1.map(x=>(x,1))
strRdd2.collect().foreach(println) //(kb02,1) (kb05,1) (kb07,1) (kb09,1) (spark,1) (study,1)
filter:
//filter操作,创建MapPartitionsRDD对象
/**
* Return a new RDD containing only the elements that satisfy a predicate.
*/
def filter(f: T => Boolean): RDD[T] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[T, T](
this,
//和map的区别:(context, pid, iter) => iter.map(cleanF)
(context, pid, iter) => iter.filter(cleanF),
preservesPartitioning = true)
}
//iter.filter方法
/** Returns an iterator over all the elements of this iterator that satisfy the predicate `p`.
* The order of the elements is preserved.
*
* @param p the predicate used to test values.
* @return an iterator which produces those values of this iterator which satisfy the predicate `p`.
* @note Reuse: $consumesAndProducesIterator
*/
def filter(p: A => Boolean): Iterator[A] = new AbstractIterator[A] {
// TODO 2.12 - Make a full-fledged FilterImpl that will reverse sense of p
private var hd: A = _
private var hdDefined: Boolean = false
def hasNext: Boolean = hdDefined || {
do {
if (!self.hasNext) return false
hd = self.next()
} while (!p(hd))
hdDefined = true
true
}
def next() = if (hasNext) { hdDefined = false; hd } else empty.next()
}
filter练习1:
val filterRdd1: RDD[Int] = sc.makeRDD(List(1,2,3,4,5,6,7,8,9,10),3)
val filterRdd2: RDD[Int] = filterRdd1.filter(_%2==0)
filterRdd2.collect.foreach(println) //输出结果:2 4 6 8 10
mapValues:
//mapValues创建MapPartitionsRDD对象
/**
* Pass each value in the key-value pair RDD through a map function without changing the keys;
* this also retains the original RDD's partitioning.
*/
def mapValues[U](f: V => U): RDD[(K, U)] = self.withScope {
//和map的区别:val cleanF = sc.clean(f)
val cleanF = self.context.clean(f)
//和map的区别:new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF)) key保持不变,只对value进行操作
new MapPartitionsRDD[(K, U), (K, V)](self,
(context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
preservesPartitioning = true)
}
map和mapValues练习:
val mapValuesRdd1: RDD[String] = sc.parallelize(List("tiger","dog","lion","eagle","panda"))
val mapValuesRdd2: RDD[(Int, String)] = mapValuesRdd1.map(x=>(x.length,x))
mapValuesRdd2.collect.foreach(println) //输出结果:(5,tiger) (3,dog) (4,lion) (5,eagle) (5,panda)
val mapValursRdd3: RDD[(Int, String)] = mapValuesRdd2.mapValues(x=>"_"+x+"_")
mapValursRdd3.collect.foreach(println) //输出结果:(5,_tiger_) (3,_dog_) (4,_lion_) (5,_eagle_) (5,_panda_)
reduceByKey
//reduce方法调用同名方法reduceByKey(defaultPartitioner(self), func)参数不同
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
* parallelism level.
*/
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
reduceByKey(defaultPartitioner(self), func)
}
//reduceByKey调用combineByKeyWithClassTag方法
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce.
*/
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
//combineByKeyWithClassTag方法经过一系列判断,生成迭代器InterruptibleIterator,传递combineValuesByKey()方法,或者new了ShuffledRDD
/**
* :: Experimental ::
* Generic function to combine the elements for each key using a custom set of aggregation
* functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C
*
* Users provide three functions:
*
* - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
* - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
* - `mergeCombiners`, to combine two C's into a single one.
*
* In addition, users can control the partitioning of the output RDD, and whether to perform
* map-side aggregation (if a mapper can produce multiple items with the same key).
*
* @note V and C can be different -- for example, one might group an RDD of type
* (Int, Int) into an RDD of type (Int, Seq[Int]).
*/
@Experimental
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
}
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.clean(mergeCombiners))
if (self.partitioner == Some(partitioner)) {
self.mapPartitions(iter => {
val context = TaskContext.get()
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else {
new ShuffledRDD[K, V, C](self, partitioner)
.setSerializer(serializer)
.setAggregator(aggregator)
.setMapSideCombine(mapSideCombine)
}
}
//combineValuesByKey
def combineValuesByKey(
iter: Iterator[_ <: Product2[K, V]],
context: TaskContext): Iterator[(K, C)] = {
val combiners = new ExternalAppendOnlyMap[K, V, C](createCombiner, mergeValue, mergeCombiners)
combiners.insertAll(iter)
updateMetrics(context, combiners)
combiners.iterator
}
//shuffledRDD
/**
* :: DeveloperApi ::
* The resulting RDD from a shuffle (e.g. repartitioning of data).
* @param prev the parent RDD.
* @param part the partitioner used to partition the RDD
* @tparam K the key class.
* @tparam V the value class.
* @tparam C the combiner class.
*/
// TODO: Make this return RDD[Product2[K, C]] or have some way to configure mutable pairs
@DeveloperApi
class ShuffledRDD[K: ClassTag, V: ClassTag, C: ClassTag](
@transient var prev: RDD[_ <: Product2[K, V]],
part: Partitioner)
extends RDD[(K, C)](prev.context, Nil) {
private var userSpecifiedSerializer: Option[Serializer] = None
private var keyOrdering: Option[Ordering[K]] = None
private var aggregator: Option[Aggregator[K, V, C]] = None
private var mapSideCombine: Boolean = false
/** Set a serializer for this RDD's shuffle, or null to use the default (spark.serializer) */
def setSerializer(serializer: Serializer): ShuffledRDD[K, V, C] = {
this.userSpecifiedSerializer = Option(serializer)
this
}
/** Set key ordering for RDD's shuffle. */
def setKeyOrdering(keyOrdering: Ordering[K]): ShuffledRDD[K, V, C] = {
this.keyOrdering = Option(keyOrdering)
this
}
/** Set aggregator for RDD's shuffle. */
def setAggregator(aggregator: Aggregator[K, V, C]): ShuffledRDD[K, V, C] = {
this.aggregator = Option(aggregator)
this
}
/** Set mapSideCombine flag for RDD's shuffle. */
def setMapSideCombine(mapSideCombine: Boolean): ShuffledRDD[K, V, C] = {
this.mapSideCombine = mapSideCombine
this
}
override def getDependencies: Seq[Dependency[_]] = {
val serializer = userSpecifiedSerializer.getOrElse {
val serializerManager = SparkEnv.get.serializerManager
if (mapSideCombine) {
serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[C]])
} else {
serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[V]])
}
}
List(new ShuffleDependency(prev, part, serializer, keyOrdering, aggregator, mapSideCombine))
}
override val partitioner = Some(part)
override def getPartitions: Array[Partition] = {
Array.tabulate[Partition](part.numPartitions)(i => new ShuffledRDDPartition(i))
}
override protected def getPreferredLocations(partition: Partition): Seq[String] = {
val tracker = SparkEnv.get.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster]
val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
tracker.getPreferredLocationsForShuffle(dep, partition.index)
}
override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context)
.read()
.asInstanceOf[Iterator[(K, C)]]
}
override def clearDependencies() {
super.clearDependencies()
prev = null
}
}
reduceByKey练习:
val mapValuesRdd1: RDD[String] = sc.parallelize(List("tiger","dog","lion","eagle","panda"))
val mapValuesRdd2: RDD[(Int, String)] = mapValuesRdd1.map(x=>(x.length,x))
val reduceByKeyRdd1: RDD[(Int, String)] = mapValuesRdd2.reduceByKey((a,b)=>a+b)
reduceByKeyRdd1.collect.foreach(println) //(3,dog) (4,lion) (5,tigereaglepanda)
groupByKey:
/**
* Group the values for each key in the RDD into a single sequence. Hash-partitions the
* resulting RDD with the existing partitioner/parallelism level. The ordering of elements
* within each group is not guaranteed, and may even differ each time the resulting RDD is
* evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
*/
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(defaultPartitioner(self))
}
//调用groupByKey(partitioner: Partitioner)方法,调用combineByKeyWithClassTag方法,创建实例bufs.asInstanceOf[RDD[(K, Iterable[V])]]返回RDD[(K, Iterable[V])]
/**
* Group the values for each key in the RDD into a single sequence. Allows controlling the
* partitioning of the resulting key-value pair RDD by passing a Partitioner.
* The ordering of elements within each group is not guaranteed, and may even differ
* each time the resulting RDD is evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
*
* @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
* key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
*/
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn't use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
groupByKey练习:
val mapValuesRdd1: RDD[String] = sc.parallelize(List("tiger","dog","lion","eagle","panda"))
val mapValuesRdd2: RDD[(Int, String)] = mapValuesRdd1.map(x=>(x.length,x))
val groupByKeyRdd: RDD[(Int, Iterable[String])] = mapValuesRdd2.groupByKey()
groupByKeyRdd.collect.foreach(println) //(3,CompactBuffer(dog)) (4,CompactBuffer(lion)) (5,CompactBuffer(tiger, eagle, panda))
sortByKey
/**
* Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
* `collect` or `save` on the resulting RDD will return or output an ordered list of records
* (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
* order of the keys).
*/
// TODO: this currently doesn't work on P other than Tuple2!
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
: RDD[(K, V)] = self.withScope
{
val part = new RangePartitioner(numPartitions, self, ascending)
new ShuffledRDD[K, V, V](self, part)
.setKeyOrdering(if (ascending) ordering else ordering.reverse)
}
//创建ShuffledRDD,设置排列方式setKeyOrdering(if (ascending) ordering else ordering.reverse) 默认为ascending为true,false则调用ordering.reverse
/** Return the opposite ordering of this one. */
override def reverse: Ordering[T] = new Ordering[T] {
override def reverse = outer
def compare(x: T, y: T) = outer.compare(y, x)
}
val mapValuesRdd1: RDD[String] = sc.parallelize(List("tiger","dog","lion","eagle","panda"))
val mapValuesRdd2: RDD[(Int, String)] = mapValuesRdd1.map(x=>(x.length,x))
val sortByKeyRdd: RDD[(Int, String)] = mapValuesRdd2.sortByKey()
sortByKeyRdd.collect.foreach(println) //(3,dog) (4,lion) (5,tiger) (5,eagle) (5,panda)
val sortByKeyRdd2: RDD[(Int, String)] = mapValuesRdd2.sortByKey(false)
sortByKeyRdd2.collect.foreach(println) //(5,tiger) (5,eagle) (5,panda) (4,lion) (3,dog)
++ union和intersection
/**
* Return the union of this RDD and another one. Any identical elements will appear multiple
* times (use `.distinct()` to eliminate them).
*/
def union(other: RDD[T]): RDD[T] = withScope {
sc.union(this, other)
def ++(other: RDD[T]): RDD[T] = withScope {
this.union(other)
}
//调用union方法,如果分区数是1则创建PartitionerAwareUnionRDD否则创建UnionRDD
/** Build the union of a list of RDDs. */
def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = withScope {
val partitioners = rdds.flatMap(_.partitioner).toSet
if (rdds.forall(_.partitioner.isDefined) && partitioners.size == 1) {
new PartitionerAwareUnionRDD(this, rdds)
} else {
new UnionRDD(this, rdds)
}
}
//intersection取交集
/**
* Return the intersection of this RDD and another one. The output will not contain any duplicate
* elements, even if the input RDDs did.
*
* @note This method performs a shuffle internally.
*/
def intersection(other: RDD[T]): RDD[T] = withScope {
this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
.filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
.keys
}
练习:
val u1=sc.parallelize(1 to 3)
val u2=sc.parallelize(3 to 4)
println("----------union------------------")
u1.union(u2).collect.foreach(println) //1 2 3 4
println("------------- ++ ---------------")
(u1++u2).collect.foreach(println) //1 2 3 4
println("---------- intersection ------------------")
u1.intersection(u2).collect.foreach(println) // 3
join、leftOuterJoin和rightOuterJoin
/**
* Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
* pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
* (k, v2) is in `other`. Performs a hash join across the cluster.
*/
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope {
join(other, defaultPartitioner(self, other))
}
//join[W](other: RDD[(K, W)], partitioner: Partitioner)
/**
* Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
* pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
* (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
*/
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
this.cogroup(other, partitioner).flatMapValues( pair =>
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
)
}
//cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)方法
/**
* For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
* list of values for that key in `this` as well as `other`.
*/
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
: RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
cg.mapValues { case Array(vs, w1s) =>
(vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
}
}
//leftOuterJoin
/**
* Perform a left outer join of `this` and `other`. For each element (k, v) in `this`, the
* resulting RDD will either contain all pairs (k, (v, Some(w))) for w in `other`, or the
* pair (k, (v, None)) if no elements in `other` have key k. Hash-partitions the output
* using the existing partitioner/parallelism level.
*/
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))] = self.withScope {
leftOuterJoin(other, defaultPartitioner(self, other))
}
//leftOuterJoin(other, defaultPartitioner(self, other))
def leftOuterJoin[W](
other: RDD[(K, W)],
partitioner: Partitioner): RDD[(K, (V, Option[W]))] = self.withScope {
this.cogroup(other, partitioner).flatMapValues { pair =>
//和join相比多了对空值的判断
if (pair._2.isEmpty) {
pair._1.iterator.map(v => (v, None))
} else {
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, Some(w))
}
}
}
val j1=sc.parallelize(List("abe","abby","apple")).map(a=>(a,2))
val j2: RDD[(String, Int)] = sc.parallelize(List("apple","beatty","beatrice")).map(a=>(a,1))
println("----------join------------------")
j1.join(j2).collect.foreach(println) //(apple,(2,1))
println("----------leftjoin------------------")
j1.leftOuterJoin(j2).collect.foreach(println) //(abby,(2,None)) (apple,(2,Some(1))) (abe,(2,None))
println("----------rightjoin------------------")
j1.rightOuterJoin(j2).collect.foreach(println) //(apple,(Some(2),1)) (beatty,(None,1)) (beatrice,(None,1))