2020.11.4课堂笔记(RDD常用算子)

RDD操作:

在这里插入图片描述

常用的转换算子:

map算子

对RDD中的每个元素都执行一个指定的函数来产生一个新的RDD
任何原RDD中的元素在新RDD中都有且只有一个元素与之对应
输入分区与输出分区一一对应

//map把普通RDD变成PairRDD
val a=sc.parallelize(List("dog","tiger","lion","cat","panther","eagle"))
val b=a.map(x=>(x,1))
b.collect
//将原RDD中每个元素都乘以2来产生一个新的RDD
val a=sc.parallelize(1 to 9)
val b=a.map(x=>x*2)
a.collect
b.collect

filter算子

对元素进行过滤,对每个元素应用指定函数,返回值为true的元素保留在新的RDD中

val a=sc.parallelize(1 to 10)
a.filter(_%2==0).collect      
a.filter(_<4).collect	
//map&filter
val rdd=sc.parallelize(List(1 to 6))
val mapRdd=rdd.map(_*2)
mapRdd.collect
val filterRdd=mapRdd.filter(_>5)
filterRdd.collect

mapValues算子

原RDD中的Key保持不变,与新的Value一起组成新的RDD中的元素,仅适用于PairRDD

val a=sc.parallelize(List("dog","tiger","lion","cat","panther","eagle"))
val b=a.map(x=>(x.length,x))
b.mapValues("x"+_+"x").collect

输出结果:

Array((3,xdogx), (5,xtigerx), (4,xlionx), (3,xcatx), (7,xpantherx), (5,xeaglex))

更多常用转换算子

distinct、reduceByKey、groupByKey、sortByKey、union、join、count
……

//distinct
val dis = sc.parallelize(List(1,2,3,4,5,6,7,8,9,9,2,6))
dis.distinct.collect
dis.distinct(2).partitions.length
//reduceByKey、groupByKey
val a = sc.parallelize(List("dog", "salmon", "pig"), 3)
val f = a.map(x=>(x.length,x))
f.reduceByKey((a,b)=>(a+b)).collect
f.reduceByKey(_+_).collect
f.groupByKey.collect
//sortByKey
val a = sc.parallelize(List("dog", "salmon", "pig"), 3)
val f = a.map(x=>(x.length,x))
f.sortByKey().collect
f.sortByKey(false).collect
//union
val u1 = sc.parallelize(1 to 3)
val u2 = sc.parallelize(3 to 4)
u1.union(u2).collect
(u1 ++ u2).collect
u1.intersection(u2).collect
//join、leftOuterJoin、rightOuterJoin
val j1 = sc.parallelize(List("abe", "abby", "apple")).map(a => (a, 1))
val j2 = sc.parallelize(List("apple", "beatty", "beatrice")).map(a => (a, 1))
j1.join(j2).collect
j1.leftOuterJoin(j2).collect
j1.rightOuterJoin(j2).collect

常用的动作算子

count

返回的是数据集中的元素的个数

val rdd=sc.parallelize(List(1,2,3,4,5,6))
rdd.count

collect

以Array返回RDD的所有元素。一般在过滤或者处理足够小的结果的时候使用
应注意到,前面所有转换操作都结合了collect动作算子进行计算输出

val rdd=sc.parallelize(List(1,2,3,4,5,6))
rdd.collect

take

返回前n个元素

val rdd=sc.parallelize(List(1,2,3,4,5,6))
rdd.take(3)

first

返回RDD第一个元素

val rdd=sc.parallelize(List(1,2,3,4,5,6))
rdd.first

reduce

根据指定函数,对RDD中的元素进行两两计算,返回计算结果

val a=sc.parallelize(1 to 100)
a.reduce((x,y)=>x+y)
a.reduce(_+_)		//与上面等价
val b=sc.parallelize(Array(("A",0), ("A",2), ("B",1), ("B",2), ("C",1)))
b.reduce((x,y)=>{(x._1+y._1,x._2+y._2)})		//(AABBC,6)

foreach

对RDD中的每个元素都使用指定函数,无返回值

val rdd=sc.parallelize(1 to 100)
rdd.foreach(println)

lookup

用于PairRDD,返回K对应的所有V值

val rdd=sc.parallelize(List(('a',1), ('a',2), ('b',3), ('c',4)))
rdd.lookup('a')		//输出WrappedArray(1, 2)

最值:max、min

返回最大值、最小值

val y=sc.parallelize(10 to 30)
y.max	//求最大值
y.min	//求最小值

saveAsTextFile

保存RDD数据至文件系统

val rdd=sc.parallelize(1 to 10,2)
rdd.saveAsTextFile("hdfs://hadoop000:8020/data/rddsave/")

课堂代码:

在pom.xml配置文件中加入依赖:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.1.1</version>
</dependency>

配置log日志级别:
Maven:org.apache.spark:spark-core_2.11:2.1.1
->spark-core_2.11-2.1.1.jar -> org.apache.spark ->log4j-defaults.properties
复制到resources 目录下,改文件名:log4j.properties

//修改日志级别
log4j.rootCategory=ERROR, console

MapDemo:

map:

//map操作,创建了MapPartitionsRDD
/**
 * Return a new RDD by applying a function to all elements of this RDD.
 */
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
//MapPartitionsRDD
/**
 * An RDD that applies the provided function to every partition of the parent RDD.
 */
private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
    var prev: RDD[T],
    f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
    preservesPartitioning: Boolean = false)
  extends RDD[U](prev) {

  override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None

  override def getPartitions: Array[Partition] = firstParent[T].partitions

  override def compute(split: Partition, context: TaskContext): Iterator[U] =
    f(context, split.index, firstParent[T].iterator(split, context))

  override def clearDependencies() {
    super.clearDependencies()
    prev = null
  }
}
//iter.map方法
/** Creates a new iterator that maps all produced values of this iterator
 *  to new values using a transformation function.
 *
 *  @param f  the transformation function
 *  @return a new iterator which transforms every value produced by this
 *          iterator by applying the function `f` to it.
 *  @note   Reuse: $consumesAndProducesIterator
 */
def map[B](f: A => B): Iterator[B] = new AbstractIterator[B] {
  def hasNext = self.hasNext
  def next() = f(self.next())
}

map练习1:

object MapDemo {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setMaster("local[1]").setAppName("mapdemo")
    val sc: SparkContext = SparkContext.getOrCreate(conf)
    val rdd1: RDD[Int] = sc.parallelize(1 to 9 ,3)
    val rdd2: RDD[Int] = rdd1.map(_*2)
    rdd2.collect().foreach(println)  //输出结果:2 4 6 8 10 12 14 16 18
  }
}

val rdd1: RDD[Int] = sc.parallelize(1 to 9 ,3)

//SparkContext的parallelize方法
def parallelize[T: ClassTag](
    seq: Seq[T],
    numSlices: Int = defaultParallelism): RDD[T] = withScope {
  assertNotStopped()
  new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
}
//第一个参数Seq 第二个参数:defaultParallelism并行度
val seq: Seq[Range.Inclusive] = Seq(1 to 9)
//Seq类型
type Seq[+A] = scala.collection.Seq[A]
val Seq = scala.collection.Seq
//1 to 9 的实际
class Inclusive(start: Int, end: Int, step: Int) extends Range(start, end, step) {
    override def par = new ParRange(this)
  override def isInclusive = true
  override protected def copy(start: Int, end: Int, step: Int): Range = new Inclusive(start, end, step)
}
//new的ParallelCollectionRDD对象
private[spark] class ParallelCollectionRDD[T: ClassTag](
    sc: SparkContext,
    @transient private val data: Seq[T],
    numSlices: Int,
    locationPrefs: Map[Int, Seq[String]])
    extends RDD[T](sc, Nil) {
  // TODO: Right now, each split sends along its full data, even if later down the RDD chain it gets
  // cached. It might be worthwhile to write the data to a file in the DFS and read it in the split
  // instead.
  // UPDATE: A parallel collection can be checkpointed to HDFS, which achieves this goal.

  override def getPartitions: Array[Partition] = {
    val slices = ParallelCollectionRDD.slice(data, numSlices).toArray
    slices.indices.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray
  }

  override def compute(s: Partition, context: TaskContext): Iterator[T] = {
    new InterruptibleIterator(context, s.asInstanceOf[ParallelCollectionPartition[T]].iterator)
  }

  override def getPreferredLocations(s: Partition): Seq[String] = {
    locationPrefs.getOrElse(s.index, Nil)
  }
}

map练习2:

val strRdd1: RDD[String] = sc.parallelize(List("kb02","kb05","kb07","kb09","spark","study"),2)
val strRdd2: RDD[(String, Int)] = strRdd1.map(x=>(x,1))
strRdd2.collect().foreach(println)  //(kb02,1) (kb05,1) (kb07,1) (kb09,1) (spark,1) (study,1)

filter:

//filter操作,创建MapPartitionsRDD对象
/**
 * Return a new RDD containing only the elements that satisfy a predicate.
 */
def filter(f: T => Boolean): RDD[T] = withScope {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[T, T](
    this,
    //和map的区别:(context, pid, iter) => iter.map(cleanF)
    (context, pid, iter) => iter.filter(cleanF),
    preservesPartitioning = true)
}
//iter.filter方法
/** Returns an iterator over all the elements of this iterator that satisfy the predicate `p`.
 *  The order of the elements is preserved.
 *
 *  @param p the predicate used to test values.
 *  @return  an iterator which produces those values of this iterator which satisfy the predicate `p`.
 *  @note    Reuse: $consumesAndProducesIterator
 */
def filter(p: A => Boolean): Iterator[A] = new AbstractIterator[A] {
  // TODO 2.12 - Make a full-fledged FilterImpl that will reverse sense of p
  private var hd: A = _
  private var hdDefined: Boolean = false
  def hasNext: Boolean = hdDefined || {
    do {
      if (!self.hasNext) return false
      hd = self.next()
    } while (!p(hd))
    hdDefined = true
    true
  }
  def next() = if (hasNext) { hdDefined = false; hd } else empty.next()
}

filter练习1:

val filterRdd1: RDD[Int] = sc.makeRDD(List(1,2,3,4,5,6,7,8,9,10),3)
val filterRdd2: RDD[Int] = filterRdd1.filter(_%2==0)
filterRdd2.collect.foreach(println)  //输出结果:2 4 6 8 10

mapValues:

//mapValues创建MapPartitionsRDD对象
/**
 * Pass each value in the key-value pair RDD through a map function without changing the keys;
 * this also retains the original RDD's partitioning.
 */
def mapValues[U](f: V => U): RDD[(K, U)] = self.withScope {
  //和map的区别:val cleanF = sc.clean(f)
  val cleanF = self.context.clean(f)
  //和map的区别:new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))   key保持不变,只对value进行操作
  new MapPartitionsRDD[(K, U), (K, V)](self,
    (context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
    preservesPartitioning = true)
}

map和mapValues练习:

val mapValuesRdd1: RDD[String] = sc.parallelize(List("tiger","dog","lion","eagle","panda"))
val mapValuesRdd2: RDD[(Int, String)] = mapValuesRdd1.map(x=>(x.length,x))
mapValuesRdd2.collect.foreach(println)    //输出结果:(5,tiger) (3,dog) (4,lion) (5,eagle) (5,panda)
val mapValursRdd3: RDD[(Int, String)] = mapValuesRdd2.mapValues(x=>"_"+x+"_")
mapValursRdd3.collect.foreach(println)    //输出结果:(5,_tiger_) (3,_dog_) (4,_lion_) (5,_eagle_) (5,_panda_)

reduceByKey

//reduce方法调用同名方法reduceByKey(defaultPartitioner(self), func)参数不同
/**
 * Merge the values for each key using an associative and commutative reduce function. This will
 * also perform the merging locally on each mapper before sending results to a reducer, similarly
 * to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
 * parallelism level.
 */
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
  reduceByKey(defaultPartitioner(self), func)
}

//reduceByKey调用combineByKeyWithClassTag方法
/**
 * Merge the values for each key using an associative and commutative reduce function. This will
 * also perform the merging locally on each mapper before sending results to a reducer, similarly
 * to a "combiner" in MapReduce.
 */
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
  combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}

//combineByKeyWithClassTag方法经过一系列判断,生成迭代器InterruptibleIterator,传递combineValuesByKey()方法,或者new了ShuffledRDD
/**
 * :: Experimental ::
 * Generic function to combine the elements for each key using a custom set of aggregation
 * functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C
 *
 * Users provide three functions:
 *
 *  - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
 *  - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
 *  - `mergeCombiners`, to combine two C's into a single one.
 *
 * In addition, users can control the partitioning of the output RDD, and whether to perform
 * map-side aggregation (if a mapper can produce multiple items with the same key).
 *
 * @note V and C can be different -- for example, one might group an RDD of type
 * (Int, Int) into an RDD of type (Int, Seq[Int]).
 */
@Experimental
def combineByKeyWithClassTag[C](
    createCombiner: V => C,
    mergeValue: (C, V) => C,
    mergeCombiners: (C, C) => C,
    partitioner: Partitioner,
    mapSideCombine: Boolean = true,
    serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
  require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
  if (keyClass.isArray) {
    if (mapSideCombine) {
      throw new SparkException("Cannot use map-side combining with array keys.")
    }
    if (partitioner.isInstanceOf[HashPartitioner]) {
      throw new SparkException("HashPartitioner cannot partition array keys.")
    }
  }
  val aggregator = new Aggregator[K, V, C](
    self.context.clean(createCombiner),
    self.context.clean(mergeValue),
    self.context.clean(mergeCombiners))
  if (self.partitioner == Some(partitioner)) {
    self.mapPartitions(iter => {
      val context = TaskContext.get()
      new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
    }, preservesPartitioning = true)
  } else {
    new ShuffledRDD[K, V, C](self, partitioner)
      .setSerializer(serializer)
      .setAggregator(aggregator)
      .setMapSideCombine(mapSideCombine)
  }
}

//combineValuesByKey
def combineValuesByKey(
    iter: Iterator[_ <: Product2[K, V]],
    context: TaskContext): Iterator[(K, C)] = {
  val combiners = new ExternalAppendOnlyMap[K, V, C](createCombiner, mergeValue, mergeCombiners)
  combiners.insertAll(iter)
  updateMetrics(context, combiners)
  combiners.iterator
}

//shuffledRDD
/**
 * :: DeveloperApi ::
 * The resulting RDD from a shuffle (e.g. repartitioning of data).
 * @param prev the parent RDD.
 * @param part the partitioner used to partition the RDD
 * @tparam K the key class.
 * @tparam V the value class.
 * @tparam C the combiner class.
 */
// TODO: Make this return RDD[Product2[K, C]] or have some way to configure mutable pairs
@DeveloperApi
class ShuffledRDD[K: ClassTag, V: ClassTag, C: ClassTag](
    @transient var prev: RDD[_ <: Product2[K, V]],
    part: Partitioner)
  extends RDD[(K, C)](prev.context, Nil) {

  private var userSpecifiedSerializer: Option[Serializer] = None

  private var keyOrdering: Option[Ordering[K]] = None

  private var aggregator: Option[Aggregator[K, V, C]] = None

  private var mapSideCombine: Boolean = false

  /** Set a serializer for this RDD's shuffle, or null to use the default (spark.serializer) */
  def setSerializer(serializer: Serializer): ShuffledRDD[K, V, C] = {
    this.userSpecifiedSerializer = Option(serializer)
    this
  }

  /** Set key ordering for RDD's shuffle. */
  def setKeyOrdering(keyOrdering: Ordering[K]): ShuffledRDD[K, V, C] = {
    this.keyOrdering = Option(keyOrdering)
    this
  }

  /** Set aggregator for RDD's shuffle. */
  def setAggregator(aggregator: Aggregator[K, V, C]): ShuffledRDD[K, V, C] = {
    this.aggregator = Option(aggregator)
    this
  }

  /** Set mapSideCombine flag for RDD's shuffle. */
  def setMapSideCombine(mapSideCombine: Boolean): ShuffledRDD[K, V, C] = {
    this.mapSideCombine = mapSideCombine
    this
  }

  override def getDependencies: Seq[Dependency[_]] = {
    val serializer = userSpecifiedSerializer.getOrElse {
      val serializerManager = SparkEnv.get.serializerManager
      if (mapSideCombine) {
        serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[C]])
      } else {
        serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[V]])
      }
    }
    List(new ShuffleDependency(prev, part, serializer, keyOrdering, aggregator, mapSideCombine))
  }

  override val partitioner = Some(part)

  override def getPartitions: Array[Partition] = {
    Array.tabulate[Partition](part.numPartitions)(i => new ShuffledRDDPartition(i))
  }

  override protected def getPreferredLocations(partition: Partition): Seq[String] = {
    val tracker = SparkEnv.get.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster]
    val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
    tracker.getPreferredLocationsForShuffle(dep, partition.index)
  }

  override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
    val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
    SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context)
      .read()
      .asInstanceOf[Iterator[(K, C)]]
  }

  override def clearDependencies() {
    super.clearDependencies()
    prev = null
  }
}

reduceByKey练习:

val mapValuesRdd1: RDD[String] = sc.parallelize(List("tiger","dog","lion","eagle","panda"))
val mapValuesRdd2: RDD[(Int, String)] = mapValuesRdd1.map(x=>(x.length,x))
val reduceByKeyRdd1: RDD[(Int, String)] = mapValuesRdd2.reduceByKey((a,b)=>a+b)
reduceByKeyRdd1.collect.foreach(println)    //(3,dog) (4,lion) (5,tigereaglepanda)

groupByKey:

/**
 * Group the values for each key in the RDD into a single sequence. Hash-partitions the
 * resulting RDD with the existing partitioner/parallelism level. The ordering of elements
 * within each group is not guaranteed, and may even differ each time the resulting RDD is
 * evaluated.
 *
 * @note This operation may be very expensive. If you are grouping in order to perform an
 * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
 * or `PairRDDFunctions.reduceByKey` will provide much better performance.
 */
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
  groupByKey(defaultPartitioner(self))
}

//调用groupByKey(partitioner: Partitioner)方法,调用combineByKeyWithClassTag方法,创建实例bufs.asInstanceOf[RDD[(K, Iterable[V])]]返回RDD[(K, Iterable[V])]
/**
 * Group the values for each key in the RDD into a single sequence. Allows controlling the
 * partitioning of the resulting key-value pair RDD by passing a Partitioner.
 * The ordering of elements within each group is not guaranteed, and may even differ
 * each time the resulting RDD is evaluated.
 *
 * @note This operation may be very expensive. If you are grouping in order to perform an
 * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
 * or `PairRDDFunctions.reduceByKey` will provide much better performance.
 *
 * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
 * key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
 */
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
  // groupByKey shouldn't use map side combine because map side combine does not
  // reduce the amount of data shuffled and requires all map side data be inserted
  // into a hash table, leading to more objects in the old gen.
  val createCombiner = (v: V) => CompactBuffer(v)
  val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
  val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
  val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
    createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
  bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}

groupByKey练习:

val mapValuesRdd1: RDD[String] = sc.parallelize(List("tiger","dog","lion","eagle","panda"))
val mapValuesRdd2: RDD[(Int, String)] = mapValuesRdd1.map(x=>(x.length,x))
val groupByKeyRdd: RDD[(Int, Iterable[String])] = mapValuesRdd2.groupByKey()
groupByKeyRdd.collect.foreach(println)    //(3,CompactBuffer(dog)) (4,CompactBuffer(lion)) (5,CompactBuffer(tiger, eagle, panda))

sortByKey

/**
 * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
 * `collect` or `save` on the resulting RDD will return or output an ordered list of records
 * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
 * order of the keys).
 */
// TODO: this currently doesn't work on P other than Tuple2!
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
    : RDD[(K, V)] = self.withScope
{
  val part = new RangePartitioner(numPartitions, self, ascending)
  new ShuffledRDD[K, V, V](self, part)
    .setKeyOrdering(if (ascending) ordering else ordering.reverse)
}

//创建ShuffledRDD,设置排列方式setKeyOrdering(if (ascending) ordering else ordering.reverse)  默认为ascending为true,false则调用ordering.reverse
/** Return the opposite ordering of this one. */
override def reverse: Ordering[T] = new Ordering[T] {
  override def reverse = outer
  def compare(x: T, y: T) = outer.compare(y, x)
}
val mapValuesRdd1: RDD[String] = sc.parallelize(List("tiger","dog","lion","eagle","panda"))
val mapValuesRdd2: RDD[(Int, String)] = mapValuesRdd1.map(x=>(x.length,x))
val sortByKeyRdd: RDD[(Int, String)] = mapValuesRdd2.sortByKey()
sortByKeyRdd.collect.foreach(println)    //(3,dog) (4,lion) (5,tiger) (5,eagle) (5,panda)
val sortByKeyRdd2: RDD[(Int, String)] = mapValuesRdd2.sortByKey(false)
sortByKeyRdd2.collect.foreach(println)    //(5,tiger) (5,eagle) (5,panda) (4,lion) (3,dog)

++ union和intersection

/**
 * Return the union of this RDD and another one. Any identical elements will appear multiple
 * times (use `.distinct()` to eliminate them).
 */
def union(other: RDD[T]): RDD[T] = withScope {
  sc.union(this, other)
def ++(other: RDD[T]): RDD[T] = withScope {
  this.union(other)
}

//调用union方法,如果分区数是1则创建PartitionerAwareUnionRDD否则创建UnionRDD
/** Build the union of a list of RDDs. */
def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = withScope {
  val partitioners = rdds.flatMap(_.partitioner).toSet
  if (rdds.forall(_.partitioner.isDefined) && partitioners.size == 1) {
    new PartitionerAwareUnionRDD(this, rdds)
  } else {
    new UnionRDD(this, rdds)
  }
}

//intersection取交集
/**
 * Return the intersection of this RDD and another one. The output will not contain any duplicate
 * elements, even if the input RDDs did.
 *
 * @note This method performs a shuffle internally.
 */
def intersection(other: RDD[T]): RDD[T] = withScope {
  this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
      .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
      .keys
}

练习:

val u1=sc.parallelize(1 to 3)
val u2=sc.parallelize(3 to 4)
println("----------union------------------")
u1.union(u2).collect.foreach(println)   //1 2 3 4
println("------------- ++ ---------------")
(u1++u2).collect.foreach(println)    //1 2 3 4
println("---------- intersection ------------------")
u1.intersection(u2).collect.foreach(println)    // 3

join、leftOuterJoin和rightOuterJoin

/**
 * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
 * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
 * (k, v2) is in `other`. Performs a hash join across the cluster.
 */
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope {
  join(other, defaultPartitioner(self, other))
}

//join[W](other: RDD[(K, W)], partitioner: Partitioner)
/**
 * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
 * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
 * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
 */
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
  this.cogroup(other, partitioner).flatMapValues( pair =>
    for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
  )
}

//cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)方法
/**
 * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
 * list of values for that key in `this` as well as `other`.
 */
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
    : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
  if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
    throw new SparkException("HashPartitioner cannot partition array keys.")
  }
  val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
  cg.mapValues { case Array(vs, w1s) =>
    (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
  }
}

//leftOuterJoin
/**
 * Perform a left outer join of `this` and `other`. For each element (k, v) in `this`, the
 * resulting RDD will either contain all pairs (k, (v, Some(w))) for w in `other`, or the
 * pair (k, (v, None)) if no elements in `other` have key k. Hash-partitions the output
 * using the existing partitioner/parallelism level.
 */
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))] = self.withScope {
  leftOuterJoin(other, defaultPartitioner(self, other))
}

//leftOuterJoin(other, defaultPartitioner(self, other))
 def leftOuterJoin[W](
     other: RDD[(K, W)],
     partitioner: Partitioner): RDD[(K, (V, Option[W]))] = self.withScope {
   this.cogroup(other, partitioner).flatMapValues { pair =>
   	 //和join相比多了对空值的判断
     if (pair._2.isEmpty) {
       pair._1.iterator.map(v => (v, None))
     } else {
       for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, Some(w))
     }
   }
 }
val j1=sc.parallelize(List("abe","abby","apple")).map(a=>(a,2))
val j2: RDD[(String, Int)] = sc.parallelize(List("apple","beatty","beatrice")).map(a=>(a,1))
println("----------join------------------")
j1.join(j2).collect.foreach(println)    //(apple,(2,1))
println("----------leftjoin------------------")
j1.leftOuterJoin(j2).collect.foreach(println)    //(abby,(2,None)) (apple,(2,Some(1))) (abe,(2,None))
println("----------rightjoin------------------")
j1.rightOuterJoin(j2).collect.foreach(println)    //(apple,(Some(2),1)) (beatty,(None,1)) (beatrice,(None,1))
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值