7.2 RDD开荒

一.RDD概述

/**
 * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
 * partitioned collection of elements that can be operated on in parallel. This class contains the
 * basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition,
 * [[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
 * pairs, such as `groupByKey` and `join`;
 *
 * Internally, each RDD is characterized by five main properties:
 *
 *  - A list of partitions  			一堆分区构成
 *  - A function for computing each split	  一个方法作用在一个分区上的
 *  - A list of dependencies on other RDDs		新的RDD依赖前者 R1----> R2 ----> R3
 *  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
 * 			kv 可以作用上分区器
 *  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
 *    an HDFS file)
 * 			作业调度到数据的最佳位置
 */


1 一堆分区构成  
	protected def getPartitions: Array[Partition]
2 一个方法作用在一个分区上的 
	def compute(split: Partition, context: TaskContext): Iterator[T]
3 新的RDD依赖前者 R1----> R2 ----> R3 
	protected def getDependencies: Seq[Dependency[_]] = deps
4 kv 可以作用上分区器 
	val partitioner: Option[Partitioner] = None
5 作业调度到数据的最佳位置
	protected def getPreferredLocations(split: Partition): Seq[String] = Nil(空List)
			case object Nil extends List[Nothing]

二. RDD 创建

Spark中创建RDD的方式分为四种:

2.1.1 从集合(内存)创建RDD

从集合中创建RDD 有parallelize 和 makeRDD 两个方法

  1. parallelize
    val rdd1 = sc.parallelize(List(1,2,3,4))
    rdd1.collect().foreach(println)

  1. makeRDD
    val rdd2 = sc.makeRDD(List(1,2,3,4))
    rdd2.collect().foreach(println)

从底层实现来看 ,makeRDD方法其实也是调用的parallelize方法

  def makeRDD[T: ClassTag](
      seq: Seq[T],
      numSlices: Int = defaultParallelism): RDD[T] = withScope {
    parallelize(seq, numSlices)
  }

2.1.2 从外部存储(文件)创建RDD

本地文件系统、HSFS

    sc.textFile("hdfs://ifeng:9000/hdfsapi/wc.txt")
        .flatMap(_.split(","))
        .map((_,1))
        .reduceByKey(_+_).collect()

三. RDD操作

RDD操作分为两类

  1. transformations

  2. actions

RDD Operations
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

1 map

行映射转换, 可以是 类型的转换 、 值的转换

    val RDD1 = sc.parallelize(List(1, 2, 3, 4, 5))

    //RDD1.map(_*2).foreach(println) 2 4 6 8 10 
2 mapPartitions

map : fun —> elem
mappartition —> partition

以partition为单位进行操作

    val RDD2 = sc.parallelize(List(1, 2, 3, 4, 5),2)
    RDD2.mapPartitions(_.map(_*2)).foreach(println)
3 mapPartitionWithIndex

一个partition对应一个task

    RDD2.mapPartitionsWithIndex((index,partition)=>{
      println("这是一个分区")
      partition.map(x => s"分区:$index,元素:$x")
    }).foreach(println)

在这里插入图片描述

4 filer
  /**
   * Return a new RDD containing only the elements that satisfy a predicate.
   */
  def filter(f: T => Boolean): RDD[T] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[T, T](
      this,
      (context, pid, iter) => iter.filter(cleanF),
      preservesPartitioning = true)
  }

    RDD1.filter(_ > 2).foreach(println)
    RDD1.filter(_ % 2 == 0).filter(_ > 2).foreach(println)
    RDD1.filter(x => x % 2 == 0 && x > 2).foreach(println)
5 glom

每个分区都放到一个Array中

  /**
   * Return an RDD created by coalescing all elements within each partition into an array.
   */
  def glom(): RDD[Array[T]] = withScope {
    new MapPartitionsRDD[Array[T], T](this, (context, pid, iter) => Iterator(iter.toArray))
  }

6 sample

取样

/**
   * Return a sampled subset of this RDD.
   *
   * @param withReplacement can elements be sampled multiple times (replaced when sampled out)
   * @param fraction expected size of the sample as a fraction of this RDD's size
   *  without replacement: probability that each element is chosen; fraction must be [0, 1]
   *  with replacement: expected number of times each element is chosen; fraction must be greater
   *  than or equal to 0
   * @param seed seed for the random number generator
   *
   * @note This is NOT guaranteed to provide exactly the fraction of the count
   * of the given [[RDD]].
   */
  def sample(
      withReplacement: Boolean,
      fraction: Double,
      seed: Long = Utils.random.nextLong): RDD[T] = {
    require(fraction >= 0,
      s"Fraction must be nonnegative, but got ${fraction}")

    sc.parallelize( 1 to 20 ).sample(true,0.5).foreach(println)

7 zip

拉链对应,一个对一个

/**
   * Zips this RDD with another one, returning key-value pairs with the first element in each RDD,
   * second element in each RDD, etc. Assumes that the two RDDs have the *same number of
   * partitions* and the *same number of elements in each partition* (e.g. one was made through
   * a map on the other).
   */
  def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
    zipPartitions(other, preservesPartitioning = false) { (thisIter, otherIter) =>
      new Iterator[(T, U)] {
        def hasNext: Boolean = (thisIter.hasNext, otherIter.hasNext) match {
          case (true, true) => true
          case (false, false) => false
          case _ => throw new SparkException("Can only zip RDDs with " +
            "same number of elements in each partition")
        }
        def next(): (T, U) = (thisIter.next(), otherIter.next())
      }
    }

在这里插入图片描述
zipWithIndex
在这里插入图片描述

8 flatMap
  /**
   *  Return a new RDD by first applying a function to all elements of this
   *  RDD, and then flattening the results.
   */
  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }

Map & flatMap之间的区别

在这里插入图片描述

上图可以看出,flatMap其实比map多的就是flatten操作。

在使用时map会将一个长度为N的RDD转换为另一个长度为N的RDD;而flatMap会将一个长度为N的RDD转换成一个N个元素的集合,然后再把这N个元素合成到一个单个RDD的结果集。

比如一个包含三行内容的数据文件“README.md”。

a b c

d
经过以下转换过程

val textFile = sc.textFile("README.md")
textFile.flatMap(_.split(" ")) 
其实就是经历了以下转换

["a b c", "", "d"] => [["a","b","c"],[],["d"]] => ["a","b","c","d"]
在这个示例中,flatMap就把包含多行数据的RDD,即[“a b c”, “”, “d”] ,转换为了一个包含多个单词的集合。实际上,flatMap相对于map多了的是[[“a”,”b”,”c”],[],[“d”]] => [“a”,”b”,”c”,”d”]这一步。
9 flatmapValues
  /**
   * Pass each value in the key-value pair RDD through a flatMap function without changing the
   * keys; this also retains the original RDD's partitioning.
   */
  def flatMapValues[U](f: V => TraversableOnce[U]): RDD[(K, U)] = self.withScope {
    val cleanF = self.context.clean(f)
    new MapPartitionsRDD[(K, U), (K, V)](self,
      (context, pid, iter) => iter.flatMap { case (k, v) =>
        cleanF(v).map(x => (k, x))
      },
      preservesPartitioning = true)
  }

10 mapValues
  /**
   * Pass each value in the key-value pair RDD through a map function without changing the keys;
   * this also retains the original RDD's partitioning.
   */
  def mapValues[U](f: V => U): RDD[(K, U)] = self.withScope {
    val cleanF = self.context.clean(f)
    new MapPartitionsRDD[(K, U), (K, V)](self,
      (context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
      preservesPartitioning = true)
  }

在这里插入图片描述

flatMap
  /**
   *  Return a new RDD by first applying a function to all elements of this
   *  RDD, and then flattening the results.
   */
  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }

join
  /**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
   */
  def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
    )
  }

在这里插入图片描述

cogroup

join也是调用的cogroup

cogroup返回的是CompactBuffer
在这里插入图片描述

union

在这里插入图片描述
分区数量 = 之和

groupby
  /**
   * Return an RDD of grouped items. Each group consists of a key and a sequence of elements
   * mapping to that key. The ordering of elements within each group is not guaranteed, and
   * may even differ each time the resulting RDD is evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   */
  def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
      : RDD[(K, Iterable[T])] = withScope {
    val cleanF = sc.clean(f)
    this.map(t => (cleanF(t), t)).groupByKey(p)
  }

在这里插入图片描述

groupByKey

groupBy 的特殊形式,只能根据Key来区分
在这里插入图片描述
wordCount
在这里插入图片描述
在这里插入图片描述

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

oifengo

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值