Spark之算子及解析

最新推荐文章于 2024-06-27 19:23:04 发布

reddy_Hu

最新推荐文章于 2024-06-27 19:23:04 发布

阅读量266

点赞数

文章标签： spark

本文链接：https://blog.csdn.net/reddy_Hu/article/details/107885756

版权

两个创建RDD的方法是完全一样的

val conf = new SparkConf().setAppName("MapPartitionsDemo").setMaster("local[*]")

    val sc = new SparkContext(conf)
    //创建RDD
    val rdd1 = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 3)
    //makeRDD的源码
    /** Distribute a local Scala collection to form an RDD.
   *
   * This method is identical to `parallelize`.
   * @param seq Scala collection to distribute
   * @param numSlices number of partitions to divide the collection into
   * @return RDD representing distributed collection
   */
    
  def makeRDD[T: ClassTag](
      seq: Seq[T],
      numSlices: Int = defaultParallelism): RDD[T] = withScope {
    parallelize(seq, numSlices)
  }
   


     //两个本地创建方法完全等价  makeRDD底层调用的parallelize方法
     //在这个方法中,如果不指定分区数量,就调用所有分区
    val rdd2 = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 3)

textFile

sc.textFile("hdfs://linux01:9000/data",2)

当这个方法总hadoop中拉取数据,比如说当前目录下有四个block块,大小分别是(99,99,99,799),如果参数列表第二个参数是2,那么就将较大文件拆成两个分区,第二个参数是3是就会拆成三个分区,以此类推(如果不传,默认为2)

sc.textFile("hdfs://linux01:9000/data",2)

//源码
 /**
   * Read a text file from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI, and return it as an RDD of Strings.
   * The text files must be encoded as UTF-8.
   *
   * @param path path to the text file on a supported file system
   * @param minPartitions suggested minimum number of partitions for the resulting RDD
   * @return RDD of lines of the text file
   */
*从HDFS、本地文件系统（在所有节点上可用）或任何

*Hadoop支持的文件系统URI，并将其作为字符串的RDD返回。

*文本文件必须编码为UTF-8。

*

*@param path受支持文件系统上文本文件的路径

*@param minPartitions建议生成的RDD的最小分区数

*@return文本文件行的RDD


  def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

其实这个方法的分区数量是有规律可循的,这个方法分区数量调用了 TextInputFormat类这个类继承了

FileInputFormat类,父类中的方法中有getSplits方法

filter

实质上是对每一个分区进行操作,底层调用MapPartitionsRDD

/**
   * Return a new RDD containing only the elements that satisfy a predicate.
   */
  def filter(f: T => Boolean): RDD[T] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[T, T](
      this,
      (_, _, iter) => iter.filter(cleanF),
      preservesPartitioning = true)
  }



/** Returns an iterator over all the elements of this iterator that satisfy the predicate `p`.
   *  The order of the elements is preserved.
   *
   *  @param p the predicate used to test values.
   *  @return  an iterator which produces those values of this iterator which satisfy the predicate `p`.
   *  @note    Reuse: $consumesAndProducesIterator
   */
  def filter(p: A => Boolean): Iterator[A] = new AbstractIterator[A] {
    // TODO 2.12 - Make a full-fledged FilterImpl that will reverse sense of p
    private var hd: A = _
    private var hdDefined: Boolean = false

    def hasNext: Boolean = hdDefined || {
      do {
        if (!self.hasNext) return false
        hd = self.next()
      } while (!p(hd))
      hdDefined = true
      true
    }

    def next() = if (hasNext) { hdDefined = false; hd } else empty.next()
  }

当使用filter方法时会将父RDD,和要操作的分区和方法传进去 ,调用其中compute方法对父RDD中的每一个分区进行传入的方法的操作

/**
 * An RDD that applies the provided function to every partition of the parent RDD.
 *
 * @param prev the parent RDD.
 * @param f The function used to map a tuple of (TaskContext, partition index, input iterator) to
 *          an output iterator.
 * @param preservesPartitioning Whether the input function preserves the partitioner, which should
 *                              be `false` unless `prev` is a pair RDD and the input function
 *                              doesn't modify the keys.
 * @param isFromBarrier Indicates whether this RDD is transformed from an RDDBarrier, a stage
 *                      containing at least one RDDBarrier shall be turned into a barrier stage.
 * @param isOrderSensitive whether or not the function is order-sensitive. If it's order
 *                         sensitive, it may return totally different result when the input order
 *                         is changed. Mostly stateful functions are order-sensitive.
 */

/ ** *一个RDD，它将提供的功能应用于父RDD的每个分区。 * * @param前置父RDD。 * @param f该函数用于将（TaskContext，分区索引，输入迭代器）的元组映射到*输出迭代器。 * @param preparesPartitioning输入函数是否保留分区程序，除非`prev`是一对RDD并且输入函数*不会修改键，否则应该为false。 * @param isFromBarrier指示此RDD是否从RDDBarrier转换而来，包含至少一个RDDBarrier的阶段*将变为障碍阶段。 * @param isOrderSensitive函数是否对顺序敏感。如果订单*敏感，则更改输入订单*时可能返回完全不同的结果。有状态功能通常是顺序敏感的。 * /


private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
    var prev: RDD[T],
    f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
    preservesPartitioning: Boolean = false,
    isFromBarrier: Boolean = false,
    isOrderSensitive: Boolean = false)
  extends RDD[U](prev) {

  override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None

  override def getPartitions: Array[Partition] = firstParent[T].partitions

  override def compute(split: Partition, context: TaskContext): Iterator[U] =
    f(context, split.index, firstParent[T].iterator(split, context))

  override def clearDependencies(): Unit = {
    super.clearDependencies()
    prev = null
  }

  @transient protected lazy override val isBarrier_ : Boolean =
    isFromBarrier || dependencies.exists(_.rdd.isBarrier())

  override protected def getOutputDeterministicLevel = {
    if (isOrderSensitive && prev.outputDeterministicLevel == DeterministicLevel.UNORDERED) {
      DeterministicLevel.INDETERMINATE
    } else {
      super.getOutputDeterministicLevel
    }
  }
}

map

map方法本质是和filter是一样的都是在底层new了一个MapPartitionsRDD类,本质上都是在map端做的处理,所以是transformation方法

mapPartitions

mapPartitions方法中传入的是迭代器,其实就是传入整个job中的的每个分区中的RDD,因为其实每个RDD就是个迭代器,比如整个job中有三个分区,这个mapPartitions方法只会宝贝调用三次,但是map方法是没有一条数据调用一次

package com.doit.spark.restart

import org.apache.spark.{SparkConf, SparkContext, TaskContext}

object MapPartitionsDemo {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("MapPartitionsDemo").setMaster("local[*]")
    val sc = new SparkContext(conf)
    //创建RDD
    val nums = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)
    //调用map方法:没处理一条数据就调用一次map方法中的函数
    val result = nums.map(x=>{
      val index = TaskContext.getPartitionId()
      (index,x*100)
    })


    //调用mapPartitions方法,可以讲数据以分区为单位取出来,一个分区就是一个迭代器
    //与map不同的是,mapPartitions方法中穿进去的是个迭代器,返回值也是个迭代器
    nums.mapPartitions(it=>{
      val index = TaskContext.getPartitionId()
      val nIt: Iterator[(Int, Int)] = it.map(x => {
        (index, x * 100)
      })
      nIt
    })
    result.saveAsTextFile("zzz")
  }
}

 /**
   * Return a new RDD by applying a function to each partition of this RDD.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (_: TaskContext, _: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }

mapPartitionsWithIndex

这个方法底层传入了一个函数和一个 preservesPartitioning

传入的第一个函数中的参数列表是Int类型和迭代器

传入的第二个从参数 preservesPartitioning 如果不传,默认是false,至于这个参数有什么效果呢,可以理解为后期的shuffle后不会改变分区,就是保存了这种分区方式有效避免数据倾斜

 val res1 = nums.mapPartitionsWithIndex((index: Int, it) => {
      it.map(e => {
        s"index:$index,element:$e"
      })
    })



/**
   * Return a new RDD by applying a function to each partition of this RDD, while tracking the index
   * of the original partition.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitionsWithIndex[U: ClassTag](
      f: (Int, Iterator[T]) => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (_: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),
      preservesPartitioning)
  }

flatMap

这个方法和之前的transformation方法是非常类似的,传进去一个迭代器返回一个迭代器,对迭代器中的每一个数据进行flatMap处理

package com.doit.spark.restart

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object FlatMap {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("MapPartitionsDemo").setMaster("local[*]")
    val sc = new SparkContext(conf)
    //创建RDD
    val arr = Array("spark hadoop flink spark", "hadoop flink spark", "spark hadoop flink")
    val lines: RDD[String] = sc.makeRDD(arr)
    val value = lines.flatMap(x => x)

    println(value.collect().toBuffer)
    //ArrayBuffer(s, p, a, r, k,  , h, a, d, o, o, p,  , f, l, i, ....)
    sc.stop()
  }
}

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (_, _, iter) => iter.flatMap(cleanF))
  }

reddy_Hu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark之算子及解析

两个创建RDD的方法是完全一样的val conf = new SparkConf().setAppName("MapPartitionsDemo").setMaster("local[*]") val sc = new SparkContext(conf) //创建RDD val rdd1 = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 3) //makeRDD的源码 /** Distribute a loc
复制链接

扫一扫