Spark-算子原理和区别_sparksql和算子区别-CSDN博客

Spark-算子原理和区别

转载声明

本文大量内容系转载自以下文章，有删改，并参考其他文档资料加入了一些内容：

spark部分：distinct去重的原理
作者：道法—自然
SQL中union和union all的区别
作者：道法—自然
repartition和coalesce的区别
作者：道法—自然
spark中makerdd和parallelize的区别
作者：道法—自然
map flatmap mappartition flatMapToPair四种用法区别
作者：道法—自然
spark中的常用算子区别（map、mapPartitions、foreach、foreachPartition）
作者：老子天下最美
对DStream.foreachRDD的理解
作者：Woople

0x01 算子原理

1.1 Distinct

Distinct去重数据经过以下步骤：

代码说明如下：

// 实验数据
val rdd = sc.makeRDD(Array(
  "hello",
  "hello",
  "hello",
  "world"
))
// 模拟distinct流程
val dinstinctRDD = rdd
  .map((_,1))
  .reduceByKey(_+_)
  .map{_._1}
dinstinctRDD.foreach(println)
// distinct等同于上述几个算子
rdd.distinct().foreach(println)

1.2 checkpoint

checkpoint算子实际上是将RDD持久化到HDFS上的，同时切断RDD之间的依赖。

0x02 区别

2.1 union和union all的区别

union在进行表求并集后会去掉重复的元素，所以会对所产生的结果集进行排序运算，删除重复的记录再返回结果。
union all则只是简单地将两个结果集合并后就返回结果。因此，如果返回的两个结果集中有重复的数据，那么返回的结果就会包含重复的数据。

从上面的对比可以看出，在执行查询操作时，union all要比union快很多，所以，如果可以确认合并的两个结果集中不包含重复的数据，那么最好使用union all。

例如，现有两个学生表Table1和Table2：

执行union语句：

select * from Table1 union select * from Table2

查询结果如下，可以看到结果集去除了重复行：
Union
执行union all语句：

select * from Table1 union all select * from Table2

查询结果如下，可以看到结果集没有去除重复行：
UnionAll

2.2 repartition和coalesce的区别

他们都是重分区函数：

repartition(numPartitions:Int):RDD[T]
coalesce(numPartitions:Int，shuffle:Boolean=false):RDD[T]

它们两个都是RDD的分区进行重新划分，repartition只是coalesce接口中shuffle为true的简易实现，例如RDD有N个分区，需要重新划分成M个分区，有以下几种情况：

N<M
一般情况下N个分区有数据分布不均匀的状况，利用HashPartitioner函数将数据重新分区为M个，这时需要将shuffle设置为true。
如果N>M并且N和M相差不多，(假如N是1000，M是100)
那么就可以将N个分区中的若干个分区合并成一个新的分区，最终合并为M个分区，这时可以将shuffle设置为false。此时如果M>N，coalesce为无效的，不进行shuffle过程，父RDD和子RDD之间是窄依赖关系。
如果N>M并且两者相差悬殊
这时如果将shuffle设置为false，父子ＲＤＤ是窄依赖关系，他们同处在一个Stage中，就可能造成Spark程序的并行度不够，从而影响性能，如果在M为1的时候，为了使coalesce之前的操作有更好的并行度，可以讲shuffle设置为true。

总之：如果shuffle为false时，如果传入的参数大于现有的分区数目，RDD的分区数不变，也就是说不经过shuffle，是无法将RDD的分区数变多的。

2.3 makerdd和parallelize的区别

2.3.1 简介

我们知道，在Spark中创建RDD的创建方式大概可以分为三种：

从集合中创建RDD，如parallelize和makeRDD
从外部存储创建RDD，如textFile
从其他RDD创建

2.3.2 parallelize和makeRDD

从集合中创建RDD，，在驱动器程序中对一个集合进行并行化的方式有两种：parallelize()和makeRDD()。

parallelize()

def parallelize[T: ClassTag](
  seq: Seq[T],
  numSlices: Int = defaultParallelism): RDD[T] = withScope {
assertNotStopped()
new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
  }

makeRDD()，有两种重构方法，如下：

方法一

/** 
 * Distribute a local Scala collection to form an RDD.
 *
 * This method is identical to `parallelize`.
 */
def makeRDD[T: ClassTag](
    seq: Seq[T],
    numSlices: Int = defaultParallelism): RDD[T] = withScope {
  parallelize(seq, numSlices)
}

可以发现，该重构方法的实现就是调用parallelize()方法。

方法二

/**
 * Distribute a local Scala collection to form an RDD, with one or more
 * location preferences (hostnames of Spark nodes) for each object.
 * Create a new partition for each collection item.
 */
 def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): 
 RDD[T] = withScope {
  assertNotStopped()
  val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap
  new ParallelCollectionRDD[T](this, seq.map(_._1), math.max(seq.size, 1), indexToPrefs)
 }

注释的意思为：分配一个本地Scala集合形成一个RDD，为每个集合对象创建一个最佳分区。

2.3.3 具体区别例子

给出如下例子，可以更清晰的看到它们之间的区别：

首先定义集合对象：

val seq = List(("American Person", List("Tom", "Jim")), ("China Person", List("LiLei", "HanMeiMei")), ("Color Type", List("Red", "Blue")))

使用parallelize()创建RDD：

val rdd1 = sc.parallelize(seq)

查询rdd1的分区数：

// 2
rdd1.partitions.size

使用makeRDD()创建RDD

val rdd2 = sc.makeRDD(seq)

查看rdd2的分区数

// 3
rdd2.partitions.size

总之：

第一种makerdd与parallerize两者完全一致，传递的都是集合的形式；其实第一种makerdd实现是依赖了parallelize函数

第二种makerdd还提供了计算位置。

2.4 map, flatmap, mapPartitions, flatMapToPair,

2.4.1 map

2.4.1.1 概述

transform算子
map用于遍历RDD,将函数f应用于每一个元素，返回新的RDD(transformation算子)。

/**
 * Return a new RDD by applying a function to all elements of this RDD.
 * 通过传递进来的function应用到该RDD的每条数据，来返回新的RDD
 */
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}

2.4.1.2 例子1

val d = sc.makeRDD(Array(1,2,3,4,5,1,3,5))
//构造pair RDD, dd:RDD[(Int,Int)]
val dd = d.map(x=>(x,1))  
dd.foreach(println)

2.4.1.3 例子2

val d = sc.textFile("/Users/chengc/cc/work/projects/sparkDemo1/src/main/resources/input/dataframe.txt")
val dd = d.map(_.split(" "))
val person = dd.map(datas => AdData(datas(0), datas(1).trim.toInt, datas(2))).toDF()

2.4.2 flatMap

2.4.2.1 概述

transform算子
对每一条输入进行指定的操作，并将得到的结果拉平
与map不同的是，map算子结果和之前数据量肯定相同，而flatMap不一定

2.4.2.2 例子1

val input1 = sc.textFile(inputFileName)
input1.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).collect().foreach(println)

2.4.2.3 例子2

val input2 = sc.textFile(inputFileName)
val words = input2.flatMap(line=> line.split(" "))
val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
//将统计出来的单词总数存入一个文本文件，引发求值
counts.saveAsTextFile(outputFileName)
spark.stop()

2.4.3 mappartitions

2.4.3.1 概述

transform算子
用于遍历操作RDD中的每一个partition分区，返回生成一个新的RDD

rdd的mapPartitions是map的一个变种，它们都可进行分区的并行处理。两者的主要区别是调用的粒度不一样
map的输入变换函数是应用于RDD中每个元素，而mapPartitions的输入函数是应用于每个分区。也就是把每个分区中的内容作为整体来处理的。

/**
 * Return a new RDD by applying a function to each partition of this RDD.
 * 通过对该RDD的每个分区调用该函数，来返回一个新的RDD
 *
 * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
 * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
 * 保持分区方式的前提是操作对象是pairRDD且传入的函数不能修改RDD内部数据的key
 */
def mapPartitions[U: ClassTag](
    f: Iterator[T] => Iterator[U],
    preservesPartitioning: Boolean = false): RDD[U] = withScope {
  val cleanedF = sc.clean(f)
  new MapPartitionsRDD(
    this,
    (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
    preservesPartitioning)
}

2.4.3.2 例子1

val a = sc.makeRDD(Array(1,2,3,4,5,1,3,5), 2)
val b = a.mapPartitions(partition=>{
  partition.map(_*2)
})
b.foreach(println(_))

2.4.3.3 总结

一般使用mapPartitions或者foreachPartition算子比map和foreach更加高效，推荐使用。

因为map是对RDD中的每一个元素遍历操作，而mapPartitions是对RDD中的每个分区的迭代器进行操作。如果在map过程中需要频繁创建额外的对象(如Connection对象)，则mapPartitions效率比map高(一个分区公用一个Connection对象或连接池)。

mapPartitions算子占用内存多，如果一个partition的计算结果非常非常大，那么可能造成OOM，怎么解决？

repartition算子来增加RDD的分区数，那么每一个partition的计算结果就减少了很多。

2.4.4 flatMapToPair

同map函数一样：对每一条输入进行指定的操作，然后为每一条输入返回一个key－value对象
最后将所有key－value对象合并为一个对象 Iterable

2.5 foreach, foreachPartition, foreachRDD

2.5.1 foreach

action算子
无返回值
源码中的注释是:Applies a function fun to all elements of this RDD。用于遍历RDD,将函数func应用于RDD的每一个元素。
foreach和foreachPartition区别
与foreachPartition类似的是，foreach也是对每个partition中的iterator实行迭代处理，通过用户传入的function对iterator进行内容的处理。

而不同的是，函数func中的参数传入的不再是一个迭代器,而是每次foreach得到的一个rdd的kv实例,也就是具体的数据.

例子如下

val d = sc.makeRDD(Array(1,2,3,4,5,1,3,5))
val dd = d.map(x=>(x,1))  //构造pair RDD, dd:RDD[(Int,Int)]
val dg = dd.reduceByKey((x, y) => x+y)  //dg :RDD[(Int, Iterable[Int])]
dg.foreach(println(_))
dg.foreach(pair => print("key="+pair._1+",value="+pair._2))

2.5.2 foreachPartition

 /**
   * Applies a function f to each partition of this RDD.
   * 将函数func应用于此RDD的每个分区
   */
  def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
  }

action算子
无返回值
用于遍历操作RDD中的每一个分区。

foreachPartition是对每个partition中的iterator实行迭代的处理,通过用户传入的function(即函数func)iterator进行内容的处理，源码中函数func传入的参数是一个迭代器,也就是说在functionPartition中函数处理的是分区迭代器,而非具体的数据.
foreach与foreachPartition对比
- 相同
  都是在每个partition中对iterator进行操作
- 不同
  - foreach是直接在每个partition中直接对iterator执行foreach操作，而传入的function只是在foreach内部使用，即遍历RDD内的每条数据传给function执行。
  - 而foreachPartition是在每个partition中把iterator给传入的function，让function自己对iterator进行处理（可以避免内存溢出）
总结
一般使用mapPartitions或者foreachPartition算子比map和foreach更加高效，推荐使用。

2.5.3 foreachRDD

2.5.3.1 官方解释

/**
 * Apply a function to each RDD in this DStream. This is an output operator, so
 * 'this' DStream will be registered as an output stream and therefore materialized.
 * 将函数应用于此DStream中的每个RDD.
 * 这是一个输出操作符,所以‘this‘‘ DStream将被注册为输出流,因此具体化
 */
def foreachRDD(foreachFunc: RDD[T] => Unit): Unit = ssc.withScope {
  val cleanedF = context.sparkContext.clean(foreachFunc, false)
  foreachRDD((r: RDD[T], t: Time) => cleanedF(r), displayInnerRDDOps = true)
}

The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

要点如下：

foreachRDD是最一般性的算子，可以接受func参数，对stream中产生的每个RDD生效。
该传入的参数func，应该将每个RDD中的数据推送到一个外部系统，比如文件或数据库
注意func函数执行位置位于执行SparkStreaming程序的那个Driver进程内
该func内部通常需要包含RDD action操作来驱使RDDtransformation计算操作

2.5.3.2 foreachRDD到底有几个RDD?

SparkStreaming是流式实时处理数据,就是将数据流按照定义的时间进行分割(就是"微批处理")。每一个时间段内处理到的都是有且只有一个RDD。那么定义里面所说的“each RDD”应该如何理解呢？

DStream可以理解为是基于时间的，即每个interval产生一个RDD，所以如果以时间为轴，每隔一段时间就会产生一个RDD，那么定义中的“each RDD”应该理解为每个interval的RDD，而不是一个interval中的每个RDD。

可以从源码分析上述论点：

DStream中的foreachRDD方法最终会调用如下的代码

private def foreachRDD(
    foreachFunc: (RDD[T], Time) => Unit,
    displayInnerRDDOps: Boolean): Unit = {
  new ForEachDStream(this,
    context.sparkContext.clean(foreachFunc, false), displayInnerRDDOps).register()
}

可以看到这个方法里面并没有任何的Iterator，可以对比一下RDD中的foreachPartition和foreach方法，这两个方法是会遍历RDD，所以才会有Iterator类型的引用

def foreach(f: T => Unit): Unit = withScope {
  val cleanF = sc.clean(f)
  sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}

def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
  val cleanF = sc.clean(f)
  sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
}

而如果每个interval中有多个RDD，那么DStream中的foreachRDD也一定会有Iterator类型的引用，但是从上述的代码中并没有。

2.5.3.3 foreachRDD代码运行位置

需要注意的是，该foreachRDD传入的func在Driver进程中运行。

2.5.3.4 foreachRDD用途

foreachRDD方法就是在处理每一个时间段内的RDD数据，对每个RDD应用func，如保存到文件、写入数据库等。

2.5.3.5 例子1

val messages = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](
      ssc, kafkaParams, topicsSet)
      
messages.map(_._2).map{ event =>
  NewClickEvent.parseFrom(event)
}.foreachRDD { rdd =>
    rdd.foreachPartition { partition =>
      val jedis = RedisClient.pool.getResource
      partition.foreach { event =>
        println("NewClickEvent:" + event)
        val userId = event.getUserId
        val itemId = event.getItemId
        val key = "II:" + itemId
        val value = jedis.get(key)
        if (value != null) {
          jedis.set("RUI:" + userId, value)
          print("Finish recommendation to user:" + userId)
        }
      }
      // destroy jedis object, please notice pool.returnResource is deprecated
      jedis.close()
    }
}

2.5.3.6 例子2

val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_AND_DISK_SER)
    val words = lines.flatMap(_.split(" "))

    // Convert RDDs of the words DStream to DataFrame and run SQL query
    words.foreachRDD { (rdd: RDD[String], time: Time) =>
      // Get the singleton instance of SparkSession
      val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
      import spark.implicits._

      // Convert RDD[String] to RDD[case class] to DataFrame
      val wordsDataFrame = rdd.map(w => Record(w)).toDF()

      // Creates a temporary view using the DataFrame
      wordsDataFrame.createOrReplaceTempView("words")

      // Do word count on table using SQL and print it
      val wordCountsDataFrame =
        spark.sql("select word, count(*) as total from words group by word")
      println(s"========= $time =========")
      wordCountsDataFrame.show()
    }