Spark core详解系列三

最新推荐文章于 2022-04-11 23:23:48 发布

Empty-cup

最新推荐文章于 2022-04-11 23:23:48 发布

阅读量146

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/qq_17310871/article/details/103283660

版权

Spark 专栏收录该内容

28 篇文章 2 订阅

订阅专栏

WordCount 详解

spark-shell 示例代码：

scala> val result = sc.textFile("file:///home/hadoop/data/wc.data").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
scala> result.toDebugString
res: String =
(2) ShuffledRDD[10] at reduceByKey at <console>:25 [Memory Deserialized 1x Replicated]
 |       CachedPartitions: 2; MemorySize: 416.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
 +-(2) MapPartitionsRDD[9] at map at <console>:25 [Memory Deserialized 1x Replicated]
    |  MapPartitionsRDD[8] at flatMap at <console>:25 [Memory Deserialized 1x Replicated]
    |  file:///home/hadoop/data/wc.data MapPartitionsRDD[1] at textFile at <console>:24 [Memory Deserialized 1x Replicated]
    |  file:///home/hadoop/data/wc.data HadoopRDD[0] at textFile at <console>:24 [Memory Deserialized 1x Replicated]

从以上 toDebugString 中可以看出，wc一共产生了5个RDD，分别是：
textFile：HadoopRDD 和 MapPartitionsRDD
flatMap：MapPartitionsRDD
map：MapPartitionsRDD
reduceByKey：ShuffledRDD

其中，textFile 源码如下：

def textFile(
    path: String,
    minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
  assertNotStopped()
  hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
    minPartitions).map(pair => pair._2.toString).setName(path)
}

可以看出，textFile 底层调用的 hadoopFile ，和mapreduce的读取数据方式一样，也是只取了 pair._2 部分，而 pair._1 部分表示数据在文本中的偏移量，一般没用。其中，hadoopFile产生HadoopRDD，map产生MapPartitionsRDD。

dependency 依赖

窄依赖
一个父RDD的partition至多被子RDD的partition使用一次。
OneToOneDependency，窄依赖都在一个stage中完成。
宽依赖 <= 会产生shuffle，会有新的stage
一个父RDD的partition会被子RDD的partition使用多次。

思考：为什么说shuffle是一个expensive operation？
答：it involves disk I/O, data serialization, and network I/O. （spark的shuffle数据也会落到磁盘上。）

缓存：Cache 和 Persist

cache底层调用了persist，且存储等级是 MEMORY_ONLY，可以认为 cache 是 persist 的一种特定情况。源码如下：

  /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   */
  def cache(): this.type = persist()
  …………
  
  /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   */
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

存储级别有：

class StorageLevel private(
    private var _useDisk: Boolean,
    private var _useMemory: Boolean,
    private var _useOffHeap: Boolean,
    private var _deserialized: Boolean,
    private var _replication: Int = 1)
…………

object StorageLevel {
  val NONE = new StorageLevel(false, false, false, false)
  val DISK_ONLY = new StorageLevel(true, false, false, false)
  val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
  val MEMORY_ONLY = new StorageLevel(false, true, false, true)
  val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
  val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
  val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
  val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
  val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
  val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
  val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
  val OFF_HEAP = new StorageLevel(true, true, true, false, 1)

注意：
cache 和 persist都是lazy级别的操作，需要第一次遇到action操作，才会触发；而 unpersist 是 eager 操作，运行后立马释放存储。
一般存储要结合 spark 的 kryo 序列化使用，否则，数据存储的大小会比输入数据大许多。

groupByKey 和 reduceByKey

reduceByKey底层源码

  @Experimental
  def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)]

可以看出，reduceByKey 底层调用了 combineByKeyWithClassTag 方法，其中有个 mapSideCombine 参数默认是true，即默认实现了 Combine 的预聚合操作。

groupByKey 底层源码

   * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
   * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
   */
  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }

可以看出，groupByKey 底层也调用了 combineByKeyWithClassTag 方法，但是其中的 mapSideCombine 参数默认是false，即没有实现 Combine 的预聚合操作；另外，从注释中看出，groupByKey 是做的全部数据的shuffle，很容易导致 OutOfMemoryError。

repartition 和 coalesce

repartition 源码

  /**
   * Return a new RDD that has exactly numPartitions partitions.
   *
   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
   * a shuffle to redistribute data.
   *
   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
   * which can avoid performing a shuffle.
   *
   * TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207.
   */
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

可以看出，repartition 底层调用了 coalesce算子，并且 shuffle 默认是 true，即 repartition 算子是一个shuffle 算子。

coalesce源码

 /**
   * Return a new RDD that is reduced into `numPartitions` partitions.
   *
   * This results in a narrow dependency, e.g. if you go from 1000 partitions
   * to 100 partitions, there will not be a shuffle, instead each of the 100
   * new partitions will claim 10 of the current partitions. If a larger number
   * of partitions is requested, it will stay at the current number of partitions.
   *
   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
   * this may result in your computation taking place on fewer nodes than
   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
   * you can pass shuffle = true. This will add a shuffle step, but means the
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
   * @note With shuffle = true, you can actually coalesce to a larger number
   * of partitions. This is useful if you have a small number of partitions,
   * say 100, potentially with a few partitions being abnormally large. Calling
   * coalesce(1000, shuffle = true) will result in 1000 partitions with the
   * data distributed using a hash partitioner. The optional partition coalescer
   * passed in must be serializable.
   */
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)

可以看出，coalesce 算子的 shuffle 默认为 false，即默认不会 shuffle。当使用 coalesce 减少分区数时，是没有shuffle的，当使用 coalesce 增加分区数时，必须指定 shuffle 为 true，否则重新分区不生效。

cogroup 和 join

  /**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
   */
  def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
    )
  }

join 算子底层调用的是 cogroup 算子。cogroup合并两个RDD生成一个新的RDD。结果中包含两个Iterable值，第一个表示RDD1中的同一key对应value值，第二个表示RDD2中的同一key对应value值，这个操作中需要通过partitioner进行重新分区，因此需要执行一次shuffle操作（当两个参加 join 的 RDD 的 partitioner 是同一个对象时，会产生窄依赖而不是宽依赖。在这种情况下，两个 RDD 之间只要对应的 partition 互相 join 即可，不会产生宽依赖问题。）。
join是把两个集合根据key进行内容聚合，而cogroup在聚合时会先对RDD中相同的key合并成 Iterable，如：

rdd1元素是：val rdd1= sc.parallelize(List((1,"A"),(2,"B")))
rdd2元素是：val rdd2 = sc.parallelize(List((1,10),(1,20),(2,30)))
rdd1.join(rdd2) 结果为： (2,(B,30)), (1,(A,10)), (1,(A,20))
rdd1.cogroup(rdd2) 结果为：(2,(CompactBuffer(B),CompactBuffer(30))), (1,(CompactBuffer(A),CompactBuffer(10, 20)))

小技巧

查看和调试RDD，可以使用：rdd.toDebugString
查看RDD中的分区数：
rdd.partitions.size
查看RDD中每个分区的元素，两种方法：
1）rdd.glom.collect
2）rdd.mapPartitionsWithIndex((index,partition) => {
partition.map(x=>s"分区是$index,元素是$x")
}).collect

Empty-cup

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark core详解系列三

wordcount 详解spark-shell 示例代码：scala> val result = sc.textFile("file:///home/hadoop/data/wc.data").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)scala> result.toDebugStringres: String =(2)...
复制链接

扫一扫