Spark core详解系列三

WordCount 详解

spark-shell 示例代码:

scala> val result = sc.textFile("file:///home/hadoop/data/wc.data").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
scala> result.toDebugString
res: String =
(2) ShuffledRDD[10] at reduceByKey at <console>:25 [Memory Deserialized 1x Replicated]
 |       CachedPartitions: 2; MemorySize: 416.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
 +-(2) MapPartitionsRDD[9] at map at <console>:25 [Memory Deserialized 1x Replicated]
    |  MapPartitionsRDD[8] at flatMap at <console>:25 [Memory Deserialized 1x Replicated]
    |  file:///home/hadoop/data/wc.data MapPartitionsRDD[1] at textFile at <console>:24 [Memory Deserialized 1x Replicated]
    |  file:///home/hadoop/data/wc.data HadoopRDD[0] at textFile at <console>:24 [Memory Deserialized 1x Replicated]

从以上 toDebugString 中可以看出,wc一共产生了5个RDD,分别是:
textFile:HadoopRDD 和 MapPartitionsRDD
flatMap:MapPartitionsRDD
map:MapPartitionsRDD
reduceByKey:ShuffledRDD

其中,textFile 源码如下:

def textFile(
    path: String,
    minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
  assertNotStopped()
  hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
    minPartitions).map(pair => pair._2.toString).setName(path)
}

可以看出,textFile 底层调用的 hadoopFile ,和mapreduce的读取数据方式一样,也是只取了 pair._2 部分,而 pair._1 部分表示数据在文本中的偏移量,一般没用。其中,hadoopFile产生HadoopRDD,map产生MapPartitionsRDD。

dependency 依赖

  • 窄依赖
    一个父RDD的partition至多被子RDD的partition使用一次。
    OneToOneDependency,窄依赖都在一个stage中完成。
  • 宽依赖 <= 会产生shuffle,会有新的stage
    一个父RDD的partition会被子RDD的partition使用多次。

思考:为什么说shuffle是一个expensive operation?
:it involves disk I/O, data serialization, and network I/O. (spark的shuffle数据也会落到磁盘上。)

缓存:Cache 和 Persist

cache底层调用了persist,且存储等级是 MEMORY_ONLY,可以认为 cache 是 persist 的一种特定情况。源码如下:

  /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   */
  def cache(): this.type = persist()
  …………
  
  /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   */
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

存储级别有:

class StorageLevel private(
    private var _useDisk: Boolean,
    private var _useMemory: Boolean,
    private var _useOffHeap: Boolean,
    private var _deserialized: Boolean,
    private var _replication: Int = 1)
…………

object StorageLevel {
  val NONE = new StorageLevel(false, false, false, false)
  val DISK_ONLY = new StorageLevel(true, false, false, false)
  val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
  val MEMORY_ONLY = new StorageLevel(false, true, false, true)
  val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
  val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
  val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
  val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
  val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
  val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
  val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
  val OFF_HEAP = new StorageLevel(true, true, true, false, 1)

注意
cache 和 persist都是lazy级别的操作,需要第一次遇到action操作,才会触发;而 unpersist 是 eager 操作,运行后立马释放存储。
一般存储要结合 spark 的 kryo 序列化使用,否则,数据存储的大小会比输入数据大许多。

groupByKey 和 reduceByKey

  • reduceByKey底层源码
  @Experimental
  def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] 

可以看出,reduceByKey 底层调用了 combineByKeyWithClassTag 方法,其中有个 mapSideCombine 参数默认是true,即默认实现了 Combine 的预聚合操作。

  • groupByKey 底层源码
   * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
   * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
   */
  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }

可以看出,groupByKey 底层也调用了 combineByKeyWithClassTag 方法,但是其中的 mapSideCombine 参数默认是false,即没有实现 Combine 的预聚合操作;另外,从注释中看出,groupByKey 是做的全部数据的shuffle,很容易导致 OutOfMemoryError。

repartition 和 coalesce

  • repartition 源码
  /**
   * Return a new RDD that has exactly numPartitions partitions.
   *
   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
   * a shuffle to redistribute data.
   *
   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
   * which can avoid performing a shuffle.
   *
   * TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207.
   */
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

可以看出,repartition 底层调用了 coalesce算子,并且 shuffle 默认是 true,即 repartition 算子是一个shuffle 算子。

  • coalesce源码
 /**
   * Return a new RDD that is reduced into `numPartitions` partitions.
   *
   * This results in a narrow dependency, e.g. if you go from 1000 partitions
   * to 100 partitions, there will not be a shuffle, instead each of the 100
   * new partitions will claim 10 of the current partitions. If a larger number
   * of partitions is requested, it will stay at the current number of partitions.
   *
   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
   * this may result in your computation taking place on fewer nodes than
   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
   * you can pass shuffle = true. This will add a shuffle step, but means the
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
   * @note With shuffle = true, you can actually coalesce to a larger number
   * of partitions. This is useful if you have a small number of partitions,
   * say 100, potentially with a few partitions being abnormally large. Calling
   * coalesce(1000, shuffle = true) will result in 1000 partitions with the
   * data distributed using a hash partitioner. The optional partition coalescer
   * passed in must be serializable.
   */
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)

可以看出,coalesce 算子的 shuffle 默认为 false,即默认不会 shuffle。当使用 coalesce 减少分区数时,是没有shuffle的,当使用 coalesce 增加分区数时,必须指定 shuffle 为 true,否则重新分区不生效。

cogroup 和 join

  /**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
   */
  def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
    )
  }

join 算子底层调用的是 cogroup 算子。cogroup合并两个RDD生成一个新的RDD。结果中包含两个Iterable值,第一个表示RDD1中的同一key对应value值,第二个表示RDD2中的同一key对应value值,这个操作中需要通过partitioner进行重新分区,因此需要执行一次shuffle操作(当两个参加 join 的 RDD 的 partitioner 是同一个对象时,会产生窄依赖而不是宽依赖。在这种情况下,两个 RDD 之间只要对应的 partition 互相 join 即可,不会产生宽依赖问题。)。
join是把两个集合根据key进行内容聚合,而cogroup在聚合时会先对RDD中相同的key合并成 Iterable,如:

rdd1元素是:val rdd1= sc.parallelize(List((1,"A"),(2,"B")))
rdd2元素是:val rdd2 = sc.parallelize(List((1,10),(1,20),(2,30)))
rdd1.join(rdd2) 结果为: (2,(B,30)), (1,(A,10)), (1,(A,20))
rdd1.cogroup(rdd2) 结果为:(2,(CompactBuffer(B),CompactBuffer(30))), (1,(CompactBuffer(A),CompactBuffer(10, 20)))
小技巧
  1. 查看和调试RDD,可以使用:rdd.toDebugString
  2. 查看RDD中的分区数:
    rdd.partitions.size
  3. 查看RDD中每个分区的元素,两种方法:
    1)rdd.glom.collect
    2)rdd.mapPartitionsWithIndex((index,partition) => {
        partition.map(x=>s"分区是$index,元素是$x")
        }).collect
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值