大数据实战第十六课(上)-Spark-Core04

一、上次课回顾

二、Shuffle剖析

三、shuffle在Spark-shell操作

四、扩展:aggregateByKey算子

一、上次课回顾

大数据实战第十五课(上)之-Spark-Core03:
https://blog.csdn.net/zhikanjiani/article/details/91045640#id_4.2

  1. 宽窄依赖定义,在容错方面定义
  2. spark on yarn(client、cluster)
  3. key-value编程

YARN HADOOP_CONF_DIR
对于yarn模式是否需要在$SPARK_HOME/conf下的slaves下修改localhosts为Hadoop002,。

跑yarn的时候只需要这台机器作为客户端就行了;为什么spark on yarn说的是它仅仅只需要一个客户端。

问:Spark on yarn是否需要启动这些东西?

在$SPARK_HOME/sbin/start-all.sh
/start-master.sh start-slaves.sh slaves

跑Spark on yarn,哐哐哐要把spark节点启动起来。

只要gateway+spark submit就行了,根本不需要启动什么进程就行。

二、Shuffle剖析

2.1 Shuffle简介

  • 回顾:一个action会触发一个job,一个job遇到shuffle会分裂出一个stage,stage中是一堆task。

参见官网:http://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations

需求

  1. 给了你一堆通话记录call records ==> 统计本月打出去了多少电话
    进入手机通话界面:通讯人、通话时间、通话时长、通话记录。

  2. spark中统计分析都是基于wc,(天时间+拨打,1), 天时间+拨打作为一个key,进行reduceByKey()操作。

  3. 相同的天时间+拨打 ==> shuffle到同一个reduce上去,你能进行累加操作么?是不能的

引出:某一种具有特定特征的数据汇聚到某一个节点进行计算,此处进行+1操作
注意:能避免shuffle的操作尽量避免。

  • Shuffle operations
    Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism(机制) for re-distributing data(重新分发数据) so that it’s grouped differently across partitions. This typically involves copying data across executors(拷贝数据到机器上,会有磁盘和网络IO) and machines, making the shuffle a complex and costly operation(是的shuffle成为了一个复杂的并且成本高的操作).

重新分发数据还跨分区的一个操作,这个典型的操作还涉及到拷贝数据到不同的机器上,还会有磁盘IO和网络IO,所以shuffle是一个复杂的并且成本高的操作。

在这里插入图片描述

2.2 Shuffle背景

  1. To understand what happens during the shuffle we can consider the example of the reduceByKey operation.
  • 我们以reduceByKey来理解shuffle操作中会发生什么.
  1. The reduceByKey operation generates a new RDD where all values for a single key are combined into a tuple
  • reduceByKey操作生成一个新的RDD,每一个key所对应的的值都会被组合成一个元组
  1. the key and the result of executing a reduce function against all values associated with that key
  • (相同特征的key会被分到一个reduce上去处理).
  1. The challenge is that not all values for a single key necessarily reside on the same partition. or even the same machine, but they must be co-located to compute the result.
  • 不是所有的key对应的value都是保存在相同的分区下的(带来的挑战是:结果是跨分区的,它们必须要在同一个地点协同工作。)
  1. Operations which can cause a shuffle include repartition operations like repartition and coalese, ‘ByKey’ operations (except for counting)like groupByKey and reduceByKey, and join operations like cogroup and join.
  • 有哪些操作可能会产生一些Shuffle?

2.3 Shuffle Performance Impact(性能上的影响)

  1. The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O(磁盘IO、数据序列化、网络IO). To organize data for the shuffle. Spark generates sets of tasks (Spark会产生一系列的task)- map tasks to organize the data(map task组织数据), and a set of reduce tasks to aggregate it(reduce task去聚合数据).This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations(这种方式来自于MapReduce,但是并没有直接映射到map和reduce操作).
  • Spark产生一系列的task ==> spark会产生一堆的stage,shuffle产生新的stage,stage产生一堆的task
  1. Internally,results from individual map tasks are kept in memory until they can’t fit,these are sorted based on the target partition and written to a single file. On the reduce side,tasks read the relevant sorted blocks.
  • 本质上,独立的map结果保存在内存上,reduce端会读取相关排序数据(map端输出的)。

三、Shuffle在Spark-shell操作

1、启动Spark-shell:

scala> val info = sc.textFile("hdfs://hadoop002:9000/wordcount/input/ruozeinput.txt")
info: org.apache.spark.rdd.RDD[String] = hdfs://hadoop002:9000/wordcount/input/ruozeinput.txt MapPartitionsRDD[1] at textFile at <console>:24

scala> info.partitions.length
res0: Int = 2

scala> val info1 = info.coalesce(1)
info1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[2] at coalesce at <console>:25

scala> info1.partitions.length
res1: Int = 1

scala> val info2 = info.coalesce(4)
info2: org.apache.spark.rdd.RDD[String] = CoalescedRDD[3] at coalesce at <console>:25

scala> info2.partitions.length
res2: Int = 2

scala> val info3 = info.coalesce(4,true)
info3: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at coalesce at <console>:25

scala> info3.partitions.length
res3: Int = 4

scala> info3.collect
res4: Array[String] = Array(hello       world, hello, hello     world   john)   

解释coalesce方法、

  1. def coalesce(numPartitions: Int, shuffle: Boolean = false,
    partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
    (implicit ord: Ordering[T] = null)
  • 传入一个分区数,传入一个true或者false,可传可不传,
  1. def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
    }
  • 调用的就是coalesce,肯定是会仅从shuffle的。
  1. 使用collect操作触发:
  • scala> info3.collect
    res4: Array[String] = Array(hello world, hello, hello world john)
  1. 使用repartition操作:
  • scala> val info4 = info.repartition(5)
    info4: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at repartition at :25

  • scala> info4.collect
    res6: Array[String] = Array(hello world john, hello world, hello)

  • scala> info.partitions.length
    res7: Int = 2

2个分区变为5个分区,对数据重新做分发,使用coalesce,避免你做一个shuffle的动作
在这里插入图片描述
在这里插入图片描述

3.1 IDEA下进行分组:

package spark01

import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.mutable.ListBuffer


object RepartitionApp {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf()
    sparkConf.setAppName("LogApp").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)

    val students = sc.parallelize(List("黄帆","梅宇豪","秦朗","杨朝珅","王乾","沈兆乘","沈其文","陈思文"),3)
    students.mapPartitionsWithIndex((index,partition) => {

      val stus = new ListBuffer[String]
        while(partition.hasNext){													//迭代分区
          stus += ("~~~~" + partition.next() + ",哪个组:" + (index+1))
        }
      stus.iterator
    }).foreach(println)		//进行打印

    sc.stop()
  }

}
mapPartitionWithIndex():意思是分分区,加一个组编号
在parallelize中设置并行度,明确是3个组;

需求一:

部门裁员,三个组变成二个组,进行如下修改:

  • students.mapPartitionsWithIndex((index,partition) ==>
    变更如下 :
    students.coalesce(2).mapPartitionsWithIndex((index,partition)

需求二:

部门裁员前是三个组,把他们重新分组变成5个组
students.repartition(5).mapPartitionsWithIndex((index,partition)

为了直观显示partition和repartition操作:

可以运行如下代码:

package Sparkcore04

import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.mutable.ListBuffer

object RepartitionApp {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf();
    sparkConf.setAppName("LogApp").setMaster("local[2]");
    val sc = new SparkContext(sparkConf);


    val students = sc.parallelize(List("梅宇豪","黄帆","杨超神","薛思雨","朱昱璇","周一虹","王晓岚","沈兆乘","陈思文"),3);
   students.mapPartitionsWithIndex((index,partition) =>
   {
     val stus = new ListBuffer[String]
     while(partition.hasNext)
       {
         stus += ("~~~~" + partition.next() + ",哪个组:" + (index+1))
       }
     stus.iterator
   }).foreach(println)

    println("---------------------------分割线---------------------------")
    students.repartition(4).mapPartitionsWithIndex((index,partition)  => {
      val stus = new ListBuffer[String]
      while(partition.hasNext) {
        stus += ("~~~" + partition.next() + ",新组" + (index+1))
      }
      stus.iterator
    }).foreach(println)
    sc.stop()
  }

}

3.2 coalesce和repartition 在生产中的使用:

  1. 假设一个RDD中有300个分区,每个分区中只有一条记录"id=100“,

  2. 此时做了一个filter操作(id > 99),结果就是还是有300个partition,每个partition中只有一条数据

变换起始条件:

  1. 原来300个partition,每个partition有10万条数据, 还是做了filter操作(id > 99),输出出来每个文件只有一条数据;
  2. 如果此时coalesce(1),以此来进行收敛,对小文件好很多。分区数决定了最终输出的文件个数。
  • rePartition应用场景:可以把数据打散,提升并行度。

3.3 ReduceByKey和groupByKey分析

1、手写一个word count:

在secureCRT上启动spark-shell --master local[2]
执行如下:sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).reduceByKey(+).collect
查看DAG图:第一个算子textFile、第二个算子flatMap、第三个算子map,遇到reduceByKey,一拆前面一个stage后面一个stage

在这里插入图片描述
两个stage,做reduceByKey的时候按照(_,1)的数据先写出来,再读进去。
reduceByKey的数据结构是:[String,Int]:代表的是单词出现的个数
在这里插入图片描述

2、reduceByKey和groupByKey的数据结构:

  • scala> sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).reduceByKey(+)
    res4: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[9] at reduceByKey at :25

  • scala> sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).groupByKey()
    res5: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[14] at groupByKey at :25

reduceByKey完成wordcount:

  1. scala> sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).reduceByKey(+).collect
    res10: Array[(String, Int)] = Array((hello,3), (world,2), (john,1))

在这里插入图片描述

groupByKey完成wordcount:

  1. scala> sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).groupByKey().map( x=> (x._1,x._2.sum)).collect
    res11: Array[(String, Int)] = Array((hello,3), (world,2), (john,1))

在这里插入图片描述

小结:

对比UI中的两张图:reduceByKey读进来53B,shuffle的数据161B;而groupBykey读进来的数据是53B,shuffle的数据却是172B.

  1. groupByKey所有的数据未经计算

  2. reduceByKey做了局部聚合操作,本地做了combiner,combiner的结果再经过shuffle,所以数据量会少一些。

3.4 图解reduceByKey和groupByKey的shuffle过程

假设有三个map的数据:第一个(a,1)(b,1)		第二个:(a,1)(b,1)  (a,1)(b,1)  第三个:(a,1)(b,1)  (a,1)(b,1)  (a,1)(b,1)  

groupByKey的shuffle过程:

在这里插入图片描述

reduceByKey的shuffle过程:

在这里插入图片描述
为啥reduceByKey的数据量要少一点,因为在map端先做了聚合减少了shuffle的数据量。

扩展aggregateByKey算子:

有些方法使用reduceByKey解决不了,引出新的算子:

源码面前了无秘密:

groupByKey中的源码:

在pairRDDFunctions.scala中定义的groupByKey方法:

  • def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn’t use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
    createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
    }

我们注意到combine的默认值就是false.

reduceByKey中的源码:

  • def combineByKeyWithClassTag[C](
    createCombiner: V => C,
    mergeValue: (C, V) => C,
    mergeCombiners: (C, C) => C,
    partitioner: Partitioner,
    mapSideCombine: Boolean = true,
    serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    require(mergeCombiners != null, “mergeCombiners must be defined”) // required as of Spark 0.9.0

我们注意到combine的默认值就是true.

4.1 collectAsMap

注释:所有的数据都会被加载到driver的内存,会扛不住挂掉

  /**
   * Return the key-value pairs in this RDD to the master as a Map.
   *
   * Warning: this doesn't return a multimap (so if you have multiple values to the same key, only
   *          one value per key is preserved in the map returned)
   *
   * @note this method should only be used if the resulting data is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def collectAsMap(): Map[K, V] = self.withScope {
    val data = self.collect()
    val map = new mutable.HashMap[K, V]
    map.sizeHint(data.length)
    data.foreach { pair => map.put(pair._1, pair._2) }
    map
  }

在RDD.scala中:

记住:只要看到了源码中有runJob,那么它一定就会触发action.

 /**
   * Return the number of elements in the RDD.
   */
  def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
  /**
   * Return an array that contains all of the elements in this RDD.
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

Array.concat(results: _) ==> 这边并不是可变参数
点击concat进入下一层源码:
-def concat[T: ClassTag](xss: Array[T]
): Array[T] //这个才是可变参数的定义

在Scala04课程中有所体现。
println(sum(1.to(10) :_* ))

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值