大数据实战第十六课（上）-Spark-Core04

最新推荐文章于 2023-09-20 16:30:26 发布

zhikanjiani

最新推荐文章于 2023-09-20 16:30:26 发布

阅读量246

点赞数

本文链接：https://blog.csdn.net/zhikanjiani/article/details/91353779

版权

一、上次课回顾

二、Shuffle剖析

2.1 Shuffle简介
2.2 Shuffle背景
2.3 Shuffle Performance Impact(Shuffle 性能上的影响)

三、shuffle在Spark-shell操作

3.1 IDEA下进行分组
3.2 coalesce和repartition 在生产中的使用
3.3 reduceByKey和groupByKey分析
3.4 图解reduceByKey和groupByKey的shuffle过程
3.5 探究源码reduceByKey和groupByKey的combiner

四、扩展：aggregateByKey算子

4.1 collectAsMap

一、上次课回顾

大数据实战第十五课(上)之-Spark-Core03：
https://blog.csdn.net/zhikanjiani/article/details/91045640#id_4.2

宽窄依赖定义，在容错方面定义
spark on yarn（client、cluster）
key-value编程

YARN HADOOP_CONF_DIR
对于yarn模式是否需要在$SPARK_HOME/conf下的slaves下修改localhosts为Hadoop002,。

跑yarn的时候只需要这台机器作为客户端就行了；为什么spark on yarn说的是它仅仅只需要一个客户端。

问：Spark on yarn是否需要启动这些东西？

在$SPARK_HOME/sbin/start-all.sh
/start-master.sh start-slaves.sh slaves

跑Spark on yarn，哐哐哐要把spark节点启动起来。

只要gateway+spark submit就行了，根本不需要启动什么进程就行。

二、Shuffle剖析

2.1 Shuffle简介

回顾：一个action会触发一个job，一个job遇到shuffle会分裂出一个stage，stage中是一堆task。

参见官网：http://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations

需求：

给了你一堆通话记录call records ==> 统计本月打出去了多少电话
进入手机通话界面：通讯人、通话时间、通话时长、通话记录。
spark中统计分析都是基于wc，（天时间+拨打，1），天时间+拨打作为一个key，进行reduceByKey（）操作。
相同的天时间+拨打 ==> shuffle到同一个reduce上去，你能进行累加操作么？是不能的

引出：某一种具有特定特征的数据汇聚到某一个节点进行计算，此处进行+1操作
注意：能避免shuffle的操作尽量避免。

Shuffle operations
Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism（机制） for re-distributing data（重新分发数据） so that it’s grouped differently across partitions. This typically involves copying data across executors（拷贝数据到机器上，会有磁盘和网络IO） and machines, making the shuffle a complex and costly operation（是的shuffle成为了一个复杂的并且成本高的操作）.

重新分发数据还跨分区的一个操作，这个典型的操作还涉及到拷贝数据到不同的机器上，还会有磁盘IO和网络IO，所以shuffle是一个复杂的并且成本高的操作。

在这里插入图片描述

2.2 Shuffle背景

To understand what happens during the shuffle we can consider the example of the reduceByKey operation.

我们以reduceByKey来理解shuffle操作中会发生什么.

The reduceByKey operation generates a new RDD where all values for a single key are combined into a tuple

reduceByKey操作生成一个新的RDD，每一个key所对应的的值都会被组合成一个元组

the key and the result of executing a reduce function against all values associated with that key

（相同特征的key会被分到一个reduce上去处理）.

The challenge is that not all values for a single key necessarily reside on the same partition. or even the same machine, but they must be co-located to compute the result.

不是所有的key对应的value都是保存在相同的分区下的（带来的挑战是：结果是跨分区的，它们必须要在同一个地点协同工作。）

Operations which can cause a shuffle include repartition operations like repartition and coalese, ‘ByKey’ operations （except for counting）like groupByKey and reduceByKey, and join operations like cogroup and join.

有哪些操作可能会产生一些Shuffle？

2.3 Shuffle Performance Impact（性能上的影响）

The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O（磁盘IO、数据序列化、网络IO）. To organize data for the shuffle. Spark generates sets of tasks （Spark会产生一系列的task）- map tasks to organize the data（map task组织数据）, and a set of reduce tasks to aggregate it（reduce task去聚合数据）.This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations（这种方式来自于MapReduce，但是并没有直接映射到map和reduce操作）.

Spark产生一系列的task ==> spark会产生一堆的stage，shuffle产生新的stage，stage产生一堆的task

Internally，results from individual map tasks are kept in memory until they can’t fit，these are sorted based on the target partition and written to a single file. On the reduce side，tasks read the relevant sorted blocks.

本质上，独立的map结果保存在内存上，reduce端会读取相关排序数据（map端输出的）。

三、Shuffle在Spark-shell操作

1、启动Spark-shell：

scala> val info = sc.textFile("hdfs://hadoop002:9000/wordcount/input/ruozeinput.txt")
info: org.apache.spark.rdd.RDD[String] = hdfs://hadoop002:9000/wordcount/input/ruozeinput.txt MapPartitionsRDD[1] at textFile at <console>:24

scala> info.partitions.length
res0: Int = 2

scala> val info1 = info.coalesce(1)
info1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[2] at coalesce at <console>:25

scala> info1.partitions.length
res1: Int = 1

scala> val info2 = info.coalesce(4)
info2: org.apache.spark.rdd.RDD[String] = CoalescedRDD[3] at coalesce at <console>:25

scala> info2.partitions.length
res2: Int = 2

scala> val info3 = info.coalesce(4,true)
info3: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at coalesce at <console>:25

scala> info3.partitions.length
res3: Int = 4

scala> info3.collect
res4: Array[String] = Array(hello       world, hello, hello     world   john)

解释coalesce方法、

def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
(implicit ord: Ordering[T] = null)

传入一个分区数，传入一个true或者false，可传可不传，

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}

调用的就是coalesce，肯定是会仅从shuffle的。

使用collect操作触发：

scala> info3.collect
res4: Array[String] = Array(hello world, hello, hello world john)

使用repartition操作：

scala> val info4 = info.repartition(5)
info4: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at repartition at :25
scala> info4.collect
res6: Array[String] = Array(hello world john, hello world, hello)
scala> info.partitions.length
res7: Int = 2

2个分区变为5个分区，对数据重新做分发，使用coalesce，避免你做一个shuffle的动作
在这里插入图片描述

3.1 IDEA下进行分组：

package spark01

import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.mutable.ListBuffer


object RepartitionApp {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf()
    sparkConf.setAppName("LogApp").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)

    val students = sc.parallelize(List("黄帆","梅宇豪","秦朗","杨朝珅","王乾","沈兆乘","沈其文","陈思文"),3)
    students.mapPartitionsWithIndex((index,partition) => {

      val stus = new ListBuffer[String]
        while(partition.hasNext){													//迭代分区
          stus += ("~~~~" + partition.next() + ",哪个组：" + (index+1))
        }
      stus.iterator
    }).foreach(println)		//进行打印

    sc.stop()
  }

}

mapPartitionWithIndex()：意思是分分区，加一个组编号
在parallelize中设置并行度，明确是3个组；

需求一：

部门裁员，三个组变成二个组，进行如下修改：

students.mapPartitionsWithIndex((index,partition) ==>
变更如下 :
students.coalesce(2).mapPartitionsWithIndex((index,partition)

需求二：

部门裁员前是三个组，把他们重新分组变成5个组
students.repartition(5).mapPartitionsWithIndex((index,partition)

为了直观显示partition和repartition操作：

可以运行如下代码：

package Sparkcore04

import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.mutable.ListBuffer

object RepartitionApp {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf();
    sparkConf.setAppName("LogApp").setMaster("local[2]");
    val sc = new SparkContext(sparkConf);


    val students = sc.parallelize(List("梅宇豪","黄帆","杨超神","薛思雨","朱昱璇","周一虹","王晓岚","沈兆乘","陈思文"),3);
   students.mapPartitionsWithIndex((index,partition) =>
   {
     val stus = new ListBuffer[String]
     while(partition.hasNext)
       {
         stus += ("~~~~" + partition.next() + ",哪个组：" + (index+1))
       }
     stus.iterator
   }).foreach(println)

    println("---------------------------分割线---------------------------")
    students.repartition(4).mapPartitionsWithIndex((index,partition)  => {
      val stus = new ListBuffer[String]
      while(partition.hasNext) {
        stus += ("~~~" + partition.next() + ",新组" + (index+1))
      }
      stus.iterator
    }).foreach(println)
    sc.stop()
  }

}

3.2 coalesce和repartition 在生产中的使用：

假设一个RDD中有300个分区，每个分区中只有一条记录"id=100“，
此时做了一个filter操作(id > 99)，结果就是还是有300个partition，每个partition中只有一条数据

变换起始条件：

原来300个partition，每个partition有10万条数据，还是做了filter操作(id > 99)，输出出来每个文件只有一条数据；
如果此时coalesce(1)，以此来进行收敛，对小文件好很多。分区数决定了最终输出的文件个数。

rePartition应用场景：可以把数据打散，提升并行度。

3.3 ReduceByKey和groupByKey分析

1、手写一个word count：

在secureCRT上启动spark-shell --master local[2]
执行如下：sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).reduceByKey(+).collect
查看DAG图：第一个算子textFile、第二个算子flatMap、第三个算子map，遇到reduceByKey，一拆前面一个stage后面一个stage

在这里插入图片描述
两个stage，做reduceByKey的时候按照（_,1）的数据先写出来，再读进去。
reduceByKey的数据结构是：[String,Int]：代表的是单词出现的个数

2、reduceByKey和groupByKey的数据结构：

scala> sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).reduceByKey(+)
res4: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[9] at reduceByKey at :25
scala> sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).groupByKey()
res5: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[14] at groupByKey at :25

reduceByKey完成wordcount：

scala> sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).reduceByKey(+).collect
res10: Array[(String, Int)] = Array((hello,3), (world,2), (john,1))

在这里插入图片描述

groupByKey完成wordcount：

scala> sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).groupByKey().map( x=> (x._1,x._2.sum)).collect
res11: Array[(String, Int)] = Array((hello,3), (world,2), (john,1))

在这里插入图片描述

小结：

对比UI中的两张图：reduceByKey读进来53B，shuffle的数据161B；而groupBykey读进来的数据是53B，shuffle的数据却是172B.

groupByKey所有的数据未经计算
reduceByKey做了局部聚合操作，本地做了combiner，combiner的结果再经过shuffle，所以数据量会少一些。

3.4 图解reduceByKey和groupByKey的shuffle过程

假设有三个map的数据：第一个（a,1）(b,1)		第二个：（a,1）(b,1)  （a,1）(b,1)  第三个：（a,1）(b,1)  （a,1）(b,1)  （a,1）(b,1)

groupByKey的shuffle过程：

在这里插入图片描述

reduceByKey的shuffle过程：

在这里插入图片描述
为啥reduceByKey的数据量要少一点，因为在map端先做了聚合减少了shuffle的数据量。

扩展aggregateByKey算子：

有些方法使用reduceByKey解决不了，引出新的算子：

源码面前了无秘密：

groupByKey中的源码：

在pairRDDFunctions.scala中定义的groupByKey方法：

def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn’t use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}

我们注意到combine的默认值就是false.

reduceByKey中的源码：

def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, “mergeCombiners must be defined”) // required as of Spark 0.9.0

我们注意到combine的默认值就是true.

4.1 collectAsMap

注释：所有的数据都会被加载到driver的内存，会扛不住挂掉

  /**
   * Return the key-value pairs in this RDD to the master as a Map.
   *
   * Warning: this doesn't return a multimap (so if you have multiple values to the same key, only
   *          one value per key is preserved in the map returned)
   *
   * @note this method should only be used if the resulting data is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def collectAsMap(): Map[K, V] = self.withScope {
    val data = self.collect()
    val map = new mutable.HashMap[K, V]
    map.sizeHint(data.length)
    data.foreach { pair => map.put(pair._1, pair._2) }
    map
  }

在RDD.scala中：

记住：只要看到了源码中有runJob，那么它一定就会触发action.

 /**
   * Return the number of elements in the RDD.
   */
  def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

  /**
   * Return an array that contains all of the elements in this RDD.
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

Array.concat(results: _) ==> 这边并不是可变参数
点击concat进入下一层源码：
-def concat[T: ClassTag](xss: Array[T]): Array[T] //这个才是可变参数的定义

在Scala04课程中有所体现。
println(sum(1.to(10) :_* ))

zhikanjiani

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大数据实战第十六课（上）-Spark-Core04

一、Shuffle参见官网：http://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operationsShuffle operationsCertain operations within Spark trigger an event known as the shuffle. The shuffle i...
复制链接

扫一扫