- 3.1 IDEA下进行分组
- 3.2 coalesce和repartition 在生产中的使用
- 3.3 reduceByKey和groupByKey分析
- 3.4 图解reduceByKey和groupByKey的shuffle过程
- 3.5 探究源码reduceByKey和groupByKey的combiner
一、上次课回顾
大数据实战第十五课(上)之-Spark-Core03:
https://blog.csdn.net/zhikanjiani/article/details/91045640#id_4.2
- 宽窄依赖定义,在容错方面定义
- spark on yarn(client、cluster)
- key-value编程
YARN HADOOP_CONF_DIR
对于yarn模式是否需要在$SPARK_HOME/conf下的slaves下修改localhosts为Hadoop002,。
跑yarn的时候只需要这台机器作为客户端就行了;为什么spark on yarn说的是它仅仅只需要一个客户端。
问:Spark on yarn是否需要启动这些东西?
在$SPARK_HOME/sbin/start-all.sh
/start-master.sh start-slaves.sh slaves
跑Spark on yarn,哐哐哐要把spark节点启动起来。
只要gateway+spark submit就行了,根本不需要启动什么进程就行。
二、Shuffle剖析
2.1 Shuffle简介
- 回顾:一个action会触发一个job,一个job遇到shuffle会分裂出一个stage,stage中是一堆task。
参见官网:http://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
需求:
-
给了你一堆通话记录call records ==> 统计本月打出去了多少电话
进入手机通话界面:通讯人、通话时间、通话时长、通话记录。 -
spark中统计分析都是基于wc,(天时间+拨打,1), 天时间+拨打作为一个key,进行reduceByKey()操作。
-
相同的天时间+拨打 ==> shuffle到同一个reduce上去,你能进行累加操作么?是不能的
引出:某一种具有特定特征的数据汇聚到某一个节点进行计算,此处进行+1操作
注意:能避免shuffle的操作尽量避免。
- Shuffle operations
Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism(机制) for re-distributing data(重新分发数据) so that it’s grouped differently across partitions. This typically involves copying data across executors(拷贝数据到机器上,会有磁盘和网络IO) and machines, making the shuffle a complex and costly operation(是的shuffle成为了一个复杂的并且成本高的操作).
重新分发数据还跨分区的一个操作,这个典型的操作还涉及到拷贝数据到不同的机器上,还会有磁盘IO和网络IO,所以shuffle是一个复杂的并且成本高的操作。
2.2 Shuffle背景
- To understand what happens during the shuffle we can consider the example of the reduceByKey operation.
- 我们以reduceByKey来理解shuffle操作中会发生什么.
- The reduceByKey operation generates a new RDD where all values for a single key are combined into a tuple
- reduceByKey操作生成一个新的RDD,每一个key所对应的的值都会被组合成一个元组
- the key and the result of executing a reduce function against all values associated with that key
- (相同特征的key会被分到一个reduce上去处理).
- The challenge is that not all values for a single key necessarily reside on the same partition. or even the same machine, but they must be co-located to compute the result.
- 不是所有的key对应的value都是保存在相同的分区下的(带来的挑战是:结果是跨分区的,它们必须要在同一个地点协同工作。)
- Operations which can cause a shuffle include repartition operations like repartition and coalese, ‘ByKey’ operations (except for counting)like groupByKey and reduceByKey, and join operations like cogroup and join.
- 有哪些操作可能会产生一些Shuffle?
2.3 Shuffle Performance Impact(性能上的影响)
- The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O(磁盘IO、数据序列化、网络IO). To organize data for the shuffle. Spark generates sets of tasks (Spark会产生一系列的task)- map tasks to organize the data(map task组织数据), and a set of reduce tasks to aggregate it(reduce task去聚合数据).This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations(这种方式来自于MapReduce,但是并没有直接映射到map和reduce操作).
- Spark产生一系列的task ==> spark会产生一堆的stage,shuffle产生新的stage,stage产生一堆的task
- Internally,results from individual map tasks are kept in memory until they can’t fit,these are sorted based on the target partition and written to a single file. On the reduce side,tasks read the relevant sorted blocks.
- 本质上,独立的map结果保存在内存上,reduce端会读取相关排序数据(map端输出的)。
三、Shuffle在Spark-shell操作
1、启动Spark-shell:
scala> val info = sc.textFile("hdfs://hadoop002:9000/wordcount/input/ruozeinput.txt")
info: org.apache.spark.rdd.RDD[String] = hdfs://hadoop002:9000/wordcount/input/ruozeinput.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> info.partitions.length
res0: Int = 2
scala> val info1 = info.coalesce(1)
info1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[2] at coalesce at <console>:25
scala> info1.partitions.length
res1: Int = 1
scala> val info2 = info.coalesce(4)
info2: org.apache.spark.rdd.RDD[String] = CoalescedRDD[3] at coalesce at <console>:25
scala> info2.partitions.length
res2: Int = 2
scala> val info3 = info.coalesce(4,true)
info3: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at coalesce at <console>:25
scala> info3.partitions.length
res3: Int = 4
scala> info3.collect
res4: Array[String] = Array(hello world, hello, hello world john)
解释coalesce方法、
- def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
(implicit ord: Ordering[T] = null)
- 传入一个分区数,传入一个true或者false,可传可不传,
- def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
- 调用的就是coalesce,肯定是会仅从shuffle的。
- 使用collect操作触发:
- scala> info3.collect
res4: Array[String] = Array(hello world, hello, hello world john)
- 使用repartition操作:
-
scala> val info4 = info.repartition(5)
info4: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at repartition at :25 -
scala> info4.collect
res6: Array[String] = Array(hello world john, hello world, hello) -
scala> info.partitions.length
res7: Int = 2
2个分区变为5个分区,对数据重新做分发,使用coalesce,避免你做一个shuffle的动作
3.1 IDEA下进行分组:
package spark01
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ListBuffer
object RepartitionApp {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
sparkConf.setAppName("LogApp").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val students = sc.parallelize(List("黄帆","梅宇豪","秦朗","杨朝珅","王乾","沈兆乘","沈其文","陈思文"),3)
students.mapPartitionsWithIndex((index,partition) => {
val stus = new ListBuffer[String]
while(partition.hasNext){ //迭代分区
stus += ("~~~~" + partition.next() + ",哪个组:" + (index+1))
}
stus.iterator
}).foreach(println) //进行打印
sc.stop()
}
}
mapPartitionWithIndex():意思是分分区,加一个组编号
在parallelize中设置并行度,明确是3个组;
需求一:
部门裁员,三个组变成二个组,进行如下修改:
- students.mapPartitionsWithIndex((index,partition) ==>
变更如下 :
students.coalesce(2).mapPartitionsWithIndex((index,partition)
需求二:
部门裁员前是三个组,把他们重新分组变成5个组
students.repartition(5).mapPartitionsWithIndex((index,partition)
为了直观显示partition和repartition操作:
可以运行如下代码:
package Sparkcore04
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ListBuffer
object RepartitionApp {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf();
sparkConf.setAppName("LogApp").setMaster("local[2]");
val sc = new SparkContext(sparkConf);
val students = sc.parallelize(List("梅宇豪","黄帆","杨超神","薛思雨","朱昱璇","周一虹","王晓岚","沈兆乘","陈思文"),3);
students.mapPartitionsWithIndex((index,partition) =>
{
val stus = new ListBuffer[String]
while(partition.hasNext)
{
stus += ("~~~~" + partition.next() + ",哪个组:" + (index+1))
}
stus.iterator
}).foreach(println)
println("---------------------------分割线---------------------------")
students.repartition(4).mapPartitionsWithIndex((index,partition) => {
val stus = new ListBuffer[String]
while(partition.hasNext) {
stus += ("~~~" + partition.next() + ",新组" + (index+1))
}
stus.iterator
}).foreach(println)
sc.stop()
}
}
3.2 coalesce和repartition 在生产中的使用:
-
假设一个RDD中有300个分区,每个分区中只有一条记录"id=100“,
-
此时做了一个filter操作(id > 99),结果就是还是有300个partition,每个partition中只有一条数据
变换起始条件:
- 原来300个partition,每个partition有10万条数据, 还是做了filter操作(id > 99),输出出来每个文件只有一条数据;
- 如果此时coalesce(1),以此来进行收敛,对小文件好很多。分区数决定了最终输出的文件个数。
- rePartition应用场景:可以把数据打散,提升并行度。
3.3 ReduceByKey和groupByKey分析
1、手写一个word count:
在secureCRT上启动spark-shell --master local[2]
执行如下:sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).reduceByKey(+).collect
查看DAG图:第一个算子textFile、第二个算子flatMap、第三个算子map,遇到reduceByKey,一拆前面一个stage后面一个stage
两个stage,做reduceByKey的时候按照(_,1)的数据先写出来,再读进去。
reduceByKey的数据结构是:[String,Int]:代表的是单词出现的个数
2、reduceByKey和groupByKey的数据结构:
-
scala> sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).reduceByKey(+)
res4: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[9] at reduceByKey at :25 -
scala> sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).groupByKey()
res5: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[14] at groupByKey at :25
reduceByKey完成wordcount:
- scala> sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).reduceByKey(+).collect
res10: Array[(String, Int)] = Array((hello,3), (world,2), (john,1))
groupByKey完成wordcount:
- scala> sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).groupByKey().map( x=> (x._1,x._2.sum)).collect
res11: Array[(String, Int)] = Array((hello,3), (world,2), (john,1))
小结:
对比UI中的两张图:reduceByKey读进来53B,shuffle的数据161B;而groupBykey读进来的数据是53B,shuffle的数据却是172B.
-
groupByKey所有的数据未经计算
-
reduceByKey做了局部聚合操作,本地做了combiner,combiner的结果再经过shuffle,所以数据量会少一些。
3.4 图解reduceByKey和groupByKey的shuffle过程
假设有三个map的数据:第一个(a,1)(b,1) 第二个:(a,1)(b,1) (a,1)(b,1) 第三个:(a,1)(b,1) (a,1)(b,1) (a,1)(b,1)
groupByKey的shuffle过程:
reduceByKey的shuffle过程:
为啥reduceByKey的数据量要少一点,因为在map端先做了聚合减少了shuffle的数据量。
扩展aggregateByKey算子:
有些方法使用reduceByKey解决不了,引出新的算子:
源码面前了无秘密:
groupByKey中的源码:
在pairRDDFunctions.scala中定义的groupByKey方法:
- def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn’t use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
我们注意到combine的默认值就是false.
reduceByKey中的源码:
- def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, “mergeCombiners must be defined”) // required as of Spark 0.9.0
我们注意到combine的默认值就是true.
4.1 collectAsMap
注释:所有的数据都会被加载到driver的内存,会扛不住挂掉
/**
* Return the key-value pairs in this RDD to the master as a Map.
*
* Warning: this doesn't return a multimap (so if you have multiple values to the same key, only
* one value per key is preserved in the map returned)
*
* @note this method should only be used if the resulting data is expected to be small, as
* all the data is loaded into the driver's memory.
*/
def collectAsMap(): Map[K, V] = self.withScope {
val data = self.collect()
val map = new mutable.HashMap[K, V]
map.sizeHint(data.length)
data.foreach { pair => map.put(pair._1, pair._2) }
map
}
在RDD.scala中:
记住:只要看到了源码中有runJob,那么它一定就会触发action.
/**
* Return the number of elements in the RDD.
*/
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
/**
* Return an array that contains all of the elements in this RDD.
*
* @note This method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*/
def collect(): Array[T] = withScope {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
Array.concat(results: _) ==> 这边并不是可变参数
点击concat进入下一层源码:
-def concat[T: ClassTag](xss: Array[T]): Array[T] //这个才是可变参数的定义
在Scala04课程中有所体现。
println(sum(1.to(10) :_* ))