- wordcount 我想大家都是在学大数据的时间,就接触过了,好比在java中的Hello World, 那么大家知道在执行WordCount程序时,发生什么,使得数据在算子间传递;
val sparkConf = new SparkConf()
.setAppName(this.getClass.getSimpleName)
.setMaster("local[2]")
val sc = new SparkContext(sparkConf)
/** 数据集:
* 张三,李四,王五
* 小林,李四,张三
* 结果:
* (张三,2)
* (小林,1)
* (李四,2)
* (王五,1)
*/
val wordRdd = sc.textFile("data/test.txt")
val wordsRdd: RDD[String] = wordRdd.flatMap(x => x.split(","))
val kvRdd: RDD[(String, Int)] = wordsRdd.map((_, 1))
val resultRdd: RDD[(String, Int)] = kvRdd.reduceByKey((_ + _))
resultRdd.foreach(println)
sc.stop()
第一步: textFile 算子 获取数据集;
// 可以看出来 textFile 调用了 hadoopFile方法
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}
// hadoopFile 又调用了HadoopRDD,它继承了RDD抽象类
def hadoopFile [K, V](
path: String,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
assertNotStopped()
......
new HadoopRDD(
this,
confBroadcast,
Some(setInputPathsFunc),
inputFormatClass,
keyClass,
valueClass,
minPartitions).setName(path)
}
// 我们可以发现 RDD传入了两个参数 一个是 sc,另一个是Nil
// @transient private var _sc: SparkContext,
// @transient private var deps: Seq[Dependency[_]]
// 这代表着当前RDD 前面的依赖(Dependency)是空的
class HadoopRDD[K, V](
sc: SparkContext,
broadcastedConf: Broadcast[SerializableConfiguration],
initLocalJobConfFuncOpt: Option[JobConf => Unit],
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int)
extends RDD[(K, V)](sc, Nil) with Logging {
- HadoopRDD类中方法解析: getPartitions/compute,
// getPartitions 获取 获取split 后的分区数组
val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)
// compute 通过 NextIterator 来获取文件中的数据并返回 迭代器Iterator
// LineRecordReader 获取数据
val iter = new NextIterator[(K, V)]{......}
//返回
new InterruptibleIterator[(K, V)](context, iter)
第二步: flatMap,Map 算子解析
// 发现 flatMap,Map 算子 均调用了 MapPartitionsRDD 且继承了 RDD 并传入一个前一个 RDD 对象prev
private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
var prev: RDD[T],
f: (TaskContext, Int, Iterator[T]) => Iterator[U], // (TaskContext, partition index, iterator)
preservesPartitioning: Boolean = false,
isFromBarrier: Boolean = false,
isOrderSensitive: Boolean = false)
extends RDD[U](prev) {
//如果 子类RDD如HadoopRDD ,MapPartitionsRDD 中没有重写getDependencies
// 则 RDD之间的依赖关系通过一下方法OneToOneDependency来生成,
// 继承了 NarrowDependency 而 NarrowDependency 继承于 Dependency
// RDD的依赖就像是一个单链表
/** Construct an RDD with just a one-to-one dependency on one parent */
def this(@transient oneParent: RDD[_]) =
this(oneParent.context, List(new OneToOneDependency(oneParent)))
- flatMap,Map 算子中 getPartitions/compute 函数解析
//分区数与上一个rdd分区一致
override def getPartitions: Array[Partition] = firstParent[T].partitions
//调用传入的参数 ,获取父RDD的迭代器 iterator,
override def compute(split: Partition, context: TaskContext): Iterator[U] =
f(context, split.index, firstParent[T].iterator(split, context))
/**由于HadoopRdd无iterator函数 ,我们可以看他的父类Rdd的iterator函数
* 发现该方法通过是否有缓存,checkpoint来调用不同的函数并返回 Iterator
* Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
* This should ''not'' be called by users directly, but is available for implementors of custom
* subclasses of RDD.
*/
final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
if (storageLevel != StorageLevel.NONE) {
getOrCompute(split, context)
} else {
computeOrReadCheckpoint(split, context)
}
}
第四步: 最重要的 reduceByKey 算子解析
reduceByKey 由于要发生shuffle 那么就要涉及到知识较多,我就简单简述下:
- 我们要了解的是 spark在 new SparkContext时,创建SparkEnv,它包含了 blockManager,而blockManager 又包含了 UnifiedMemoryManager(内存的管理),
shuffleManager(操作shuffle Writer Reader流程方式) mapOutputTracker等等, - 我们这里主要说的是 shuffleManager:
shuffleManager.registerShuffle 用于调用依赖 getDependencies时生成,用于shuffle 对应的ShuffleWriter
shuffleManager.getReader 用于读取ShuffleWriter落地数据
// getDependencies 用于生成父RDD依赖的集合, 通过 ShuffleDependency(继承于Dependency)一系列条件生成对应shuffleHandle ,
// 如: Bypass,Serialized,Base
// ,
override def getDependencies: Seq[Dependency[_]] = {
......
List(new ShuffleDependency(prev, part, serializer, keyOrdering, aggregator, mapSideCombine))
}
//用于获取 ShuffleWriter统计后的落地数据,并返回 Iterator
override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context)
.read()
.asInstanceOf[Iterator[(K, C)]]
}
- 那么ShuffleWriter的啥时候发生?
ShuffleWriter 是在 action触发runJob的时候,通过一系列计算 划分stage,task,
当task 是 ShuffleMapTask时 executor 运行 task.run 方法就会触发 ShuffleWriter
另外,ShuffleWriter 具体是调用那种方式Writer 以后再说…
override def runTask(context: TaskContext): MapStatus = {
// Deserialize the RDD using the broadcast variable.
......
val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
_executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime
_executorDeserializeCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime
} else 0L
var writer: ShuffleWriter[Any, Any] = null
try {
//通过 shuffleManager 调用 getWriter获取 writer
val manager = SparkEnv.get.shuffleManager
writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
writer.stop(success = true).get
} catch {
......
}
}
总结:
- rdd算子之间的数据 是通过 Iterator来关联的,当遇到shuffle 时,通过 shuffleWriter 与shuffleRead来关联;
- rdd算子之间的依赖都是通过,传入前一个rdd,并储存在Dependencies序列集合中;
- 问个小问题:
使用了reduceByKey就一定会发生shuffle 吗? 为什么?