当经过shuffle写数据到本地磁盘后,需要从磁盘中将数据读取出来,这个是 ShuffledRDD 做的事情:
override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context)
.read() // 每次reducer都读一个partition
.asInstanceOf[Iterator[(K, C)]]
}
可以看出,是通过 ShuffleManager.getReader方法来获得一个读取器,目前spark只有一种类型的读取器:HashShuffleReader,看一下具体源码:
override def getReader[K, C](
handle: ShuffleHandle,
startPartition: Int,
endPartition: Int,
context: TaskContext): ShuffleReader[K, C] = {
// We currently use the same block store shuffle fetcher as the hash-based shuffle.
new HashShuffleReader(
handle.asInstanceOf[BaseShuffleHandle[K, _, C]], startPartition, endPartition, context)
}
继续查看其 read() 方法:
override def read(): Iterator[Product2[K, C]] = {
val ser = Serializer.getSerializer(dep.serializer)
<span style="color:#FF0000;">val iter = BlockStoreShuffleFetcher.fetch(handle.shuffleId, startPartition, context, ser)</span> //真正的从file中抓取reducer所需的内容
val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) {
if (dep.mapSideCombine) { //mapper端已经按key进行聚合了,此时,合并combiners
new InterruptibleIterator(context, dep.aggregator.get.combineCombinersByKey(iter, context))
} else { //mapper端没有进行聚合,此时,合并values
new InterruptibleIterator(context, dep.aggregator.get.combineValuesByKey(iter, context))
}
} else if (dep.aggregator.isEmpty && dep.mapSideCombine) {
throw new IllegalStateException("Aggregator is empty for map-side combine")
} else { //没有聚合器,将其转换为键值对,因为之后的rdd需要这样的格式
// Convert the Product2s to pairs since this is what downstream RDDs currently expect
iter.asInstanceOf[Iterator[Product2[K, C]]].map(pair => (pair._1, pair._2))
}
// Sort the output if there is a sort ordering defined.
dep.keyOrdering match {
case Some(keyOrd: Ordering[K]) => //需要排序的情况
// Create an ExternalSorter to sort the data. Note that if spark.shuffle.spill is disabled,
// the ExternalSorter won't spill to disk.
v