一、Shuffle过程图解:
文字理解:在RDD聚合过程中,数据是需要落盘的,不可能一直缓存在内存中等待上一个RDD计算完毕,因此,就有了上图的过程和数据文件落盘的理解,以及优化Shuffle的方向(减少落盘的数据量)
二、同样引入另外一种情况:上一个阶段的Task是只有一个CPU和分区,而reduce阶段是有三个分区=三个Task,那我们落盘的文件应该是怎样的呢
文字理解:如果只落一个文件,那么三个Task去分,不知道应该从哪里开始读起,如果落三个文件,任务一多,就会形成小文件的问题,所以最好的办法是,生成一个数据文件,一个索引文件可以通过源码解析得知,Spark的原理;
并且下一个阶段的Task都能拿到同一个Key的数据
三、因此就引出有点类似MR的shuffle的概念,就是上一个Task/RDD的写入磁盘
源码印证:
阅读DAGSchedule.scala
val tasks: Seq[Task[_]] = try {
val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
stage match {
case stage: ShuffleMapStage =>//匹配模式匹配,如果是shuffle的阶段
stage.pendingPartitions.clear()
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = partitions(id)
stage.pendingPartitions += id
new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
//那么就会new一个ShuffleMapTask
taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId, stage.rdd.isBarrier())
}
继续阅读:ShuffleMapTask,extends Task(def run) ,Task中有一个抽象的方法runTask(),必须要在子类ShuffleMapTask 中重写
所以可以在ShuffleMapTask中看到runTask()方法的重写,里面有写文件的方法
override def runTask(context: TaskContext): MapStatus = {
// Deserialize the RDD using the broadcast variable.
val threadMXBean = ManagementFactory.getThreadMXBean
val deserializeStartTimeNs = System.nanoTime()
val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported)
........................
val rdd = rddAndDep._1
val dep = rddAndDep._2
// While we use the old shuffle fetch protocol, we use partitionId as mapId in the
// ShuffleBlockId construction.
val mapId = if (SparkEnv.get.conf.get(config.SHUFFLE_USE_OLD_FETCH_PROTOCOL)) {
partitionId
} else context.taskAttemptId()
dep.shuffleWriterProcessor.write(rdd, dep, mapId, context, partition)//写的步骤
}
进入到Write方法,发现是一个abstract class,ctrl+H,发现有一个SortShuffleWriter,里面有
sorter.writePartitionedMapOutput(dep.shuffleId, mapId, mapOutputWriter)
val partitionLengths = mapOutputWriter.commitAllPartitions()
两个方法writePartitionedMapOutput,commitAllPartitions ,进入commitAllPartitions,查找发现有一个本地磁盘的操作
然后就可以看到方法里有一个writeIndexFileAndCommit 方法,写入索引文件和提交任务
public long[] commitAllPartitions() throws IOException {
if (outputFileChannel != null && outputFileChannel.position() != bytesWrittenToMergedFile)
cleanUp();
File resolvedTmp = outputTempFile != null && outputTempFile.isFile() ? outputTempFile : null;
blockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, resolvedTmp); //
return partitionLengths;
}
再进入writeIndexFileAndCommit方法,就能看到indexfile和datafile
def writeIndexFileAndCommit(
shuffleId: Int,
mapId: Long,
lengths: Array[Long],
dataTmp: File): Unit = {
.....................................
if (indexFile.exists()) {
indexFile.delete()
}
if (dataFile.exists()) {
dataFile.delete()
}
}
同样下一个RDD/task,读的操作是怎么做的,源码
在Task中有
val tasks: Seq[Task[_]] = try {
case stage: ResultStage => //模式匹配,如果是结果stage,就new ResultTask
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptNumber,
taskBinary, part, locs, id, properties, serializedTaskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId,
stage.rdd.isBarrier())
}
}
}
发现ResultTask 里面一样有Runtask ,但是在里面没有发现Reader,仅有一个rdd.iterator,再执行getOrCompute 方法,如果存储级别不为null,然后再进入RDD.scala
private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
{
if (isCheckpointedAndMaterialized) {
firstParent[T].iterator(split, context)
} else {
compute(split, context)
}
}
因为不是CheckPoint,所以走compute方法,发现是个抽象类,RDD抽象类里面有很多,因为我们是属于ShuffleRDD嘛,所以在ShuffleRDD.scala里面,发现抽象的compute方法,所以其实每种RDD都有不同的compute方法?
override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
val metrics = context.taskMetrics().createTempShuffleReadMetrics()
SparkEnv.get.shuffleManager.getReader(
dep.shuffleHandle, split.index, split.index + 1, context, metrics)
.read() //根据Handle处理器,获取数据索引,和数据的读取
.asInstanceOf[Iterator[(K, C)]]
}
实战:项目实战,Spark1.0版本未经过优化的HashShuffle
五、shuffle写的过程中,各种Writer适用的范围及条件【源码解读】
源码:
ShuffleWriterProcessor.scala
获取shuffle管理器和writer
def write(
rdd: RDD[_],
dep: ShuffleDependency[_, _, _],
mapId: Long,
context: TaskContext,
partition: Partition): MapStatus = {
var writer: ShuffleWriter[Any, Any] = null
try {
val manager = SparkEnv.get.shuffleManager
writer = manager.getWriter[Any, Any](
dep.shuffleHandle,
...................
点击 manager.getWriter
getWriter会匹配三种shuffle处理器:
SerializedShuffleHandle
BypassMergeSortShuffleHandle
BaseShuffleHandle
override def getWriter[K, V]({
handle match {
case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
new UnsafeShuffleWriter(
env.blockManager,
context.taskMemoryManager(),
unsafeShuffleHandle,
mapId,
context,
env.conf,
metrics,
shuffleExecutorComponents)
case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
new BypassMergeSortShuffleWriter(
env.blockManager,
bypassMergeSortHandle,
mapId,
env.conf,
metrics,
shuffleExecutorComponents)
case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
new SortShuffleWriter(
shuffleBlockResolver, other, mapId, context, shuffleExecutorComponents)
}
}
5.1 SerializedShuffleHandle
回退到:ShuffleWriterProcessor.scala ,shuffleHandle是getWriter的第一个参数
点击dep.shuffleHandle
val manager = SparkEnv.get.shuffleManager
writer = manager.getWriter[Any, Any](
dep.shuffleHandle,
mapId,
context,
createMetricsReporter(context))
writer.write(
rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
writer.stop(success = true).get
...................
注册handle,点击registerShuffle()
val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
shuffleId, this)
registerShuffle方法
else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
// Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
new SerializedShuffleHandle[K, V](
shuffleId, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
canUseSerializedShuffle是否可以序列化
def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean = {
val shufId = dependency.shuffleId
val numPartitions = dependency.partitioner.numPartitions
if (!dependency.serializer.supportsRelocationOfSerializedObjects) { //是否支持序列化重定向对象
log.debug(s"Can't use serialized shuffle for shuffle $shufId because the serializer, " +
s"${dependency.serializer.getClass.getName}, does not support object relocation")
false
} else if (dependency.mapSideCombine) { //是否支持预聚合
log.debug(s"Can't use serialized shuffle for shuffle $shufId because we need to do " +
s"map-side aggregation")
false
} else if (numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) { //分区数是否小于16777215+1
log.debug(s"Can't use serialized shuffle for shuffle $shufId because it has more than " +
s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")
false
} else {
log.debug(s"Can use serialized shuffle for shuffle $shufId")
true
}
}
总结:1、需要满足序列化重定向对象(Kyro序列化)
2、不能使用预聚合(groupbyKey,reduceByKey等)
3、分区数要小于16777216
5.2 shouldBypassMergeSort 忽略合并排序writer
SortShuffleManager.scala
if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
// If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't
// need map-side aggregation, then write numPartitions files directly and just concatenate
// them at the end. This avoids doing serialization and deserialization twice to merge
// together the spilled files, which would happen with the normal code path. The downside is
// having multiple files open at a time and thus more memory allocated to buffers.
new BypassMergeSortShuffleHandle[K, V](
shuffleId, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
}
点击shouldBypassMergeSort
def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = {
// We cannot bypass sorting if we need to do map-side aggregation.
if (dep.mapSideCombine) {
false
} else {
val bypassMergeThreshold: Int = conf.get(config.SHUFFLE_SORT_BYPASS_MERGE_THRESHOLD)
dep.partitioner.numPartitions <= bypassMergeThreshold
}
}
总结:1、不能使用预聚合算子
2、分区数小于等于200(可配)