在 Spark 的shuffle流程框架以及源码详解(匠心巨作)(1)这篇博客中,我们详细的介绍了Spark Shuffle 的发展过程,介绍了Spark Shuffle 过程中用到的数据结构,这些都为后面讲解Shuffle 的详细流程,以及源码详解做铺垫。本篇博客主要介绍BypassMergeSortShuffleWriter 的框架以及源码详解。本文代码是基于Spark 2.3.2版本。
1. Spark Shuffle 的执行流程
上图是Spark Shuffle 的执行流程图。Shuffle过程一般分为两部分,一部分是ShuffleWrite,另外一部分是ShuffleRead。同时根据不同的应用场景,ShuffleWrite有三种方式,BypassMergeSortShuffleWriter,UnsafeShuffleWriter,SortShuffleWriter。三种方式有很大的差异,具体使用哪一种方式,根据应用场景和设置的参数有关系。本篇博客只介BypassMergeSortShuffleWriter,另外两种方式在后面的博客中会介绍。
2. Spark ShuffleManager如何获取Writer
Shuffle 的程序入口处是在ShuffleMapTask中的runTask中,看一下runTask()中的代码:
override def runTask(context: TaskContext): MapStatus = {
// Deserialize the RDD using the broadcast variable.
val threadMXBean = ManagementFactory.getThreadMXBean
val deserializeStartTime = System.currentTimeMillis()
val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
threadMXBean.getCurrentThreadCpuTime
} else 0L
val ser = SparkEnv.get.closureSerializer.newInstance()
val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
_executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime
_executorDeserializeCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime
} else 0L
var writer: ShuffleWriter[Any, Any] = null
try {
val manager = SparkEnv.get.shuffleManager //获取ShuffleManager,目前版本2.3.2只有SortShuffleManager
writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)//根据ShuffleHandel 获取对应的Writer, 这里的ShuffleHandle是向ShuffleManager 注册的。
writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]]) // 调用实际的write的方法
writer.stop(success = true).get
} catch {
case e: Exception =>
try {
if (writer != null) {
writer.stop(success = false)
}
} catch {
case e: Exception =>
log.debug("Could not stop writer", e)
}
throw e
}
}
在runTask()的方法中,首先获取ShuffleManager,再根据向ShuffleManager注册的ShuffleHandle获取对应的Writer。下面来看一下ShuffleHandle是如何向ShuffleManager注册的:
override def registerShuffle[K, V, C](
shuffleId: Int,
numMaps: Int,
dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) { //Shuffle dependency 是shouldBypassMergeSort,就创建BypassMergeSortShuffleHandle
// If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't
// need map-side aggregation, then write numPartitions files directly and just concatenate
// them at the end. This avoids doing serialization and deserialization twice to merge
// together the spilled files, which would happen with the normal code path. The downside is
// having multiple files open at a time and thus more memory allocated to buffers.
new BypassMergeSortShuffleHandle[K, V](
shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
} else if (SortShuffleManager.canUseSerializedShuffle(dependency)) { //满足canUseSerializedShuffle(dependency)条件,就创建SerializedShuffleHandle
// Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
new SerializedShuffleHandle[K, V](
shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
} else {
// Otherwise, buffer map outputs in a deserialized form:
new BaseShuffleHandle(shuffleId, numMaps, dependency) //创建BaseShuffleHandle
}
}
上面的代码主要完成ShuffleHandle的注册,ShuffleManager根据注册的ShuffleHandle决定调用哪种Writer方法。获取Writer的方法: 根据祖册的ShuffleHandle,创建对应的Writer。
override def getWriter[K, V](
handle: ShuffleHandle,
mapId: Int,
context: TaskContext): ShuffleWriter[K, V] = {
numMapsForShuffle.putIfAbsent(
handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps)
val env = SparkEnv.get
handle match {
case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
new UnsafeShuffleWriter(
env.blockManager,
shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
context.taskMemoryManager(),
unsafeShuffleHandle,
mapId,
context,
env.conf)
case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
new BypassMergeSortShuffleWriter(
env.blockManager,
shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
bypassMergeSortHandle,
mapId,
context,
env.conf)
case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
new SortShuffleWriter(shuffleBlockResolver, other, mapId, context)
}
}
上面介绍了ShuffleManager是如何获取不同的Writer 的,获取Writer后就会开始把ShuffleMapTask 中输出的数据进行写磁盘操作。
3. BypassMergeSortShuffleWriter执行架构
4. BypassMergeSortShuffleWriter源码详解
在上面获取到BypassMergeSortShuffleWriter的方法后,整个代码的执行流程如下:
使用BypassMergeSortShuffleWriter的条件如下,这种Shuffle方式类似以HashShuffle,只是把最终ShuffleMapTask输出的数据全部合并到一个文件里。
- Aggregator is specified; map端不会有聚合操作
- no Ordering is specified; map端不能有排序的操作
- the number of partitions is less than spark.shuffle.sort.bypassMergeThreshold;reduce分区的数量要少于这个设定的值
首先执行BypassMergeSortShuffleWriter.java里面的Write方法。
@Override
public void write(Iterator<Product2<K, V>> records) throws IOException {
assert (partitionWriters == null);
if (!records.hasNext()) { //ShuffleMapTask如果没有输出数据,那就创建一个空的Index文件,并封装到MapStatus 返回
partitionLengths = new long[numPartitions];
shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, null);
mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
return;
}
final SerializerInstance serInstance = serializer.newInstance();//创建一个序列化器的实例
final long openStartTime = System.nanoTime();
partitionWriters = new DiskBlockObjectWriter[numPartitions];//按照numPartition的数量,创建一个partitionWriter数组,这个numPartition其实就是ShuffleReduceTask的数量
partitionWriterSegments = new FileSegment[numPartitions];//按照numPartition的数量,创建FileSegment数组,里面存放着数据,偏移量和数据的长度
for (int i = 0; i < numPartitions; i++) {
final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =
blockManager.diskBlockManager().createTempShuffleBlock();//创建临时文件,有多少个分区,就创建多少个临时文件
final File file = tempShuffleBlockIdPlusFile._2();
final BlockId blockId = tempShuffleBlockIdPlusFile._1();
partitionWriters[i] =
blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics);//为每天一个临时文件创建一个DiskWriter,用来把数据写入对应的文件里
}
// Creating the file to write to and creating a disk writer both involve interacting with
// the disk, and can take a long time in aggregate when we open many files, so should be
// included in the shuffle write time.
writeMetrics.incWriteTime(System.nanoTime() - openStartTime);
while (records.hasNext()) {
final Product2<K, V> record = records.next();
final K key = record._1();
partitionWriters[partitioner.getPartition(key)].write(key, record._2());//当有数据输出时,首先对record按Key进行分区,根据分区的结果把数据写进对应的文件里
}
//遍历每一个partitionWriters,把所有的records 全部写入对应的文件里,并把每一个文件里的数据形成FileSegment
for (int i = 0; i < numPartitions; i++) {
final DiskBlockObjectWriter writer = partitionWriters[i];
partitionWriterSegments[i] = writer.commitAndGet();
writer.close();
}
//创建最终的输出文件
File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
//创建最终输出文件的临时文件
File tmp = Utils.tempFileWith(output);
try {
//把所有的临时文件合并成一个最终的文件
partitionLengths = writePartitionedFile(tmp);
// 创建索引文件,记录每个分区的FileSegment的偏移量,以便后续Reduce拉去数据时,根据偏移量直接找到属于自己那部分的数据
shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
} finally {
if (tmp.exists() && !tmp.delete()) {
logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
}
}
//最后把输出的文件和索引文件封装在MapStatus 中
mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
}
write方法是最顶层的调用,在这里面会首先根据Partition 的数量,创建相应个数的临时文件,当ShuffleMap Task输出数据时,根据分区器按照Key的分区,直接把数据写入到已经创建好的临时文件,当把数据都写入到对应的临时文件中后,就会在每个文件中创建一个FileSegment。最后创建一个最终的输出文件,把所有的临时文件全部合并在一起,并根据FileSegment 保存的信息,创建Index索引文件。最后把最终的输出文件Map output 和Index索引文件封装到MapStatus中,向MapOutputTrackerMaster注册。
看一下 writePartitionedFile
/**
* Concatenate all of the per-partition files into a single combined file.
*
* @return array of lengths, in bytes, of each partition of the file (used by map output tracker).
*/
private long[] writePartitionedFile(File outputFile) throws IOException {
// Track location of the partition starts in the output file
final long[] lengths = new long[numPartitions];
if (partitionWriters == null) {
// We were passed an empty iterator
return lengths;
}
//创建输出流
final FileOutputStream out = new FileOutputStream(outputFile, true);
final long writeStartTime = System.nanoTime();
boolean threwException = true;
try {
//循环遍历上面创建的Segment数组,提取里面的文件
for (int i = 0; i < numPartitions; i++) {
final File file = partitionWriterSegments[i].file();
if (file.exists()) {
//把提取出的文件放入输入流中
final FileInputStream in = new FileInputStream(file);
boolean copyThrewException = true;
try {
lengths[i] = Utils.copyStream(in, out, false, transferToEnabled);
copyThrewException = false;
} finally {
Closeables.close(in, copyThrewException);
}
if (!file.delete()) {
logger.error("Unable to delete file for partition {}", i);
}
}
}
threwException = false;
} finally {
Closeables.close(out, threwException);
writeMetrics.incWriteTime(System.nanoTime() - writeStartTime);
}
//重置partitionWriters
partitionWriters = null;
//返回合并后的文件的长度
return lengths;
}
writePartitionFile 的作用是把每一个临时的磁盘文件合并为一个大文件,并返回整个文件的长度。合并完文件后,还需要创建一个索引文件,以便后面的ReduceTask分区可以直接在一块文件中,根据offset直接抓取属于自己分区的文件。
看一下索引文件的创建:
**
* Write an index file with the offsets of each block, plus a final offset at the end for the
* end of the output file. This will be used by getBlockData to figure out where each block
* begins and ends.
*
* It will commit the data and index file as an atomic operation, use the existing ones, or
* replace them with new ones.
*
* Note: the `lengths` will be updated to match the existing index file if use the existing ones.
*/
def writeIndexFileAndCommit(
shuffleId: Int,
mapId: Int,
lengths: Array[Long],
dataTmp: File): Unit = {
//创建一个索引文件,
val indexFile = getIndexFile(shuffleId, mapId)
// 创建索引文件的临时文件
val indexTmp = Utils.tempFileWith(indexFile)
try {
val out = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(indexTmp)))
Utils.tryWithSafeFinally {
// We take in lengths of each block, need to convert it to offsets.
var offset = 0L
// 前面生成的临时文件都记录了各自的offset和length,因此遍历filesegment,根据offset和length即可把每个block 的offset计算出来
out.writeLong(offset)
for (length <- lengths) {
offset += length
out.writeLong(offset)
}
} {
out.close()
}
val dataFile = getDataFile(shuffleId, mapId)
// There is only one IndexShuffleBlockResolver per executor, this synchronization make sure
// the following check and rename are atomic.
synchronized {
val existingLengths = checkIndexAndDataFile(indexFile, dataFile, lengths.length)
if (existingLengths != null) {
// Another attempt for the same task has already written our map outputs successfully,
// so just use the existing partition lengths and delete our temporary map outputs.
System.arraycopy(existingLengths, 0, lengths, 0, lengths.length)
if (dataTmp != null && dataTmp.exists()) {
dataTmp.delete()
}
indexTmp.delete()
} else {
// This is the first successful attempt in writing the map outputs for this task,
// so override any existing index and data files with the ones we wrote.
if (indexFile.exists()) {
indexFile.delete()
}
if (dataFile.exists()) {
dataFile.delete()
}
if (!indexTmp.renameTo(indexFile)) {
throw new IOException("fail to rename file " + indexTmp + " to " + indexFile)
}
if (dataTmp != null && dataTmp.exists() && !dataTmp.renameTo(dataFile)) {
throw new IOException("fail to rename file " + dataTmp + " to " + dataFile)
}
}
}
} finally {
if (indexTmp.exists() && !indexTmp.delete()) {
logError(s"Failed to delete temporary index file at ${indexTmp.getAbsolutePath}")
}
}
}
writeIndexFileAndCommit的作用是记录每个分区的文件在最后的整个文件的位置,reduce 分区可以根据这个信息,抓取属于自己的数据。最后把output文件和index文件封装为mapStatus,然后向mapOutputTrackerMaster进行注册,至此BypassMergeSortShuffleWriter完成。