ByPassMergeSortShuffleWriter
使用条件:
(1)不需要进行map端的聚合
(2)partition数量小于spark.shuffle.sort.bypassMergeThreshold,默认是200
下面首先看write()
public void write(Iterator<Product2<K, V>> records) throws IOException {
assert (partitionWriters == null);
if (!records.hasNext()) {
partitionLengths = new long[numPartitions];
shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, null);
mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
return;
}
final SerializerInstance serInstance = serializer.newInstance();
final long openStartTime = System.nanoTime();
// 根据下游stage的partition个数,来创建和partition个数相同的writer
partitionWriters = new DiskBlockObjectWriter[numPartitions];
// 同时创建和partition个数相同的文件段对象
partitionWriterSegments = new FileSegment[numPartitions];
for (int i = 0; i < numPartitions; i++) {
// 在磁盘上创建文件
final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =
blockManager.diskBlockManager().createTempShuffleBlock();
final File file = tempShuffleBlockIdPlusFile._2();
final BlockId blockId = tempShuffleBlockIdPlusFile._1();
// 每个writer关联了一个文件
partitionWriters[i] =
blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics);
}
// Creating the file to write to and creating a disk writer both involve interacting with
// the disk, and can take a long time in aggregate when we open many files, so should be
// included in the shuffle write time.
writeMetrics.incWriteTime(System.nanoTime() - openStartTime);
while (records.hasNext()) {
final Product2<K, V> record = records.next();
final K key = record._1();
// 将记录写入到对应partition的文件中,此时并没有对文件进行排序
partitionWriters[partitioner.getPartition(key)].write(key, record._2());
}
for (int i = 0; i < numPartitions; i++) {
final DiskBlockObjectWriter writer = partitionWriters[i];
// 代表文件中一段数据的对象,包括所属的文件,开始偏移量和长度
partitionWriterSegments[i] = writer.commitAndGet();
writer.close();
}
// 从这里也可以知道,这里处理的数据是每个shuffleMapTask处理的数据
// 所以只有一个shuffleId和一个mapId
// 使用shuffleId和mapId可以确定唯一标示
File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
File tmp = Utils.tempFileWith(output);
try {
// 合并多个临时文件,形成一个文件
partitionLengths = writePartitionedFile(tmp);
// 创建索引文件并提交
shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
} finally {
if (tmp.exists() && !tmp.delete()) {
logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
}
}
mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
}
大致流程如下:
- 每个mapper都会创建reducer个数个临时文件
- 针对每条数据将它们写入所属文件,注意这里并没有在内存中缓存记录
- 当所有数据都写入完毕之后,在合并各个文件,并且生成一个索引文件