ShuffleWriter
1.概述
本次分析基于spark版本2.11进行;
spark中的shuffle是一个整体的大框架,本次主要对ShuffleWriter在shuffle中产生作用的原理进行梳理;
2.ShuffleHandle注册
2.1.注册时间点
- shuffleHandle是宽依赖ShuffleDependency的属性之一;
- 当实例化宽依赖对象的时候,就会向shuffleManager注册handle,并返回handle用以初始化shuffleHandle属性;
- 向shuffleManager注册handle时,会实例化一个ShuffleHandle对象;
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
@transient private val _rdd: RDD[_ <: Product2[K, V]],
val partitioner: Partitioner,
val serializer: Serializer = SparkEnv.get.serializer,
val keyOrdering: Option[Ordering[K]] = None,
val aggregator: Option[Aggregator[K, V, C]] = None,
val mapSideCombine: Boolean = false)
extends Dependency[Product2[K, V]] {
override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]
val shuffleId: Int = _rdd.context.newShuffleId()
//向shuffleManager注册handle,并返回handle初始化shuffleHandle属性
val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
shuffleId, _rdd.partitions.length, this)
}
2.2.向shuffleManager注册shuffle
说明:
- 在spark中,SortShuffleManager是ShuffleManager的唯一实现类,注册shuffle最终在SortShuffleManager中完成;
- 发生时间:
- 在实例化ShuffleDependency对象时,初始化宽依赖的shuffleHandle属性,此时宽依赖向shuffleManager注册shuffle;
总结:
- BypassMergeSortShuffleHandle:
- 适用:不是map端聚合且分区数不高于200
- 效果:直接写入numPartitions文件,并在最后将它们连接起来
- 优势:避免了进行两次序列化和反序列化以合并溢出的文件
- 缺点:一次打开多个文件,从而为缓冲区分配更多内存;
- BaseShuffleHandle:
- 适用:前面2中不适用的;
- 效果:以反序列化的形式缓冲映射输出
- 特点:支持map端聚合
SerializedShuffleHandle:
- 适用
- 序列化器支持对象迁移:持序列化重定向;
- 非map端聚合
- 分区数不大于16777216
- 效果:以序列化的形式缓冲映射输出
private[spark] class SortShuffleManager(conf: SparkConf) extends ShuffleManager with Logging {
override def registerShuffle[K, V, C](
shuffleId: Int,
numMaps: Int,
dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
//不是map端聚合且分区数不高于200
if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
// 直接写入numPartitions文件,并在最后将它们连接起来
//这避免了进行两次序列化和反序列化以合并溢出的文件,这在正常的代码路径中会发生。缺点是一次打开多个文件,从而为缓冲区分配更多内存。
new BypassMergeSortShuffleHandle[K, V](
shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
}
//序列化器支持对象迁移、非map端聚合、分区数不大于16777216
else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
// 以序列化的形式缓冲映射输出
new SerializedShuffleHandle[K, V](
shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
} else {
// 以反序列化的形式缓冲映射输出
new BaseShuffleHandle(shuffleId, numMaps, dependency)
}
}
}
2.2.1.BypassMergeSortShuffleHandle判断
要求:
- 不是map端聚合且分区数不高于200
- 分区数阈值由
spark.shuffle.sort.bypassMergeThreshold
参数指定;默认值200;
private[spark] object SortShuffleWriter {
//不是map端聚合且分区数不高于200
def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = {
// map端聚合需要排序
if (dep.mapSideCombine) {
false
} else {
//spark.shuffle.sort.bypassMergeThreshold : 默认值200
val bypassMergeThreshold: Int = conf.getInt("spark.shuffle.sort.bypassMergeThreshold", 200)
//分区数小于spark.shuffle.sort.bypassMergeThreshold或200
dep.partitioner.numPartitions <= bypassMergeThreshold
}
}
}
2.2.2.SerializedShuffleHandle判断
要求:
- 支持序列化重定向;
- 非map端聚合;
- 分区数不大于16777216;
- 以上3个条件同时满足;
private[spark] object SortShuffleManager extends Logging {
//16777216
val MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE =
PackedRecordPointer.MAXIMUM_PARTITION_ID + 1
//序列化器支持对象迁移、非map端聚合、分区数不大于16777216
def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean = {
val shufId = dependency.shuffleId
val numPartitions = dependency.partitioner.numPartitions
//序列化器不支持对象迁移
if (!dependency.serializer.supportsRelocationOfSerializedObjects) {
log.debug(s"Can't use serialized shuffle for shuffle $shufId because the serializer, " +
s"${dependency.serializer.getClass.getName}, does not support object relocation")
false
}
//map端聚合
else if (dependency.mapSideCombine) {
log.debug(s"Can't use serialized shuffle for shuffle $shufId because we need to do " +
s"map-side aggregation")
false
}
//分区数大于16777216
else if (numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {
log.debug(s"Can't use serialized shuffle for shuffle $shufId because it has more than " +
s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")
false
} else {
log.debug(s"Can use serialized shuffle for shuffle $shufId")
true
}
}
}
3.ShuffleWriter实例化
3.1.实例化时间点
实例化和使用时机:
- 当executor执行shuffle任务时,底层调用
ShuffleMapTask.runTask()
函数进行实现; - 在
ShuffleMapTask.runTask()
函数中,会实例化一个ShuffleWriter对象,然后通过ShuffleWriter.write()
函数将数据落地到磁盘;
3.1.1.Executor#run-执行器运行任务
说明:
- 在Executor中,执行shuffle任务时,底层调用
ShuffleMapTask.runTask()
函数进行实现;
private[spark] class Executor(
executorId: String,
executorHostname: String,
env: SparkEnv,
userClassPath: Seq[URL] = Nil,
isLocal: Boolean = false,
uncaughtExceptionHandler: UncaughtExceptionHandler = new SparkUncaughtExceptionHandler)
extends Logging {
//针对shuffle任务,实例化ShuffleMapTask对象
@volatile var task: Task[Any] = _
override def run(): Unit = {
//---------其他代码---------
try {
//---------其他代码---------
val value = Utils.tryWithSafeFinally {
//针对shuffle任务,执行ShuffleMapTask#runTask函数
val res = task.run(
taskAttemptId = taskId,
attemptNumber = taskDescription.attemptNumber,
metricsSystem = env.metricsSystem)
threwException = false
res
}
//---------其他代码---------
} catch {
//---------其他代码---------
} finally {
runningTasks.remove(taskId)
}
}
}
- 在
ShuffleMapTask.runTask()
函数中,会实例化一个ShuffleWriter对象,然后通过ShuffleWriter.write
将数据落地到磁盘; - 实例化一个ShuffleWriter对象时候会将stage依赖中维护的shuffleHandle传过去;
private[spark] class ShuffleMapTask(
stageId: Int,
stageAttemptId: Int,
taskBinary: Broadcast[Array[Byte]],
partition: Partition,
@transient private var locs: Seq[TaskLocation],
localProperties: Properties,
serializedTaskMetrics: Array[Byte],
jobId: Option[Int] = None,
appId: Option[String] = None,
appAttemptId: Option[String] = None,
isBarrier: Boolean = false)
extends Task[MapStatus](stageId, stageAttemptId, partition.index, localProperties,
serializedTaskMetrics, jobId, appId, appAttemptId, isBarrier)
with Logging {
override def runTask(context: TaskContext): MapStatus = {
//---------其他代码---------
val ser = SparkEnv.get.closureSerializer.newInstance()
val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
//---------其他代码---------
var writer: ShuffleWriter[Any, Any] = null
try {
val manager = SparkEnv.get.shuffleManager
//从shuffleManager中获取写入器:将依赖中维护的shuffleHandle传过去
writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
//通过写入器将数据落地磁盘
writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
//关闭写入器
writer.stop(success = true).get
} catch {
//---------其他代码---------
}
}
}
3.1.2.实例化ShuffleWriter对象
说明:
- UnsafeShuffleWriter、BypassMergeSortShuffleWriter、SortShuffleWriter都是ShuffleWriter的子类;
- SerializedShuffleHandle、BypassMergeSortShuffleHandle、BaseShuffleHandle是ShuffleHandle的子类;
- 根据ShuffleHandle实例化对象的具体子类类型,实例化不同的ShuffleWriter子类对象;
private[spark] class SortShuffleManager(conf: SparkConf) extends ShuffleManager with Logging {
override def getWriter[K, V](
handle: ShuffleHandle,
mapId: Int,
context: TaskContext): ShuffleWriter[K, V] = {
numMapsForShuffle.putIfAbsent(
handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps)
val env = SparkEnv.get
//rdd依赖中维护的ShuffleHandle类型,实例化对应的writer
handle match {
case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
new UnsafeShuffleWriter(
env.blockManager,
shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
context.taskMemoryManager(),
unsafeShuffleHandle,
mapId,
context,
env.conf)
case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
new BypassMergeSortShuffleWriter(
env.blockManager,
shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
bypassMergeSortHandle,
mapId,
context,
env.conf)
case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
new SortShuffleWriter(shuffleBlockResolver, other, mapId, context)
}
}
}
4.BypassMergeSortShuffleWriter
4.1.实例化
对主要属性进行分析说明
- 定义数据写入磁盘时文件缓存的大小,默认32kb;
- 可以过
spark.shuffle.file.buffer
参数指定
- 可以过
- 定义合并临时文件时,是否通过NIO方式赋值文件数据:默认true;
- 以FileSegment数组的形式缓存临时文件句柄;
- 以long数组的形式缓存每个分区的数据量;
final class BypassMergeSortShuffleWriter<K, V> extends ShuffleWriter<K, V> {
private static final Logger logger = LoggerFactory.getLogger(BypassMergeSortShuffleWriter.class);
//数据写入磁盘时的文件缓存,默认32kb,通过spark.shuffle.file.buffer参数指定
private final int fileBufferSize;
//合并临时文件时,是否通过NIO方式赋值文件数据:默认true;通过spark.file.transferTo参数指定
private final boolean transferToEnabled;
//分区数
private final int numPartitions;
private final BlockManager blockManager;
//分区器
private final Partitioner partitioner;
private final ShuffleWriteMetrics writeMetrics;
//本次shuffle的唯一标识
private final int shuffleId;
private final int mapId;
private final Serializer serializer;
//创建和维护shuffle数据的逻辑块和物理文件位置的对应关系
private final IndexShuffleBlockResolver shuffleBlockResolver;
//每个分区的写出器
private DiskBlockObjectWriter[] partitionWriters;
//每个分区的临时文件
private FileSegment[] partitionWriterSegments;
//文件输出状态信息
@Nullable private MapStatus mapStatus;
//每个分区的数据量
private long[] partitionLengths;
private boolean stopping = false;
}
4.2.write
原理:
- 通过blockManager为每个分区构建一个临时文件;根据临时文件构建分区数据磁盘写出器;
- 遍历数据记录,数据添加到分区对应临时文件的输出流中;
- 文件flush,将临时文件输出流中数据真正落地到磁盘文件中;
- 所有分区的临时文件合并为一个大的数据文件,并且生成对应的index文件;
- 记录数据输出状态
说明:
- 调用一次write函数,生成的临时文件根据分区数决定,一个分区一个临时文件;最终合并出来的大文件只有一个,对应生成一个index文件;
- index文件记录每个分区数据的长度、偏移量;
final class BypassMergeSortShuffleWriter<K, V> extends ShuffleWriter<K, V> {
public void write(Iterator<Product2<K, V>> records) throws IOException {
assert this.partitionWriters == null;
if (!records.hasNext()) {//空记录,生成一个空index文件
this.partitionLengths = new long[this.numPartitions];
this.shuffleBlockResolver.writeIndexFileAndCommit(this.shuffleId, this.mapId, this.partitionLengths, (File)null);
this.mapStatus = .MODULE$.apply(this.blockManager.shuffleServerId(), this.partitionLengths);
} else {
//初始化序列化器
SerializerInstance serInstance = this.serializer.newInstance();
long openStartTime = System.nanoTime();
//初始化分区写出器数组
this.partitionWriters = new DiskBlockObjectWriter[this.numPartitions];
//初始化分区对应临时文件数组
this.partitionWriterSegments = new FileSegment[this.numPartitions];
//通过blockManager获取block的临时文件,并以此构建分区的磁盘写出器
//一个分区对应磁盘写出器和一个临时文件
int i;
for(i = 0; i < this.numPartitions; ++i) {
//通过blockManager构建block的临时文件信息:blockId,临时文件
Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile = this.blockManager.diskBlockManager().createTempShuffleBlock();
File file = (File)tempShuffleBlockIdPlusFile._2();
BlockId blockId = (BlockId)tempShuffleBlockIdPlusFile._1();
//构建当前分区的磁盘写出器:绑定临时文件、文件输出流;
this.partitionWriters[i] = this.blockManager.getDiskWriter(blockId, file, serInstance, this.fileBufferSize, this.writeMetrics);
}
this.writeMetrics.incWriteTime(System.nanoTime() - openStartTime);
//将数据逐条添加到各分区对应临时文件的输出流中
while(records.hasNext()) {
Product2<K, V> record = (Product2)records.next();
K key = record._1();
//将数据添加到临时文件输出流中
this.partitionWriters[this.partitioner.getPartition(key)].write(key, record._2());
}
//将各分区从文件输出流flush到临时文件中
for(i = 0; i < this.numPartitions; ++i) {
//获取当前分区的磁盘写出器
DiskBlockObjectWriter writer = this.partitionWriters[i];
//文件flush,返回记录的偏移量、长度、临时文件
this.partitionWriterSegments[i] = writer.commitAndGet();
writer.close();
}
//获取输出数据文件
File output = this.shuffleBlockResolver.getDataFile(this.shuffleId, this.mapId);
//创建输出数据文件的临时文件
File tmp = Utils.tempFileWith(output);
try {
//临时文件合并产生数据文件,返回临时文件长度数组
this.partitionLengths = this.writePartitionedFile(tmp);
//生成数据文件对应的index文件
this.shuffleBlockResolver.writeIndexFileAndCommit(this.shuffleId, this.mapId, this.partitionLengths, tmp);
} finally {
if (tmp.exists() && !tmp.delete()) {
logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
}
}
//记录数据输出状态
this.mapStatus = .MODULE$.apply(this.blockManager.shuffleServerId(), this.partitionLengths);
}
}
}
4.2.1.writePartitionedFile-合并临时文件
说明:
- 按照分区顺序,逐个将分区临时文件数据复制到数据文件;
- 默认根据NIO方式进行数据复制;
- 可以通过
spark.file.transferTo
参数指定;
- 可以通过
final class BypassMergeSortShuffleWriter<K, V> extends ShuffleWriter<K, V> {
private long[] writePartitionedFile(File outputFile) throws IOException {
//初始化分区数据量数组
final long[] lengths = new long[numPartitions];
if (partitionWriters == null) {
//没有分区写出器,代表分区写出器构造没有执行,不存在数据写出,不存在临时文件
//返回空数组
return lengths;
}
//构建数据文件输出流
final FileOutputStream out = new FileOutputStream(outputFile, true);
final long writeStartTime = System.nanoTime();
boolean threwException = true;
try {
//遍历分区
for (int i = 0; i < numPartitions; i++) {
//取出分区临时文件
final File file = partitionWriterSegments[i].file();
if (file.exists()) {
//构建临时文件输入流
final FileInputStream in = new FileInputStream(file);
boolean copyThrewException = true;
try {
//将分区临时文件数据复制到数据文件中
//transferToEnabled默认问true:以NIO的方式复制
lengths[i] = Utils.copyStream(in, out, false, transferToEnabled);
copyThrewException = false;
} finally {
Closeables.close(in, copyThrewException);
}
if (!file.delete()) {
logger.error("Unable to delete file for partition {}", i);
}
}
}
threwException = false;
} finally {
Closeables.close(out, threwException);
writeMetrics.incWriteTime(System.nanoTime() - writeStartTime);
}
partitionWriters = null;
return lengths;
}
}
5.SortShuffleWriter
5.1.实例化
SortShuffleWriter实例化相对简单,没有比较多的属性需要初始化;
private[spark] class SortShuffleWriter[K, V, C](
shuffleBlockResolver: IndexShuffleBlockResolver,
handle: BaseShuffleHandle[K, V, C],
mapId: Int,
context: TaskContext)
extends ShuffleWriter[K, V] with Logging {
//依赖关系
private val dep = handle.dependency
private val blockManager = SparkEnv.get.blockManager
//排序器
private var sorter: ExternalSorter[K, V, _] = null
private var stopping = false
private var mapStatus: MapStatus = null
private val writeMetrics = context.taskMetrics().shuffleWriteMetrics
}
5.2.write
原理:
- 构建排序器
- 将所有数据添加到排序器中;
- 以PartitionedAppendOnlyMap或PartitionedPairBuffer形式将数据缓存在内存中;
- 当内存中缓存数据达到溢写条件时,将缓存中的数据整个溢写到磁盘中,由一个临时文件保存;
- 将排序器中数据输出到一个临时文件;
- 构建一个临时文件的index文件;
- 记录数据输出状态;
特别说明:
- 临时文件中的数据是根据分区编号先后进行写入的;
- 如果存在map端聚合,是现将数据进行聚合后,再写入临时文件的;
- 调用一次
SortShuffleWriter.write()
函数,会生成一个磁盘输出文件和一个对应的index文件;
private[spark] class SortShuffleWriter[K, V, C](
shuffleBlockResolver: IndexShuffleBlockResolver,
handle: BaseShuffleHandle[K, V, C],
mapId: Int,
context: TaskContext)
extends ShuffleWriter[K, V] with Logging {
override def write(records: Iterator[Product2[K, V]]): Unit = {
//构建排序器
sorter = if (dep.mapSideCombine) {
//map端聚合,需要定义排序器的聚合器、排序方式
new ExternalSorter[K, V, C](
context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
} else {
// 在这种情况下,我们既没有向排序器传递聚合器,也没有向排序器传递排序器,因为我们不关心键是否在每个分区中排序;如果正在运行的操作是sortByKey,则将在reduce端执行
new ExternalSorter[K, V, V](
context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
}
//数据添加到排序器:满足溢写条件情况下,缓存数据将会溢写到磁盘;否则,在内存中缓存
sorter.insertAll(records)
//获取输出数据文件
val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
//创建输出文件的临时文件
val tmp = Utils.tempFileWith(output)
try {
//组装blockId
val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
//临时文件合并数据到数据文件:根据参数进行排序聚合
val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
//创建数据文件的index文件
shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
//记录数据输出状态
mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
} finally {
if (tmp.exists() && !tmp.delete()) {
logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
}
}
}
}
5.2.1.ExternalSorter#insertAll-数据添加到ExternalSorter
数据缓存:
- PartitionedAppendOnlyMap
- 针对需要map端聚合的情况,使用此对象缓存数据;
- 以(partitionId, key)为key,对value进行聚合;
- 以((partitionId, key),聚合后的value)为一条数据
- PartitionedPairBuffer
- 针对非map端聚合的情况,使用此对象缓存数据;
- 以(key,value)为一条数据;
步骤:
- 遍历记录中的数据,根据是否map端聚合,逐条将数据缓存到集合中;
- map端聚合,将数据缓存到PartitionedAppendOnlyMap,缓存的时候,根据(partitionId, key)为key对value进行合并;
- 非map端聚合,直接以(key,value)形式将数据缓存到PartitionedPairBuffer中;
- 每缓存一条记录,即判断一次是否需要将缓存中的数据溢写到磁盘
特别说明:
- 如果没有发生缓存数据溢写到磁盘,数据将会以集合的方式缓存在内存中;
- 记录数据在ExternalSorter中的保存形式:
- 全部都以集合的形式缓存在内存中
- 全部都以临时文件的形式落地在磁盘中,ExternalSorter中以spills属性维护对所有临时文件的引用;
- 部分记录以集合的形式缓存在内存中,部分记录以临时文件的形式落地在磁盘;
private[spark] class ExternalSorter[K, V, C](
context: TaskContext,
aggregator: Option[Aggregator[K, V, C]] = None,
partitioner: Option[Partitioner] = None,
ordering: Option[Ordering[K]] = None,
serializer: Serializer = SparkEnv.get.serializer)
extends Spillable[WritablePartitionedPairCollection[K, C]](context.taskMemoryManager())
with Logging {
@volatile private var map = new PartitionedAppendOnlyMap[K, C]
@volatile private var buffer = new PartitionedPairBuffer[K, C]
def insertAll(records: Iterator[Product2[K, V]]): Unit = {
// 聚合器部位None,则需要map端聚合
val shouldCombine = aggregator.isDefined
//map端预聚合,数据以PartitionedAppendOnlyMap形式缓存
if (shouldCombine) {
// Combine values in-memory first using our AppendOnlyMap
val mergeValue = aggregator.get.mergeValue
val createCombiner = aggregator.get.createCombiner
var kv: Product2[K, V] = null
//数据聚合:已有数据,就聚合;没有数据就新建一个数据;
val update = (hadValue: Boolean, oldValue: C) => {
if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
}
//遍历记录数据
while (records.hasNext) {
//自上次溢写后,本次从记录数据中读取数据的数量+1
addElementsRead()
//从记录中读取一条数据
kv = records.next()
//数据根据(分区,key)进行聚合
map.changeValue((getPartition(kv._1), kv._1), update)
//判断缓存数据是否需要溢写到磁盘
maybeSpillCollection(usingMap = true)
}
} else {
// 非map端聚合,数据以PartitionedPairBuffer形式缓存
while (records.hasNext) {
addElementsRead()
val kv = records.next()
buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
maybeSpillCollection(usingMap = false)
}
}
}
}
5.2.1.1.ExternalSorter#maybeSpillCollection-判断是否需要将缓存数据溢写到磁盘
数据溢写条件:
- 缓存到集合总的数据量时32的倍数 && 缓存的数据字节数不小于集合阈值 && 缓存的数据字节数 > 扩充后的集合阈值;
- 缓存到集合的的数据量大于Integer.MAX_VALUE(0x7fffffff)
- 以上2个条件满足其一即发生溢写
溢写:
- 底层的溢写逻辑由
ExternalSorter#spill
实现; - 溢写一次会生成一个溢写临时文件;
- 添加到ExternalSorter的数据,其所有的溢写文件都维护在
ExternalSorter#spills
属性中;
集合的阈值:
- 初始阈值
- 初始阈值由
spark.shuffle.spill.initialMemoryThreshold
参数决定; - 默认5M;
- 初始阈值由
- 阈值扩充
- 每次阈值扩充量:2 * 当前集合字节数 - 当前集合阈值;
- 扩充时机:当前集合字节数 >= 当前集合阈值 时;
特别说明:
- 缓存中的数据溢写到磁盘中后,用于缓存记录数据的集合将会重新构建;
private[spark] class ExternalSorter[K, V, C](
context: TaskContext,
aggregator: Option[Aggregator[K, V, C]] = None,
partitioner: Option[Partitioner] = None,
ordering: Option[Ordering[K]] = None,
serializer: Serializer = SparkEnv.get.serializer)
extends Spillable[WritablePartitionedPairCollection[K, C]](context.taskMemoryManager())
with Logging {
@volatile private var map = new PartitionedAppendOnlyMap[K, C]
@volatile private var buffer = new PartitionedPairBuffer[K, C]
//到目前为止观察到的内存中数据结构的峰值大小,以字节为单位
private var _peakMemoryUsedBytes: Long = 0L
def peakMemoryUsedBytes: Long = _peakMemoryUsedBytes
private def maybeSpillCollection(usingMap: Boolean): Unit = {
var estimatedSize = 0L
if (usingMap) {//map端聚合
//估计集合的当前大小(以字节为单位)
estimatedSize = map.estimateSize()
//内存数据溢写到磁盘
if (maybeSpill(map, estimatedSize)) {
//发生数据溢写后,重新构建缓存集合
map = new PartitionedAppendOnlyMap[K, C]
}
} else {
//估计集合的当前大小(以字节为单位)
estimatedSize = buffer.estimateSize()
//内存数据溢写到磁盘
if (maybeSpill(buffer, estimatedSize)) {
//发生数据溢写后,重新构建缓存集合
buffer = new PartitionedPairBuffer[K, C]
}
}
//更新内存中数据结构的峰值大小
if (estimatedSize > _peakMemoryUsedBytes) {
_peakMemoryUsedBytes = estimatedSize
}
}
}
private[spark] abstract class Spillable[C](taskMemoryManager: TaskMemoryManager)
extends MemoryConsumer(taskMemoryManager) with Logging {
// 自上次溢出后从输入中读取的元素数
protected def elementsRead: Int = _elementsRead
private[this] var _elementsRead = 0
// 集合大小的初始阈值:默认5M
private[this] val initialMemoryThreshold: Long =
SparkEnv.get.conf.getLong("spark.shuffle.spill.initialMemoryThreshold", 5 * 1024 * 1024)
//集合的字节大小阈值
//为了避免大量的小溢出,初始化该值为数量级> 0
@volatile private[this] var myMemoryThreshold = initialMemoryThreshold
//溢写的总字节数
@volatile private[this] var _memoryBytesSpilled = 0L
//发生溢写的次数
private[this] var _spillCount = 0
//如果需要,将当前内存中的收集信息溢出到磁盘。试图在溢出之前获取更多内存
protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
var shouldSpill = false
//读取数据量是32的倍数,且集合数据内存占用量不小于集合设置的阈值
if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
//集合内存阈值扩充:2倍当前使用量 - 原有阈值
val amountToRequest = 2 * currentMemory - myMemoryThreshold
val granted = acquireMemory(amountToRequest)
myMemoryThreshold += granted
// 如果申请到的不够,map或buffer预估占用内存量还是大于阈值,确定溢写
shouldSpill = currentMemory >= myMemoryThreshold
}
//如果上面判定不需要溢写,但读取的记录总数比Integer.MAX_VALUE大,也还是得溢写
//numElementsForceSpillThreshold:Integer.MAX_VALUE 0x7fffffff
shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
//溢写实施
if (shouldSpill) {
//溢写次数+1
_spillCount += 1
logSpillage(currentMemory)
//溢写
spill(collection)
//清空数据读取量
_elementsRead = 0
//累加溢写总字节数
_memoryBytesSpilled += currentMemory
//释放内存
releaseMemory()
}
//返回溢写判断结果
shouldSpill
}
}
5.2.1.2.ExternalSorter#spill-缓存数据溢写到磁盘
步骤:
- 对集合中的数据根据排序比较器进行排序,获取排序后数据迭代器
- 排序后的数据溢写到磁盘,返回溢写文件;
- 通过diskBlockManager创建一个临时文件
- 通过blockManager创建临时文件写入器;
- 遍历排序后的数据,将数据逐条添加到临时文件写入器,记录添加的数据量;
- 每隔10000条,通过flush将写入器数据批量落地到临时文件;
- 遍历结束后,将剩下的不足10000条的数据批量落地到临时文件;
- 返回临时文件;
- 溢写文件添加到临时溢写文件文件集合;
特别说明:
- 集合中数据落地磁盘文件是一批一批的落地的,批处理数据量由
spark.shuffle.spill.batchSize
参数设置,默认10000;
private[spark] class ExternalSorter[K, V, C](
context: TaskContext,
aggregator: Option[Aggregator[K, V, C]] = None,
partitioner: Option[Partitioner] = None,
ordering: Option[Ordering[K]] = None,
serializer: Serializer = SparkEnv.get.serializer)
extends Spillable[WritablePartitionedPairCollection[K, C]](context.taskMemoryManager())
with Logging {
private val spills = new ArrayBuffer[SpilledFile]
private val serInstance = serializer.newInstance()
//文件缓存大小,默认32K
private val fileBufferSize = conf.getSizeAsKb("spark.shuffle.file.buffer", "32k").toInt * 1024
//批处理记录数量,默认10000
private val serializerBatchSize = conf.getLong("spark.shuffle.spill.batchSize", 10000)
//记录每个分区的数据量
val elementsPerPartition = new Array[Long](numPartitions)
//集合数据溢写到磁盘
override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {
//对集合中的数据根据排序比较器进行排序,获取排序后数据迭代器
//有排序比较器,对分区内数据根据key升序排序
//没有排序比较器,根据分区进行升序排序
val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)
//数据溢写到磁盘
val spillFile = spillMemoryIteratorToDisk(inMemoryIterator)
//添加到临时文件集合
spills += spillFile
}
//排序比较器:分区内对key进行升序排列
private val keyComparator: Comparator[K] = ordering.getOrElse(new Comparator[K] {
override def compare(a: K, b: K): Int = {
val h1 = if (a == null) 0 else a.hashCode()
val h2 = if (b == null) 0 else b.hashCode()
if (h1 < h2) -1 else if (h1 == h2) 0 else 1
}
})
//获取排序比较器
private def comparator: Option[Comparator[K]] = {
if (ordering.isDefined || aggregator.isDefined) {
Some(keyComparator)
} else {
None
}
}
//数据溢写到磁盘
private[this] def spillMemoryIteratorToDisk(inMemoryIterator: WritablePartitionedIterator)
: SpilledFile = {
// 因为溢写文件在shuffle过程中会被读取,因此它们的压缩不由spill相关参数控制
// 创建一个临时块
val (blockId, file) = diskBlockManager.createTempShuffleBlock()
//记录每次溢写的数据量
var objectsWritten: Long = 0
val spillMetrics: ShuffleWriteMetrics = new ShuffleWriteMetrics
//获取临时文件块磁盘写入器
val writer: DiskBlockObjectWriter =
blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, spillMetrics)
// List of batch sizes (bytes) in the order they are written to disk
val batchSizes = new ArrayBuffer[Long]
// How many elements we have in each partition
val elementsPerPartition = new Array[Long](numPartitions)
//文件flush,返回记录的偏移量、长度、临时文件
def flush(): Unit = {
//将磁盘写出器序列化流中数据flush到临时文件中
val segment = writer.commitAndGet()
batchSizes += segment.length
_diskBytesSpilled += segment.length
objectsWritten = 0
}
var success = false
try {
//记录数据遍历
while (inMemoryIterator.hasNext) {
val partitionId = inMemoryIterator.nextPartition()
require(partitionId >= 0 && partitionId < numPartitions,
s"partition Id: ${partitionId} should be in the range [0, ${numPartitions})")
//数据逐条添加到磁盘写出器序列化流中
inMemoryIterator.writeNext(writer)
//累加记录所附分区的数据量
elementsPerPartition(partitionId) += 1
//累加当次溢写的数据量
objectsWritten += 1
//每10000条数据落地到磁盘文件一次
if (objectsWritten == serializerBatchSize) {
flush()
}
}
//遍历完毕,将剩余不足10000的数据的落地到磁盘
if (objectsWritten > 0) {
flush()
} else {
writer.revertPartialWritesAndClose()
}
success = true
} finally {
if (success) {
writer.close()
} else {
//
writer.revertPartialWritesAndClose()
if (file.exists()) {
if (!file.delete()) {
logWarning(s"Error deleting ${file}")
}
}
}
}
//返回溢写文件
SpilledFile(file, blockId, batchSizes.toArray, elementsPerPartition)
}
}
5.2.2.ExternalSorter#writePartitionedFile-产生完整数据文件
功能:
- 将溢写的临时文件和缓存中的数据合并,产生一个完成的数据文件;
详情:
- 未发生过溢写
- 将缓存中的数据根据分区id和key进行排序;
- 将排序后的数据按照分区依次批量写入临时文件;(一个分区写一次)
- 发生过溢写
- 将磁盘中溢写文件的数据与内存中缓存的数据根据分区进行合流
- 将合流后的数据按照分区依次批量写入临时文件;(一个分区写一次)
private[spark] class ExternalSorter[K, V, C](
context: TaskContext,
aggregator: Option[Aggregator[K, V, C]] = None,
partitioner: Option[Partitioner] = None,
ordering: Option[Ordering[K]] = None,
serializer: Serializer = SparkEnv.get.serializer)
extends Spillable[WritablePartitionedPairCollection[K, C]](context.taskMemoryManager())
with Logging {
def writePartitionedFile(
blockId: BlockId,
outputFile: File): Array[Long] = {
// Track location of each range in the output file
val lengths = new Array[Long](numPartitions)
//通过blockManager获取输出文件写入器
val writer = blockManager.getDiskWriter(blockId, outputFile, serInstance, fileBufferSize,
context.taskMetrics().shuffleWriteMetrics)
if (spills.isEmpty) {//没有发生溢写
// 确定数据在内存中的缓存形式
val collection = if (aggregator.isDefined) map else buffer
//数据排序:根据分区id和key排序
val it = collection.destructiveSortedWritablePartitionedIterator(comparator)
//排序后的数据遍历:将数据根据分区依次写入输出文件中
while (it.hasNext) {
val partitionId = it.nextPartition()
//同一个分区的数据依次添加到写出器序列化流中:同一个分区的数据在一起
while (it.hasNext && it.nextPartition() == partitionId) {
it.writeNext(writer)
}
//序列化流中数据flush到文件,返回记录的偏移量、长度、临时文件
val segment = writer.commitAndGet()
//缓存分区与分区数据量
lengths(partitionId) = segment.length
}
} else {//有发生数据溢写
// 溢写文件和缓存数据合并,合并后再根据分区依次写入输出文件中
for ((id, elements) <- this.partitionedIterator) {
if (elements.hasNext) {
//将分区中的数据依次添加到写出器序列化流中
for (elem <- elements) {
writer.write(elem._1, elem._2)
}
//序列化流中数据flush到文件,返回记录的偏移量、长度、临时文件
val segment = writer.commitAndGet()
//缓存分区与分区数据量
lengths(id) = segment.length
}
}
}
writer.close()
context.taskMetrics().incMemoryBytesSpilled(memoryBytesSpilled)
context.taskMetrics().incDiskBytesSpilled(diskBytesSpilled)
context.taskMetrics().incPeakExecutionMemory(peakMemoryUsedBytes)
lengths
}
}
5.2.2.1.ExternalSorter#partitionedIterator-磁盘数据和内存数据合并
详情:
- 读取磁盘临时文件和内存缓存中的数据;
- 遍历分区,将磁盘临时文件和内存缓存中同一个分区的数据合并;
- 对合并后的数据进行聚合排序操作
- 定义了聚合器:跨分区执行部分聚合:根据聚合器定义聚合value,最后按照key排序;
- 没有定义聚合器,但是定义了排序器:对数据根据排序器进行排序,而不是合并它们;
- 没有定义聚合器和排序器:返回合并后的结果;
- 返回分区与分区数据的映射集合
private[spark] class ExternalSorter[K, V, C](
context: TaskContext,
aggregator: Option[Aggregator[K, V, C]] = None,
partitioner: Option[Partitioner] = None,
ordering: Option[Ordering[K]] = None,
serializer: Serializer = SparkEnv.get.serializer)
extends Spillable[WritablePartitionedPairCollection[K, C]](context.taskMemoryManager())
with Logging {
def partitionedIterator: Iterator[(Int, Iterator[Product2[K, C]])] = {
val usingMap = aggregator.isDefined
val collection: WritablePartitionedPairCollection[K, C] = if (usingMap) map else buffer
if (spills.isEmpty) {//未发生溢写
//未定义排序规则
if (!ordering.isDefined) {
// 只按分区ID排序,而不是key
groupByPartition(destructiveIterator(collection.partitionedDestructiveSortedIterator(None)))
} else {//定义排序规则
// 根据分区ID和key进行排序
groupByPartition(destructiveIterator(
collection.partitionedDestructiveSortedIterator(Some(keyComparator))))
}
} else {//发生溢写
//合并溢出的和内存中的数据
merge(spills, destructiveIterator(
collection.partitionedDestructiveSortedIterator(comparator)))
}
}
//数据合并
private def merge(spills: Seq[SpilledFile], inMemory: Iterator[((Int, K), C)])
: Iterator[(Int, Iterator[Product2[K, C]])] = {
//从磁盘读取溢写的临时文件
val readers = spills.map(new SpillReader(_))
//获取内存中缓存的数据
val inMemBuffered = inMemory.buffered
//遍历分区
(0 until numPartitions).iterator.map { p =>
//获取内存中当前分区数据
val inMemIterator = new IteratorForPartition(p, inMemBuffered)
//将磁盘文件数据和内存缓存数据合流
val iterators = readers.map(_.readNextPartition()) ++ Seq(inMemIterator)
if (aggregator.isDefined) {//定义了聚合器
// 跨分区执行部分聚合:根据聚合器定义聚合value,最后按照key排序
(p, mergeWithAggregation(
iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))
} else if (ordering.isDefined) {//没有聚合器,但是定义了排序器
// 对元素进行排序,而不是合并它们
(p, mergeSort(iterators, ordering.get))
} else {//没有定义聚合器和排序器
//返回合并后的结果
(p, iterators.iterator.flatten)
}
}
}
}
6.UnsafeShuffleWriter
6.1.实例化
说明:
- 通过调用构造函数new一个UnsafeShuffleWriter对象完成实例化;
- 使用UnsafeShuffleWriter要求stage分区数不能大于16777216
- 初始化排序器,指定缓存初始化大小4096
- 初始化系列化缓存,指定大小1M
public class UnsafeShuffleWriter<K, V> extends ShuffleWriter<K, V> {
private static final Logger logger = LoggerFactory.getLogger(UnsafeShuffleWriter.class);
private static final ClassTag<Object> OBJECT_CLASS_TAG;
@VisibleForTesting
static final int DEFAULT_INITIAL_SORT_BUFFER_SIZE = 4096;
static final int DEFAULT_INITIAL_SER_BUFFER_SIZE = 1048576;
private final BlockManager blockManager;
private final IndexShuffleBlockResolver shuffleBlockResolver;
private final TaskMemoryManager memoryManager;
private final SerializerInstance serializer;
private final Partitioner partitioner;
private final ShuffleWriteMetrics writeMetrics;
private final int shuffleId;
private final int mapId;
private final TaskContext taskContext;
private final SparkConf sparkConf;
private final boolean transferToEnabled;
private final int initialSortBufferSize;
private final int inputBufferSizeInBytes;
private final int outputBufferSizeInBytes;
@Nullable
private MapStatus mapStatus;
@Nullable
private ShuffleExternalSorter sorter;
private long peakMemoryUsedBytes = 0L;
private UnsafeShuffleWriter.MyByteArrayOutputStream serBuffer;
private SerializationStream serOutputStream;
private boolean stopping = false;
public UnsafeShuffleWriter(BlockManager blockManager, IndexShuffleBlockResolver shuffleBlockResolver, TaskMemoryManager memoryManager, SerializedShuffleHandle<K, V> handle, int mapId, TaskContext taskContext, SparkConf sparkConf) throws IOException {
int numPartitions = handle.dependency().partitioner().numPartitions();
//分区数不能大于16777216
if (numPartitions > SortShuffleManager.MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE()) {
throw new IllegalArgumentException("UnsafeShuffleWriter can only be used for shuffles with at most " + SortShuffleManager.MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE() + " reduce partitions");
} else {
this.blockManager = blockManager;
this.shuffleBlockResolver = shuffleBlockResolver;
this.memoryManager = memoryManager;
this.mapId = mapId;
ShuffleDependency<K, V, V> dep = handle.dependency();
this.shuffleId = dep.shuffleId();
this.serializer = dep.serializer().newInstance();
this.partitioner = dep.partitioner();
this.writeMetrics = taskContext.taskMetrics().shuffleWriteMetrics();
this.taskContext = taskContext;
this.sparkConf = sparkConf;
this.transferToEnabled = sparkConf.getBoolean("spark.file.transferTo", true);
//排序器初始化缓存大小:4096
this.initialSortBufferSize = sparkConf.getInt("spark.shuffle.sort.initialBufferSize", 4096);
//输入缓存大小:默认32k
this.inputBufferSizeInBytes = (int)(Long)sparkConf.get(.MODULE$.SHUFFLE_FILE_BUFFER_SIZE()) * 1024;
//输出缓存大小:默认32k
this.outputBufferSizeInBytes = (int)(Long)sparkConf.get(.MODULE$.SHUFFLE_UNSAFE_FILE_OUTPUT_BUFFER_SIZE()) * 1024;
this.open();
}
}
private void open() {
assert this.sorter == null;
//初始化排序器,指定缓存初始化大小4096
this.sorter = new ShuffleExternalSorter(this.memoryManager, this.blockManager, this.taskContext, this.initialSortBufferSize, this.partitioner.numPartitions(), this.sparkConf, this.writeMetrics);
//初始化系列化缓存,指定大小1M
this.serBuffer = new UnsafeShuffleWriter.MyByteArrayOutputStream(1048576);
//初始化序列化流
this.serOutputStream = this.serializer.serializeStream(this.serBuffer);
}
}
6.2.write
说明:
- UnsafeShuffleWriter中提供2种重载的write函数,底层都是通过
write(scala.collection.Iterator<Product2<K, V>> records)
函数实现; - 首先将迭代器中数据逐条添加到排序器中;
- 排序器中数据达到溢写条件,迭代器中数据将会溢写到一个临时文件中;
- 其次将排序器中数据落地到一个输出文件中;
- 会产生一个输出文件 + 一个输出文件对应的index文件;
- 最后释放排序器中资源;
public class UnsafeShuffleWriter<K, V> extends ShuffleWriter<K, V> {
public void write(Iterator<Product2<K, V>> records) throws IOException {
//将java迭代器转为scala迭代器
write(JavaConverters.asScalaIteratorConverter(records).asScala());
}
public void write(scala.collection.Iterator<Product2<K, V>> records) throws IOException {
boolean success = false;
try {
//遍历记录,逐条将记录添加到排序器中
while (records.hasNext()) {
insertRecordIntoSorter(records.next());
}
//合并临时文件为一个整体文件,然后上次临时文件
closeAndWriteOutput();
success = true;
} finally {
if (sorter != null) {
try {
//释放资源
sorter.cleanupResources();
} catch (Exception e) {
// Only throw this error if we won't be masking another
// error.
if (success) {
throw e;
} else {
logger.error("In addition to a failure during writing, we failed during " +
"cleanup.", e);
}
}
}
}
}
}
6.2.1.insertRecordIntoSorter-添加数据到排序器
说明:
- 首先,将数据添加到序列化流中;
- 序列化流中通过MyByteArrayOutputStream对象对数据进行缓存;默认缓存1M;
- 序列化流中有计数器统计缓存中数据量;
- 其次,将序列化流中数据添加到排序器;
public class UnsafeShuffleWriter<K, V> extends ShuffleWriter<K, V> {
private ShuffleExternalSorter sorter;
//序列化缓存:默认大小1M
private MyByteArrayOutputStream serBuffer;
//序列化流
private SerializationStream serOutputStream;
void insertRecordIntoSorter(Product2<K, V> record) throws IOException {
assert(sorter != null);
final K key = record._1();
final int partitionId = partitioner.getPartition(key);
//重置序列化缓存计数器:重置为0
serBuffer.reset();
//将key、value写入序列化流中
serOutputStream.writeKey(key, OBJECT_CLASS_TAG);
serOutputStream.writeValue(record._2(), OBJECT_CLASS_TAG);
serOutputStream.flush();
//确保序列化缓存中有数据
final int serializedRecordSize = serBuffer.size();
assert (serializedRecordSize > 0);
//将序列化数据添加到排序器中
sorter.insertRecord(
serBuffer.getBuf(), Platform.BYTE_ARRAY_OFFSET, serializedRecordSize, partitionId);
}
}
6.2.1.1.ShuffleExternalSorter#insertRecord
说明:
- 1.对达到溢写条件的缓存数据,就将缓存中数据溢写到磁盘;
- 2.对达到扩充容量要求的内存排序器缓存进行容量扩充;
- 3.申请新的page存储数据;
- 4.获取数据存储地址;
- 5.将数据长度添加到page;
- 6.将数据复制到page中;
- 7.将数据存储地址添加到内存排序器中;
总结:
- 数据存储在外部排序器的链表中,通过page(内存块MemoryBlock)作为链表元素存储数据;
- 数据地址和分区id存储在内存排序器的缓存中;
- 数据溢写条件:数据量超过Interger.MAX_VALUE;
- 一次溢写产生一个临时文件;
final class ShuffleExternalSorter extends MemoryConsumer {
public void insertRecord(Object recordBase, long recordOffset, int length, int partitionId) throws IOException {
assert this.inMemSorter != null;
//内存排序器中数据量 >= Integer.MAX_VALUE
if (this.inMemSorter.numRecords() >= this.numElementsForSpillThreshold) {
logger.info("Spilling data because number of spilledRecords crossed the threshold " + this.numElementsForSpillThreshold);
//数据溢写到磁盘
this.spill();
}
//内存排序器中存储数据的缓存扩容
this.growPointerArrayIfNecessary();
//数据对其,确定数据长度
int uaoSize = UnsafeAlignedOffset.getUaoSize();
int required = length + uaoSize;
//申请新的page存储数据
this.acquireNewPageIfNecessary(required);
assert this.currentPage != null;
//获取当前page中存储数据的对象
Object base = this.currentPage.getBaseObject();
//获取数据存储地址
long recordAddress = this.taskMemoryManager.encodePageNumberAndOffset(this.currentPage, this.pageCursor);
//将数据长度添加到page
UnsafeAlignedOffset.putSize(base, this.pageCursor, length);
// 调整下page指针的内存地址
this.pageCursor += (long)uaoSize;
// 再将数据复制到page中
Platform.copyMemory(recordBase, recordOffset, base, this.pageCursor, (long)length);
// 调整下page光标位置
this.pageCursor += (long)length;
// 将数据在page中的存储地址和分区id记录到内存排序器中
this.inMemSorter.insertRecord(recordAddress, partitionId);
}
}
6.2.1.1.1.ShuffleExternalSorter#spill-缓存数据溢写磁盘
说明:
- 判断溢写条件是否达成;
- 将数据溢写到临时文件,一次溢写产生一个临时文件;
- 数据排序时通过对内存排序器缓存的数据地址根据分区id升序排列完成的;
- 临时文件中的数据一个分区时在一起的;
- 释放内存资源;
- 重置内存排序器;
final class ShuffleExternalSorter extends MemoryConsumer {
//从父类MemoryConsumer继承过来的
public void spill() throws IOException {
this.spill(9223372036854775807L, this);
}
public long spill(long size, MemoryConsumer trigger) throws IOException {
//要求内存排序器缓存数据量 > 0
if (trigger == this && this.inMemSorter != null && this.inMemSorter.numRecords() != 0) {
logger.info("Thread {} spilling sort data of {} to disk ({} {} so far)", new Object[]{Thread.currentThread().getId(), Utils.bytesToString(this.getMemoryUsage()), this.spills.size(), this.spills.size() > 1 ? " times" : " time"});
//排序并写入临时文件
this.writeSortedFile(false);
//释放内存资源
long spillSize = this.freeMemory();
//重置内存排序器
this.inMemSorter.reset();
//上报溢出切片文件大小
this.taskContext.taskMetrics().incMemoryBytesSpilled(spillSize);
return spillSize;
} else {
return 0L;
}
}
}
6.2.1.1.1.1.writeSortedFile-数据溢写到临时文件
说明:
- 对内存排序器缓存的数据地址,根据地址中的分区id,使用RadixSort方法进行升序排列;
- 分区id存储在每天数据地址的5~7这3个字节中;
- 3个字节,一个字节8位,可以存2的24(3*8=24)次方个分区id;
- 从blockManager中获取溢写临时文件信息;
- 从blockManager中获取磁盘写出器;
- 遍历数据地址,将每个数据地址对应的数据写入临时文件;
- 从数据地址中获取数据分区id;
- 将上个分区的数据flush到临时文件,并记录分区数据量到溢写信息对象中;
- 从外部排序器的数据链表中,根据数据地址,找到存储数据的page;
- 遍历page中的数据,将数据添加到写出器的序列化流中;
- 每次从page中读取缓存大小的数据写入写出器缓存;
- 将写出器缓存数据添加到写出器序列化流中;
- 读取器缓存大小由spark.shuffle.spill.diskWriteBufferSize参数控制;默认1M;
- 数据遍历结束,通知写入器已经向序列化流中写入了一整条条数据
- 数据遍历结束,将数据从序列化流中flush到临时文件中;
- 关闭写出器;
- 将分区数据量缓存到溢写信息对象中;并将溢写信息对象添加到溢写信息地下链表中;
- 上报溢写指标;
总结:
-
排序通过对内存排序器缓存器中数据地址的排序实现的;
-
根据排序后的数据地址,从外部排序器数据的链表缓存器中确定存储数据的page;
-
根据排序后的数据地址依次将数据写出到临时文件;
-
内存排序器数据地址排序时根据分区id升序排列的,所以临时文件中一个分区的数据是在一起的;
-
对每条数据写出到临时文件,现将数据读取到写出去缓存(默认1M),然后将缓存添加到序列化流;最后一个分区的数据一起flush到临时文件;
final class ShuffleExternalSorter extends MemoryConsumer {
private void writeSortedFile(boolean isLastFile) {
ShuffleWriteMetrics writeMetricsToUse;
if (isLastFile) {//非切分文件
//上报指标
writeMetricsToUse = this.writeMetrics;
} else {//切分文件
//构建指标统计器
writeMetricsToUse = new ShuffleWriteMetrics();
}
//数据排序
ShuffleSorterIterator sortedRecords = this.inMemSorter.getSortedIterator();
//构建磁盘写出缓存:通过spark.shuffle.spill.diskWriteBufferSize控制大小
byte[] writeBuffer = new byte[this.diskWriteBufferSize];
//从blockManager获取临时溢写文件信息
Tuple2<TempShuffleBlockId, File> spilledFileInfo = this.blockManager.diskBlockManager().createTempShuffleBlock();
File file = (File)spilledFileInfo._2();
TempShuffleBlockId blockId = (TempShuffleBlockId)spilledFileInfo._1();
//构建溢写信息对象
SpillInfo spillInfo = new SpillInfo(this.numPartitions, file, blockId);
SerializerInstance ser = DummySerializerInstance.INSTANCE;
//从blockManager中获取磁盘写出器
DiskBlockObjectWriter writer = this.blockManager.getDiskWriter(blockId, file, ser, this.fileBufferSizeBytes, writeMetricsToUse);
//初始化当前分区id
int currentPartition = -1;
int uaoSize = UnsafeAlignedOffset.getUaoSize();
//遍历数据存储地址
while(sortedRecords.hasNext()) {
//将地址数据加载到packedRecordPointer中
sortedRecords.loadNext();
//从地址数据中获取对应分区id
int partition = sortedRecords.packedRecordPointer.getPartitionId();
assert partition >= currentPartition;
//分区号不同:代表上一个分区数据已经全部添加到写出去序列化流中
if (partition != currentPartition) {
//currentPartition != -1:不是处理第一个分区的数据
if (currentPartition != -1) {
//将上一个分区的数据从序列化流中flush到临时文件中
FileSegment fileSegment = writer.commitAndGet();
//将分区数据量缓存到溢写信息对象中
spillInfo.partitionLengths[currentPartition] = fileSegment.length();
}
//更新当前分区id
currentPartition = partition;
}
//根据内存排序器缓存的数据地址找到外部排序器通过链表缓存的page
long recordPointer = sortedRecords.packedRecordPointer.getRecordPointer();
Object recordPage = this.taskMemoryManager.getPage(recordPointer);
//通过数据在page中的偏移量,计算数据长度
long recordOffsetInPage = this.taskMemoryManager.getOffsetInPage(recordPointer);
int dataRemaining = UnsafeAlignedOffset.getSize(recordPage, recordOffsetInPage);
//遍历数据
int toTransfer;
for(long recordReadPosition = recordOffsetInPage + (long)uaoSize; dataRemaining > 0; dataRemaining -= toTransfer) {
//计算每次处理的数据量:在磁盘写出缓存大小和数据大小见取小
toTransfer = Math.min(this.diskWriteBufferSize, dataRemaining);
//根据确定的处理数据量,从page中将数据赋值到写出缓存器中
Platform.copyMemory(recordPage, recordReadPosition, writeBuffer, (long)Platform.BYTE_ARRAY_OFFSET, (long)toTransfer);
//将写出缓冲器中数据添加到写出器的序列化流中
writer.write(writeBuffer, 0, toTransfer);
//数据读取偏移量更新
recordReadPosition += (long)toTransfer;
}
//通知写入器已经向序列化流中写入了一条数据
writer.recordWritten();
}
//数据遍历结束,将数据从序列化流中flush到临时文件中
FileSegment committedSegment = writer.commitAndGet();
//关闭写出器
writer.close();
if (currentPartition != -1) {
//将分区数据量缓存到溢写信息对象中
spillInfo.partitionLengths[currentPartition] = committedSegment.length();
//将本次溢写信息对象添加到溢写信息对象链表中
this.spills.add(spillInfo);
}
if (!isLastFile) {
//上报指标
this.writeMetrics.incRecordsWritten(writeMetricsToUse.recordsWritten());
this.taskContext.taskMetrics().incDiskBytesSpilled(writeMetricsToUse.bytesWritten());
}
}
}
6.2.1.1.1.2.溢写数据排序
说明:
- 对内存排序器中缓存的数据地址进行排序;
- 根据分区ID对内存排序器缓存数据排序;
final class ShuffleInMemorySorter {
public ShuffleInMemorySorter.ShuffleSorterIterator getSortedIterator() {
int offset = 0;
//默认基数排序
if (this.useRadixSort) {
//分区id存储在longArray的5~7这3个字节中;最多可以存2的24次方个分区id;
offset = RadixSort.sort(this.array, (long)this.pos, 5, 7, false, false);
} else {
//TimSort排序
MemoryBlock unused = new MemoryBlock(this.array.getBaseObject(), this.array.getBaseOffset() + (long)this.pos * 8L, (this.array.size() - (long)this.pos) * 8L);
LongArray buffer = new LongArray(unused);
Sorter<PackedRecordPointer, LongArray> sorter = new Sorter(new ShuffleSortDataFormat(buffer));
sorter.sort(this.array, 0, this.pos, SORT_COMPARATOR);
}
return new ShuffleInMemorySorter.ShuffleSorterIterator(this.pos, this.array, offset);
}
}
6.2.1.1.2.growPointerArrayIfNecessary-内存排序器缓存扩容
说明:
- 首先,判断内存排序器缓存是否达到扩容条件;
- 然后,针对达到条件的情况,进行扩容;
- 计算当前缓存容量;
- 按照当前缓存容量的2倍进行扩容,构建一个新的数组缓存器;
- 再次判断缓存资源是否够用;
- 够用,释放新构建的数组缓存器资源;
- 不够用,将新的数组缓存器替换为内存排序器的缓存;
final class ShuffleExternalSorter extends MemoryConsumer {
private ShuffleInMemorySorter inMemSorter;
private void growPointerArrayIfNecessary() throws IOException {
assert this.inMemSorter != null;
//判断是否达到扩容条件
if (!this.inMemSorter.hasSpaceForAnotherRecord()) {
//计算当前缓存容量
long used = this.inMemSorter.getMemoryUsage();
//扩容
LongArray array;
try {
array = this.allocateArray(used / 8L * 2L);
} catch (TooLargePageException var5) {
this.spill();
return;
} catch (SparkOutOfMemoryError var6) {
if (!this.inMemSorter.hasSpaceForAnotherRecord()) {
logger.error("Unable to grow the pointer array");
throw var6;
}
return;
}
//再次判断缓存是否够用:可能其他task释放了资源,从而缓存够用
if (this.inMemSorter.hasSpaceForAnotherRecord()) {
//这种情况下,释放刚申请的page资源
this.freeArray(array);
} else {
//资源还是不够用,使用刚申请的page资源
this.inMemSorter.expandPointerArray(array);
}
}
}
}
6.2.1.1.2.1.扩容条件判断
说明:
- 当缓存最新数据的索引 < 缓存可用容量时,缓存不需要扩容;
- 反之:当当缓存最新数据的索引 >= 缓存可用容量时,需要扩容;
- 针对缓存可用容量
- 默认可用容量为总容量的一半;
- 如果使用非useRadixSort方案,可用容量为总容量的2/3;
- 针对缓存
- 使用LongArray作为缓存器;
final class ShuffleInMemorySorter {
//缓存数据的数组
private LongArray array;
//记录数组中最新数据的索引
private int pos = 0;
//定义数组可使用容量
private int usableCapacity = 0;
ShuffleInMemorySorter(MemoryConsumer consumer, int initialSize, boolean useRadixSort) {
//---------其他代码---------
this.usableCapacity = this.getUsableCapacity();
}
//最新数据索引 < 可用容量:缓存不需要扩容
public boolean hasSpaceForAnotherRecord() {
return this.pos < this.usableCapacity;
}
//默认可用容量为总容量的一半;
//如果使用非useRadixSort方案,可用容量为总容量的2/3
private int getUsableCapacity() {
return (int)((double)this.array.size() / (this.useRadixSort ? 2.0D : 1.5D));
}
}
6.2.1.1.2.2.内存使用量计算
说明:
- 数组page大小即为内存使用量,单位byte;
final class ShuffleInMemorySorter {
//返回数组page大小
public long getMemoryUsage() {
return this.array.size() * 8L;
}
}
public final class LongArray {
//返回数组page大小
public long size() {
return this.length;
}
}
6.2.1.1.2.3.allocateArray-扩容
说明:
- 根据新的容量申请新的page(MemoryBlock)
- 根据page构建新的数组
public abstract class MemoryConsumer {
public LongArray allocateArray(long size) {
long required = size * 8L;
//根据新的容量申请新的page(MemoryBlock)
MemoryBlock page = this.taskMemoryManager.allocatePage(required, this);
if (page == null || page.size() < required) {
this.throwOom(page, required);
}
//更新内存使用量
this.used += required;
//根据page构建新的数组
return new LongArray(page);
}
}
6.2.1.1.2.4.expandPointerArray-数据转移
说明:
- 首先,确保新缓存容量比旧缓存容量大;
- 其次,将数据从旧数组复制到新数组;
- 然后,释放旧数组资源;
- 接着,新数组作为内存排序器缓存;
- 最后,更新内存排序器可用容量;
final class ShuffleInMemorySorter {
public void expandPointerArray(LongArray newArray) {
//确保新缓存容量比原来缓存容量大
assert newArray.size() > this.array.size();
//数据从旧数组复制到新数组
Platform.copyMemory(this.array.getBaseObject(), this.array.getBaseOffset(), newArray.getBaseObject(), newArray.getBaseOffset(), (long)this.pos * 8L);
//释放旧数组资源
this.consumer.freeArray(this.array);
//新数组作为内存排序器缓存
this.array = newArray;
//更新内存排序器可用容量
this.usableCapacity = this.getUsableCapacity();
}
}
6.2.1.1.3.acquireNewPageIfNecessary-申请新的page存储数据
构建新page条件:
- 当前page为null;
- 当前page不够用;
- 以上二者存在一种就需要构建新page;
步骤:
- 根据数据长度构建新page作为当前page;
- 更新page光标位置;
- 新构建page加入page列表;
final class ShuffleExternalSorter extends MemoryConsumer {
private void acquireNewPageIfNecessary(int required) {
//当前page为null,或者当前page不够用
if (this.currentPage == null || this.pageCursor + (long)required > this.currentPage.getBaseOffset() + this.currentPage.size()) {
//根据数据长度构建新page作为当前page
this.currentPage = this.allocatePage((long)required);
//更新page光标位置
this.pageCursor = this.currentPage.getBaseOffset();
//新构建page加入page列表
this.allocatedPages.add(this.currentPage);
}
}
}
6.2.2.closeAndWriteOutput-外部排序器缓存数据落地到磁盘
说明:
- 更新内存使用峰值;
- 将缓存数据落地到磁盘临时文件;
- 合并临时文件(可能多个:一次溢写一个)为一个输出文件;
- 合并后的输出文件,数据根据分区id升序排列,一个分区的数据在一块;
- 创建输出文件对应index文件;
- 存储每个分区的数据偏移量;
- 数据偏移量顺序和输出的数据文件一致,一一对应;
- 记录shuffle状态信息;
public class UnsafeShuffleWriter<K, V> extends ShuffleWriter<K, V> {
void closeAndWriteOutput() throws IOException {
assert(sorter != null);
// 更新内存使用峰值
updatePeakMemoryUsed();
serBuffer = null;
serOutputStream = null;
//将缓存中的数据落地到磁盘临时文件中
final SpillInfo[] spills = sorter.closeAndGetSpills();
sorter = null;
final long[] partitionLengths;
//获取输出数据文件
final File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
//构建输出数据文件的临时文件
final File tmp = Utils.tempFileWith(output);
try {
try {
//合并临时文件
partitionLengths = mergeSpills(spills, tmp);
} finally {
for (SpillInfo spill : spills) {
if (spill.file.exists() && ! spill.file.delete()) {
logger.error("Error while deleting spill file {}", spill.file.getPath());
}
}
}
//创建输出文件对应的index文件
shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
} finally {
if (tmp.exists() && !tmp.delete()) {
logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
}
}
//记录shuffle状态信息
mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
}
}
6.2.2.1.closeAndGetSpills-强制flush缓存到磁盘中
说明:
- 根据内存排序器缓存的数据地址,将地址对应的数据落地到临时文件中;
- 释放内存资源和内存排序器;
- 返回所有的溢写信息对象数组;
- 溢写信息对象中存储了每个分区对应的数据量;
final class ShuffleExternalSorter extends MemoryConsumer {
public SpillInfo[] closeAndGetSpills() throws IOException {
if (this.inMemSorter != null) {
//将缓存中的数据溢写到磁盘
this.writeSortedFile(true);
//释放内存资源
this.freeMemory();
//释放内存排序器
this.inMemSorter.free();
this.inMemSorter = null;
}
//返回溢写信息对象数组
return (SpillInfo[])this.spills.toArray(new SpillInfo[this.spills.size()]);
}
}
6.2.2.2.mergeSpills-合并临时文件
说明:
- 没有临时文件,创建一个空输出文件;
- 由一个临时文件,将临时文件迁移并重命名为输出文件;
- 由多个临时文件,执行文件合并;
- 快合并
- 要求开启快速合并且支持快速合并;
- 通过spark.file.transferTo判断判断是否基于传输快速合并,默认是;
- 否则基于文件流快速合并;
- 要求开启快速合并且支持快速合并;
- 慢合并
- 如果不是快速合并,则采取慢合并;
- 快合并
- 返回每个分区的数据量数组;
public class UnsafeShuffleWriter<K, V> extends ShuffleWriter<K, V> {
private long[] mergeSpills(SpillInfo[] spills, File outputFile) throws IOException {
//是否压缩:默认是,通过spark.shuffle.compress设置
boolean compressionEnabled = this.sparkConf.getBoolean("spark.shuffle.compress", true);
//压缩方式编码:默认LZ4CompressionCodec,通过spark.io.compression.codec设置
CompressionCodec compressionCodec = org.apache.spark.io.CompressionCodec..MODULE$.createCodec(this.sparkConf);
//是否启用fast merge:默认是
boolean fastMergeEnabled = this.sparkConf.getBoolean("spark.shuffle.unsafe.fastMergeEnabled", true);
//是否支持 fast merge:
//1、不启用压缩算法
//2、或者SnappyCompressionCodec、LZFCompressionCodec、LZ4CompressionCodec、ZStdCompressionCodec这4种压缩算法之一
boolean fastMergeIsSupported = !compressionEnabled || org.apache.spark.io.CompressionCodec..MODULE$.supportsConcatenationOfSerializedStreams(compressionCodec);
//是否启用加密:默认否,通过 spark.io.encryption.enabled 参数来设置
boolean encryptionEnabled = this.blockManager.serializerManager().encryptionEnabled();
try {
if (spills.length == 0) {//没有溢写分解
// 创建一个空文件
(new FileOutputStream(outputFile)).close();
//返回全是0的分区数据量数组
return new long[this.partitioner.numPartitions()];
} else if (spills.length == 1) {//一个溢写文件
//临时文件迁移并重名为输出文件
Files.move(spills[0].file, outputFile);
//返回分区数据量数组
return spills[0].partitionLengths;
} else {//多个溢写文件
long[] partitionLengths;
//fast merge
if (fastMergeEnabled && fastMergeIsSupported) {
//基于传输&&不加密方式快速合并:默认方式;
if (this.transferToEnabled && !encryptionEnabled) {
//基于传输的合并
logger.debug("Using transferTo-based fast merge");
partitionLengths = this.mergeSpillsWithTransferTo(spills, outputFile);
} else {
//基于文件流的合并
logger.debug("Using fileStream-based fast merge");
partitionLengths = this.mergeSpillsWithFileStream(spills, outputFile, (CompressionCodec)null);
}
} else {
//慢合并
logger.debug("Using slow merge");
partitionLengths = this.mergeSpillsWithFileStream(spills, outputFile, compressionCodec);
}
//写出指标统计
this.writeMetrics.decBytesWritten(spills[spills.length - 1].file.length());
this.writeMetrics.incBytesWritten(outputFile.length());
//返回分区数据量数组
return partitionLengths;
}
} catch (IOException var9) {
if (outputFile.exists() && !outputFile.delete()) {
logger.error("Unable to delete output file {}", outputFile.getPath());
}
throw var9;
}
}
}
6.3.排序器
6.3.1.ShuffleExternalSorter-外部排序器
说明:
ShuffleExternalSorter
是MemoryConsumer
的子类;- 在构造
ShuffleExternalSorter
实例化对象时,会构造一个MemoryConsumer
实例化对象; - 指定文件缓存大小:默认32k;
- 指定溢写阈值:Integer.MAX;
- 磁盘写出缓冲区:默认1M;
- 构建一个内存排序器,并维护在当前排序器中;
- 指定缓存初始化大小4096;
- 默认根据useRadixSort排序;
总结:
- 在外部排序器中,通过链表存储数据;
- 链表中的元素为page,实际上是内存块MemoryBlock;
- 所有的数据都存在一个个page中;
- 通过currentPage指向最新的page,当前page;
- 通过pageCursor(光标)指向page中数据的偏移量;
final class ShuffleExternalSorter extends MemoryConsumer {
private static final Logger logger = LoggerFactory.getLogger(ShuffleExternalSorter.class);
@VisibleForTesting
static final int DISK_WRITE_BUFFER_SIZE = 1048576;
private final int numPartitions;
private final TaskMemoryManager taskMemoryManager;
private final BlockManager blockManager;
private final TaskContext taskContext;
private final ShuffleWriteMetrics writeMetrics;
//溢写阈值
private final int numElementsForSpillThreshold;
//文件缓冲区
private final int fileBufferSizeBytes;
//磁盘写出缓冲区
private final int diskWriteBufferSize;
//存储数据的page最多可以使用2^13个页表
private final LinkedList<MemoryBlock> allocatedPages = new LinkedList();
//溢出文件的元数据信息的列表
private final LinkedList<SpillInfo> spills = new LinkedList();
private long peakMemoryUsedBytes;
@Nullable
private ShuffleInMemorySorter inMemSorter;
//当前使用的page
@Nullable
private MemoryBlock currentPage = null;
//Page的光标
private long pageCursor = -1L;
//构造函数
ShuffleExternalSorter(TaskMemoryManager memoryManager, BlockManager blockManager, TaskContext taskContext, int initialSize, int numPartitions, SparkConf conf, ShuffleWriteMetrics writeMetrics) {
//构造一个MemoryConsumer实例化对象
super(memoryManager, (long)((int)Math.min(134217728L, memoryManager.pageSizeBytes())), memoryManager.getTungstenMemoryMode());
this.taskMemoryManager = memoryManager;
this.blockManager = blockManager;
this.taskContext = taskContext;
//确定reduce的分区数目
this.numPartitions = numPartitions;
//指定文件缓存大小:默认32k
this.fileBufferSizeBytes = (int)(Long)conf.get(.MODULE$.SHUFFLE_FILE_BUFFER_SIZE()) * 1024;
//指定溢写阈值:Integer.MAX
this.numElementsForSpillThreshold = (Integer)conf.get(.MODULE$.SHUFFLE_SPILL_NUM_ELEMENTS_FORCE_SPILL_THRESHOLD());
this.writeMetrics = writeMetrics;
//构建一个内存排序器,并维护在当前排序器中
this.inMemSorter = new ShuffleInMemorySorter(this, initialSize, conf.getBoolean("spark.shuffle.sort.useRadixSort", true));
//内存使用情况
this.peakMemoryUsedBytes = this.getMemoryUsage();
//磁盘写出缓冲区:默认1M
this.diskWriteBufferSize = (int)(Long)conf.get(.MODULE$.SHUFFLE_DISK_WRITE_BUFFER_SIZE());
}
6.3.1.1.ShuffleInMemorySorter-内存排序器
说明:
- 使用LongArray作为内存缓存形式,对数据进行缓存;
- array初始化容量为4096;
- 可用容量根据排序方案确定:
- 针对useRadixSort排序,可用容量时总容量的一半;
- 针对非useRadixSort排序,可用容量时总容量的2/3;
- 默认比较器:根据分区ID升序排列;
总结:
- 内存排序器通过LongArray对象缓存数据存储地址;
- LongArray对象底层通过内存块MemoryBlock存储数据;
final class ShuffleInMemorySorter {
//初始化排序规则:默认根据分区Id升序排列
private static final ShuffleInMemorySorter.SortComparator SORT_COMPARATOR = new ShuffleInMemorySorter.SortComparator();
private final MemoryConsumer consumer;
private LongArray array;
private final boolean useRadixSort;
private int pos = 0;
private int usableCapacity = 0;
private final int initialSize;
ShuffleInMemorySorter(MemoryConsumer consumer, int initialSize, boolean useRadixSort) {
this.consumer = consumer;
assert initialSize > 0;
//缓存初始化大小4096
this.initialSize = initialSize;
//默认根据useRadixSort排序
this.useRadixSort = useRadixSort;
//构建一个4096大小的LongArray作为排序器数据缓存
this.array = consumer.allocateArray((long)initialSize);
//可用容量
this.usableCapacity = this.getUsableCapacity();
}
//内部类:排序比较器
private static final class SortComparator implements Comparator<PackedRecordPointer> {
private SortComparator() {
}
//默认根据分区id进行升序排列
public int compare(PackedRecordPointer left, PackedRecordPointer right) {
int leftId = left.getPartitionId();
int rightId = right.getPartitionId();
return leftId < rightId ? -1 : (leftId > rightId ? 1 : 0);
}
}
}
6.3.1.1.1.可用容量计算
说明:
- 针对useRadixSort排序,可用容量时总容量的一半;
- 针对非useRadixSort排序,可用容量时总容量的2/3;
final class ShuffleInMemorySorter {
private static final ShuffleInMemorySorter.SortComparator SORT_COMPARATOR = new ShuffleInMemorySorter.SortComparator();
private final MemoryConsumer consumer;
private LongArray array;
private final boolean useRadixSort;
private int pos = 0;
private int usableCapacity = 0;
private final int initialSize;
//针对根据useRadixSort排序,可用容量时总容量的一半;否则,可用容量是总容量的2/3;
private int getUsableCapacity() {
return (int)((double)this.array.size() / (this.useRadixSort ? 2.0D : 1.5D));
}
}
6.4. 总结
适用场景:
- 序列化器支持对象迁移:支持序列化重定向;
- 非map端聚合
- 分区数不大于16777216
写出流程:
- 首先将迭代器中数据逐条添加到排序器中;
- 排序器中数据达到溢写条件,迭代器中数据将会溢写到一个临时文件中;
- 其次将排序器中数据落地到一个输出文件中;
- 会产生一个输出文件 + 一个输出文件对应的index文件;
- 最后释放排序器中资源;
排序器:
- 外部排序器
- 缓存数据
- 通过以page(内存块memeryBlock)为元素的链表实现数据缓存;
- 内存排序器
- 缓存数据地址
- 数据地址的5~7折3个字节存储分区id;
- 总共可以存储2的24次方(16777216)个分区编号;
- 通过LongArry(内部以memeryBlock存储数据)实现数据缓存;
- 缓存数据地址
- 排序的实现
- 通过对内存排序器缓存的数据地址根据分区id以RadixSort方式升序排序实现数据的排序;
- 溢写
- 内存缓存器中的数据地址数据达到Integer.MAX_VALUE,即产生一次数据溢写;
- 一次数据溢写,参数一个临时文件;
产出:
- 一次write,产生一个数据输出文件 + 一个index文件
- 输出文件
- 文件中数据根据分区id升序排列;
- 一个分区的数据在一块;
- index文件
- 文件中存储数据文件中每个分区的数据偏移量;
- 文件中数据和数据文件中分区一一对应;
7.创建数据文件对应index文件
说明:
- index文件中记录每个分区数据偏移量;
- index文件中记录记录的每个数据偏移量与数据文件中每个分区的对应;2个文件中分区顺序一致,都是升序排列;
private[spark] class IndexShuffleBlockResolver(
conf: SparkConf,
_blockManager: BlockManager = null)
extends ShuffleBlockResolver
with Logging {
def writeIndexFileAndCommit(
shuffleId: Int,
mapId: Int,
lengths: Array[Long],
dataTmp: File): Unit = {
//创建index文件
val indexFile = getIndexFile(shuffleId, mapId)
val indexTmp = Utils.tempFileWith(indexFile)
try {
//获取数据文件
val dataFile = getDataFile(shuffleId, mapId)
//每个执行器只有一个IndexShuffleBlockResolver,这个同步确保下面的检查和重命名是原子的.
synchronized {
//检查索引文件和数据文件
val existingLengths = checkIndexAndDataFile(indexFile, dataFile, lengths.length)
if (existingLengths != null) {
System.arraycopy(existingLengths, 0, lengths, 0, lengths.length)
//如果相关的index已经存在, 就可以直接退出了, 这是因为这个mapTask可能已经运行过了.
// 当然也可能因为其它原因失败, 但总之这次写是不成功的, 直接删除tmp文件完事
if (dataTmp != null && dataTmp.exists()) {
dataTmp.delete()
}
} else {
// 创建面向index临时文件的数据输出流
val out = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(indexTmp)))
Utils.tryWithSafeFinally {
// We take in lengths of each block, need to convert it to offsets.
var offset = 0L
out.writeLong(offset)
//遍历分区数据量数组
for (length <- lengths) {
//更新每个分区数据偏移量
offset += length
//将分区数据偏移量依次写入index文件
out.writeLong(offset)
}
} {
out.close()
}
//索引文件删除
if (indexFile.exists()) {
indexFile.delete()
}
//数据文件删除
if (dataFile.exists()) {
dataFile.delete()
}
//将索引临时文件改名为索引文件
if (!indexTmp.renameTo(indexFile)) {
throw new IOException("fail to rename file " + indexTmp + " to " + indexFile)
}
if (dataTmp != null && dataTmp.exists() && !dataTmp.renameTo(dataFile)) {
throw new IOException("fail to rename file " + dataTmp + " to " + dataFile)
}
}
}
} finally {
if (indexTmp.exists() && !indexTmp.delete()) {
logError(s"Failed to delete temporary index file at ${indexTmp.getAbsolutePath}")
}
}
}
}
8.总结
整体逻辑:
-
构建RDD依赖时,如果是宽依赖,会初始化宽依赖的shuffleHandle属性
- 此时会向shuffleManager注册handle时,根据不同情况实例化不同的ShuffleHandle对象
-
在Executor执行任务时,针对shuffle任务,会将任务执行的结果数据通过ShuffleWriter落地到磁盘
- 从ShuffleManager中,根据RDD依赖中的shuffleHandle属性值(ShuffleHandle对象)的不同类型,实例化获取不同的ShuffleWriter类型对象
- 落地到磁盘时调用
ShuffleWriter.write()
函数实现的 - 执行一次shuffle任务,落地到磁盘会生成一个数据文件 + 一个index文件
ShuffleWriter类型的分析:
- BypassMergeSortShuffleWriter
- handle:BypassMergeSortShuffleHandle
- 适用:不是map端聚合且分区数不高于200
- 效果:直接写入numPartitions文件,并在最后将它们连接起来
- 优势:避免了进行两次序列化和反序列化以合并溢出的文件
- 缺点:一次打开多个文件,从而为缓冲区分配更多内存;
- SortShuffleWriter
- handle:BaseShuffleHandle
- 适用:前面2中不适用的;
- 效果:以反序列化的形式缓冲映射输出
- 特点:支持map端聚合、支持排序
- UnsafeShuffleWriter
- handle:SerializedShuffleHandle
- 适用
- 序列化器支持对象迁移:持序列化重定向;
- 非map端聚合
- 分区数不大于16777216
- 效果:以序列化的形式缓冲映射输出、支持排序