一、上下文
《Spark-Task启动流程》中讲到如果一个Task是一个ShuffleMapTask,那么最后在调用ShuffleWriter写入磁盘后还会判断是否可以启用push-based shuffle机制,下面我们就来继续看看push-based shuffle机制背后都做了什么
二、push-based shuffle机制开启条件
1、spark.shuffle.push.enabled 设置为true 默认为 false (设置为true可在客户端启用基于推送的shuffle,这与服务器端标志spark.shuffle.push.server.mergedShuffleFileManagerImpl协同工作,该标志需要使用相应的org.apache.spark.network.shuffle进行设置。启用基于推送的shuffle的MergedShuffleFileManager实现)
2、提交应用程序以在YARN模式下运行
3、已启用外部洗牌服务
4、IO加密已禁用
5、序列化器(如KryoSerialer)支持重新定位序列化对象
6、RDD不能是Barrier的
说明:调用barrier()可以返回一个RDDBarrier,且会将该RDD所处的Stage也标记为barrier,在该Stage,Spark必须同时启动所有任务。如果任务失败,Spark将中止整个阶段并重新启动此阶段的所有任务,而不是只重新启动失败的任务。
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](...){
private def canShuffleMergeBeEnabled(): Boolean = {
val isPushShuffleEnabled = Utils.isPushBasedShuffleEnabled(rdd.sparkContext.getConf,
// invoked at driver
isDriver = true)
if (isPushShuffleEnabled && rdd.isBarrier()) {
logWarning("Push-based shuffle is currently not supported for barrier stages")
}
isPushShuffleEnabled &&
// TODO: SPARK-35547: Push based shuffle is currently unsupported for Barrier stages
!rdd.isBarrier()
}
}
三、推送ShuffleWriter的结果
1、ShuffleWriteProcessor
def write(
rdd: RDD[_],
dep: ShuffleDependency[_, _, _],
mapId: Long,
context: TaskContext,
partition: Partition): MapStatus = {
var writer: ShuffleWriter[Any, Any] = null
try {
//SparkEnv从获取ShuffleManager
val manager = SparkEnv.get.shuffleManager
//从ShuffleManager获取ShuffleWriter
writer = manager.getWriter[Any, Any](
dep.shuffleHandle,
mapId,
context,
createMetricsReporter(context))
//用ShuffleWriter将该Stage结果写入磁盘
writer.write(
rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
val mapStatus = writer.stop(success = true)
if (mapStatus.isDefined) {
if (dep.shuffleMergeEnabled && dep.getMergerLocs.nonEmpty && !dep.shuffleMergeFinalized) {
manager.shuffleBlockResolver match {
case resolver: IndexShuffleBlockResolver =>
//获取该Stage结果信息
val dataFile = resolver.getDataFile(dep.shuffleId, mapId)
//推送
new ShuffleBlockPusher(SparkEnv.get.conf)
.initiateBlockPush(dataFile, writer.getPartitionLengths(), dep, partition.index)
case _ =>
}
}
}
mapStatus.get
} catch {
......
}
}
2、ShuffleBlockPusher
用于在启用push-based shuffle时将混洗块推送到远程混洗服务。它是在ShuffleWriter完成结果文件的写入后创建。
private[spark] class ShuffleBlockPusher(conf: SparkConf) extends Logging {
//spark.shuffle.push.maxBlockSizeToPush 默认 1M
//推送到远程外部shuffle服务的单个块的最大大小。大于此阈值的块不会被推送到远程合并。这些shuffle块将由executors以原始方式获取。
private[this] val maxBlockSizeToPush = conf.get(SHUFFLE_MAX_BLOCK_SIZE_TO_PUSH)
//spark.shuffle.push.maxBlockBatchSize 默认 3M
//要分组到单个推送请求中的一批shuffle块的最大大小
//默认值为3m,因为它大于2m(TransportConf#memoryMapBytes的默认值)。如果这也默认为2m,则很可能每批块都将通过内存映射加载到内存中,这对于MB大小的小数据块具有更高的开销。
private[this] val maxBlockBatchSize = conf.get(SHUFFLE_MAX_BLOCK_BATCH_SIZE_FOR_PUSH)
//每个reduce端任务同时获取的map端输出的最大大小。由于每个输出都需要我们创建一个缓冲区来接收它,这表示每个reduce端任务的固定内存开销,因此除非您有大量内存,否则请保持较小的内存开销
private[this] val maxBytesInFlight =
conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024
//此配置限制了在任何给定点获取块的远程请求数量。当集群中的主机数量增加时,可能会导致与一个或多个节点的大量入站连接,导致工作进程在负载下失败。通过允许它限制获取请求的数量,可以缓解这种情况
//默认为 Int.MaxValue 即 2147483647
private[this] val maxReqsInFlight = conf.getInt("spark.reducer.maxReqsInFlight", Int.MaxValue)
//spark.reducer.maxBlocksInFlightPerAddress 默认为 Int.MaxValue 即 2147483647
//此配置限制了每个reduce任务从给定主机端口获取的远程块的数量。当在单次或同时从给定地址请求大量块时,这可能会使 executor 或 Node Manager 崩溃。这对于在启用外部shuffle时减少节点管理器的负载特别有用。您可以通过将其设置为较低的值来缓解这个问题。
private[this] val maxBlocksInFlightPerAddress = conf.get(REDUCER_MAX_BLOCKS_IN_FLIGHT_PER_ADDRESS)
private[this] var bytesInFlight = 0L
private[this] var reqsInFlight = 0
private[this] val numBlocksInFlightPerAddress = new HashMap[BlockManagerId, Int]()
private[this] val deferredPushRequests = new HashMap[BlockManagerId, Queue[PushRequest]]()
//推送请求队列
private[this] val pushRequests = new Queue[PushRequest]
private[this] val errorHandler = createErrorHandler()
// VisibleForTesting
private[shuffle] val unreachableBlockMgrs = new HashSet[BlockManagerId]()
//......
//初始化块推送
private[shuffle] def initiateBlockPush(
dataFile: File,//map端生成的Shuffle数据文件
partitionLengths: Array[Long], //Shuffle块大小数组,这样我们就可以分辨Shuffle块了
dep: ShuffleDependency[_, _, _], //用于获取shuffle ID和远程shuffle服务的位置,以推送本地shuffle块
mapIndex: Int): Unit = { //shuffle map 任务的索引
val numPartitions = dep.partitioner.numPartitions
val transportConf = SparkTransportConf.fromSparkConf(conf, "shuffle")
val requests = prepareBlockPushRequests(numPartitions, mapIndex, dep.shuffleId,
dep.shuffleMergeId, dataFile, partitionLengths, dep.getMergerLocs, transportConf)
// 随机化PushRequest的顺序
//如果map端有排序,那么每个分区的相同key所在的大致范围是一样的,就会造成同意时间都向下游同一个分区或者同一台节点推送数据,因此需要乱序下,这样更有利于并行推送
pushRequests ++= Utils.randomize(requests)
submitTask(() => {
tryPushUpToMax()
})
}
private[shuffle] def tryPushUpToMax(): Unit = {
try {
pushUpToMax()
} catch {
......
}
//由于多个块推送线程可能会为同一个映射器调用pushUpToMax,
//因此我们同步对此方法的访问,以便只有一个线程可以为给定的映射器推送块。
//这有助于简化对共享状态的访问。这样做的缺点是,如果所有线程都被来自同一映射器的块推送占用,我们可能会不必要地阻止其他映射器的区块推送。
private def pushUpToMax(): Unit = synchronized {
// 如果可能的话,处理任何未完成的延迟推送请求
if (deferredPushRequests.nonEmpty) {
for ((remoteAddress, defReqQueue) <- deferredPushRequests) {
while (isRemoteBlockPushable(defReqQueue) &&
!isRemoteAddressMaxedOut(remoteAddress, defReqQueue.front)) {
val request = defReqQueue.dequeue()
logDebug(s"Processing deferred push request for $remoteAddress with "
+ s"${request.blocks.length} blocks")
sendRequest(request)
if (defReqQueue.isEmpty) {
deferredPushRequests -= remoteAddress
}
}
}
}
//如果可能的话,处理任何常规推送请求。
while (isRemoteBlockPushable(pushRequests)) {
//从队列中取出一个请求
val request = pushRequests.dequeue()
val remoteAddress = request.address
//reduce 端也有接收块的大小限制,如果超过了就不用给对方发送了 默认为 Int.MaxValue 即 2147483647
if (isRemoteAddressMaxedOut(remoteAddress, request)) {
logDebug(s"Deferring push request for $remoteAddress with ${request.blocks.size} blocks")
deferredPushRequests.getOrElseUpdate(remoteAddress, new Queue[PushRequest]())
.enqueue(request)
} else {
sendRequest(request)
}
}
def isRemoteBlockPushable(pushReqQueue: Queue[PushRequest]): Boolean = {
pushReqQueue.nonEmpty &&
(bytesInFlight == 0 ||
(reqsInFlight + 1 <= maxReqsInFlight &&
bytesInFlight + pushReqQueue.front.size <= maxBytesInFlight))
}
// 检查发送新的推送请求是否会超过推送到给定远程地址的最大块数。
def isRemoteAddressMaxedOut(remoteAddress: BlockManagerId, request: PushRequest): Boolean = {
(numBlocksInFlightPerAddress.getOrElse(remoteAddress, 0)
+ request.blocks.size) > maxBlocksInFlightPerAddress
}
}
//将块推送到远程shuffle服务器。一旦当前批中的某个块传输完成,回调监听器将再次调用#pushUpToMax来触发推送下一批块。这样,我们将映射任务与块推送过程解耦,因为它是负责大部分块推送的网状客户端线程,而不是任务执行线程。
private def sendRequest(request: PushRequest): Unit = {
bytesInFlight += request.size
reqsInFlight += 1
numBlocksInFlightPerAddress(request.address) = numBlocksInFlightPerAddress.getOrElseUpdate(
request.address, 0) + request.blocks.length
val sizeMap = request.blocks.map { case (blockId, size) => (blockId.toString, size) }.toMap
val address = request.address
val blockIds = request.blocks.map(_._1.toString)
val remainingBlocks = new HashSet[String]() ++= blockIds
//块推送监听器
val blockPushListener = new BlockPushingListener {
//启动连接并将块推送到远程shuffle服务始终由块推送线程处理。
//我们不应该在netty事件循环调用的blockPushListener回调中启动连接创建,
//因为:1。TransportClient.createConnection(…)块用于建立连接,建议避免在事件循环中进行任何阻塞操作;
// 2.实际的连接创建是一个添加到另一个事件循环的任务队列中的任务,该事件循环最终可能会相互阻塞。一旦blockPushListener收到块推送成功或失败的通知,我们只需将其委托给块推送线程。
def handleResult(result: PushResult): Unit = {
submitTask(() => {
if (updateStateAndCheckIfPushMore(
sizeMap(result.blockId), address, remainingBlocks, result)) {
//再次进行推送
tryPushUpToMax()
}
})
}
override def onBlockPushSuccess(blockId: String, data: ManagedBuffer): Unit = {
logTrace(s"Push for block $blockId to $address successful.")
handleResult(PushResult(blockId, null))
}
override def onBlockPushFailure(blockId: String, exception: Throwable): Unit = {
// check the message or it's cause to see it needs to be logged.
if (!errorHandler.shouldLogError(exception)) {
logTrace(s"Pushing block $blockId to $address failed.", exception)
} else {
logWarning(s"Pushing block $blockId to $address failed.", exception)
}
handleResult(PushResult(blockId, exception))
}
}
//除了随机化推送请求的顺序外,还进一步随机化推送申请中块的顺序,以进一步降低推送块在服务器端发生洗牌冲突的可能性。这不会增加在执行器端读取未合并的shuffle文件的成本,因为我们仍然在读取MB大小的块,并且只在读取后对内存中的切片缓冲区进行随机化。
//一个请求包括多个块,请求随机化,块也随机化 然后发送请求
val (blockPushIds, blockPushBuffers) = Utils.randomize(blockIds.zip(
sliceReqBufferIntoBlockBuffers(request.reqBuffer, request.blocks.map(_._2)))).unzip
//从SparkEnv获取blockManager,blockManager调用blockStoreClient来进行传输
//下面我们就详细看下它是如何把块推送到reduce端的
SparkEnv.get.blockManager.blockStoreClient.pushBlocks(
address.host, address.port, blockPushIds.toArray,
blockPushBuffers.toArray, blockPushListener)
}
//触发推送
protected def submitTask(task: Runnable): Unit = {
if (BLOCK_PUSHER_POOL != null) {
BLOCK_PUSHER_POOL.execute(task)
}
}
private val BLOCK_PUSHER_POOL: ExecutorService = {
val conf = SparkEnv.get.conf
if (Utils.isPushBasedShuffleEnabled(conf,
isDriver = SparkContext.DRIVER_IDENTIFIER == SparkEnv.get.executorId)) {
//spark.shuffle.push.numPushThreads 默认值 spark-submit中给executor分批的内核数量
//指定推块池中的线程数。这些线程有助于创建连接并将块推送到远程外部shuffle服务。默认情况下,线程池大小等于Spark executor 内核的数量。
val numThreads = conf.get(SHUFFLE_NUM_PUSH_THREADS)
.getOrElse(conf.getInt(SparkLauncher.EXECUTOR_CORES, 1))
ThreadUtils.newDaemonFixedThreadPool(numThreads, "shuffle-block-push-thread")
} else {
null
}
}
//将当前map端的shuffle数据文件转换为PushRequest列表。
//基本上,shuffle 文件中的连续块被分组到单个请求中,以允许更有效地读取块数据。
//给定shuffle的每个map端将收到与目标位置相同的BlockManagerId列表,以将块推送到目标位置。
//同一shuffle中的所有map端将以一致的方式将shuffle分区范围映射到各个目标位置,以确保每个目标位置接收属于同一组分区范围的shuffle块。
//0长度的块和足够大的块将被跳过。
private[shuffle] def prepareBlockPushRequests(
numPartitions: Int,
partitionId: Int,
shuffleId: Int,
shuffleMergeId: Int,
dataFile: File,
partitionLengths: Array[Long],
mergerLocs: Seq[BlockManagerId],
transportConf: TransportConf): Seq[PushRequest] = {
var offset = 0L
var currentReqSize = 0
var currentReqOffset = 0L
var currentMergerId = 0
val numMergers = mergerLocs.length
//推送请求数组
val requests = new ArrayBuffer[PushRequest]
var blocks = new ArrayBuffer[(BlockId, Int)]
for (reduceId <- 0 until numPartitions) {
val blockSize = partitionLengths(reduceId)
logDebug(
s"Block ${ShufflePushBlockId(shuffleId, shuffleMergeId, partitionId,
reduceId)} is of size $blockSize")
//跳过0长度的块和足够大的块
if (blockSize > 0) {
val mergerId = math.min(math.floor(reduceId * 1.0 / numPartitions * numMergers),
numMergers - 1).asInstanceOf[Int]
//如果当前请求超出最大批处理大小,
// 或者当前请求中的块数超出每个目标的限制,
// 或者下一个块推送位置用于不同的洗牌服务,
// 或者下个块超过推送的最大块大小限制,
//则启动新的PushRequest。这保证了每个PushRequest代表洗牌文件中要推送到同一洗牌服务的连续块,并且不会超出现有的限制。
if (currentReqSize + blockSize <= maxBlockBatchSize
&& blocks.size < maxBlocksInFlightPerAddress
&& mergerId == currentMergerId && blockSize <= maxBlockSizeToPush) {
// 将当前块添加到当前批次
currentReqSize += blockSize.toInt
} else {
if (blocks.nonEmpty) {
// 将上一批转换为PushRequest
requests += PushRequest(mergerLocs(currentMergerId), blocks.toSeq,
createRequestBuffer(transportConf, dataFile, currentReqOffset, currentReqSize))
blocks = new ArrayBuffer[(BlockId, Int)]
}
//开始一个新批次
currentReqSize = 0
// 将currentReqOffset设置为-1,以便我们能够区分currentReqOffset的初始值和何时开始新批处理
currentReqOffset = -1
currentMergerId = mergerId
}
// 仅在大小符合的情况下进行推送
//如果一个分区的大小超过 1M 那就不能推送了,只能用传统的Shuffle方式拉取
//其实这个blockSize 应该不会太大,除非有数据倾斜 ,因为这是一个分区向下游某一个分区推送的数据大小
if (blockSize <= maxBlockSizeToPush) {
val blockSizeInt = blockSize.toInt
blocks += ((ShufflePushBlockId(shuffleId, shuffleMergeId, partitionId,
reduceId), blockSizeInt))
// 仅当当前块是请求中的第一个块时才更新currentReqOffset
if (currentReqOffset == -1) {
currentReqOffset = offset
}
if (currentReqSize == 0) {
currentReqSize += blockSizeInt
}
}
}
offset += blockSize
}
// 添加最终请求
if (blocks.nonEmpty) {
requests += PushRequest(mergerLocs(currentMergerId), blocks.toSeq,
createRequestBuffer(transportConf, dataFile, currentReqOffset, currentReqSize))
}
requests.toSeq
}
}
3、ExternalBlockStoreClient
客户端,用于读取指向外部(executor外部)服务器的RDD块和shuffle块。
以尽最大努力的方式异步地将一系列Shuffle块推送到远程节点。这些Shuffle块以及其他客户端推送的块将被合并到目标节点上的每个Shuffle分区合并的Shuffle文件中。
public class ExternalBlockStoreClient extends BlockStoreClient {
public void pushBlocks(
String host,
int port,
String[] blockIds,
ManagedBuffer[] buffers,
BlockPushingListener listener) {
checkInit();
//如果块大小和buffer大小不匹配就停止
assert blockIds.length == buffers.length : "Number of block ids and buffers do not match.";
//将块和buffer进行对应 放入map中
Map<String, ManagedBuffer> buffersWithId = new HashMap<>();
for (int i = 0; i < blockIds.length; i++) {
buffersWithId.put(blockIds[i], buffers[i]);
}
//日志打印:推送多大的 shuffle 块 都某一台节点
logger.debug("Push {} shuffle blocks to {}:{}", blockIds.length, host, port);
try {
RetryingBlockTransferor.BlockTransferStarter blockPushStarter =
(inputBlockId, inputListener) -> {
if (clientFactory != null) {
assert inputListener instanceof BlockPushingListener :
"Expecting a BlockPushingListener, but got " + inputListener.getClass();
//创建一个连接到给定远程主机/端口的 TransportClient
//我们维护一个客户端数组(大小由spark.shuffle.io.numConnectionsPerPeer决定),并随机选择一个使用。如果在随机选择的位置之前没有创建客户端,则此函数会创建一个新客户端并将其放置在那里。
//因为推送shuffle会对多个节点发送多次请求,因此将创建好的TransportClient 放入数组,如果下次遇到同一个远程目标主机,就不用再创建了,
//fastFail 默认值为 false 如果fastFail参数为真,则在快速失败时间窗口内(io等待重试超时的95%)对同一地址的最后一次尝试失败时立即失败。假设调用者将处理重试。
//在创建新的TransportClient之前,我们将执行在此工厂注册的所有TransportClientBootstrap 这会一直阻止,直到成功建立并完全引导连接。
//TransportClient 其实是一个netty 客户端,且会再pipeline中设置一个TransportChannelHandler
TransportClient client = clientFactory.createClient(host, port);
//构建OneForOneBlockPusher 进行推送
new OneForOneBlockPusher(client, appId, comparableAppAttemptId, inputBlockId,
(BlockPushingListener) inputListener, buffersWithId).start();
} else {
logger.info("This clientFactory was closed. Skipping further block push retries.");
}
};
int maxRetries = transportConf.maxIORetries();
if (maxRetries > 0) {
new RetryingBlockTransferor(
transportConf, blockPushStarter, blockIds, listener, PUSH_ERROR_HANDLER).start();
} else {
blockPushStarter.createAndStart(blockIds, listener);
}
} catch (Exception e) {
logger.error("Exception while beginning pushBlocks", e);
for (String blockId : blockIds) {
listener.onBlockPushFailure(blockId, e);
}
}
}
}
4、OneForOneBlockPusher
用于将块推送到要合并的远程shuffle服务,与之对应的类是OneForOneBlockFetcher:用于从远程shuffles服务中拉取块
public class OneForOneBlockPusher {
//开始推块过程,每次推块都调用监听器
public void start() {
logger.debug("Start pushing {} blocks", blockIds.length);
for (int i = 0; i < blockIds.length; i++) {
assert buffers.containsKey(blockIds[i]) : "Could not find the block buffer for block "
+ blockIds[i];
String[] blockIdParts = blockIds[i].split("_");
if (blockIdParts.length != 5 || !blockIdParts[0].equals(SHUFFLE_PUSH_BLOCK_PREFIX)) {
throw new IllegalArgumentException(
"Unexpected shuffle push block id format: " + blockIds[i]);
}
//构建消息头:appId 重试id 块信息等
ByteBuffer header =
new PushBlockStream(appId, appAttemptId, Integer.parseInt(blockIdParts[1]),
Integer.parseInt(blockIdParts[2]), Integer.parseInt(blockIdParts[3]),
Integer.parseInt(blockIdParts[4]), i).toByteBuffer();
//调用netty的客户端开始传输数据
client.uploadStream(new NioManagedBuffer(header), buffers.get(blockIds[i]),
new BlockPushCallback(i, blockIds[i]));
}
}
}
5、TransportClient
客户端,用于获取预先协商的流的连续块。此API旨在实现大量数据的高效传输,这些数据被分解为大小从数百KB到几MB不等的块。
请注意,虽然此客户端处理从流(即数据平面)中提取块,但流的实际设置是在传输层范围之外完成的。提供方便的方法“sendRPC”来实现客户端和服务器之间的控制平面通信,以执行此设置。
例如,一个典型的工作流程可能是:
client.sendRPC(新的OpenFile(“/foo”))--> 返回StreamId=100
client.fetchChunk(streamId = 100, chunkIndex = 0, callback)
client.fetchChunk(streamId = 100, chunkIndex = 1, callback)
......
client.sendRPC(new CloseStream(100))
使用TransportClientFactory构造TransportClient的实例。单个TransportClient可用于多个流,但任何给定的流都必须限制在单个客户端,以避免乱序响应。
注意:此类用于向服务器发出请求,而TransportResponseHandler负责处理来自服务器的响应。
并发:线程安全,可以从多个线程调用。
public class TransportClient implements Closeable {
//以流的形式将数据发送到远程端
public long uploadStream(
ManagedBuffer meta,
ManagedBuffer data,
RpcResponseCallback callback) {
if (logger.isTraceEnabled()) {
logger.trace("Sending RPC to {}", getRemoteAddress(channel));
}
long requestId = requestId();
handler.addRpcRequest(requestId, callback);
RpcChannelListener listener = new RpcChannelListener(requestId, callback);
//UploadStream是一种一种RPC,其数据在帧外发送,因此可以作为流读取。
//利用netty传输数据
channel.writeAndFlush(new UploadStream(requestId, meta, data)).addListener(listener);
return requestId;
}
}
四、总结
1、ShuffleMapTask中的ShuffleWriter将结果写入磁盘完毕
2、判断当前环境是否支持push-based shuffle(假定支持)
3、获取该Task中的Shuffle结果文件
4、构建并初始化ShuffleBlockPusher(单块最大限制、单次推送请求数据大小限制、对端一次性可接收的数据大小限制等等)
5、按照分区将数据块组装成PushRequest放入队列中,并将其随机打散(如果有的分区过大会造成不会推送的情况,此时就需要下一个Stage计算时过来拉取)
6、准备推送
7、推送前检查对端是否达到接收限制,并将这次PushRequest中的块进行打散
8、从SparkEnv获取BlockManager,BlockManager调用BlockStoreClient来进行传输
9、为该Task结果数据维护一个Map(ConcurrentHashMap<SocketAddress, ClientPool> connectionPool)如果没有远端的Socket对应的Netty客户端就新建,如果有就直接获取
10、构建一个OneForOneBlockPusher开始推送数据流
11、最终调用Netty客户端的channel.writeAndFlush()将数据流推送到目标主机
12、如果监听器收到推送成功的消息将再次调用pushUpToMax来触发推送下一批块