SPARK3 基于YARN的EXECUTOR的动态申请详解

最新推荐文章于 2024-04-30 15:03:06 发布

gezooo

最新推荐文章于 2024-04-30 15:03:06 发布

阅读量790

点赞数

文章标签： hadoop big data hdfs spark

本文链接：https://blog.csdn.net/gezooo/article/details/125129513

版权

最近工作中遇到RM（ResourceManager）性能慢的问题，研究了一下RM的资源分配流程，因为内部流程还是比较复杂的，为了以后方便查阅，记录一下。在研究YARN容器的生命周期的过程中，从Applicaiton的提交，到最后的完成，整个流程中容器的分配和删除主要是ApplicaitonMaster发起的，由RM内部的Scheduler来管理分配。本篇文章还是聚焦在spark的executor的分配逻辑上，关于YARN内部container的分配流程计划单独写一篇文章记录，这里就不过多介绍，简要的说明下YARN的RM内部的一些组件，对于后面executor的分配流程的理解会有帮助。

在RM内部主要是三个对外协议，

1. ApplicationClientProtocal 是负责跟client交互，client提交application, getApplicationReport/killApplication等. 对应RM中的组件是ClientRMService, 对应默认端口是 8032

2. ResourceTracker, nodeManager用来注册，和heatBeat的协议，在RM中的组件是ResourceTrackerService, 对应默认端口是 8031

3. ApplicationMasterProtocal, applicationMaster交互的协议，分有三个方法 registerApplicationMaster/finishApplicaitonMaster/allocate，在RM中的组件是ApplicationMasterService, 对应的默认端口是 8030

我研究版本是基于spark3.1.1 以及hadoop cdh 2.6.0。

在spark3.1.1中，executor的申请我们生产上配置的是动态申请，管理executor申请的类是ExecutorAllocationManager，在SparkContext初始化的时候，就初始化好了，这个组件是在Spark的Driver端初始化的。

ExecutorAllocationManager里面包含了两个SparkListener, ExecutorAllocationListener 和ExecutorMonitor，这两个listener监听spark RDD执行过程中的stage和task事件，汇总task和executor的信息。

ExecutorAllocationManager 在初始好之后，启动一个内部线程，定期的执行schedule方法，

在schedule方法内部会根据ExecutorMonitor计算executor是否timeout, 如果有timeout , 就会调用killExectors释放executor.

executor的timeout的计算逻辑是根据executor是否有task在运行，是否有cache rdd block或者cache shuffle data来判定的。如果有task运行，不计算该executor的timeout, 否则计算cache shuffle data 和 cache rdd block的timeout时间，取两者最小的。

在shecule方法内部还有另外一个逻辑是处理updateAndSyncNumExecutorsTarget. 这个方法里会计算targetNum of executor，这个是spark程序运行当前需要的executor的总量。计算的依据就是当前pendingTask的数量，这里主要就是依赖ExecutorAllocationListener 了，ExecutorAllocationListener 会监听Spark作业运行过程中的事件，统计pendingTask的数量,相关事件处理逻辑如下：

当stageSubmitted的时候，会把stage的task数量存放到stageAttemptToNumTasks中
当stageCompleted的时候，会从stageAttemptToNumTasks中删除该stage
当taskStart的时候，会把stageAttemptToNumRunningTask +1，并根据task是否是speculative，把taskIndex加入到stageAttemptToSpeculativeTaskIndices中或 stageAttemptToTaskIndices中。
当taskEnd的时候，如果task是失败的，并且不是specilative的，会把taskindex从stageAttempttoTaskIndices中删除。
还有其他监听speculative task的事件，就不一一列举了。。。

当ExecutorAllocationManager计算maxNumExecutorsNeededPerResourceProfile的时候，maxNeeded = (pending + running) * executorAllocationRatio / taskPerExecutor, 其中

pending 是所有stageAttempt的 pendingTask + pendingSpeculativeTasks.
pendingTask = totalNumTask - numRunning(保存在stageAttempToTaskIndices中的taskindex的数量)
pendingSpeculativeTasks 是 numTotalTaskOfSpeculative – numRunningOfSpeculative.
这些数值都是在监听staage/task事件过程中更新的。

maxNeeded的第二种算法是 unschedulableTaskSets * executorAllocationRatio / tasksPerExecutor + executorMonitor.executorCountWithResourceProfile(rpId)
其中 unschedulableTaskSets 的值是 TaskSchedulerImpl.resourceOffers的时候，如果没有任何task被分配，并且是动态dynamic allocation executor开启，就会把SparkListenerUnschedulableTaskSetAdded事件发送出来，unschedulableTaskSets 就是该taskSet的数量，注意不是taskSets里task的数量。

上面两种算法中，取最大的作为maxNeeded的数值。

当maxNeeded 跟上一轮计算的不相等的时候，就会增加或者减少每个ResourceProfile对应的targetNum Of Executor. 并调用EcecutorAllocationClient的requestTotalExecutors 触发ApplicationMaster到YARN上申请executor/release. 注意，这里每次都是根据当期pendingTask 计算需要的executor的总量，不是增量，调用的流程如下图所示

YarnAllocator是ApplicaitonMaster中的一个组件，辅助ApplicaitonMaster到RM中申请释放容器。在ApplicaitonMaster启动的时候，会启动一个后台线程，定期的调用YarnAllocator的 allocateResources方法，这个方法会调用AMRMClient.allocate方法，跟RM交互。接下来看看YarnAllocator是如何处理executor的申请释放的。

在YarnAllocator内部中维护了一个targetNumExecutorsPerResourceProfileId，这里存放的是每个ResourceProfile对应的期望的Executor的数量，在Spark中，一个Executor对应Yarn中的一个Container，也就是维护了需要的Container的数量。每次调用allocateResources的时候，第一步就是调用updateResourceRequest, 在里面计算每个ResourceProfile还需要多少Executor.
计算公式需要的executor的数量 = targetNum – pending – running – starting。

targetNum 就是上面targetNumExecutorsPerResourceProfileId 中保存的数量。这个数量就是之前ExecutorAllocationManager定期发送给SchedulerBackend的RequestExecutors中的参数。
pending是到目前为止给AMRMClient提交的所有还没有被allocate的ContainRequest。YarnAllocator发送给AMRMClient的addContainerRequests不是立即发送给RM的，是缓存在AMRMClient中的。当调用AMRMClient的allocate方法的时候，才会真正的发送到RM，并得到allocationResponse，在allocationResponse里包含了RM返回给client的新分配的containers，以及complete的container，client会后面处理这些containsers, 对于新分配的containers，如果合适，就会launch executor,下给starting executor count + 1, launch成功之后，把starting ececutor count -1,然后加入到running executor列表里。
running 是已经launch的数量
starting就是刚收到allocate response, 正在启动中的容器数量, 一旦启动成果，就会 -1，并加入到running里去

后面会用到host的taskCount的信息，这里插叙下是host的taskcount的来龙去脉。
当stageSubmitted的时候，在stageInfo里包含了TaskLocation的信息，在ExecutorAllocationManager里处理stageSubmitted事件的时候，会把taskLocation的信息汇总到stageAttemptToExecutorPlacementHints中，后面再转换到rpIdToHostToLocalTaskCount，保存了每个ResourceProfile中每个host里taskcount的数量。并在发送RequestExecutors事件的时，发送给ApplicationMaster.

上面计算出每个resourceProfile还缺少的executor的数量之后，再把pending的containsRequests根据locality的需求，拆分为三部分，依次保存在下面三个list里

localRequests, request中host是在 rpIdToHostToLocalTaskCount 的host列表中的
staleRequests, request 中的host不在 rpIdToHostToLocalTaskCount 的host列表中
anyHostRequests request中的host为空

接下来调用 AMRMClient.removeContainerReuqest(stale)方法，把staleRequests的请求取消掉，并保存取消的数量cancelledContainers

需要申请的容器的数量 miss = miss + cancelledContainers的数量，

后面再根据containerPlacementStrategy计算containerLocalityPreferences, 就是对有locility需求的containerRequest分配host/rack. 这里暂且不表，比较细节，有需求的朋友可以咨询查阅，代码在spark的LocalityPreferredContainerPlacementStrategy类中。

同理，如果missing是<0的，也就是executor目前是过剩的，就会调用AMRMClient.removeContainerRquest把过剩的executor remove掉。

AMRMClient.addContainerRequest 和 AMRMClient.removeContainerRequest只是提交请求到AMRMClient缓存起来，并没有真正的提交到RM(ResourceManager)。要真正的提交到RM中，需要调用AMRMClient.allocate()方法去把请求发送到RM。这个方法的名字有点歧义，除了像RM请求container之外，release container也是这里处理的，这个接口还承担AM到RM heatbeat的作用，如果超时没有调用，RM会认为AM超时了。所以AM需要定期的调用allocator方法去heatbeat. RM会返回给AM 新分配的容器，完成容器的状态，以及node health updates。所以这个方法的名字叫allocate有一点容易误导。

当AM收到allocate返回的allocateResponse之后，会处理

新分配的容器allocatedContainers
处理complete的容器

处理新分配容器的逻辑如下

处理matchHost的container，根据分配的container的host, 在containerRequest中查找match的containerRequest，如果找到，就把containerRqeuest 调用AMRMClient.removeContainerRequest(containerRequest) 删除。表明该request 已经完成。并把该container放入到 containersToUse的列表中。
如果没有match,就放入remainging列表中，继续处理matchRack,跟matchHost类型，资源名称是allocated container的node的rack
如果找到，就把containerRqeuest 调用AMRMClient.removeContainerRequest(containerRequest) 删除。表明该request 已经完成。并把该container放入到 containersToUse的列表中。
如果没有match,就放入remaining列表中，继续处理matchAny. 资源名称是ANY，如果找到，就把containerRqeuest 调用AMRMClient.removeContainerRequest(containerRequest) 删除。表明该request 已经完成。并把该container放入到 containersToUse的列表中。
如果还没有match，就可能是资源太小或者其他原因分配的资源跟当期请求的不匹配，就会调用AMRMClient.releaseAssignedContainer(container.getId())把该资源释放。

后面会调用runAllocatedContainer(containersToUser)更新内部状态，

runningExecutorsPerResourceProfileId
numExecutorsStartingPerResourceProfileId
allocatedHostToContainersMapPerRPId

最终调用NMClient.startContainer(container, ctx),
容器中执行的命令是在ExecutorRunnable.prepareCommand()里生成的。启动的类是org.apache.spark.executor.YarnCoarseGrainedExecutorBackend

到这里spark3动态分配executor的流程基本就算完成了。下面是AMRMClientImpl中的一些逻辑，深入理解一下AMRMClientImpl的内部实现

AMRMClient里维护了一个 remoteRquestsTable, 是一个多层的MAP,维护了从 Priority -> ResourceName(e.g. nodename, rackname,*) -> Resource Capability -> ResourceRequestInfo

AMRMClient里维护了一个 remoteRquestsTable, 是一个多级map remoteRequestsTable,维护了从 Priority -> ResourceName(e.g. nodename, rackname,*) -> Resource Capability -> ResourceRequestInfo

//Key -> Priority
//Value -> Map
//Key->ResourceName (e.g., nodename, rackname, *)
//Value->Map
//Key->Resource Capability
//Value->ResourceRequest
protected final
Map<Priority, Map<String, TreeMap<Resource, ResourceRequestInfo>>>
remoteRequestsTable =
new TreeMap<Priority, Map<String, TreeMap<Resource, ResourceRequestInfo>>>();

当addContainerRequest(T req)的时候，会把请求依次添加到，request.getNode(), request.getRack(), inferredRacks, Any 等ResourceName对应的缓存中。

@Override
public synchronized void addContainerRequest(T req) {
Preconditions.checkArgument(req != null,
"Resource request can not be null.");
Set<String> dedupedRacks = new HashSet<String>();
if (req.getRacks() != null) {
dedupedRacks.addAll(req.getRacks());
if(req.getRacks().size() != dedupedRacks.size()) {
Joiner joiner = Joiner.on(',');
LOG.warn("ContainerRequest has duplicate racks: "
+ joiner.join(req.getRacks()));
}
}
Set<String> inferredRacks = resolveRacks(req.getNodes());
inferredRacks.removeAll(dedupedRacks);

// check that specific and non-specific requests cannot be mixed within a
// priority
checkLocalityRelaxationConflict(req.getPriority(), ANY_LIST,
req.getRelaxLocality());
// check that specific rack cannot be mixed with specific node within a
// priority. If node and its rack are both specified then they must be
// in the same request.
// For explicitly requested racks, we set locality relaxation to true
checkLocalityRelaxationConflict(req.getPriority(), dedupedRacks, true);
checkLocalityRelaxationConflict(req.getPriority(), inferredRacks,
req.getRelaxLocality());

if (req.getNodes() != null) {
HashSet<String> dedupedNodes = new HashSet<String>(req.getNodes());
if(dedupedNodes.size() != req.getNodes().size()) {
Joiner joiner = Joiner.on(',');
LOG.warn("ContainerRequest has duplicate nodes: "
+ joiner.join(req.getNodes()));
}
for (String node : dedupedNodes) {
addResourceRequest(req.getPriority(), node, req.getCapability(), req,
true, req.getNodeLabelExpression());
}
}

for (String rack : dedupedRacks) {
addResourceRequest(req.getPriority(), rack, req.getCapability(), req,
true, req.getNodeLabelExpression());
}

// Ensure node requests are accompanied by requests for
// corresponding rack
for (String rack : inferredRacks) {
addResourceRequest(req.getPriority(), rack, req.getCapability(), req,
req.getRelaxLocality(), req.getNodeLabelExpression());
}

// Off-switch
addResourceRequest(req.getPriority(), ResourceRequest.ANY,
req.getCapability(), req, req.getRelaxLocality(), req.getNodeLabelExpression());
}

当删除ContainerRequest的时候，会从node, rack,any中把ContainerRequest请求删除。

@Override
public synchronized void removeContainerRequest(T req) {
Preconditions.checkArgument(req != null,
"Resource request can not be null.");
Set<String> allRacks = new HashSet<String>();
if (req.getRacks() != null) {
allRacks.addAll(req.getRacks());
}
allRacks.addAll(resolveRacks(req.getNodes()));

// Update resource requests
if (req.getNodes() != null) {
for (String node : new HashSet<String>(req.getNodes())) {
decResourceRequest(req.getPriority(), node, req.getCapability(), req);
}
}

for (String rack : allRacks) {
decResourceRequest(req.getPriority(), rack, req.getCapability(), req);
}

decResourceRequest(req.getPriority(), ResourceRequest.ANY,
req.getCapability(), req);
}

从当前的提交过的请求中找到复合查询条件的请求，这些请求都是在addContainerRequest 时候添加的

@Override
public synchronized List<? extends Collection<T>> getMatchingRequests(
Priority priority,
String resourceName,
Resource capability) {
Preconditions.checkArgument(capability != null,
"The Resource to be requested should not be null ");
Preconditions.checkArgument(priority != null,
"The priority at which to request containers should not be null ");
List<LinkedHashSet<T>> list = new LinkedList<LinkedHashSet<T>>();
Map<String, TreeMap<Resource, ResourceRequestInfo>> remoteRequests =
this.remoteRequestsTable.get(priority);
if (remoteRequests == null) {
return list;
}
TreeMap<Resource, ResourceRequestInfo> reqMap = remoteRequests
.get(resourceName);
if (reqMap == null) {
return list;
}

ResourceRequestInfo resourceRequestInfo = reqMap.get(capability);
if (resourceRequestInfo != null &&
!resourceRequestInfo.containerRequests.isEmpty()) {
list.add(resourceRequestInfo.containerRequests);
return list;
}

// no exact match. Container may be larger than what was requested.
// get all resources <= capability. map is reverse sorted.
SortedMap<Resource, ResourceRequestInfo> tailMap =
reqMap.tailMap(capability);
for(Map.Entry<Resource, ResourceRequestInfo> entry : tailMap.entrySet()) {
if (canFit(entry.getKey(), capability) &&
!entry.getValue().containerRequests.isEmpty()) {
// match found that fits in the larger resource
list.add(entry.getValue().containerRequests);
}
}

// no match found
return list;
}

由于本人水平有限，难免会有错误支持，如发现请留言指正。谢谢

gezooo

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
SPARK3 基于YARN的EXECUTOR的动态申请详解

最近工作中遇到RM（ResourceManager）性能慢的问题，研究了一下RM的资源分配流程，因为内部流程还是比较复杂的，为了以后方便查阅，记录一下。在研究YARN容器的生命周期的过程中，从Applicaiton的提交，到最后的完成，整个流程中容器的分配和删除主要是ApplicaitonMaster发起的，由RM内部的Scheduler来管理分配。本篇文章还是聚焦在spark的executor的分配逻辑上，关于YARN内部container的分配流程计划单独写一篇文章记录，这里就不过多介绍，简要的说明下
复制链接

扫一扫