本文探讨的参数是:mapreduce.job.reduce.slowstart.completedmaps
关于此参数的介绍
hadoop 3.1.1中mapred-default.xml中关于此参数的介绍如下:
mapreduce.job.reduce.slowstart.completedmaps | 0.05 (默认值) | Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job.(介绍) |
默认配置的表达的意思是: 0.05%的map task结束后可以开始为reduce task申请资源。
配置的参数如果设置的过小,可能会让reduce task和map task争抢资源(若reduce占用过多资源,让map等待时间过长,map会抢占reduce),且会造成reduce空转。
若配置的参数过大,就丧失了让reduce和map并行,来降低job工作时间的目标。
因此slowstart参数该如何设计是一个学问。
在别人的博客里发现了以下的信息(33%的配比是如何推断得知我还不清楚:),以后搞懂了补上)
源码分析
接下来进入源码分析阶段。
1、初始化,在RMContainerAllocator.java中的serviceInit函数里可看到reduceSlowStart的初始化过程,如果配置文件里对其进行了更改,就用更改过的参数值;如果没有,则是使用的默认值0.05。
2、RMContainerAllocator.java中的heartbeat函数(当心跳来的时候就会触发)会判断map task已完成的数量并判断是否要开始调度reduce task。
protected synchronized void heartbeat() throws Exception {
scheduleStats.updateAndLogIfChanged("Before Scheduling: ");
List<Container> allocatedContainers = getResources();
if (allocatedContainers != null && allocatedContainers.size() > 0) {
System.out.println("allocatedContainers:");
for(int i = 0 ; i < allocatedContainers.size(); i ++){
System.out.println(allocatedContainers.get(i).toString());
}
scheduledRequests.assign(allocatedContainers);
}
int completedMaps = getJob().getCompletedMaps();//获取已完成的map数量
int completedTasks = completedMaps + getJob().getCompletedReduces(); //获取已完成的task数量
//如果已完成的task数量没有变 || map的资源请求大于0
if ((lastCompletedTasks != completedTasks) ||
(scheduledRequests.maps.size() > 0)) {
lastCompletedTasks = completedTasks;//更新已完成的task数量
recalculateReduceSchedule = true; //重新计算reduce的调度
}
if (recalculateReduceSchedule) {
boolean reducerPreempted = preemptReducesIfNeeded();//考虑是否要抢占reduce
if (!reducerPreempted) {//如果不用抢占reduce
// Only schedule new reducers if no reducer preemption happens for
// this heartbeat 只有不需要抢占reduce的时候,才会在这个心跳里启动新的reduce.
scheduleReduces(getJob().getTotalMaps(), completedMaps,
scheduledRequests.maps.size(), scheduledRequests.reduces.size(),
assignedRequests.maps.size(), assignedRequests.reduces.size(),
mapResourceRequest, reduceResourceRequest, pendingReduces.size(),
maxReduceRampupLimit, reduceSlowStart);
}
recalculateReduceSchedule = false;
}
scheduleStats.updateAndLogIfChanged("After Scheduling: ");
}
3、此函数判断是否要抢占reduce,如果所有申请了资源的map都没被分到资源,且等待时间过长,就要启用抢占reduce。
boolean preemptReducesIfNeeded() {
if (reduceResourceRequest.equals(Resources.none())) {
return false; // no reduces
}
if (assignedRequests.maps.size() > 0) {
// there are assigned mappers 有map task已经分配到了资源且正在运行
return false;
}
if (scheduledRequests.maps.size() <= 0) {
// there are no pending requests for mappers 没有map task向resourceManager发送资源请求,但尚未分配到资源;
return false;
}
// At this point:
// we have pending mappers and all assigned resources are taken by reducers
//有等待的map 但所有已分配的资源都被reducer分走了
if (reducerUnconditionalPreemptionDelayMs >= 0) {
// Unconditional preemption is enabled.
//启用无条件抢占。如果映射器挂起的时间超过了配置的阈值,要抢占reduce。
if (preemptReducersForHangingMapRequests(
reducerUnconditionalPreemptionDelayMs)) {
return true;
}
}
// The pending mappers haven't been waiting for too long. Let us see if
// there are enough resources for a mapper to run. This is calculated by
// excluding scheduled reducers from headroom and comparing it against
// resources required to run one mapper.
//如果在队列里等待的map没有等待过久,可以看有没有适合的资源给map,如果有的话,就不用抢占reduce了。
Resource scheduledReducesResource = Resources.multiply(
reduceResourceRequest, scheduledRequests.reduces.size());
Resource availableResourceForMap =
Resources.subtract(getAvailableResources(), scheduledReducesResource);
if (ResourceCalculatorUtils.computeAvailableContainers(availableResourceForMap,
mapResourceRequest, getSchedulerResourceTypes()) > 0) {
// Enough room to run a mapper
return false;
}
// Available resources are not enough to run mapper. See if we should hold
// off before preempting reducers and preempt if okay.
return preemptReducersForHangingMapRequests(reducerNoHeadroomPreemptionDelayMs);
}
3、如果不需要抢占reduce资源,那么可以调度新的reduce,接下来看scheduleReduces函数,调度reduce的时候把reduceSlowStart作为参数传了进去。
//check for slow start
if (!getIsReduceStarted()) {//not set yet 在Reduce调度尚未启动时
int completedMapsForReduceSlowstart = (int)Math.ceil(reduceSlowStart *
totalMaps); //计算开始调度reduce时,map task应该完成的数量
if(completedMaps < completedMapsForReduceSlowstart) {//尚未达到
LOG.info("Reduce slow start threshold not met. " +
"completedMapsForReduceSlowstart " +
completedMapsForReduceSlowstart);
return;
} else {
LOG.info("Reduce slow start threshold reached. Scheduling reduces.");
setIsReduceStarted(true); //设置开始调度reduce
}
}
4、若job处于Uber模式(小作业条件下,频繁的创建container会给集群带来较大的消耗,因此创造了uber模式,在uber模式下,所有的map和reduce都是串行的,都在同一个container中)。若是Uber模式,JobImpl中会将reduceSlowStart设置为1.
if (isUber) {
LOG.info("Uberizing job " + jobId + ": " + numMapTasks + "m+"
+ numReduceTasks + "r tasks (" + dataInputLength
+ " input bytes) will run sequentially on single node.");
// make sure reduces are scheduled only after all map are completed
conf.setFloat(MRJobConfig.COMPLETED_MAPS_FOR_REDUCE_SLOWSTART,
1.0f);