Hadoop调度源码分析 任务调度阶段

上一篇文章介绍了,hadoop的调度主要分为两部分:
1)初始化部分
2)任务调度部分,本文章主要介绍这部分的内容。
  • 任务调度阶段

任务调度阶段这部分主要通过心跳机制实现的。具体分为以下一些步骤:

  1. TaskTracker通过RPC调用向JobTracker发送Heartbeat,在发送心跳的时候,一般会传送TaskTracker节点上的资源的信息,比方说CPU usage、Disk Usage等等。
  2. 接到心跳之后,JobTracker会查找是否存在初始化task和清理task(一个作业在初始化的时候,会创建两个setup task和两个cleanup task分别用来处理MapTask和ReduceTask),于是: List<Task> tasks = getSetupAndCleanupTasks(taskTrackerStatus);
  3. JobTracker调用TaskScheduler的assignTask,Hadoop缺省使用JobQueueTaskQueue,这部分特别重要。结合代码阐述。

public synchronized List<Task> assignTasks(TaskTracker taskTracker)

throws IOException {

TaskTrackerStatus taskTrackerStatus = taskTracker.getStatus();

ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus();

final int numTaskTrackers = clusterStatus.getTaskTrackers(); //获取TaskTracker的个数

final int clusterMapCapacity = clusterStatus.getMaxMapTasks();//获取Hadoop集群中能够运行MapTask的最大容量

final int clusterReduceCapacity = clusterStatus.getMaxReduceTasks();//获取Hadoop集群中能够运行ReduceTask的最大容量

Collection<JobInProgress> jobQueue =

jobQueueJobInProgressListener.getJobQueue(); //获取提交作业的JobInProgress,作业和JobInProgress一一对应。

//

// Get map + reduce counts for the current tracker.

//

final int trackerMapCapacity = taskTrackerStatus.getMaxMapSlots();//TaskTrackermap slots的容量

final int trackerReduceCapacity = taskTrackerStatus.getMaxReduceSlots();//TaskTrackerreduce slots的容量

final int trackerRunningMaps = taskTrackerStatus.countMapTasks();//TaskTracker正在运行MapTask的个数

final int trackerRunningReduces = taskTrackerStatus.countReduceTasks();//TaskTracker正在运行的Reduce task的个数

// Assigned tasks

List<Task> assignedTasks = new ArrayList<Task>(); //用来存储分配的任务

//

// Compute (running + pending) map and reduce task numbers across pool

//

int remainingReduceLoad = 0;

int remainingMapLoad = 0;

synchronized (jobQueue) {

for (JobInProgress job : jobQueue) {

if (job.getStatus().getRunState() == JobStatus.RUNNING) {

remainingMapLoad += (job.desiredMaps() - job.finishedMaps());

if (job.scheduleReduces()) {

remainingReduceLoad +=

(job.desiredReduces() - job.finishedReduces());

}

}

}

} //上面这一段代码用来计算当前正在执行的Job还需要的mapTaskReduceTask的个数,(每个提交上的作业都会生成一个JobInProgress,但并不代表它正在执行)

// Compute the 'load factor' for maps and reduces

double mapLoadFactor = 0.0;

if (clusterMapCapacity > 0) {

mapLoadFactor = (double)remainingMapLoad / clusterMapCapacity;

}

double reduceLoadFactor = 0.0;

if (clusterReduceCapacity > 0) {

reduceLoadFactor = (double)remainingReduceLoad / clusterReduceCapacity;

}//上面这一段程序恨好说明了当前Hadoop集群有控制作业负载的参数,分别是mapLoadFactorreduceLoadFactor,这个数用来说明当前集群是否太忙或者太闲,后期可以用来作为energy effiencies的重要依据,来动态调整HadoopTaskTracker的个数,注意当前Hadoop-0.21.0版本没有实现这一点。

//

// In the below steps, we allocate first map tasks (if appropriate),

// and then reduce tasks if appropriate. We go through all jobs

// in order of job arrival; jobs only get serviced if their

// predecessors are serviced, too.

//

//

// We assign tasks to the current taskTracker if the given machine

// has a workload that's less than the maximum load of that kind of

// task.

// However, if the cluster is close to getting loaded i.e. we don't

// have enough _padding_ for speculative executions etc., we only

// schedule the "highest priority" task i.e. the task from the job

// with the highest priority.

//

final int trackerCurrentMapCapacity =

Math.min((int)Math.ceil(mapLoadFactor * trackerMapCapacity),

trackerMapCapacity);//为了尽可能使得每一个TaskTracker获得相同的Capacity,并保证在供需平衡,所以这里用到了mapLoadFactor来平衡,例如当前平均每个TaskTracker需要调度的MapTask的个数经过mapLoadFactor * trackerMapCapacity = 3, 而trackerMapCapacity=4,则这个时候,当前map的容量是3,这样使得任务可以完成,还要保证每一个TaskTracker不至于出现TaskSkew现象。

int availableMapSlots = trackerCurrentMapCapacity - trackerRunningMaps; //获取当前TaskTracker还可以运行的MapTask的个数。

boolean exceededMapPadding = false;

if (availableMapSlots > 0) {

exceededMapPadding =

exceededPadding(true, clusterStatus, trackerMapCapacity);

}

int numLocalMaps = 0;

int numNonLocalMaps = 0;

scheduleMaps:

for (int i=0; i < availableMapSlots; ++i) {

synchronized (jobQueue) {

for (JobInProgress job : jobQueue) {

if (job.getStatus().getRunState() != JobStatus.RUNNING) {

continue;

}//首先满足已经执行,并且map task还没有部署的作业。

Task t = null;

// Try to schedule a node-local or rack-local Map task Hadoop本地性的体现

t =

job.obtainNewLocalMapTask(taskTrackerStatus, numTaskTrackers,

taskTrackerManager.getNumberOfUniqueHosts());

if (t != null) {

assignedTasks.add(t);

++numLocalMaps;

// Don't assign map tasks to the hilt!

// Leave some free slots in the cluster for future task-failures,

// speculative tasks etc. beyond the highest priority job

if (exceededMapPadding) {

break scheduleMaps;

}

// Try all jobs again for the next Map task

break;

}

// Try to schedule a node-local or rack-local Map task

t =

job.obtainNewNonLocalMapTask(taskTrackerStatus, numTaskTrackers,

taskTrackerManager.getNumberOfUniqueHosts());

if (t != null) {

assignedTasks.add(t);

++numNonLocalMaps;

// We assign at most 1 off-switch or speculative task

// This is to prevent TaskTrackers from stealing local-tasks

// from other TaskTrackers.

break scheduleMaps;

}//最多分配一个非本地的task,一是保证任务的并行性,二是避免有些TaskTracker的本地Map Task被偷走。

}

}

}

int assignedMaps = assignedTasks.size();

//

// Same thing, but for reduce tasks

// However we _never_ assign more than 1 reduce task per heartbeat

//

final int trackerCurrentReduceCapacity =

Math.min((int)Math.ceil(reduceLoadFactor * trackerReduceCapacity),

trackerReduceCapacity);

final int availableReduceSlots =

Math.min((trackerCurrentReduceCapacity - trackerRunningReduces), 1);

boolean exceededReducePadding = false;

if (availableReduceSlots > 0) {

exceededReducePadding = exceededPadding(false, clusterStatus,

trackerReduceCapacity);

synchronized (jobQueue) {

for (JobInProgress job : jobQueue) {

if (job.getStatus().getRunState() != JobStatus.RUNNING ||

job.numReduceTasks == 0) {

continue;

}

Task t =

job.obtainNewReduceTask(taskTrackerStatus, numTaskTrackers,

taskTrackerManager.getNumberOfUniqueHosts()

);

if (t != null) {

assignedTasks.add(t);

break;

}

// Don't assign reduce tasks to the hilt!

// Leave some free slots in the cluster for future task-failures,

// speculative tasks etc. beyond the highest priority job

if (exceededReducePadding) {

break;

}

}

}

}//最多分配一个Reduce任务

if (LOG.isDebugEnabled()) {

LOG.debug("Task assignments for " + taskTrackerStatus.getTrackerName() + " --> " +

"[" + mapLoadFactor + ", " + trackerMapCapacity + ", " +

trackerCurrentMapCapacity + ", " + trackerRunningMaps + "] -> [" +

(trackerCurrentMapCapacity - trackerRunningMaps) + ", " +

assignedMaps + " (" + numLocalMaps + ", " + numNonLocalMaps +

")] [" + reduceLoadFactor + ", " + trackerReduceCapacity + ", " +

trackerCurrentReduceCapacity + "," + trackerRunningReduces +

"] -> [" + (trackerCurrentReduceCapacity - trackerRunningReduces) +

", " + (assignedTasks.size()-assignedMaps) + "]");

}

return assignedTasks;

}

4、在向JobTracker发送heartbeat后,返回的reponse中有分配好的任务LaunchTaskAction,将其加入队列, 调用 addToTaskQueue,如果是map task则放入mapLancher(类型为TaskLauncher),如果是reduce task则放入reduceLancher(类型为TaskLauncher)

private void addToTaskQueue(LaunchTaskAction action) { 
  
  if (action.getTask().isMapTask()) { 
  
    mapLauncher.addToTaskQueue(action); 
  
  } else { 
  
    reduceLauncher.addToTaskQueue(action); 
  
  } 
  
} 

TaskLauncher是一个线程,其run函数从上面放入的queue中取出一个TaskInProgress,然后调用 startNewTask(TaskInProgress tip)来启动一个task,其又主要调用了localizeJob(TaskInProgress tip)

private void localizeJob(TaskInProgress tip) throws IOException { 
  
  //首先要做的一件事情是有关Task的文件从HDFS拷贝的TaskTracker的本地文件系统中:job.splitjob.xml以及job.jar 
  
  Path localJarFile = null; 
  
  Task t = tip.getTask(); 
  
  JobID jobId = t.getJobID(); 
  
  Path jobFile = new Path(t.getJobFile()); 
  
  …… 
  
  Path localJobFile = lDirAlloc.getLocalPathForWrite( 
  
                                  getLocalJobDir(jobId.toString()) 
  
                                  + Path.SEPARATOR + "job.xml", 
  
                                  jobFileSize, fConf); 
  
  RunningJob rjob = addTaskToJob(jobId, tip); 
  
  synchronized (rjob) { 
  
    if (!rjob.localized) { 
  
      FileSystem localFs = FileSystem.getLocal(fConf); 
  
      Path jobDir = localJobFile.getParent(); 
  
      …… 
  
      //job.split拷贝到本地 
  
      systemFS.copyToLocalFile(jobFile, localJobFile); 
  
      JobConf localJobConf = new JobConf(localJobFile); 
  
      Path workDir = lDirAlloc.getLocalPathForWrite( 
  
                       (getLocalJobDir(jobId.toString()) 
  
                       + Path.SEPARATOR + "work"), fConf); 
  
      if (!localFs.mkdirs(workDir)) { 
  
        throw new IOException("Mkdirs failed to create " 
  
                    + workDir.toString()); 
  
      } 
  
      System.setProperty("job.local.dir", workDir.toString()); 
  
      localJobConf.set("job.local.dir", workDir.toString()); 
  
      // copy Jar file to the local FS and unjar it. 
  
      String jarFile = localJobConf.getJar(); 
  
      long jarFileSize = -1; 
  
      if (jarFile != null) { 
  
        Path jarFilePath = new Path(jarFile); 
  
        localJarFile = new Path(lDirAlloc.getLocalPathForWrite( 
  
                                   getLocalJobDir(jobId.toString()) 
  
                                   + Path.SEPARATOR + "jars", 
  
                                   5 * jarFileSize, fConf), "job.jar"); 
  
        if (!localFs.mkdirs(localJarFile.getParent())) { 
  
          throw new IOException("Mkdirs failed to create jars directory "); 
  
        } 
  
        //job.jar拷贝到本地 
  
        systemFS.copyToLocalFile(jarFilePath, localJarFile); 
  
        localJobConf.setJar(localJarFile.toString()); 
  
       //jobconfiguration写成job.xml 
  
        OutputStream out = localFs.create(localJobFile); 
  
        try { 
  
          localJobConf.writeXml(out); 
  
        } finally { 
  
          out.close(); 
  
        } 
  
        // 解压缩job.jar 
  
        RunJar.unJar(new File(localJarFile.toString()), 
  
                     new File(localJarFile.getParent().toString())); 
  
      } 
  
      rjob.localized = true; 
  
      rjob.jobConf = localJobConf; 
  
    } 
  
  } 
  
  //真正的启动此Task 
  
  launchTaskForJob(tip, new JobConf(rjob.jobConf)); 
  
} 

当所有的task运行所需要的资源都拷贝到本地后,则调用launchTaskForJob,其又调用TaskInProgresslaunchTask函数.

public synchronized void launchTask() throws IOException {
  
    …… 
  
    //创建task运行目录 
  
    localizeTask(task); 
  
    if (this.taskStatus.getRunState() == TaskStatus.State.UNASSIGNED) { 
  
      this.taskStatus.setRunState(TaskStatus.State.RUNNING); 
  
    } 
  
    //创建并启动TaskRunner,对于MapTask,创建的是MapTaskRunner,对于ReduceTask,创建的是ReduceTaskRunner 
  
    this.runner = task.createRunner(TaskTracker.this, this); 
  
    this.runner.start(); 
  
    this.taskStatus.setStartTime(System.currentTimeMillis()); 
  
} 

5、真正的map taskreduce task都是在Child进程中运行的,这部分的内容主要是MapTaskReduceTask,相信大家很多都读过了,再说这部分已经不在属于调度的部分,就不赘述了。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值