hadoop 如何做到map任务本地化

最新推荐文章于 2021-07-06 11:10:48 发布

涛侠

最新推荐文章于 2021-07-06 11:10:48 发布

阅读量5.6k

点赞数

分类专栏： hadoop 源码之我见文章标签：任务 hadoop cache null tts list

本文链接：https://blog.csdn.net/dboy1/article/details/6256765

版权

hadoop 源码之我见专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Hadoop继承了Google的MapReduce的特性，具有map任务最大化本地化的能力，下面具体说下它是如何做到的。

在hadoop中，有很多taskScheduler，这里以默认的JobQueueTaskScheduler为例来说明。下面是assignTask的源代码

//@Override public synchronized List<Task> assignTasks(TaskTrackerStatus taskTracker) throws IOException { ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus(); final int numTaskTrackers = clusterStatus.getTaskTrackers(); final int clusterMapCapacity = clusterStatus.getMaxMapTasks(); final int clusterReduceCapacity = clusterStatus.getMaxReduceTasks(); Collection<JobInProgress> jobQueue = jobQueueJobInProgressListener.getJobQueue(); // // Get map + reduce counts for the current tracker. // final int trackerMapCapacity = taskTracker.getMaxMapTasks(); final int trackerReduceCapacity = taskTracker.getMaxReduceTasks(); final int trackerRunningMaps = taskTracker.countMapTasks(); final int trackerRunningReduces = taskTracker.countReduceTasks(); // Assigned tasks List<Task> assignedTasks = new ArrayList<Task>(); // // Compute (running + pending) map and reduce task numbers across pool // int remainingReduceLoad = 0; int remainingMapLoad = 0; synchronized (jobQueue) { for (JobInProgress job : jobQueue) { if (job.getStatus().getRunState() == JobStatus.RUNNING) { remainingMapLoad += (job.desiredMaps() - job.finishedMaps()); if (job.scheduleReduces()) { remainingReduceLoad += (job.desiredReduces() - job.finishedReduces()); } } } } // Compute the 'load factor' for maps and reduces double mapLoadFactor = 0.0; if (clusterMapCapacity > 0) { mapLoadFactor = (double)remainingMapLoad / clusterMapCapacity; } double reduceLoadFactor = 0.0; if (clusterReduceCapacity > 0) { reduceLoadFactor = (double)remainingReduceLoad / clusterReduceCapacity; } // // In the below steps, we allocate first map tasks (if appropriate), // and then reduce tasks if appropriate. We go through all jobs // in order of job arrival; jobs only get serviced if their // predecessors are serviced, too. // // // We assign tasks to the current taskTracker if the given machine // has a workload that's less than the maximum load of that kind of // task. // However, if the cluster is close to getting loaded i.e. we don't // have enough _padding_ for speculative executions etc., we only // schedule the "highest priority" task i.e. the task from the job // with the highest priority. // final int trackerCurrentMapCapacity = Math.min((int)Math.ceil(mapLoadFactor * trackerMapCapacity), trackerMapCapacity); //这种计算使availableMapSlots<=availableMapSlots int availableMapSlots = trackerCurrentMapCapacity - trackerRunningMaps; boolean exceededMapPadding = false; if (availableMapSlots > 0) { exceededMapPadding = exceededPadding(true, clusterStatus, trackerMapCapacity); } int numLocalMaps = 0; int numNonLocalMaps = 0; scheduleMaps: for (int i=0; i < availableMapSlots; ++i) { synchronized (jobQueue) { //这里体现了FIFO 的实现，先进的job 先执行 for (JobInProgress job : jobQueue) { if (job.getStatus().getRunState() != JobStatus.RUNNING) { continue; } Task t = null; // Try to schedule a node-local or rack-local Map task t = job.obtainNewLocalMapTask(taskTracker, numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts()); if (t != null) { assignedTasks.add(t); ++numLocalMaps; // Don't assign map tasks to the hilt! // Leave some free slots in the cluster for future task-failures, // speculative tasks etc. beyond the highest priority job if (exceededMapPadding) { break scheduleMaps; } // Try all jobs again for the next Map task break; } // Try to schedule a node-local or rack-local Map task t = job.obtainNewNonLocalMapTask(taskTracker, numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts()); if (t != null) { assignedTasks.add(t); ++numNonLocalMaps; // We assign at most 1 off-switch or speculative task // This is to prevent TaskTrackers from stealing local-tasks // from other TaskTrackers. break scheduleMaps; } } } } int assignedMaps = assignedTasks.size(); // // Same thing, but for reduce tasks // However we _never_ assign more than 1 reduce task per heartbeat //每次心跳分配的reduce 数量不超过1 个 final int trackerCurrentReduceCapacity = Math.min((int)Math.ceil(reduceLoadFactor * trackerReduceCapacity), trackerReduceCapacity); //这样可以使availableReduceSlots 不大于集群的reduce 的比率 final int availableReduceSlots = Math.min((trackerCurrentReduceCapacity - trackerRunningReduces), 1); boolean exceededReducePadding = false; if (availableReduceSlots > 0) { exceededReducePadding = exceededPadding(false, clusterStatus, trackerReduceCapacity); synchronized (jobQueue) { for (JobInProgress job : jobQueue) { if (job.getStatus().getRunState() != JobStatus.RUNNING || job.numReduceTasks == 0) { continue; } Task t = job.obtainNewReduceTask(taskTracker, numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts() ); if (t != null) { assignedTasks.add(t); //直接break break; } // Don't assign reduce tasks to the hilt! // Leave some free slots in the cluster for future task-failures, // speculative tasks etc. beyond the highest priority job if (exceededReducePadding) { break; } } } } if (LOG.isDebugEnabled()) { LOG.debug("Task assignments for " + taskTracker.getTrackerName() + " --> " + "[" + mapLoadFactor + ", " + trackerMapCapacity + ", " + trackerCurrentMapCapacity + ", " + trackerRunningMaps + "] -> [" + (trackerCurrentMapCapacity - trackerRunningMaps) + ", " + assignedMaps + " (" + numLocalMaps + ", " + numNonLocalMaps + ")] [" + reduceLoadFactor + ", " + trackerReduceCapacity + ", " + trackerCurrentReduceCapacity + "," + trackerRunningReduces + "] -> [" + (trackerCurrentReduceCapacity - trackerRunningReduces) + ", " + (assignedTasks.size()-assignedMaps) + "]"); } return assignedTasks; }

这里不想分析JobQueueTaskScheduler是具体如何工作的，主要想说的是它如何做到本地化。

它首先设置了一些和集群有关的常量，然后获取jobQueue。jobQueueJobInProgressListener是它对JobTracker注册的jobInProgressListener。然后获取还需要加载的map和reduce任务的数量。保存在int remainingReduceLoad 和

int remainingMapLoad 中。然后计算mapLoadFactor 和 reduceLoadFactor。以及是否要扩充slot等等。

。。。

从 int numLocalMaps = 0;

int numNonLocalMaps = 0;

下面以后是真正分配任务的代码段了，我们来看看

if (job.getStatus().getRunState() != JobStatus.RUNNING) { continue; }

接着看到这

// Try to schedule a node-local or rack-local Map task t = job.obtainNewLocalMapTask(taskTracker, numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts());

看到注释，就会发现这里试图分配一个本地或者距离较近的一个MapTask

，于是我们找到了突破口，看看这个obtainNewLocalMapTask方法。不过在看之前，先看看getNumberOfUniqueHosts方法。这个方法很简单,返回向jobtracker请求UniqueHosts的size，我们暂且不去care它，接着走进obtainNewLocalMapTask看看他具体的实现方法。

public synchronized Task obtainNewLocalMapTask(TaskTrackerStatus tts, int clusterSize, int numUniqueHosts) throws IOException { if (!tasksInited.get()) { LOG.info("Cannot create task split for " + profile.getJobID()); return null; } //maxLevel = 2 , 指的是NetworkTopology.DEFAULT_HOST_LEVEL int target = findNewMapTask(tts, clusterSize, numUniqueHosts, maxLevel, status.mapProgress()); if (target == -1) { return null; } Task result = maps[target].getTaskToRun(tts.getTrackerName()); if (result != null) { addRunningTaskToTIP(maps[target], result.getTaskID(), tts, true); } return result; }

我们发现最重要的是这句

int target = findNewMapTask(tts, clusterSize, numUniqueHosts, maxLevel,

status.mapProgress());

它是来寻找target的，所以需要深入看看这个函数的实现。这个方法非常的长，因为它是寻找map task的核心，需要耐心的研读。

private synchronized int findNewMapTask(final TaskTrackerStatus tts, final int clusterSize, final int numUniqueHosts, final int maxCacheLevel, final double avgProgress) { if (numMapTasks == 0) { LOG.info("No maps to schedule for " + profile.getJobID()); return -1; } String taskTracker = tts.getTrackerName(); TaskInProgress tip = null; // // Update the last-known clusterSize // this.clusterSize = clusterSize; if (!shouldRunOnTaskTracker(taskTracker)) { return -1; } // Check to ensure this TaskTracker has enough resources to // run tasks from this job long outSize = resourceEstimator.getEstimatedMapOutputSize(); long availSpace = tts.getResourceStatus().getAvailableSpace(); if(availSpace < outSize) { LOG.warn("No room for map task. Node " + tts.getHost() + " has " + availSpace + " bytes free; but we expect map to take " + outSize); return -1; //see if a different TIP might work better. } // For scheduling a map task, we have two caches and a list (optional) // I) one for non-running task // II) one for running task (this is for handling speculation) // III) a list of TIPs that have empty locations (e.g., dummy splits), // the list is empty if all TIPs have associated locations // First a look up is done on the non-running cache and on a miss, a look // up is done on the running cache. The order for lookup within the cache: // 1. from local node to root [bottom up] // 2. breadth wise for all the parent nodes at max level // We fall to linear scan of the list (III above) if we have misses in the // above caches Node node = jobtracker.getNode(tts.getHost()); // // I) Non-running TIP : // // 1. check from local node to the root [bottom up cache lookup] // i.e if the cache is available and the host has been resolved // (node!=null) if (node != null) { Node key = node; int level = 0; // maxCacheLevel might be greater than this.maxLevel if findNewMapTask is // called to schedule any task (local, rack-local, off-switch or speculative) // tasks or it might be NON_LOCAL_CACHE_LEVEL (i.e. -1) if findNewMapTask is // (i.e. -1) if findNewMapTask is to only schedule off-switch/speculative // tasks int maxLevelToSchedule = Math.min(maxCacheLevel, maxLevel); for (level = 0;level < maxLevelToSchedule; ++level) { List <TaskInProgress> cacheForLevel = nonRunningMapCache.get(key); if (cacheForLevel != null) { tip = findTaskFromList(cacheForLevel, tts, numUniqueHosts,level == 0); if (tip != null) { // Add to running cache scheduleMap(tip); // remove the cache if its empty if (cacheForLevel.size() == 0) { nonRunningMapCache.remove(key); } return tip.getIdWithinJob(); } } key = key.getParent(); } // Check if we need to only schedule a local task (node-local/rack-local) if (level == maxCacheLevel) { return -1; } } //2. Search breadth-wise across parents at max level for non-running // TIP if // - cache exists and there is a cache miss // - node information for the tracker is missing (tracker's topology // info not obtained yet) // collection of node at max level in the cache structure Collection<Node> nodesAtMaxLevel = jobtracker.getNodesAtMaxLevel(); // get the node parent at max level Node nodeParentAtMaxLevel = (node == null) ? null : JobTracker.getParentNode(node, maxLevel - 1); for (Node parent : nodesAtMaxLevel) { // skip the parent that has already been scanned if (parent == nodeParentAtMaxLevel) { continue; } List<TaskInProgress> cache = nonRunningMapCache.get(parent); if (cache != null) { tip = findTaskFromList(cache, tts, numUniqueHosts, false); if (tip != null) { // Add to the running cache scheduleMap(tip); // remove the cache if empty if (cache.size() == 0) { nonRunningMapCache.remove(parent); } LOG.info("Choosing a non-local task " + tip.getTIPId()); return tip.getIdWithinJob(); } } } // 3. Search non-local tips for a new task tip = findTaskFromList(nonLocalMaps, tts, numUniqueHosts, false); if (tip != null) { // Add to the running list scheduleMap(tip); LOG.info("Choosing a non-local task " + tip.getTIPId()); return tip.getIdWithinJob(); } // // II) Running TIP : // if (hasSpeculativeMaps) { long currentTime = System.currentTimeMillis(); // 1. Check bottom up for speculative tasks from the running cache if (node != null) { Node key = node; for (int level = 0; level < maxLevel; ++level) { Set<TaskInProgress> cacheForLevel = runningMapCache.get(key); if (cacheForLevel != null) { tip = findSpeculativeTask(cacheForLevel, tts, avgProgress, currentTime, level == 0); if (tip != null) { if (cacheForLevel.size() == 0) { runningMapCache.remove(key); } return tip.getIdWithinJob(); } } key = key.getParent(); } } // 2. Check breadth-wise for speculative tasks for (Node parent : nodesAtMaxLevel) { // ignore the parent which is already scanned if (parent == nodeParentAtMaxLevel) { continue; } Set<TaskInProgress> cache = runningMapCache.get(parent); if (cache != null) { tip = findSpeculativeTask(cache, tts, avgProgress, currentTime, false); if (tip != null) { // remove empty cache entries if (cache.size() == 0) { runningMapCache.remove(parent); } LOG.info("Choosing a non-local task " + tip.getTIPId() + " for speculation"); return tip.getIdWithinJob(); } } } // 3. Check non-local tips for speculation tip = findSpeculativeTask(nonLocalRunningMaps, tts, avgProgress, currentTime, false); if (tip != null) { LOG.info("Choosing a non-local task " + tip.getTIPId() + " for speculation"); return tip.getIdWithinJob(); } } return -1; }

findNewMapTask有一个taskTrackerstatus的变量，这个变量是干嘛的呢？

其实，assginTask是tasktracker每过10秒（默认）向jobtracker发送一次心跳，在这个过程中，他需要将它的状态传递给jobtracker, 这个变量就是jobtracker需要知道的。好了，下面看看这个方法的具体实现。

获取tip检查是否可以跑或者高效的跑，否则返回-1。然后它会从最近的level中寻找task，一般顺序为local -> rack-local -> off-switch ->speculative，阅读之后会发现 map任务的获取是和提交任务时的Node相关的。因此，下面还需要阅读提交作业的相关代码。

至此，本地化过程的难点转移到了提交任务那面了。

下面介于篇幅关系，我就不贴代码了，直接口述。

在提交作业的时候，会建立一个新的JobInProgress，这段很容易找到不再多说，然后调度器会有一个线程去调用JobInProgress的initTasks方法，这个方法会去调用readSplitFile，这个SplitFile是怎么生成的等到下篇在说。总之是一个分割任务的文件，然后会建立RawSplits数组调用crerateCache方法来建立nonRunningMapCache，首先将Splits[i].getlocations为0的加入到nonLocationMaps。然后建立新的Node。它需要调用jobTracker的resolveAndAddToTopology方法，此方法调用dnsToSwitchMapping.resolve方法。dnsTo..由jobTrakcer在初始化的时候由反响映射建立了ScriptBasedMapping ,他的resolve方法首先会调用runResolveCommand方法，这个方法负责运行一个脚本，是从"topology.script.file.name"属性中得到的。这个脚本用来返回一个IP地址对应的机架信息，这个完全由用户自己手写，所以一般这个机架感知功能是默认不用的。然后用返回信息的第一个参数作为参数，传入NodeBase.normalize。这里没脚本的话，返回的是NetworkTopology.DEFAULT_RACK，最后调用addHostToNodeMapping添加host,node到hostnameToNodeMap，再返回到createCache方法，它得到了机架信息以后，递归的将TaskInProgress添加到node中，到此，map tasks就初始化完成了，然后jobtrakcer.initJob将建立cleanip和setup的map和reduce任务的TIP，然后返回。

后面的叙述不知道大家能否看懂，总结一下就是在作业初始化的时候，文件信息和位置信息已经记录下来，在调度作业的时候，直接给tasktracker最近的分片就行了，有什么不懂的还是看代码吧，其实上面把过程说的很详细了。

涛侠

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
9
评论
hadoop 如何做到map任务本地化

Hadoop继承了Google的MapReduce的特性，具有map任务最大化本地化的能力，下面具体说下它是如何做到的。在hadoop中，有很多taskScheduler，这里以默认的JobQueueTaskScheduler为例来说明。下面是assignTask的源代码 //@Override public synchronized List assignTasks(TaskTrackerStatus taskTracker) throws IOException { Clu
复制链接

扫一扫

专栏目录