hadoop_数据本地化优势

最新推荐文章于 2024-01-23 19:54:05 发布

weixin_34029680

最新推荐文章于 2024-01-23 19:54:05 发布

阅读量328

点赞数

文章标签：大数据 python

原文链接：https://my.oschina.net/u/4009325/blog/2254724

版权

2019独角兽企业重金招聘Python工程师标准>>>

hadoop在存储有输入数据（hdfs中的数据）的节点上运行map任务，可以获得最佳性能，因为他无需使用最宝贵的集群宽带资源。

数据本地化是hadoop数据处理的核心，优势，可以获得最佳性能。

什么时候开始这个数据本地化优势的呢？【-----hadoop版本比价老。2.x之后，有yarn。但是可以以这篇做参考】

1，reduce吗？不是，是map任务。一个split切片对应一个map任务的。移动计算比移动数据成本低，所以说移动计算是最好的解决方案。这就实现数据本地化优势。

2，当时看了一些源码，也查看了一些博客，其中有篇博客解释的很清楚【https://blog.csdn.net/dboy1/article/details/6256765】

//@Override
  public synchronized List<Task> assignTasks(TaskTrackerStatus taskTracker)
      throws IOException {
    ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus();
    final int numTaskTrackers = clusterStatus.getTaskTrackers();
    final int clusterMapCapacity = clusterStatus.getMaxMapTasks();
    final int clusterReduceCapacity = clusterStatus.getMaxReduceTasks();
    Collection<JobInProgress> jobQueue =
      jobQueueJobInProgressListener.getJobQueue();
    //
    // Get map + reduce counts for the current tracker.
    //
    final int trackerMapCapacity = taskTracker.getMaxMapTasks();
    final int trackerReduceCapacity = taskTracker.getMaxReduceTasks();
    final int trackerRunningMaps = taskTracker.countMapTasks();
    final int trackerRunningReduces = taskTracker.countReduceTasks();
    // Assigned tasks
    List<Task> assignedTasks = new ArrayList<Task>();
    //
    // Compute (running + pending) map and reduce task numbers across pool
    //
    int remainingReduceLoad = 0;
    int remainingMapLoad = 0;
    synchronized (jobQueue) {
      for (JobInProgress job : jobQueue) {
        if (job.getStatus().getRunState() == JobStatus.RUNNING) {
          remainingMapLoad += (job.desiredMaps() - job.finishedMaps());
          if (job.scheduleReduces()) {
            remainingReduceLoad += 
              (job.desiredReduces() - job.finishedReduces());
          }
        }
      }
    }
    // Compute the 'load factor' for maps and reduces
    double mapLoadFactor = 0.0;
    if (clusterMapCapacity > 0) {
      mapLoadFactor = (double)remainingMapLoad / clusterMapCapacity;
    }
    double reduceLoadFactor = 0.0;
    if (clusterReduceCapacity > 0) {
      reduceLoadFactor = (double)remainingReduceLoad / clusterReduceCapacity;
    }
        
    //
    // In the below steps, we allocate first map tasks (if appropriate),
    // and then reduce tasks if appropriate.  We go through all jobs
    // in order of job arrival; jobs only get serviced if their 
    // predecessors are serviced, too.
    //
    //
    // We assign tasks to the current taskTracker if the given machine 
    // has a workload that's less than the maximum load of that kind of
    // task.
    // However, if the cluster is close to getting loaded i.e. we don't
    // have enough _padding_ for speculative executions etc., we only 
    // schedule the "highest priority" task i.e. the task from the job 
    // with the highest priority.
    //
    
    final int trackerCurrentMapCapacity = 
      Math.min((int)Math.ceil(mapLoadFactor * trackerMapCapacity), 
                              trackerMapCapacity);
	//这种计算使availableMapSlots<=availableMapSlots
    int availableMapSlots = trackerCurrentMapCapacity - trackerRunningMaps;
    boolean exceededMapPadding = false;
    if (availableMapSlots > 0) {
      exceededMapPadding = 
        exceededPadding(true, clusterStatus, trackerMapCapacity);
    }
    
    int numLocalMaps = 0;
    int numNonLocalMaps = 0;
    scheduleMaps:
    for (int i=0; i < availableMapSlots; ++i) {
      synchronized (jobQueue) {
		//这里体现了FIFO 的实现，先进的job 先执行
        for (JobInProgress job : jobQueue) {
          if (job.getStatus().getRunState() != JobStatus.RUNNING) {
            continue;
          }
          Task t = null;
          
          // Try to schedule a node-local or rack-local Map task
          t = 
            job.obtainNewLocalMapTask(taskTracker, numTaskTrackers,
                                      taskTrackerManager.getNumberOfUniqueHosts());
          if (t != null) {
            assignedTasks.add(t);
            ++numLocalMaps;
            
            // Don't assign map tasks to the hilt!
            // Leave some free slots in the cluster for future task-failures,
            // speculative tasks etc. beyond the highest priority job
            if (exceededMapPadding) {
              break scheduleMaps;
            }
           
            // Try all jobs again for the next Map task 
            break;
          }
          
          // Try to schedule a node-local or rack-local Map task
          t = 
            job.obtainNewNonLocalMapTask(taskTracker, numTaskTrackers,
                                   taskTrackerManager.getNumberOfUniqueHosts());
          
          if (t != null) {
            assignedTasks.add(t);
            ++numNonLocalMaps;
            
            // We assign at most 1 off-switch or speculative task
            // This is to prevent TaskTrackers from stealing local-tasks
            // from other TaskTrackers.
            break scheduleMaps;
          }
        }
      }
    }
    int assignedMaps = assignedTasks.size();
    //
    // Same thing, but for reduce tasks
    // However we _never_ assign more than 1 reduce task per heartbeat
    //每次心跳分配的reduce 数量不超过1 个
    final int trackerCurrentReduceCapacity = 
      Math.min((int)Math.ceil(reduceLoadFactor * trackerReduceCapacity), 
               trackerReduceCapacity);
	//这样可以使availableReduceSlots 不大于集群的reduce 的比率
    final int availableReduceSlots = 
      Math.min((trackerCurrentReduceCapacity - trackerRunningReduces), 1);
    boolean exceededReducePadding = false;
    if (availableReduceSlots > 0) {
      exceededReducePadding = exceededPadding(false, clusterStatus, 
                                              trackerReduceCapacity);
      synchronized (jobQueue) {
        for (JobInProgress job : jobQueue) {
          if (job.getStatus().getRunState() != JobStatus.RUNNING ||
              job.numReduceTasks == 0) {
            continue;
          }
          Task t = 
            job.obtainNewReduceTask(taskTracker, numTaskTrackers, 
                                    taskTrackerManager.getNumberOfUniqueHosts()
                                    );
          if (t != null) {
            assignedTasks.add(t);
			//直接break
            break;
          }
          
          // Don't assign reduce tasks to the hilt!
          // Leave some free slots in the cluster for future task-failures,
          // speculative tasks etc. beyond the highest priority job
          if (exceededReducePadding) {
            break;
          }
        }
      }
    }
    
    if (LOG.isDebugEnabled()) {
      LOG.debug("Task assignments for " + taskTracker.getTrackerName() + " --> " +
                "[" + mapLoadFactor + ", " + trackerMapCapacity + ", " + 
                trackerCurrentMapCapacity + ", " + trackerRunningMaps + "] -> [" + 
                (trackerCurrentMapCapacity - trackerRunningMaps) + ", " +
                assignedMaps + " (" + numLocalMaps + ", " + numNonLocalMaps + 
                ")] [" + reduceLoadFactor + ", " + trackerReduceCapacity + ", " + 
                trackerCurrentReduceCapacity + "," + trackerRunningReduces + 
                "] -> [" + (trackerCurrentReduceCapacity - trackerRunningReduces) + 
                ", " + (assignedTasks.size()-assignedMaps) + "]");
    }
    return assignedTasks;
  }

主要还是使用JobQueueTaskScheduler设置了一些和集群有关的常量，获取jobQueue。jobQueueJobInProgressListener是它对JobTracker注册的jobInProgressListener。然后获取还需要加载的map和reduce任务的数量。保存在int remainingReduceLoad 和 int remainingMapLoad 中。以及是否要扩充slot等。

int numLocalMaps = 0;

int numNonLocalMaps = 0;

if (job.getStatus().getRunState() != JobStatus.RUNNING) {
            continue;
          }

继续向下看：

// Try to schedule a node-local or rack-local Map task
          t = 
            job.obtainNewLocalMapTask(taskTracker, numTaskTrackers,
                                      taskTrackerManager.getNumberOfUniqueHosts());

看到这块注释，翻译过来分配一个本地或者距离较近的一个MapTask

通过这个突破口，看看这个obtaionNewLocalMapTask方法，但是先看一下getNumberOfUniqueHosts()方法，这个方法很简单，返回向jobtracker请求UniqueHosts的size，我们暂且不去care它，接着走进obtainNewLocalMapTask。

  public synchronized Task obtainNewLocalMapTask(TaskTrackerStatus tts,
                                                     int clusterSize, 
                                                     int numUniqueHosts)
  throws IOException {
    if (!tasksInited.get()) {
      LOG.info("Cannot create task split for " + profile.getJobID());
      return null;
    }
	//maxLevel = 2 , 指的是NetworkTopology.DEFAULT_HOST_LEVEL
    int target = findNewMapTask(tts, clusterSize, numUniqueHosts, maxLevel, 
                                status.mapProgress());
    if (target == -1) {
      return null;
    }
    Task result = maps[target].getTaskToRun(tts.getTrackerName());
    if (result != null) {
      addRunningTaskToTIP(maps[target], result.getTaskID(), tts, true);
    }
    return result;
  }

我们发现最重要的是这句

int target = findNewMapTask(tts, clusterSize, numUniqueHosts, maxLevel,

status.mapProgress());


private synchronized int findNewMapTask(final TaskTrackerStatus tts, 
                                          final int clusterSize,
                                          final int numUniqueHosts,
                                          final int maxCacheLevel,
                                          final double avgProgress) {
    if (numMapTasks == 0) {
      LOG.info("No maps to schedule for " + profile.getJobID());
      return -1;
    }
    String taskTracker = tts.getTrackerName();
    TaskInProgress tip = null;
    
    //
    // Update the last-known clusterSize
    //
    this.clusterSize = clusterSize;
    if (!shouldRunOnTaskTracker(taskTracker)) {
      return -1;
    }
    // Check to ensure this TaskTracker has enough resources to 
    // run tasks from this job
    long outSize = resourceEstimator.getEstimatedMapOutputSize();
    long availSpace = tts.getResourceStatus().getAvailableSpace();
    if(availSpace < outSize) {
      LOG.warn("No room for map task. Node " + tts.getHost() + 
               " has " + availSpace + 
               " bytes free; but we expect map to take " + outSize);
      return -1; //see if a different TIP might work better. 
    }
    
    
    // For scheduling a map task, we have two caches and a list (optional)
    //  I)   one for non-running task
    //  II)  one for running task (this is for handling speculation)
    //  III) a list of TIPs that have empty locations (e.g., dummy splits),
    //       the list is empty if all TIPs have associated locations
    // First a look up is done on the non-running cache and on a miss, a look 
    // up is done on the running cache. The order for lookup within the cache:
    //   1. from local node to root [bottom up]
    //   2. breadth wise for all the parent nodes at max level
    // We fall to linear scan of the list (III above) if we have misses in the 
    // above caches
    Node node = jobtracker.getNode(tts.getHost());
    
    //
    // I) Non-running TIP :
    // 
    // 1. check from local node to the root [bottom up cache lookup]
    //    i.e if the cache is available and the host has been resolved
    //    (node!=null)
    if (node != null) {
      Node key = node;
      int level = 0;
      // maxCacheLevel might be greater than this.maxLevel if findNewMapTask is
      // called to schedule any task (local, rack-local, off-switch or speculative)
      // tasks or it might be NON_LOCAL_CACHE_LEVEL (i.e. -1) if findNewMapTask is
      //  (i.e. -1) if findNewMapTask is to only schedule off-switch/speculative
      // tasks
      int maxLevelToSchedule = Math.min(maxCacheLevel, maxLevel);
      for (level = 0;level < maxLevelToSchedule; ++level) {
        List <TaskInProgress> cacheForLevel = nonRunningMapCache.get(key);
        if (cacheForLevel != null) {
          tip = findTaskFromList(cacheForLevel, tts, 
              numUniqueHosts,level == 0);
          if (tip != null) {
            // Add to running cache
            scheduleMap(tip);
            // remove the cache if its empty
            if (cacheForLevel.size() == 0) {
              nonRunningMapCache.remove(key);
            }
            return tip.getIdWithinJob();
          }
        }
        key = key.getParent();
      }
      
      // Check if we need to only schedule a local task (node-local/rack-local)
      if (level == maxCacheLevel) {
        return -1;
      }
    }
    //2. Search breadth-wise across parents at max level for non-running 
    //   TIP if
    //     - cache exists and there is a cache miss 
    //     - node information for the tracker is missing (tracker's topology
    //       info not obtained yet)
    // collection of node at max level in the cache structure
    Collection<Node> nodesAtMaxLevel = jobtracker.getNodesAtMaxLevel();
    // get the node parent at max level
    Node nodeParentAtMaxLevel = 
      (node == null) ? null : JobTracker.getParentNode(node, maxLevel - 1);
    
    for (Node parent : nodesAtMaxLevel) {
      // skip the parent that has already been scanned
      if (parent == nodeParentAtMaxLevel) {
        continue;
      }
      List<TaskInProgress> cache = nonRunningMapCache.get(parent);
      if (cache != null) {
        tip = findTaskFromList(cache, tts, numUniqueHosts, false);
        if (tip != null) {
          // Add to the running cache
          scheduleMap(tip);
          // remove the cache if empty
          if (cache.size() == 0) {
            nonRunningMapCache.remove(parent);
          }
          LOG.info("Choosing a non-local task " + tip.getTIPId());
          return tip.getIdWithinJob();
        }
      }
    }
    // 3. Search non-local tips for a new task
    tip = findTaskFromList(nonLocalMaps, tts, numUniqueHosts, false);
    if (tip != null) {
      // Add to the running list
      scheduleMap(tip);
      LOG.info("Choosing a non-local task " + tip.getTIPId());
      return tip.getIdWithinJob();
    }
    //
    // II) Running TIP :
    // 
 
    if (hasSpeculativeMaps) {
      long currentTime = System.currentTimeMillis();
      // 1. Check bottom up for speculative tasks from the running cache
      if (node != null) {
        Node key = node;
        for (int level = 0; level < maxLevel; ++level) {
          Set<TaskInProgress> cacheForLevel = runningMapCache.get(key);
          if (cacheForLevel != null) {
            tip = findSpeculativeTask(cacheForLevel, tts, 
                                      avgProgress, currentTime, level == 0);
            if (tip != null) {
              if (cacheForLevel.size() == 0) {
                runningMapCache.remove(key);
              }
              return tip.getIdWithinJob();
            }
          }
          key = key.getParent();
        }
      }
      // 2. Check breadth-wise for speculative tasks
      
      for (Node parent : nodesAtMaxLevel) {
        // ignore the parent which is already scanned
        if (parent == nodeParentAtMaxLevel) {
          continue;
        }
        Set<TaskInProgress> cache = runningMapCache.get(parent);
        if (cache != null) {
          tip = findSpeculativeTask(cache, tts, avgProgress, 
                                    currentTime, false);
          if (tip != null) {
            // remove empty cache entries
            if (cache.size() == 0) {
              runningMapCache.remove(parent);
            }
            LOG.info("Choosing a non-local task " + tip.getTIPId() 
                     + " for speculation");
            return tip.getIdWithinJob();
          }
        }
      }
      // 3. Check non-local tips for speculation
      tip = findSpeculativeTask(nonLocalRunningMaps, tts, avgProgress, 
                                currentTime, false);
      if (tip != null) {
        LOG.info("Choosing a non-local task " + tip.getTIPId() 
                 + " for speculation");
        return tip.getIdWithinJob();
      }
    }
    
    return -1;
  }

findNewMapTask有一个taskTrackerstatus的变量，这个变量是干嘛的呢？

其实，assginTask是tasktracker每过10秒（默认）向jobtracker发送一次心跳，在这个过程中，他需要将它的状态传递给jobtracker, 这个变量就是jobtracker需要知道的。好了，下面看看这个方法的具体实现。

获取tip检查是否可以跑或者高效的跑，否则返回-1。然后它会从最近的level中寻找task，一般顺序为local -> rack-local -> off-switch ->speculative，阅读之后会发现 map任务的获取是和提交任务时的Node相关的。因此，下面还需要阅读提交作业的相关代码。

至此，本地化过程的难点转移到了提交任务那面了。

下面介于篇幅关系，我就不贴代码了，直接口述。

在提交作业的时候，会建立一个新的JobInProgress，这段很容易找到不再多说，然后调度器会有一个线程去调用JobInProgress的initTasks方法，这个方法会去调用readSplitFile，这个SplitFile是怎么生成的等到下篇在说。总之是一个分割任务的文件，然后会建立RawSplits数组调用crerateCache方法来建立nonRunningMapCache，首先将Splits[i].getlocations为0的加入到nonLocationMaps。然后建立新的Node。它需要调用jobTracker的resolveAndAddToTopology方法，此方法调用dnsToSwitchMapping.resolve方法。dnsTo..由jobTrakcer在初始化的时候由反响映射建立了ScriptBasedMapping ,他的resolve方法首先会调用runResolveCommand方法，这个方法负责运行一个脚本，是从"topology.script.file.name"属性中得到的。这个脚本用来返回一个IP地址对应的机架信息，这个完全由用户自己手写，所以一般这个机架感知功能是默认不用的。然后用返回信息的第一个参数作为参数，传入NodeBase.normalize。这里没脚本的话，返回的是NetworkTopology.DEFAULT_RACK，最后调用addHostToNodeMapping添加host,node到hostnameToNodeMap，再返回到createCache方法，它得到了机架信息以后，递归的将TaskInProgress添加到node中，到此，map tasks就初始化完成了，然后jobtrakcer.initJob将建立cleanip和setup的map和reduce任务的TIP，然后返回。

后面的叙述不知道大家能否看懂，总结一下就是在作业初始化的时候，文件信息和位置信息已经记录下来，在调度作业的时候，直接给tasktracker最近的分片就行了，

转载于:https://my.oschina.net/u/4009325/blog/2254724