hadoop在存储有输入数据(hdfs中的数据)的节点上运行map任务,可以获得最佳性能,因为他无需使用最宝贵的集群宽带资源。
数据本地化是hadoop数据处理的核心,优势,可以获得最佳性能。
什么时候开始这个数据本地化优势的呢?【-----hadoop版本比价老。2.x之后,有yarn。但是可以以这篇做参考】
1,reduce吗? 不是,是map任务。一个split切片对应一个map任务的。移动计算比移动数据成本低,所以说移动计算是最好的解决方案。这就实现数据本地化优势。
2,当时看了一些源码,也查看了一些博客,其中有篇博客解释的很清楚【https://blog.csdn.net/dboy1/article/details/6256765】
//@Override
public synchronized List<Task> assignTasks(TaskTrackerStatus taskTracker)
throws IOException {
ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus();
final int numTaskTrackers = clusterStatus.getTaskTrackers();
final int clusterMapCapacity = clusterStatus.getMaxMapTasks();
final int clusterReduceCapacity = clusterStatus.getMaxReduceTasks();
Collection<JobInProgress> jobQueue =
jobQueueJobInProgressListener.getJobQueue();
//
// Get map + reduce counts for the current tracker.
//
final int trackerMapCapacity = taskTracker.getMaxMapTasks();
final int trackerReduceCapacity = taskTracker.getMaxReduceTasks();
final int trackerRunningMaps = taskTracker.countMapTasks();
final int trackerRunningReduces = taskTracker.countReduceTasks();
// Assigned tasks
List<Task> assignedTasks = new ArrayList<Task>();
//
// Compute (running + pending) map and reduce task numbers across pool
//
int remainingReduceLoad = 0;
int remainingMapLoad = 0;
synchronized (jobQueue) {
for (JobInProgress job : jobQueue) {
if (job.getStatus().getRunState() == JobStatus.RUNNING) {
remainingMapLoad += (job.desiredMaps() - job.finishedMaps());
if (job.scheduleReduces()) {
remainingReduceLoad +=
(job.desiredReduces() - job.finishedReduces());
}
}
}
}
// Compute the 'load factor' for maps and reduces
double mapLoadFactor = 0.0;
if (clusterMapCapacity > 0) {
mapLoadFactor = (double)remainingMapLoad / clusterMapCapacity;
}
double reduceLoadFactor = 0.0;
if (clusterReduceCapacity > 0) {
reduceLoadFactor = (double)remainingReduceLoad / clusterReduceCapacity;
}
//
// In the below steps, we allocate first map tasks (if appropriate),
// and then reduce tasks if appropriate. We go through all jobs
// in order of job arrival; jobs only get serviced if their
// predecessors are serviced, too.
//
//
// We assign tasks to the current taskTracker if the given machine
// has a workload that's less than the maximum load of that kind of
// task.
// However, if the cluster is close to getting loaded i.e. we don't
// have enough _padding_ for speculative executions etc., we only
// schedule the "highest priority" task i.e. the task from the job
// with the highest priority.
//
final int trackerCurrentMapCapacity =
Math.min((int)Math.ceil(mapLoadFactor * trackerMapCapacity),
trackerMapCapacity);
//这种计算使availableMapSlots<=availableMapSlots
int availableMapSlots = trackerCurrentMapCapacity - trackerRunningMaps;
boolean exceededMapPadding = false;
if (availableMapSlots > 0) {
exceededMapPadding =
exceededPadding(true, clusterStatus, trackerMapCapacity);
}
int numLocalMaps = 0;
int numNonLocalMaps = 0;
scheduleMaps:
for (int i=0; i < availableMapSlots; ++i) {
synchronized (jobQueue) {
//这里体现了FIFO 的实现,先进的job 先执行
for (JobInProgress job : jobQueue) {
if (job.getStatus().getRunState() != JobStatus.RUNNING) {
continue;
}
Task t = null;
// Try to schedule a node-local or rack-local Map task
t =
job.obtainNewLocalMapTask(taskTracker, numTaskTrackers,
taskTrackerManager.getNumberOfUniqueHosts());
if (t != null) {
assignedTasks.add(t);
++numLocalMaps;
// Don't assign map tasks to the hilt!
// Leave some free slots in the cluster for future task-failures,
// speculative tasks etc. beyond the highest priority job
if (exceededMapPadding) {
break scheduleMaps;
}
// Try all jobs again for the next Map task
break;
}
// Try to schedule a node-local or rack-local Map task
t =
job.obtainNewNonLocalMapTask(taskTracker, numTaskTrackers,
taskTrackerManager.getNumberOfUniqueHosts());
if (t != null) {
assignedTasks.add(t);
++numNonLocalMaps;
// We assign at most 1 off-switch or speculative task
// This is to prevent TaskTrackers from stealing local-tasks
// from other TaskTrackers.
break scheduleMaps;
}
}
}
}
int assignedMaps = assignedTasks.size();
//
// Same thing, but for reduce tasks
// However we _never_ assign more than 1 reduce task per heartbeat
//每次心跳分配的reduce 数量不超过1 个
final int trackerCurrentReduceCapacity =
Math.min((int)Math.ceil(reduceLoadFactor * trackerReduceCapacity),
trackerReduceCapacity);
//这样可以使availableReduceSlots 不大于集群的reduce 的比率
final int availableReduceSlots =
Math.min((trackerCurrentReduceCapacity - trackerRunningReduces), 1);
boolean exceededReducePadding = false;
if (availableReduceSlots > 0) {
exceededReducePadding = exceededPadding(false, clusterStatus,
trackerReduceCapacity);
synchronized (jobQueue) {
for (JobInProgress job : jobQueue) {
if (job.getStatus().getRunState() != JobStatus.RUNNING ||
job.numReduceTasks == 0) {
continue;
}
Task t =
job.obtainNewReduceTask(taskTracker, numTaskTrackers,
taskTrackerManager.getNumberOfUniqueHosts()
);
if (t != null) {
assignedTasks.add(t);
//直接break
break;
}
// Don't assign reduce tasks to the hilt!
// Leave some free slots in the cluster for future task-failures,
// speculative tasks etc. beyond the highest priority job
if (exceededReducePadding) {
break;
}
}
}
}
if (LOG.isDebugEnabled()) {
LOG.debug("Task assignments for " + taskTracker.getTrackerName() + " --> " +
"[" + mapLoadFactor + ", " + trackerMapCapacity + ", " +
trackerCurrentMapCapacity + ", " + trackerRunningMaps + "] -> [" +
(trackerCurrentMapCapacity - trackerRunningMaps) + ", " +
assignedMaps + " (" + numLocalMaps + ", " + numNonLocalMaps +
")] [" + reduceLoadFactor + ", " + trackerReduceCapacity + ", " +
trackerCurrentReduceCapacity + "," + trackerRunningReduces +
"] -> [" + (trackerCurrentReduceCapacity - trackerRunningReduces) +
", " + (assignedTasks.size()-assignedMaps) + "]");
}
return assignedTasks;
}
主要还是使用JobQueueTaskScheduler设置了一些和集群有关的常量,获取jobQueue。jobQueueJobInProgressListener是它对JobTracker注册的jobInProgressListener。然后获取还需要加载的map和reduce任务的数量。保存在int remainingReduceLoad 和 int remainingMapLoad 中。以及是否要扩充slot等。
int numLocalMaps = 0;
int numNonLocalMaps = 0;
if (job.getStatus().getRunState() != JobStatus.RUNNING) {
continue;
}
继续向下看:
// Try to schedule a node-local or rack-local Map task
t =
job.obtainNewLocalMapTask(taskTracker, numTaskTrackers,
taskTrackerManager.getNumberOfUniqueHosts());
看到这块注释,翻译过来分配一个本地或者距离较近的一个MapTask
通过这个突破口,看看这个obtaionNewLocalMapTask方法, 但是先看一下getNumberOfUniqueHosts()方法,这个方法很简单,返回向jobtracker请求UniqueHosts的size,我们暂且不去care它,接着走进obtainNewLocalMapTask。
public synchronized Task obtainNewLocalMapTask(TaskTrackerStatus tts,
int clusterSize,
int numUniqueHosts)
throws IOException {
if (!tasksInited.get()) {
LOG.info("Cannot create task split for " + profile.getJobID());
return null;
}
//maxLevel = 2 , 指的是NetworkTopology.DEFAULT_HOST_LEVEL
int target = findNewMapTask(tts, clusterSize, numUniqueHosts, maxLevel,
status.mapProgress());
if (target == -1) {
return null;
}
Task result = maps[target].getTaskToRun(tts.getTrackerName());
if (result != null) {
addRunningTaskToTIP(maps[target], result.getTaskID(), tts, true);
}
return result;
}
我们发现最重要的是这句
int target = findNewMapTask(tts, clusterSize, numUniqueHosts, maxLevel,
status.mapProgress());
private synchronized int findNewMapTask(final TaskTrackerStatus tts,
final int clusterSize,
final int numUniqueHosts,
final int maxCacheLevel,
final double avgProgress) {
if (numMapTasks == 0) {
LOG.info("No maps to schedule for " + profile.getJobID());
return -1;
}
String taskTracker = tts.getTrackerName();
TaskInProgress tip = null;
//
// Update the last-known clusterSize
//
this.clusterSize = clusterSize;
if (!shouldRunOnTaskTracker(taskTracker)) {
return -1;
}
// Check to ensure this TaskTracker has enough resources to
// run tasks from this job
long outSize = resourceEstimator.getEstimatedMapOutputSize();
long availSpace = tts.getResourceStatus().getAvailableSpace();
if(availSpace < outSize) {
LOG.warn("No room for map task. Node " + tts.getHost() +
" has " + availSpace +
" bytes free; but we expect map to take " + outSize);
return -1; //see if a different TIP might work better.
}
// For scheduling a map task, we have two caches and a list (optional)
// I) one for non-running task
// II) one for running task (this is for handling speculation)
// III) a list of TIPs that have empty locations (e.g., dummy splits),
// the list is empty if all TIPs have associated locations
// First a look up is done on the non-running cache and on a miss, a look
// up is done on the running cache. The order for lookup within the cache:
// 1. from local node to root [bottom up]
// 2. breadth wise for all the parent nodes at max level
// We fall to linear scan of the list (III above) if we have misses in the
// above caches
Node node = jobtracker.getNode(tts.getHost());
//
// I) Non-running TIP :
//
// 1. check from local node to the root [bottom up cache lookup]
// i.e if the cache is available and the host has been resolved
// (node!=null)
if (node != null) {
Node key = node;
int level = 0;
// maxCacheLevel might be greater than this.maxLevel if findNewMapTask is
// called to schedule any task (local, rack-local, off-switch or speculative)
// tasks or it might be NON_LOCAL_CACHE_LEVEL (i.e. -1) if findNewMapTask is
// (i.e. -1) if findNewMapTask is to only schedule off-switch/speculative
// tasks
int maxLevelToSchedule = Math.min(maxCacheLevel, maxLevel);
for (level = 0;level < maxLevelToSchedule; ++level) {
List <TaskInProgress> cacheForLevel = nonRunningMapCache.get(key);
if (cacheForLevel != null) {
tip = findTaskFromList(cacheForLevel, tts,
numUniqueHosts,level == 0);
if (tip != null) {
// Add to running cache
scheduleMap(tip);
// remove the cache if its empty
if (cacheForLevel.size() == 0) {
nonRunningMapCache.remove(key);
}
return tip.getIdWithinJob();
}
}
key = key.getParent();
}
// Check if we need to only schedule a local task (node-local/rack-local)
if (level == maxCacheLevel) {
return -1;
}
}
//2. Search breadth-wise across parents at max level for non-running
// TIP if
// - cache exists and there is a cache miss
// - node information for the tracker is missing (tracker's topology
// info not obtained yet)
// collection of node at max level in the cache structure
Collection<Node> nodesAtMaxLevel = jobtracker.getNodesAtMaxLevel();
// get the node parent at max level
Node nodeParentAtMaxLevel =
(node == null) ? null : JobTracker.getParentNode(node, maxLevel - 1);
for (Node parent : nodesAtMaxLevel) {
// skip the parent that has already been scanned
if (parent == nodeParentAtMaxLevel) {
continue;
}
List<TaskInProgress> cache = nonRunningMapCache.get(parent);
if (cache != null) {
tip = findTaskFromList(cache, tts, numUniqueHosts, false);
if (tip != null) {
// Add to the running cache
scheduleMap(tip);
// remove the cache if empty
if (cache.size() == 0) {
nonRunningMapCache.remove(parent);
}
LOG.info("Choosing a non-local task " + tip.getTIPId());
return tip.getIdWithinJob();
}
}
}
// 3. Search non-local tips for a new task
tip = findTaskFromList(nonLocalMaps, tts, numUniqueHosts, false);
if (tip != null) {
// Add to the running list
scheduleMap(tip);
LOG.info("Choosing a non-local task " + tip.getTIPId());
return tip.getIdWithinJob();
}
//
// II) Running TIP :
//
if (hasSpeculativeMaps) {
long currentTime = System.currentTimeMillis();
// 1. Check bottom up for speculative tasks from the running cache
if (node != null) {
Node key = node;
for (int level = 0; level < maxLevel; ++level) {
Set<TaskInProgress> cacheForLevel = runningMapCache.get(key);
if (cacheForLevel != null) {
tip = findSpeculativeTask(cacheForLevel, tts,
avgProgress, currentTime, level == 0);
if (tip != null) {
if (cacheForLevel.size() == 0) {
runningMapCache.remove(key);
}
return tip.getIdWithinJob();
}
}
key = key.getParent();
}
}
// 2. Check breadth-wise for speculative tasks
for (Node parent : nodesAtMaxLevel) {
// ignore the parent which is already scanned
if (parent == nodeParentAtMaxLevel) {
continue;
}
Set<TaskInProgress> cache = runningMapCache.get(parent);
if (cache != null) {
tip = findSpeculativeTask(cache, tts, avgProgress,
currentTime, false);
if (tip != null) {
// remove empty cache entries
if (cache.size() == 0) {
runningMapCache.remove(parent);
}
LOG.info("Choosing a non-local task " + tip.getTIPId()
+ " for speculation");
return tip.getIdWithinJob();
}
}
}
// 3. Check non-local tips for speculation
tip = findSpeculativeTask(nonLocalRunningMaps, tts, avgProgress,
currentTime, false);
if (tip != null) {
LOG.info("Choosing a non-local task " + tip.getTIPId()
+ " for speculation");
return tip.getIdWithinJob();
}
}
return -1;
}
findNewMapTask有一个taskTrackerstatus的变量,这个变量是干嘛的呢?
其实,assginTask是tasktracker每过10秒(默认)向jobtracker发送一次心跳,在这个过程中,他需要将它的状态传递给jobtracker, 这个变量就是jobtracker需要知道的。好了,下面看看这个方法的具体实现。
获取tip检查是否可以跑或者高效的跑,否则返回-1。然后它会从最近的level中寻找task,一般顺序为local -> rack-local -> off-switch ->speculative,阅读之后会发现 map任务的获取是和提交任务时的Node相关的。因此,下面还需要阅读提交作业的相关代码。
至此,本地化过程的难点转移到了提交任务那面了。
下面介于篇幅关系,我就不贴代码了,直接口述。
在提交作业的时候,会建立一个新的JobInProgress,这段很容易找到不再多说,然后调度器会有一个线程去调用JobInProgress的initTasks方法,这个方法会去调用readSplitFile,这个SplitFile是怎么生成的等到下篇在说。总之是一个分割任务的文件,然后会建立RawSplits数组调用crerateCache方法来建立nonRunningMapCache,首先将Splits[i].getlocations为0的加入到nonLocationMaps。然后建立新的Node。它需要调用jobTracker的resolveAndAddToTopology方法,此方法调用dnsToSwitchMapping.resolve方法。dnsTo..由jobTrakcer在初始化的时候由反响映射建立了ScriptBasedMapping ,他的resolve方法首先会调用runResolveCommand方法,这个方法负责运行一个脚本,是从"topology.script.file.name"属性中得到的。这个脚本用来返回一个IP地址对应的机架信息,这个完全由用户自己手写,所以一般这个机架感知功能是默认不用的。然后用返回信息的第一个参数作为参数,传入NodeBase.normalize。这里没脚本的话,返回的是NetworkTopology.DEFAULT_RACK,最后调用addHostToNodeMapping添加host,node到hostnameToNodeMap,再返回到createCache方法,它得到了机架信息以后,递归的将TaskInProgress添加到node中,到此,map tasks就初始化完成了,然后jobtrakcer.initJob将建立cleanip和setup的map和reduce任务的TIP,然后返回。
后面的叙述不知道大家能否看懂,总结一下就是在作业初始化的时候,文件信息和位置信息已经记录下来,在调度作业的时候,直接给tasktracker最近的分片就行了,