hadoop-mapreduce中reducetask运行分析

最新推荐文章于 2024-04-18 15:21:41 发布

隔壁老杨hongs

最新推荐文章于 2024-04-18 15:21:41 发布

阅读量1.9k

点赞数

分类专栏： HADOOP 文章标签：源代码 mapreduce源码分析 hadoop mapreduce mapreduce-shuffle

本文链接：https://blog.csdn.net/u014393917/article/details/25792325

版权

本文详细分析了Hadoop MapReduce中ReduceTask的运行过程，包括Copy、Sort和Reduce三个阶段。重点讲解了ReduceTask如何启动Shuffle过程，包括获取Map任务完成事件、启动Fetcher线程、数据Copy与 Shuffle策略。通过对ShuffleScheduler、Fetcher线程以及Reduce阶段的深入探讨，揭示了数据在ReduceTask中的处理逻辑。

摘要由CSDN通过智能技术生成

ReduceTask的运行

Reduce处理程序中需要执行三个类型的处理，

1.copy,从各map中copy数据过来

2.sort,对数据进行排序操作。

3.reduce,执行业务逻辑的处理。

ReduceTask的运行也是通过run方法开始，

通过mapreduce.job.reduce.shuffle.consumer.plugin.class配置shuffle的plugin,

默认是Shuffle实现类。实现ShuffleConsumerPlugin接口。

生成Shuffle实例，并执行plugin的init函数进行初始化，

Class<?extendsShuffleConsumerPlugin>clazz =

job.getClass(MRConfig.SHUFFLE_CONSUMER_PLUGIN,Shuffle.class,ShuffleConsumerPlugin.class);

shuffleConsumerPlugin =ReflectionUtils.newInstance(clazz, job);

LOG.info("UsingShuffleConsumerPlugin: " +shuffleConsumerPlugin);

ShuffleConsumerPlugin.ContextshuffleContext =

newShuffleConsumerPlugin.Context(getTaskID(),job, FileSystem.getLocal(job),umbilical,

super.lDirAlloc,reporter, codec,

combinerClass,combineCollector,

spilledRecordsCounter,reduceCombineInputCounter,

shuffledMapsCounter,

reduceShuffleBytes,failedShuffleCounter,

mergedMapOutputsCounter,

taskStatus,copyPhase,sortPhase,this,

mapOutputFile,localMapFiles);

shuffleConsumerPlugin.init(shuffleContext);

执行shuffle的run函数，得到RawKeyValueIterator的实例。

rIter =shuffleConsumerPlugin.run();

Shuffle.run函数定义：

.....................................

inteventsPerReducer = Math.max(MIN_EVENTS_TO_FETCH,

MAX_RPC_OUTSTANDING_EVENTS/ jobConf.getNumReduceTasks());

intmaxEventsToFetch = Math.min(MAX_EVENTS_TO_FETCH,eventsPerReducer);

生成map的完成状态获取线程，并启动此线程，此线程中从am中获取此job中所有完成的map的event

通过ShuffleSchedulerImpl实例把所有的map的完成的map的host,mapid,

等记录到mapLocations容器中。此线程每一秒执行一个获取操作。

//Start the map-completion events fetcher thread

finalEventFetcher<K,V> eventFetcher =

newEventFetcher<K,V>(reduceId,umbilical,scheduler,this,

maxEventsToFetch);

eventFetcher.start();

下面看看EventFetcher.run函数的执行过程：以下代码中我只保留了代码的主体部分。

...................

EventFetcher.run:

publicvoid run() {

intfailures = 0;

........................

intnumNewMaps = getMapCompletionEvents();

..................................

}

......................

}

EventFetcher.getMapCompletionEvents

..................................

MapTaskCompletionEventsUpdateupdate =

umbilical.getMapCompletionEvents(

(org.apache.hadoop.mapred.JobID)reduce.getJobID(),

fromEventIdx,

maxEventsToFetch,

(org.apache.hadoop.mapred.TaskAttemptID)reduce);

events =update.getMapTaskCompletionEvents();

.....................

for(TaskCompletionEvent event : events) {

scheduler.resolve(event);

if(TaskCompletionEvent.Status.SUCCEEDED== event.getTaskStatus()) {

++numNewMaps;

}

shecduler是ShuffleShedulerImpl的实例。

ShuffleShedulerImpl.resolve

caseSUCCEEDED:

URI u = getBaseURI(reduceId,event.getTaskTrackerHttp());

addKnownMapOutput(u.getHost() +":"+ u.getPort(),

u.toString(),

event.getTaskAttemptId());

maxMapRuntime= Math.max(maxMapRuntime,event.getTaskRunTime());

break;

.......

ShuffleShedulerImpl.addKnownMapOutput函数：

把mapid与对应的host添加到mapLocations容器中，

MapHost host =mapLocations.get(hostName);

if(host == null){

host = newMapHost(hostName, hostUrl);

mapLocations.put(hostName,host);

}

此时会把host的状设置为PENDING

host.addKnownMap(mapId);

同时把host添加到pendingHosts容器中。notify相关的Fetcher文件copy线程。

//Mark the host as pending

if(host.getState() == State.PENDING){

pendingHosts.add(host);

notifyAll();

}

.....................

回到ReduceTask.run函数中，接着向下执行

//Start the map-output fetcher threads

booleanisLocal = localMapFiles!= null;

通过mapreduce.reduce.shuffle.parallelcopies配置的值，默认为5，生成获取map数据的线程数。

生成Fetcher线程实例，并启动相关的线程。

通过mapreduce.reduce.shuffle.connect.timeout配置连接超时时间。默认180000

通过mapreduce.reduce.shuffle.read.timeout配置读取超时时间，默认为180000

finalintnumFetchers = isLocal ? 1 :

jobConf.getInt(MRJobConfig.SHUFFLE_PARALLEL_COPIES,5);

Fetcher<K,V>[] fetchers =newFetcher[numFetchers];

if(isLocal) {

fetchers[0] = newLocalFetcher<K, V>(jobConf,reduceId,scheduler,

merger,reporter,metrics,this,reduceTask.getShuffleSecret(),

localMapFiles);

fetchers[0].start();

}else{

for(inti=0; i < numFetchers; ++i) {

fetchers[i] = newFetcher<K,V>(jobConf,reduceId,scheduler,merger,

reporter,metrics,this,

reduceTask.getShuffleSecret());

fetchers[i].start();

}

.........................

接下来进行Fetcher线程里面，看看Fetcher.run函数运行流程：

..........................

MapHost host = null;

try{

//If merge is on, block

merger.waitForResource();

从ShuffleScheduler中取出一个MapHost实例，

//Get a host to shuffle from

host = scheduler.getHost();

metrics.threadBusy();

执行shuffle操作。

//Shuffle

copyFromHost(host);

} finally{

if(host != null){

scheduler.freeHost(host);

metrics.threadFree();

}

接下来看看ShuffleScheduler中的getHost函数：

........

如果pendingHosts的值没有，先wait住，等待EventFetcher线程去获取数据来notify此wait

while(pendingHosts.isEmpty()){

wait();

}

MapHost host = null;

Iterator<MapHost> iter =pendingHosts.iterator();

从pendingHosts中random出一个MapHost，并返回给调用程序。

intnumToPick = random.nextInt(pendingHosts.size());

for(inti=0; i <= numToPick; ++i) {

host = iter.next();

}

pendingHosts.remove(host);

........................

当得到一个MapHost后，执行copyFromHost来进行数据的copy操作。

此时，一个task的host的url样子基本上是这个样子：

host:port/mapOutput?job=xxx&reduce=123(当前reduce的partid值)&map=

copyFromHost的代码部分：

.....

List<TaskAttemptID>maps = scheduler.getMapsForHost(host);

.....

Set<TaskAttemptID>remaining = newHashSet<TaskAttemptID>(maps);

.....

此部分完成后，url样子中map=后面会有很多个mapid，多个用英文的”,”号分开的。

URLurl = getMapOutputURL(host, maps);

此处根据url打开httpconnection,

如果mapreduce.shuffle.ssl.enabled配置为true时，会打开SSL连接。默认为false.

openConnection(url);

.....

最低0.47元/天解锁文章

隔壁老杨hongs

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录