ReduceTask的运行

最新推荐文章于 2021-09-18 15:50:48 发布

马小纪

最新推荐文章于 2021-09-18 15:50:48 发布

阅读量133

点赞数

本文链接：https://blog.csdn.net/maxiaoji/article/details/106226794

版权

Class < ?extendsShuffleConsumerPlugin > clazz = job.getClass(MRConfig.SHUFFLE_CONSUMER_PLUGIN, Shuffle.class, ShuffleConsumerPlugin.class);
shuffleConsumerPlugin = ReflectionUtils.newInstance(clazz, job);
LOG.info("UsingShuffleConsumerPlugin: " + shuffleConsumerPlugin);

ShuffleConsumerPlugin.ContextshuffleContext = newShuffleConsumerPlugin.Context(getTaskID(), job, FileSystem.getLocal(job), umbilical, super.lDirAlloc, reporter, codec, combinerClass, combineCollector, spilledRecordsCounter, reduceCombineInputCounter, shuffledMapsCounter, reduceShuffleBytes, failedShuffleCounter, mergedMapOutputsCounter, taskStatus, copyPhase, sortPhase, this, mapOutputFile, localMapFiles);
shuffleConsumerPlugin.init(shuffleContext);执行shuffle的run函数，得到RawKeyValueIterator的实例。rIter = shuffleConsumerPlugin.run();

Shuffle.run函数定义：.....................................

inteventsPerReducer = Math.max(MIN_EVENTS_TO_FETCH, MAX_RPC_OUTSTANDING_EVENTS / jobConf.getNumReduceTasks());
intmaxEventsToFetch = Math.min(MAX_EVENTS_TO_FETCH, eventsPerReducer);生成map的完成状态获取线程，并启动此线程，此线程中从am中获取此job中所有完成的map的event通过ShuffleSchedulerImpl实例把所有的map的完成的map的host,
mapid,
等记录到mapLocations容器中。此线程每一秒执行一个获取操作。
//Start the map-completion events fetcher thread
finalEventFetcher < K,
V > eventFetcher = newEventFetcher < K,
V > (reduceId, umbilical, scheduler, this, maxEventsToFetch);
eventFetcher.start();下面看看EventFetcher.run函数的执行过程：以下代码中我只保留了代码的主体部分。...................EventFetcher.run: publicvoid run() {
    intfailures = 0;........................intnumNewMaps = getMapCompletionEvents();..................................
}......................
}
EventFetcher.getMapCompletionEvents..................................MapTaskCompletionEventsUpdateupdate = umbilical.getMapCompletionEvents((org.apache.hadoop.mapred.JobID) reduce.getJobID(), fromEventIdx, maxEventsToFetch, (org.apache.hadoop.mapred.TaskAttemptID) reduce);
events = update.getMapTaskCompletionEvents();.....................
for (TaskCompletionEvent event: events) {
    scheduler.resolve(event);
    if (TaskCompletionEvent.Status.SUCCEEDED == event.getTaskStatus()) {++numNewMaps;
    }
}
shecduler是ShuffleShedulerImpl的实例。ShuffleShedulerImpl.resolve caseSUCCEEDED: URI u = getBaseURI(reduceId, event.getTaskTrackerHttp());
addKnownMapOutput(u.getHost() + ":" + u.getPort(), u.toString(), event.getTaskAttemptId());
maxMapRuntime = Math.max(maxMapRuntime, event.getTaskRunTime());
break;.......ShuffleShedulerImpl.addKnownMapOutput函数：把mapid与对应的host添加到mapLocations容器中，MapHost host = mapLocations.get(hostName);
if (host == null) {
    host = newMapHost(hostName, hostUrl);
    mapLocations.put(hostName, host);
}此时会把host的状设置为PENDING host.addKnownMap(mapId);同时把host添加到pendingHosts容器中。notify相关的Fetcher文件copy线程。
//Mark the host as pending
if (host.getState() == State.PENDING) {
    pendingHosts.add(host);
    notifyAll();
}.....................

回到ReduceTask.run函数中，接着向下执行
//Start the map-output fetcher threads
booleanisLocal = localMapFiles != null;通过mapreduce.reduce.shuffle.parallelcopies配置的值，默认为5，生成获取map数据的线程数。生成Fetcher线程实例，并启动相关的线程。通过mapreduce.reduce.shuffle.connect.timeout配置连接超时时间。默认180000通过mapreduce.reduce.shuffle.read.timeout配置读取超时时间，默认为180000 finalintnumFetchers = isLocal ? 1 : jobConf.getInt(MRJobConfig.SHUFFLE_PARALLEL_COPIES, 5);
Fetcher < K,
V > [] fetchers = newFetcher[numFetchers];
if (isLocal) {
    fetchers[0] = newLocalFetcher < K,
    V > (jobConf, reduceId, scheduler, merger, reporter, metrics, this, reduceTask.getShuffleSecret(), localMapFiles);
    fetchers[0].start();
} else {
    for (inti = 0; i < numFetchers; ++i) {
        fetchers[i] = newFetcher < K,
        V > (jobConf, reduceId, scheduler, merger, reporter, metrics, this, reduceTask.getShuffleSecret());
        fetchers[i].start();
    }
}.........................

接下来进行Fetcher线程里面，看看Fetcher.run函数运行流程：..........................MapHost host = null;
try {
    //If merge is on, block
    merger.waitForResource();从ShuffleScheduler中取出一个MapHost实例，
    //Get a host to shuffle from
    host = scheduler.getHost();
    metrics.threadBusy();执行shuffle操作。
    //Shuffle
    copyFromHost(host);
} finally {
    if (host != null) {
        scheduler.freeHost(host);
        metrics.threadFree();
    }
}接下来看看ShuffleScheduler中的getHost函数：........如果pendingHosts的值没有，先wait住，等待EventFetcher线程去获取数据来notify此wait
while (pendingHosts.isEmpty()) {
    wait();
}

MapHost host = null;
Iterator < MapHost > iter = pendingHosts.iterator();从pendingHosts中random出一个MapHost，并返回给调用程序。intnumToPick = random.nextInt(pendingHosts.size());
for (inti = 0; i <= numToPick; ++i) {
    host = iter.next();
}

pendingHosts.remove(host);........................当得到一个MapHost后，执行copyFromHost来进行数据的copy操作。此时，一个task的host的url样子基本上是这个样子：host: port / mapOutput ? job = xxx & reduce = 123(当前reduce的partid值) & map = copyFromHost的代码部分：.....List < TaskAttemptID > maps = scheduler.getMapsForHost(host);.....Set < TaskAttemptID > remaining = newHashSet < TaskAttemptID > (maps);.....此部分完成后，url样子中map = 后面会有很多个mapid，多个用英文的”,
”号分开的。URLurl = getMapOutputURL(host, maps);此处根据url打开httpconnection,
如果mapreduce.shuffle.ssl.enabled配置为true时，会打开SSL连接。默认为false.openConnection(url);.....设置连接超时时间，header,
读取超时时间等值。并打开HttpConnection的连接。
// put url hashinto http header
connection.addRequestProperty(SecureShuffleUtils.HTTP_HEADER_URL_HASH, encHash);
//set the read timeout
connection.setReadTimeout(readTimeout);
//put shuffle version into httpheader
connection.addRequestProperty(ShuffleHeader.HTTP_HEADER_NAME, ShuffleHeader.DEFAULT_HTTP_HEADER_NAME);
connection.addRequestProperty(ShuffleHeader.HTTP_HEADER_VERSION, ShuffleHeader.DEFAULT_HTTP_HEADER_VERSION);
connect(connection, connectionTimeout);.....执行文件的copy操作。此处是迭代执行，每一个读取一个map的文件。并把remaining中的值去掉一个。直到remaining的值全部读取完成。TaskAttemptID[] failedTasks = null;
while (!remaining.isEmpty() && failedTasks == null) {在copyMapOutput函数中，每次读取一个mapid,
    根据MergeManagerImpl中的reserve函数，1.检查map的输出是否超过了mapreduce.reduce.memory.totalbytes配置的大小。此配置的默认值是当前Runtime的maxMemory * mapreduce.reduce.shuffle.input.buffer.percent配置的值。Buffer.percent的默认值为0.90;如果mapoutput超过了此配置的大小时,
    生成一个OnDiskMapOutput实例。2.如果没有超过此大小，生成一个InMemoryMapOutput实例。failedTasks = copyMapOutput(host, input, remaining);
}在copyMapOutput函数中首先调用的MergeManagerImpl.reserve函数：
if (!canShuffleToMemory(requestedSize)) {.....returnnewOnDiskMapOutput < K,
    V > (mapId, reduceId, this, requestedSize, jobConf, mapOutputFile, fetcher, true);
}.....
if (usedMemory > memoryLimit) {.....,
    当前使用的memory已经超过了配置的内存使用大小，此时返回null，把host重新添加到shuffleScheduler的pendingHosts队列中。returnnull;
}
returnunconditionalReserve(mapId, requestedSize, true);生成一个InMemoryMapOutput,
并把usedMemory加上此mapoutput的大小。privatesynchronizedInMemoryMapOutput < K,
V > unconditionalReserve(TaskAttemptID mapId, longrequestedSize, booleanprimaryMapOutput) {
    usedMemory += requestedSize;
    returnnewInMemoryMapOutput < K,
    V > (jobConf, mapId, this, (int) requestedSize, codec, primaryMapOutput);
}

下面是当usedMemory使用超过了指定的大小后，的处理部分，重新把host添加到队列中。如下所示：copyMapOutput函数
if (mapOutput == null) {
    LOG.info("fetcher#" + id + "- MergeManager returned status WAIT ...");
    //Notan error but wait to process data.
    returnEMPTY_ATTEMPT_ID_ARRAY;
}此时host中还有没处理完成的mapoutput,
在Fetcher.run中，重新添加到队列中把此host
if (host != null) {
    scheduler.freeHost(host);
    metrics.threadFree();
}.........接下来还是在copyMapOutput函数中，通过mapoutput也就是merge.reserve函数返回的实例的shuffle函数。如果mapoutput是InMemoryMapOutput,
在调用shuffle时，直接把map输出写入到内存。如果是OnDiskMapOutput,
在调用shuffle时，直接把map的输出写入到local临时文件中。....最后，执行ShuffleScheduler.copySucceeded完成文件的copy,
调用mapout.commit函数。scheduler.copySucceeded(mapId, host, compressedLength, endTime - startTime, mapOutput);并从remaining中移出处理过的mapid,

接下来看看MapOutput.commit函数：a.InMemoryMapOutput.commit函数：publicvoidcommit() throwsIOException {
    merger.closeInMemoryFile(this);
}调用MergeManagerImpl.closeInMemoryFile函数: publicsynchronizedvoidcloseInMemoryFile(InMemoryMapOutput < K, V > mapOutput) {把此mapOutput实例添加到inMemoryMapOutputs列表中。inMemoryMapOutputs.add(mapOutput);
    LOG.info("closeInMemoryFile-> map-output of size: " + mapOutput.getSize() + ",inMemoryMapOutputs.size() -> " + inMemoryMapOutputs.size() + ",commitMemory -> " + commitMemory + ", usedMemory ->" + usedMemory);把commitMemory的大小增加当前传入的mapoutput的size大小。commitMemory += mapOutput.getSize();检查是否达到merge的值，此值是mapreduce.reduce.memory.totalbytes配置 * mapreduce.reduce.shuffle.merge.percent配置的值，默认是当前Runtime的memory * 0.90 * 0.90也就是说，只有有新的mapoutput加入，这个检查条件就肯定会达到
    //Can hang if mergeThreshold is really low.
    if (commitMemory >= mergeThreshold) {.......把正在进行merge的mapoutput列表添加到一起发起merge操作。inMemoryMapOutputs.addAll(inMemoryMergedMapOutputs);
        inMemoryMergedMapOutputs.clear();
        inMemoryMerger.startMerge(inMemoryMapOutputs);
        commitMemory = 0L; // Reset commitMemory.
    }如果mapreduce.reduce.merge.memtomem.enabled配置为true,
    默认为false同时inMemoryMapOutputs中的mapoutput个数达到了mapreduce.reduce.merge.memtomem.threshold配置的值，默认值是mapreduce.task.io.sort.factor配置的值，默认为100发起memTomem的merger操作。
    if (memToMemMerger != null) {
        if (inMemoryMapOutputs.size() >= memToMemMergeOutputsThreshold) {
            memToMemMerger.startMerge(inMemoryMapOutputs);
        }
    }
}

MergemanagerImpl.InMemoryMerger.merger函数操作：在执行inMemoryMerger.startMerge(inMemoryMapOutputs);操作后，会notify此线程，同时执行merger函数：publicvoidmerge(List < InMemoryMapOutput < K, V >> inputs) throwsIOException {
    if (inputs == null || inputs.size() == 0) {
        return;
    }....................TaskAttemptID mapId = inputs.get(0).getMapId();
    TaskID mapTaskId = mapId.getTaskID();

    List < Segment < K,
    V >> inMemorySegments = newArrayList < Segment < K,
    V >> ();生成InMemoryReader实例，并把传入的容器清空，把生成好后的segment放到到inmemorysegments中。longmergeOutputSize = createInMemorySegments(inputs, inMemorySegments, 0);
    intnoInMemorySegments = inMemorySegments.size();生成一个输出的文件路径，Path outputPath = mapOutputFile.getInputFileForWrite(mapTaskId, mergeOutputSize).suffix(Task.MERGED_OUTPUT_PREFIX);针对输出的临时文件生成一个Write实例。Writer < K,
    V > writer = newWriter < K,
    V > (jobConf, rfs, outputPath, (Class < K > ) jobConf.getMapOutputKeyClass(), (Class < V > ) jobConf.getMapOutputValueClass(), codec, null);

    RawKeyValueIterator rIter = null;
    CompressAwarePathcompressAwarePath;
    try {
        LOG.info("Initiatingin-memory merge with " + noInMemorySegments + "segments...");此部分与map端的输出没什么区别，得到几个segment的文件的一个iterator,
        此部分是一个优先堆，每一次next都会从所有的segment中读取出最小的一个key与value rIter = Merger.merge(jobConf, rfs, (Class < K > ) jobConf.getMapOutputKeyClass(), (Class < V > ) jobConf.getMapOutputValueClass(), inMemorySegments, inMemorySegments.size(), newPath(reduceId.toString()), (RawComparator < K > ) jobConf.getOutputKeyComparator(), reporter, spilledRecordsCounter, null, null);如果没有combiner程序，直接写入到文件，否则，如果有combiner，先执行combiner处理。
        if (null == combinerClass) {
            Merger.writeFile(rIter, writer, reporter, jobConf);
        } else {
            combineCollector.setWriter(writer);
            combineAndSpill(rIter, reduceCombineInputCounter);
        }
        writer.close();此处与map端的输出不同的地方在这里，这里不写入spillindex文件，而是生成一个CompressAwarePath，把输出路径,
        大小写入到此实例中。compressAwarePath = newCompressAwarePath(outputPath, writer.getRawLength(), writer.getCompressedLength());

        LOG.info(reduceId + "Merge of the " + noInMemorySegments + "files in-memory complete." + "Local file is " + outputPath + "of size " + localFS.getFileStatus(outputPath).getLen());
    } catch(IOException e) {
        //makesure that we delete the ondiskfile that we created
        //earlierwhen we invoked cloneFileAttributes
        localFS.delete(outputPath, true);
        throwe;
    }此处，把生成的文件添加到onDiskMapOutputs属性中，并检查此容器中的文件是否达到了mapreduce.task.io.sort.factor配置的值，如果是，发起disk的merger操作。
    //Note the output of the merge
    closeOnDiskFile(compressAwarePath);
}

}上面最后一行的全部定义在下面这里。publicsynchronizedvoidcloseOnDiskFile(CompressAwarePath file) {
    onDiskMapOutputs.add(file);
    if (onDiskMapOutputs.size() >= (2 * ioSortFactor - 1)) {
        onDiskMerger.startMerge(onDiskMapOutputs);
    }
}

b.OnDiskMapOutput.commit函数：把tmp文件rename到指定的目录下，生成一个CompressAwarePath实例，调用上面提到的处理程序。publicvoidcommit() throwsIOException {
    fs.rename(tmpOutputPath, outputPath);
    CompressAwarePathcompressAwarePath = newCompressAwarePath(outputPath, getSize(), this.compressedSize);
    merger.closeOnDiskFile(compressAwarePath);
}

MergeManagerImpl.OnDiskMerger.merger函数：这个函数到现在基本上没有什么可以解说的东西，注意一点就是，每merge一个文件后，会把这个merge后的文件路径重新添加到onDiskMapOutputs容器中。publicvoidmerge(List < CompressAwarePath > inputs) throwsIOException {
    //sanity check
    if (inputs == null || inputs.isEmpty()) {
        LOG.info("Noondisk files to merge...");
        return;
    }
    longapproxOutputSize = 0;
    intbytesPerSum = jobConf.getInt("io.bytes.per.checksum", 512);
    LOG.info("OnDiskMerger:We have " + inputs.size() + "map outputs on disk. Triggering merge...");
    //1. Prepare the list of files to be merged.
    for (CompressAwarePath file: inputs) {
        approxOutputSize += localFS.getFileStatus(file).getLen();
    }

    //add the checksum length
    approxOutputSize += ChecksumFileSystem.getChecksumLength(approxOutputSize, bytesPerSum);

    //2. Start the on-disk merge process
    Path outputPath = localDirAllocator.getLocalPathForWrite(inputs.get(0).toString(), approxOutputSize, jobConf).suffix(Task.MERGED_OUTPUT_PREFIX);
    Writer < K,
    V > writer = newWriter < K,
    V > (jobConf, rfs, outputPath, (Class < K > ) jobConf.getMapOutputKeyClass(), (Class < V > ) jobConf.getMapOutputValueClass(), codec, null);
    RawKeyValueIterator iter = null;
    CompressAwarePathcompressAwarePath;
    Path tmpDir = newPath(reduceId.toString());
    try {
        iter = Merger.merge(jobConf, rfs, (Class < K > ) jobConf.getMapOutputKeyClass(), (Class < V > ) jobConf.getMapOutputValueClass(), codec, inputs.toArray(newPath[inputs.size()]), true, ioSortFactor, tmpDir, (RawComparator < K > ) jobConf.getOutputKeyComparator(), reporter, spilledRecordsCounter, null, mergedMapOutputsCounter, null);

        Merger.writeFile(iter, writer, reporter, jobConf);
        writer.close();
        compressAwarePath = newCompressAwarePath(outputPath, writer.getRawLength(), writer.getCompressedLength());
    } catch(IOException e) {
        localFS.delete(outputPath, true);
        throwe;
    }

    closeOnDiskFile(compressAwarePath);

    LOG.info(reduceId + "Finished merging " + inputs.size() + "map output files on disk of total-size " + approxOutputSize + "." + "Local output file is " + outputPath + " of size " + localFS.getFileStatus(outputPath).getLen());
}
}

ok，现在map的copy部分执行完成，回到ShuffleConsumerPlugin的run方法中，也就是Shuffle的run方法中，接着上面的代码向下分析：此处等待所有的copy操作完成，
//Wait for shuffle to complete successfully
while (!scheduler.waitUntilDone(PROGRESS_FREQUENCY)) {
    reporter.progress();
    synchronized(this) {
        if (throwable != null) {
            thrownewShuffleError("error in shuffle in " + throwingThreadName, throwable);
        }
    }
}如果执行到这一行时，说明所有的mapcopy操作已经完成，关闭查找map运行状态的线程与执行copy操作的几个线程。
//Stop the event-fetcher thread
eventFetcher.shutDown();
//Stop the map-output fetcher threads
for (Fetcher < K, V > fetcher: fetchers) {
    fetcher.shutDown();
}
//stop the scheduler
scheduler.close();发am发送状态，通知AM，此时要执行排序操作。copyPhase.complete(); // copy is already complete
taskStatus.setPhase(TaskStatus.Phase.SORT);reduceTask.statusUpdate(umbilical);

执行最后的merge, 其实在合并所有文件与memory中的数据时，也同时会进行排序操作。
//Finish the on-going merges...
RawKeyValueIterator kvIter = null;
try {
    kvIter = merger.close();
} catch(Throwable e) {
    thrownewShuffleError("Error while doingfinal merge ", e);
}

//Sanity check
synchronized(this) {
    if (throwable != null) {
        thrownewShuffleError("error in shuffle in " + throwingThreadName, throwable);
    }
}最后返回这个合并后的iterator实例。returnkvIter;

Merger也就是MergeManagerImpl.close函数：publicRawKeyValueIterator close() throwsThrowable {关闭几个merge的线程，在关闭时会等待现有的merge完成。
    //Wait for on-going merges to complete
    if (memToMemMerger != null) {
        memToMemMerger.close();
    }
    inMemoryMerger.close();
    onDiskMerger.close();
    List < InMemoryMapOutput < K,
    V >> memory = newArrayList < InMemoryMapOutput < K,
    V >> (inMemoryMergedMapOutputs);
    inMemoryMergedMapOutputs.clear();
    memory.addAll(inMemoryMapOutputs);
    inMemoryMapOutputs.clear();
    List < CompressAwarePath > disk = newArrayList < CompressAwarePath > (onDiskMapOutputs);
    onDiskMapOutputs.clear();执行最终的merge操作。returnfinalMerge(jobConf, rfs, memory, disk);
}最后的一个merge操作privateRawKeyValueIterator finalMerge(JobConf job, FileSystem fs, List < InMemoryMapOutput < K, V >> inMemoryMapOutputs, List < CompressAwarePath > onDiskMapOutputs) throwsIOException {
    LOG.info("finalMergecalled with " + inMemoryMapOutputs.size() + " in-memory map-outputs and " + onDiskMapOutputs.size() + "on-disk map-outputs");
    finalfloatmaxRedPer = job.getFloat(MRJobConfig.REDUCE_INPUT_BUFFER_PERCENT, 0f);
    if (maxRedPer > 1.0 || maxRedPer < 0.0) {
        thrownewIOException(MRJobConfig.REDUCE_INPUT_BUFFER_PERCENT + maxRedPer);
    }得到可以cache到内存的大小,
    比例通过mapreduce.reduce.input.buffer.percent配置，intmaxInMemReduce = (int) Math.min(Runtime.getRuntime().maxMemory() * maxRedPer, Integer.MAX_VALUE);

    //merge configparams
    Class < K > keyClass = (Class < K > ) job.getMapOutputKeyClass();
    Class < V > valueClass = (Class < V > ) job.getMapOutputValueClass();
    booleankeepInputs = job.getKeepFailedTaskFiles();
    finalPath tmpDir = newPath(reduceId.toString());
    finalRawComparator < K > comparator = (RawComparator < K > ) job.getOutputKeyComparator();

    //segments required to vacate memory
    List < Segment < K,
    V >> memDiskSegments = newArrayList < Segment < K,
    V >> ();
    longinMemToDiskBytes = 0;
    booleanmergePhaseFinished = false;
    if (inMemoryMapOutputs.size() > 0) {
        TaskID mapId = inMemoryMapOutputs.get(0).getMapId().getTaskID();这个地方根据可cache到内存的值，把不能cache到内存的部分生成InMemoryReader实例，并添加到memDiskSegments容器中。inMemToDiskBytes = createInMemorySegments(inMemoryMapOutputs, memDiskSegments, maxInMemReduce);
        finalintnumMemDiskSegments = memDiskSegments.size();把内存中多于部分的mapoutput数据写入到文件中，并把文件路径添加到onDiskMapOutputs容器中。
        if (numMemDiskSegments > 0 && ioSortFactor > onDiskMapOutputs.size()) {...........此部分主要是写入内存中多于的mapoutput到磁盘中去mergePhaseFinished = true;
            //must spill to disk, but can't retain in-memfor intermediate merge
            finalPath outputPath = mapOutputFile.getInputFileForWrite(mapId, inMemToDiskBytes).suffix(Task.MERGED_OUTPUT_PREFIX);
            finalRawKeyValueIterator rIter = Merger.merge(job, fs, keyClass, valueClass, memDiskSegments, numMemDiskSegments, tmpDir, comparator, reporter, spilledRecordsCounter, null, mergePhase);
            Writer < K,
            V > writer = newWriter < K,
            V > (job, fs, outputPath, keyClass, valueClass, codec, null);
            try {
                Merger.writeFile(rIter, writer, reporter, job);
                writer.close();
                onDiskMapOutputs.add(newCompressAwarePath(outputPath, writer.getRawLength(), writer.getCompressedLength()));
                writer = null;
                //add to list of final disk outputs.
            } catch(IOException e) {
                if (null != outputPath) {
                    try {
                        fs.delete(outputPath, true);
                    } catch(IOException ie) {
                        //NOTHING
                    }
                }
                throwe;
            } finally {
                if (null != writer) {
                    writer.close();
                }
            }
            LOG.info("Merged" + numMemDiskSegments + "segments, " + inMemToDiskBytes + "bytes to disk to satisfy " + "reducememory limit");
            inMemToDiskBytes = 0;
            memDiskSegments.clear();
        }
        elseif(inMemToDiskBytes != 0) {
            LOG.info("Keeping" + numMemDiskSegments + "segments, " + inMemToDiskBytes + "bytes in memory for " + "intermediate,on-disk merge");
        }
    }

    //segments on disk
    List < Segment < K,
    V >> diskSegments = newArrayList < Segment < K,
    V >> ();
    longonDiskBytes = inMemToDiskBytes;
    longrawBytes = inMemToDiskBytes;生成目前文件中有的所有的mapoutput路径的onDisk数组CompressAwarePath[] onDisk = onDiskMapOutputs.toArray(newCompressAwarePath[onDiskMapOutputs.size()]);
    for (CompressAwarePath file: onDisk) {
        longfileLength = fs.getFileStatus(file).getLen();
        onDiskBytes += fileLength;
        rawBytes += (file.getRawDataLength() > 0) ? file.getRawDataLength() : fileLength;

        LOG.debug("Diskfile: " + file + "Length is " + fileLength);把现在reduce端接收过来并存储到文件中的mapoutput生成segment并添加到distSegments容器中diskSegments.add(newSegment < K, V > (job, fs, file, codec, keepInputs, (file.toString().endsWith(Task.MERGED_OUTPUT_PREFIX) ? null: mergedMapOutputsCounter), file.getRawDataLength()));
    }
    LOG.info("Merging" + onDisk.length + " files, " + onDiskBytes + "bytes from disk");按内容的大小从小到大排序此distSegments容器Collections.sort(diskSegments, newComparator < Segment < K, V >> () {
        publicintcompare(Segment < K, V > o1, Segment < K, V > o2) {
            if (o1.getLength() == o2.getLength()) {
                return0;
            }
            returno1.getLength() < o2.getLength() ? -1 : 1;
        }
    });把现在memory中所有的mapoutput内容生成segment并添加到finalSegments容器中。
    //build final list of segments from merged backed by disk + in-mem
    List < Segment < K,
    V >> finalSegments = newArrayList < Segment < K,
    V >> ();
    longinMemBytes = createInMemorySegments(inMemoryMapOutputs, finalSegments, 0);
    LOG.info("Merging" + finalSegments.size() + "segments, " + inMemBytes + "bytes from memory into reduce");
    if (0 != onDiskBytes) {
        finalintnumInMemSegments = memDiskSegments.size();
        diskSegments.addAll(0, memDiskSegments);
        memDiskSegments.clear();
        //Pass mergePhase only if there is a going to be intermediate
        //merges. See comment where mergePhaseFinished is being set
        Progress thisPhase = (mergePhaseFinished) ? null: mergePhase;这个部分是把现在磁盘上的mapoutput生成一个iterator,
        RawKeyValueIterator diskMerge = Merger.merge(job, fs, keyClass, valueClass, codec, diskSegments, ioSortFactor, numInMemSegments, tmpDir, comparator, reporter, false, spilledRecordsCounter, null, thisPhase);
        diskSegments.clear();
        if (0 == finalSegments.size()) {
            returndiskMerge;
        }把现在磁盘上的iterator也同样添加到finalSegments容器中，也就是此时，这个容器中有两个优先堆排序的队列，每next一次，要从内存与磁盘中找出最小的一个kv.finalSegments.add(newSegment < K, V > (newRawKVIteratorReader(diskMerge, onDiskBytes), true, rawBytes));
    }
    returnMerger.merge(job, fs, keyClass, valueClass, finalSegments, finalSegments.size(), tmpDir, comparator, reporter, spilledRecordsCounter, null, null);
}

shuffle部分现在全部执行完成，重新加到ReduceTask.run函数中，接着代码向下分析：rIter = shuffleConsumerPlugin.run();............RawComparatorcomparator = job.getOutputValueGroupingComparator();
if (useNewApi) {
    runNewReducer(job, umbilical, reporter, rIter, comparator, keyClass, valueClass);
} else {
    runOldReducer........
}在以上代码中执行runNewReducer主要是执行reduce的run函数，org.apache.hadoop.mapreduce.TaskAttemptContexttaskContext = neworg.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job, getTaskID(), reporter);
//make a reducer
org.apache.hadoop.mapreduce.Reducer < INKEY, INVALUE, OUTKEY, OUTVALUE > reducer = (org.apache.hadoop.mapreduce.Reducer < INKEY, INVALUE, OUTKEY, OUTVALUE > ) ReflectionUtils.newInstance(taskContext.getReducerClass(), job);org.apache.hadoop.mapreduce.RecordWriter < OUTKEY, OUTVALUE > trackedRW = newNewTrackingRecordWriter < OUTKEY, OUTVALUE > (this, taskContext);job.setBoolean("mapred.skip.on", isSkipping());job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());org.apache.hadoop.mapreduce.Reducer.Context reducerContext = createReduceContext(reducer, job, getTaskID(), rIter, reduceInputKeyCounter, reduceInputValueCounter, trackedRW, committer, reporter, comparator, keyClass, valueClass);
try {
    reducer.run(reducerContext);
} finally {
    trackedRW.close(reducerContext);
}

以上代码中创建Reducer运行的Context, 并执行reducer.run函数：createReduceContext函数定义部分代码：org.apache.hadoop.mapreduce.ReduceContext < INKEY, INVALUE, OUTKEY, OUTVALUE > reduceContext = newReduceContextImpl < INKEY, INVALUE, OUTKEY, OUTVALUE > (job, taskId, rIter, inputKeyCounter, inputValueCounter, output, committer, reporter, comparator, keyClass, valueClass);

org.apache.hadoop.mapreduce.Reducer < INKEY, INVALUE, OUTKEY, OUTVALUE > .Context reducerContext = newWrappedReducer < INKEY, INVALUE, OUTKEY, OUTVALUE > ().getReducerContext(reduceContext);ReduceContextImpl主要是执行在RawKeyValueInterator中读取数据的相关操作。Reducer.run函数：publicvoid run(Context context) throwsIOException, InterruptedException {
    setup(context);
    try {
        while (context.nextKey()) {
            reduce(context.getCurrentKey(), context.getValues(), context);
            //If a back up store is used, reset it
            Iterator < VALUEIN > iter = context.getValues().iterator();
            if (iterinstanceofReduceContext.ValueIterator) { ((ReduceContext.ValueIterator < VALUEIN > ) iter).resetBackupStore();
            }
        }
    } finally {
        cleanup(context);
    }
}在run函数中通过context.nextkey来得到下一行的数据，这部分主要在ReduceContextImpl中完成：nextkey调用nextKeyValue函数：publicboolean nextKeyValue() throwsIOException, InterruptedException {
    if (!hasMore) {
        key = null;
        value = null;
        returnfalse;
    }此处用来检查是否是一个key下面的第一个value,
    如果是第一个value时，此值为false,
    也就是说，nextKeyIsSame的值是true时，表示现在next的数据与current的key是一行数据。否则表示已经进行了换行操作。firstValue = !nextKeyIsSame;执行一下RawKeyValueInterator(也就是Merge中的队列)，得到当前最小的key DataInputBuffer nextKey = input.getKey();把key设置到buffer中，设置到buffer中的目的是为了通过keyDeserializer来读取一个key的值。currentRawKey.set(nextKey.getData(), nextKey.getPosition(), nextKey.getLength() - nextKey.getPosition());
    buffer.reset(currentRawKey.getBytes(), 0, currentRawKey.getLength());从buffer中读取key的值，并存储到key中，这个地方要注意一下，下面先看看这部分的定义：.........................生成一个key的Deserializer实例，this.keyDeserializer = serializationFactory.getDeserializer(keyClass);把buffer当成keyDeserializer的InputStream。this.keyDeserializer.open(buffer);
    Deserializer中执行deserializer函数的定义：此部分定义可以看出，一个key / value只会生成实例，此部分从性能上考虑主要是为了减少对象的生成。每次生成一个数据时，都是通过readFields重新去生成Writable实例中的内容，因此，很多同学在reduce中使用value时，会出现数据引用不对的情况，因为对象还是同一个对象，但值是最后一个，所以会出现数据不对的情况publicWritable deserialize(Writable w) throwsIOException {
        Writable writable;
        if (w == null) {
            writable = (Writable) ReflectionUtils.newInstance(writableClass, getConf());
        } else {
            writable = w;
        }
        writable.readFields(dataIn);
        returnwritable;
    }.........................读取key的内容key = keyDeserializer.deserialize(key);按key相同的方式，得到当前的value的值，DataInputBuffer nextVal = input.getValue();
    buffer.reset(nextVal.getData(), nextVal.getPosition(), nextVal.getLength() - nextVal.getPosition());
    value = valueDeserializer.deserialize(value);

    currentKeyLength = nextKey.getLength() - nextKey.getPosition();
    currentValueLength = nextVal.getLength() - nextVal.getPosition();

    isMarked的值为false,
    同时backupStore属性为null
    if (isMarked) {
        backupStore.write(nextKey, nextVal);
    }把input执行一次next操作，此处会从所有的文件 / memory中找到最小的一个kv.hasMore = input.next();
    if (hasMore) {比较一下，是否与currentkey是同一个key,
        如果是表示在同一行中。也就是key相同。nextKey = input.getKey();
        nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0, currentRawKey.getLength(), nextKey.getData(), nextKey.getPosition(), nextKey.getLength() - nextKey.getPosition()) == 0;
    } else {
        nextKeyIsSame = false;
    }
    inputValueCounter.increment(1);
    returntrue;
}

接下来是调用reduce函数，此时会通过context.getValues函数把key对应的所有的value传给reduce.此处的context.getValues如下所示：ReduceContextImpl.getValues() public Iterable < VALUEIN > getValues() throwsIOException, InterruptedException {
    returniterable;
}以上代码中直接返回的是iterable的实例，此实例在ReduceContextImpl实例生成时生成。privateValueIterable iterable = newValueIterable();这个类是ReduceContextImpl中的内部类protectedclass ValueIterable implementsIterable < VALUEIN > {
    privateValueIterator iterator = newValueIterator();@Override publicIterator < VALUEIN > iterator() {
        returniterator;
    }
}此实例中引用一个ValueIterator类，这也是一个内部类。每次进行执行时，通过此ValueIterator.next来获取一条数据，publicVALUEIN next() {
    inReset的值默认为false.也就是说inReset检查内部的代码不会执行，其实backupStore本身值就是null如果想使用backupStore,
    需要执行其内部的make函数。
    if (inReset) {.................里面的代码不分析
    }如果是key下面的第一个value,
    把firstValue设置为false,
    因为下一次来时，就不是firstValue了.返回当前的value
    //if this is the first record, we don't need to advance
    if (firstValue) {
        firstValue = false;
        returnvalue;
    }
    //if this isn't the first record and the next key is different, they
    //can't advance it here.
    if (!nextKeyIsSame) {
        thrownewNoSuchElementException("iteratepast last value");
    }
    //otherwise, go to the next key/value pair
    try {这里表示不是第一个value的时候，也就是firstValue的值为false,
        执行一下nextKeyValue函数，得到当前的value.返回。nextKeyValue();
        returnvalue;
    } catch(IOException ie) {
        thrownewRuntimeException("next valueiterator failed", ie);
    } catch(InterruptedException ie) {
        //this is bad, but we can't modify the exception list of java.util
        thrownewRuntimeException("next valueiterator interrupted", ie);
    }
}

当reduce执行完成后的输出，跟map端无reduce时的输出一样。直接输出。

Class < ?extendsShuffleConsumerPlugin > clazz = job.getClass(MRConfig.SHUFFLE_CONSUMER_PLUGIN, Shuffle.class, ShuffleConsumerPlugin.class);shuffleConsumerPlugin = ReflectionUtils.newInstance(clazz, job);LOG.info("UsingShuffleConsumerPlugin: " + shuffleConsumerPlugin);
ShuffleConsumerPlugin.ContextshuffleContext = newShuffleConsumerPlugin.Context(getTaskID(), job, FileSystem.getLocal(job), umbilical, super.lDirAlloc, reporter, codec, combinerClass, combineCollector, spilledRecordsCounter, reduceCombineInputCounter, shuffledMapsCounter, reduceShuffleBytes, failedShuffleCounter, mergedMapOutputsCounter, taskStatus, copyPhase, sortPhase, this, mapOutputFile, localMapFiles);shuffleConsumerPlugin.init(shuffleContext);执行shuffle的run函数，得到RawKeyValueIterator的实例。rIter = shuffleConsumerPlugin.run();
Shuffle.run函数定义：.....................................
inteventsPerReducer = Math.max(MIN_EVENTS_TO_FETCH, MAX_RPC_OUTSTANDING_EVENTS / jobConf.getNumReduceTasks());intmaxEventsToFetch = Math.min(MAX_EVENTS_TO_FETCH, eventsPerReducer);生成map的完成状态获取线程，并启动此线程，此线程中从am中获取此job中所有完成的map的event通过ShuffleSchedulerImpl实例把所有的map的完成的map的host,mapid,等记录到mapLocations容器中。此线程每一秒执行一个获取操作。//Start the map-completion events fetcher threadfinalEventFetcher < K,V > eventFetcher = newEventFetcher < K,V > (reduceId, umbilical, scheduler, this, maxEventsToFetch);eventFetcher.start();下面看看EventFetcher.run函数的执行过程：以下代码中我只保留了代码的主体部分。...................EventFetcher.run: publicvoid run() { intfailures = 0;........................intnumNewMaps = getMapCompletionEvents();..................................}......................}EventFetcher.getMapCompletionEvents..................................MapTaskCompletionEventsUpdateupdate = umbilical.getMapCompletionEvents((org.apache.hadoop.mapred.JobID) reduce.getJobID(), fromEventIdx, maxEventsToFetch, (org.apache.hadoop.mapred.TaskAttemptID) reduce);events = update.getMapTaskCompletionEvents();.....................for (TaskCompletionEvent event: events) { scheduler.resolve(event); if (TaskCompletionEvent.Status.SUCCEEDED == event.getTaskStatus()) {++numNewMaps; }}shecduler是ShuffleShedulerImpl的实例。ShuffleShedulerImpl.resolve caseSUCCEEDED: URI u = getBaseURI(reduceId, event.getTaskTrackerHttp());addKnownMapOutput(u.getHost() + ":" + u.getPort(), u.toString(), event.getTaskAttemptId());maxMapRuntime = Math.max(maxMapRuntime, event.getTaskRunTime());break;.......ShuffleShedulerImpl.addKnownMapOutput函数：把mapid与对应的host添加到mapLocations容器中，MapHost host = mapLocations.get(hostName);if (host == null) { host = newMapHost(hostName, hostUrl); mapLocations.put(hostName, host);}此时会把host的状设置为PENDING host.addKnownMap(mapId);同时把host添加到pendingHosts容器中。notify相关的Fetcher文件copy线程。//Mark the host as pendingif (host.getState() == State.PENDING) { pendingHosts.add(host); notifyAll();}.....................
回到ReduceTask.run函数中，接着向下执行//Start the map-output fetcher threadsbooleanisLocal = localMapFiles != null;通过mapreduce.reduce.shuffle.parallelcopies配置的值，默认为5，生成获取map数据的线程数。生成Fetcher线程实例，并启动相关的线程。通过mapreduce.reduce.shuffle.connect.timeout配置连接超时时间。默认180000通过mapreduce.reduce.shuffle.read.timeout配置读取超时时间，默认为180000 finalintnumFetchers = isLocal ? 1 : jobConf.getInt(MRJobConfig.SHUFFLE_PARALLEL_COPIES, 5);Fetcher < K,V > [] fetchers = newFetcher[numFetchers];if (isLocal) { fetchers[0] = newLocalFetcher < K, V > (jobConf, reduceId, scheduler, merger, reporter, metrics, this, reduceTask.getShuffleSecret(), localMapFiles); fetchers[0].start();} else { for (inti = 0; i < numFetchers; ++i) { fetchers[i] = newFetcher < K, V > (jobConf, reduceId, scheduler, merger, reporter, metrics, this, reduceTask.getShuffleSecret()); fetchers[i].start(); }}.........................
接下来进行Fetcher线程里面，看看Fetcher.run函数运行流程：..........................MapHost host = null;try { //If merge is on, block merger.waitForResource();从ShuffleScheduler中取出一个MapHost实例， //Get a host to shuffle from host = scheduler.getHost(); metrics.threadBusy();执行shuffle操作。 //Shuffle copyFromHost(host);} finally { if (host != null) { scheduler.freeHost(host); metrics.threadFree(); }}接下来看看ShuffleScheduler中的getHost函数：........如果pendingHosts的值没有，先wait住，等待EventFetcher线程去获取数据来notify此waitwhile (pendingHosts.isEmpty()) { wait();}
MapHost host = null;Iterator < MapHost > iter = pendingHosts.iterator();从pendingHosts中random出一个MapHost，并返回给调用程序。intnumToPick = random.nextInt(pendingHosts.size());for (inti = 0; i <= numToPick; ++i) { host = iter.next();}
pendingHosts.remove(host);........................当得到一个MapHost后，执行copyFromHost来进行数据的copy操作。此时，一个task的host的url样子基本上是这个样子：host: port / mapOutput ? job = xxx & reduce = 123(当前reduce的partid值) & map = copyFromHost的代码部分：.....List < TaskAttemptID > maps = scheduler.getMapsForHost(host);.....Set < TaskAttemptID > remaining = newHashSet < TaskAttemptID > (maps);.....此部分完成后，url样子中map = 后面会有很多个mapid，多个用英文的”,”号分开的。URLurl = getMapOutputURL(host, maps);此处根据url打开httpconnection,如果mapreduce.shuffle.ssl.enabled配置为true时，会打开SSL连接。默认为false.openConnection(url);.....设置连接超时时间，header,读取超时时间等值。并打开HttpConnection的连接。// put url hashinto http headerconnection.addRequestProperty(SecureShuffleUtils.HTTP_HEADER_URL_HASH, encHash);//set the read timeoutconnection.setReadTimeout(readTimeout);//put shuffle version into httpheaderconnection.addRequestProperty(ShuffleHeader.HTTP_HEADER_NAME, ShuffleHeader.DEFAULT_HTTP_HEADER_NAME);connection.addRequestProperty(ShuffleHeader.HTTP_HEADER_VERSION, ShuffleHeader.DEFAULT_HTTP_HEADER_VERSION);connect(connection, connectionTimeout);.....执行文件的copy操作。此处是迭代执行，每一个读取一个map的文件。并把remaining中的值去掉一个。直到remaining的值全部读取完成。TaskAttemptID[] failedTasks = null;while (!remaining.isEmpty() && failedTasks == null) {在copyMapOutput函数中，每次读取一个mapid, 根据MergeManagerImpl中的reserve函数，1.检查map的输出是否超过了mapreduce.reduce.memory.totalbytes配置的大小。此配置的默认值是当前Runtime的maxMemory * mapreduce.reduce.shuffle.input.buffer.percent配置的值。Buffer.percent的默认值为0.90;如果mapoutput超过了此配置的大小时, 生成一个OnDiskMapOutput实例。2.如果没有超过此大小，生成一个InMemoryMapOutput实例。failedTasks = copyMapOutput(host, input, remaining);}在copyMapOutput函数中首先调用的MergeManagerImpl.reserve函数：if (!canShuffleToMemory(requestedSize)) {.....returnnewOnDiskMapOutput < K, V > (mapId, reduceId, this, requestedSize, jobConf, mapOutputFile, fetcher, true);}.....if (usedMemory > memoryLimit) {....., 当前使用的memory已经超过了配置的内存使用大小，此时返回null，把host重新添加到shuffleScheduler的pendingHosts队列中。returnnull;}returnunconditionalReserve(mapId, requestedSize, true);生成一个InMemoryMapOutput,并把usedMemory加上此mapoutput的大小。privatesynchronizedInMemoryMapOutput < K,V > unconditionalReserve(TaskAttemptID mapId, longrequestedSize, booleanprimaryMapOutput) { usedMemory += requestedSize; returnnewInMemoryMapOutput < K, V > (jobConf, mapId, this, (int) requestedSize, codec, primaryMapOutput);}
下面是当usedMemory使用超过了指定的大小后，的处理部分，重新把host添加到队列中。如下所示：copyMapOutput函数if (mapOutput == null) { LOG.info("fetcher#" + id + "- MergeManager returned status WAIT ..."); //Notan error but wait to process data. returnEMPTY_ATTEMPT_ID_ARRAY;}此时host中还有没处理完成的mapoutput,在Fetcher.run中，重新添加到队列中把此hostif (host != null) { scheduler.freeHost(host); metrics.threadFree();}.........接下来还是在copyMapOutput函数中，通过mapoutput也就是merge.reserve函数返回的实例的shuffle函数。如果mapoutput是InMemoryMapOutput,在调用shuffle时，直接把map输出写入到内存。如果是OnDiskMapOutput,在调用shuffle时，直接把map的输出写入到local临时文件中。....最后，执行ShuffleScheduler.copySucceeded完成文件的copy,调用mapout.commit函数。scheduler.copySucceeded(mapId, host, compressedLength, endTime - startTime, mapOutput);并从remaining中移出处理过的mapid,
接下来看看MapOutput.commit函数：a.InMemoryMapOutput.commit函数：publicvoidcommit() throwsIOException { merger.closeInMemoryFile(this);}调用MergeManagerImpl.closeInMemoryFile函数: publicsynchronizedvoidcloseInMemoryFile(InMemoryMapOutput < K, V > mapOutput) {把此mapOutput实例添加到inMemoryMapOutputs列表中。inMemoryMapOutputs.add(mapOutput); LOG.info("closeInMemoryFile-> map-output of size: " + mapOutput.getSize() + ",inMemoryMapOutputs.size() -> " + inMemoryMapOutputs.size() + ",commitMemory -> " + commitMemory + ", usedMemory ->" + usedMemory);把commitMemory的大小增加当前传入的mapoutput的size大小。commitMemory += mapOutput.getSize();检查是否达到merge的值，此值是mapreduce.reduce.memory.totalbytes配置 * mapreduce.reduce.shuffle.merge.percent配置的值，默认是当前Runtime的memory * 0.90 * 0.90也就是说，只有有新的mapoutput加入，这个检查条件就肯定会达到 //Can hang if mergeThreshold is really low. if (commitMemory >= mergeThreshold) {.......把正在进行merge的mapoutput列表添加到一起发起merge操作。inMemoryMapOutputs.addAll(inMemoryMergedMapOutputs); inMemoryMergedMapOutputs.clear(); inMemoryMerger.startMerge(inMemoryMapOutputs); commitMemory = 0L; // Reset commitMemory. }如果mapreduce.reduce.merge.memtomem.enabled配置为true, 默认为false同时inMemoryMapOutputs中的mapoutput个数达到了mapreduce.reduce.merge.memtomem.threshold配置的值，默认值是mapreduce.task.io.sort.factor配置的值，默认为100发起memTomem的merger操作。 if (memToMemMerger != null) { if (inMemoryMapOutputs.size() >= memToMemMergeOutputsThreshold) { memToMemMerger.startMerge(inMemoryMapOutputs); } }}
MergemanagerImpl.InMemoryMerger.merger函数操作：在执行inMemoryMerger.startMerge(inMemoryMapOutputs);操作后，会notify此线程，同时执行merger函数：publicvoidmerge(List < InMemoryMapOutput < K, V >> inputs) throwsIOException { if (inputs == null || inputs.size() == 0) { return; }....................TaskAttemptID mapId = inputs.get(0).getMapId(); TaskID mapTaskId = mapId.getTaskID();
List < Segment < K, V >> inMemorySegments = newArrayList < Segment < K, V >> ();生成InMemoryReader实例，并把传入的容器清空，把生成好后的segment放到到inmemorysegments中。longmergeOutputSize = createInMemorySegments(inputs, inMemorySegments, 0); intnoInMemorySegments = inMemorySegments.size();生成一个输出的文件路径，Path outputPath = mapOutputFile.getInputFileForWrite(mapTaskId, mergeOutputSize).suffix(Task.MERGED_OUTPUT_PREFIX);针对输出的临时文件生成一个Write实例。Writer < K, V > writer = newWriter < K, V > (jobConf, rfs, outputPath, (Class < K > ) jobConf.getMapOutputKeyClass(), (Class < V > ) jobConf.getMapOutputValueClass(), codec, null);
RawKeyValueIterator rIter = null; CompressAwarePathcompressAwarePath; try { LOG.info("Initiatingin-memory merge with " + noInMemorySegments + "segments...");此部分与map端的输出没什么区别，得到几个segment的文件的一个iterator, 此部分是一个优先堆，每一次next都会从所有的segment中读取出最小的一个key与value rIter = Merger.merge(jobConf, rfs, (Class < K > ) jobConf.getMapOutputKeyClass(), (Class < V > ) jobConf.getMapOutputValueClass(), inMemorySegments, inMemorySegments.size(), newPath(reduceId.toString()), (RawComparator < K > ) jobConf.getOutputKeyComparator(), reporter, spilledRecordsCounter, null, null);如果没有combiner程序，直接写入到文件，否则，如果有combiner，先执行combiner处理。 if (null == combinerClass) { Merger.writeFile(rIter, writer, reporter, jobConf); } else { combineCollector.setWriter(writer); combineAndSpill(rIter, reduceCombineInputCounter); } writer.close();此处与map端的输出不同的地方在这里，这里不写入spillindex文件，而是生成一个CompressAwarePath，把输出路径, 大小写入到此实例中。compressAwarePath = newCompressAwarePath(outputPath, writer.getRawLength(), writer.getCompressedLength());
LOG.info(reduceId + "Merge of the " + noInMemorySegments + "files in-memory complete." + "Local file is " + outputPath + "of size " + localFS.getFileStatus(outputPath).getLen()); } catch(IOException e) { //makesure that we delete the ondiskfile that we created //earlierwhen we invoked cloneFileAttributes localFS.delete(outputPath, true); throwe; }此处，把生成的文件添加到onDiskMapOutputs属性中，并检查此容器中的文件是否达到了mapreduce.task.io.sort.factor配置的值，如果是，发起disk的merger操作。 //Note the output of the merge closeOnDiskFile(compressAwarePath);}
}上面最后一行的全部定义在下面这里。publicsynchronizedvoidcloseOnDiskFile(CompressAwarePath file) { onDiskMapOutputs.add(file); if (onDiskMapOutputs.size() >= (2 * ioSortFactor - 1)) { onDiskMerger.startMerge(onDiskMapOutputs); }}
b.OnDiskMapOutput.commit函数：把tmp文件rename到指定的目录下，生成一个CompressAwarePath实例，调用上面提到的处理程序。publicvoidcommit() throwsIOException { fs.rename(tmpOutputPath, outputPath); CompressAwarePathcompressAwarePath = newCompressAwarePath(outputPath, getSize(), this.compressedSize); merger.closeOnDiskFile(compressAwarePath);}
MergeManagerImpl.OnDiskMerger.merger函数：这个函数到现在基本上没有什么可以解说的东西，注意一点就是，每merge一个文件后，会把这个merge后的文件路径重新添加到onDiskMapOutputs容器中。publicvoidmerge(List < CompressAwarePath > inputs) throwsIOException { //sanity check if (inputs == null || inputs.isEmpty()) { LOG.info("Noondisk files to merge..."); return; } longapproxOutputSize = 0; intbytesPerSum = jobConf.getInt("io.bytes.per.checksum", 512); LOG.info("OnDiskMerger:We have " + inputs.size() + "map outputs on disk. Triggering merge..."); //1. Prepare the list of files to be merged. for (CompressAwarePath file: inputs) { approxOutputSize += localFS.getFileStatus(file).getLen(); }
//add the checksum length approxOutputSize += ChecksumFileSystem.getChecksumLength(approxOutputSize, bytesPerSum);
//2. Start the on-disk merge process Path outputPath = localDirAllocator.getLocalPathForWrite(inputs.get(0).toString(), approxOutputSize, jobConf).suffix(Task.MERGED_OUTPUT_PREFIX); Writer < K, V > writer = newWriter < K, V > (jobConf, rfs, outputPath, (Class < K > ) jobConf.getMapOutputKeyClass(), (Class < V > ) jobConf.getMapOutputValueClass(), codec, null); RawKeyValueIterator iter = null; CompressAwarePathcompressAwarePath; Path tmpDir = newPath(reduceId.toString()); try { iter = Merger.merge(jobConf, rfs, (Class < K > ) jobConf.getMapOutputKeyClass(), (Class < V > ) jobConf.getMapOutputValueClass(), codec, inputs.toArray(newPath[inputs.size()]), true, ioSortFactor, tmpDir, (RawComparator < K > ) jobConf.getOutputKeyComparator(), reporter, spilledRecordsCounter, null, mergedMapOutputsCounter, null);
Merger.writeFile(iter, writer, reporter, jobConf); writer.close(); compressAwarePath = newCompressAwarePath(outputPath, writer.getRawLength(), writer.getCompressedLength()); } catch(IOException e) { localFS.delete(outputPath, true); throwe; }
closeOnDiskFile(compressAwarePath);
LOG.info(reduceId + "Finished merging " + inputs.size() + "map output files on disk of total-size " + approxOutputSize + "." + "Local output file is " + outputPath + " of size " + localFS.getFileStatus(outputPath).getLen());}}
ok，现在map的copy部分执行完成，回到ShuffleConsumerPlugin的run方法中，也就是Shuffle的run方法中，接着上面的代码向下分析：此处等待所有的copy操作完成，//Wait for shuffle to complete successfullywhile (!scheduler.waitUntilDone(PROGRESS_FREQUENCY)) { reporter.progress(); synchronized(this) { if (throwable != null) { thrownewShuffleError("error in shuffle in " + throwingThreadName, throwable); } }}如果执行到这一行时，说明所有的mapcopy操作已经完成，关闭查找map运行状态的线程与执行copy操作的几个线程。//Stop the event-fetcher threadeventFetcher.shutDown();//Stop the map-output fetcher threadsfor (Fetcher < K, V > fetcher: fetchers) { fetcher.shutDown();}//stop the schedulerscheduler.close();发am发送状态，通知AM，此时要执行排序操作。copyPhase.complete(); // copy is already completetaskStatus.setPhase(TaskStatus.Phase.SORT);reduceTask.statusUpdate(umbilical);
执行最后的merge, 其实在合并所有文件与memory中的数据时，也同时会进行排序操作。//Finish the on-going merges...RawKeyValueIterator kvIter = null;try { kvIter = merger.close();} catch(Throwable e) { thrownewShuffleError("Error while doingfinal merge ", e);}
//Sanity checksynchronized(this) { if (throwable != null) { thrownewShuffleError("error in shuffle in " + throwingThreadName, throwable); }}最后返回这个合并后的iterator实例。returnkvIter;
Merger也就是MergeManagerImpl.close函数：publicRawKeyValueIterator close() throwsThrowable {关闭几个merge的线程，在关闭时会等待现有的merge完成。 //Wait for on-going merges to complete if (memToMemMerger != null) { memToMemMerger.close(); } inMemoryMerger.close(); onDiskMerger.close(); List < InMemoryMapOutput < K, V >> memory = newArrayList < InMemoryMapOutput < K, V >> (inMemoryMergedMapOutputs); inMemoryMergedMapOutputs.clear(); memory.addAll(inMemoryMapOutputs); inMemoryMapOutputs.clear(); List < CompressAwarePath > disk = newArrayList < CompressAwarePath > (onDiskMapOutputs); onDiskMapOutputs.clear();执行最终的merge操作。returnfinalMerge(jobConf, rfs, memory, disk);}最后的一个merge操作privateRawKeyValueIterator finalMerge(JobConf job, FileSystem fs, List < InMemoryMapOutput < K, V >> inMemoryMapOutputs, List < CompressAwarePath > onDiskMapOutputs) throwsIOException { LOG.info("finalMergecalled with " + inMemoryMapOutputs.size() + " in-memory map-outputs and " + onDiskMapOutputs.size() + "on-disk map-outputs"); finalfloatmaxRedPer = job.getFloat(MRJobConfig.REDUCE_INPUT_BUFFER_PERCENT, 0f); if (maxRedPer > 1.0 || maxRedPer < 0.0) { thrownewIOException(MRJobConfig.REDUCE_INPUT_BUFFER_PERCENT + maxRedPer); }得到可以cache到内存的大小, 比例通过mapreduce.reduce.input.buffer.percent配置，intmaxInMemReduce = (int) Math.min(Runtime.getRuntime().maxMemory() * maxRedPer, Integer.MAX_VALUE);
//merge configparams Class < K > keyClass = (Class < K > ) job.getMapOutputKeyClass(); Class < V > valueClass = (Class < V > ) job.getMapOutputValueClass(); booleankeepInputs = job.getKeepFailedTaskFiles(); finalPath tmpDir = newPath(reduceId.toString()); finalRawComparator < K > comparator = (RawComparator < K > ) job.getOutputKeyComparator();
//segments required to vacate memory List < Segment < K, V >> memDiskSegments = newArrayList < Segment < K, V >> (); longinMemToDiskBytes = 0; booleanmergePhaseFinished = false; if (inMemoryMapOutputs.size() > 0) { TaskID mapId = inMemoryMapOutputs.get(0).getMapId().getTaskID();这个地方根据可cache到内存的值，把不能cache到内存的部分生成InMemoryReader实例，并添加到memDiskSegments容器中。inMemToDiskBytes = createInMemorySegments(inMemoryMapOutputs, memDiskSegments, maxInMemReduce); finalintnumMemDiskSegments = memDiskSegments.size();把内存中多于部分的mapoutput数据写入到文件中，并把文件路径添加到onDiskMapOutputs容器中。 if (numMemDiskSegments > 0 && ioSortFactor > onDiskMapOutputs.size()) {...........此部分主要是写入内存中多于的mapoutput到磁盘中去mergePhaseFinished = true; //must spill to disk, but can't retain in-memfor intermediate merge finalPath outputPath = mapOutputFile.getInputFileForWrite(mapId, inMemToDiskBytes).suffix(Task.MERGED_OUTPUT_PREFIX); finalRawKeyValueIterator rIter = Merger.merge(job, fs, keyClass, valueClass, memDiskSegments, numMemDiskSegments, tmpDir, comparator, reporter, spilledRecordsCounter, null, mergePhase); Writer < K, V > writer = newWriter < K, V > (job, fs, outputPath, keyClass, valueClass, codec, null); try { Merger.writeFile(rIter, writer, reporter, job); writer.close(); onDiskMapOutputs.add(newCompressAwarePath(outputPath, writer.getRawLength(), writer.getCompressedLength())); writer = null; //add to list of final disk outputs. } catch(IOException e) { if (null != outputPath) { try { fs.delete(outputPath, true); } catch(IOException ie) { //NOTHING } } throwe; } finally { if (null != writer) { writer.close(); } } LOG.info("Merged" + numMemDiskSegments + "segments, " + inMemToDiskBytes + "bytes to disk to satisfy " + "reducememory limit"); inMemToDiskBytes = 0; memDiskSegments.clear(); } elseif(inMemToDiskBytes != 0) { LOG.info("Keeping" + numMemDiskSegments + "segments, " + inMemToDiskBytes + "bytes in memory for " + "intermediate,on-disk merge"); } }
//segments on disk List < Segment < K, V >> diskSegments = newArrayList < Segment < K, V >> (); longonDiskBytes = inMemToDiskBytes; longrawBytes = inMemToDiskBytes;生成目前文件中有的所有的mapoutput路径的onDisk数组CompressAwarePath[] onDisk = onDiskMapOutputs.toArray(newCompressAwarePath[onDiskMapOutputs.size()]); for (CompressAwarePath file: onDisk) { longfileLength = fs.getFileStatus(file).getLen(); onDiskBytes += fileLength; rawBytes += (file.getRawDataLength() > 0) ? file.getRawDataLength() : fileLength;
LOG.debug("Diskfile: " + file + "Length is " + fileLength);把现在reduce端接收过来并存储到文件中的mapoutput生成segment并添加到distSegments容器中diskSegments.add(newSegment < K, V > (job, fs, file, codec, keepInputs, (file.toString().endsWith(Task.MERGED_OUTPUT_PREFIX) ? null: mergedMapOutputsCounter), file.getRawDataLength())); } LOG.info("Merging" + onDisk.length + " files, " + onDiskBytes + "bytes from disk");按内容的大小从小到大排序此distSegments容器Collections.sort(diskSegments, newComparator < Segment < K, V >> () { publicintcompare(Segment < K, V > o1, Segment < K, V > o2) { if (o1.getLength() == o2.getLength()) { return0; } returno1.getLength() < o2.getLength() ? -1 : 1; } });把现在memory中所有的mapoutput内容生成segment并添加到finalSegments容器中。 //build final list of segments from merged backed by disk + in-mem List < Segment < K, V >> finalSegments = newArrayList < Segment < K, V >> (); longinMemBytes = createInMemorySegments(inMemoryMapOutputs, finalSegments, 0); LOG.info("Merging" + finalSegments.size() + "segments, " + inMemBytes + "bytes from memory into reduce"); if (0 != onDiskBytes) { finalintnumInMemSegments = memDiskSegments.size(); diskSegments.addAll(0, memDiskSegments); memDiskSegments.clear(); //Pass mergePhase only if there is a going to be intermediate //merges. See comment where mergePhaseFinished is being set Progress thisPhase = (mergePhaseFinished) ? null: mergePhase;这个部分是把现在磁盘上的mapoutput生成一个iterator, RawKeyValueIterator diskMerge = Merger.merge(job, fs, keyClass, valueClass, codec, diskSegments, ioSortFactor, numInMemSegments, tmpDir, comparator, reporter, false, spilledRecordsCounter, null, thisPhase); diskSegments.clear(); if (0 == finalSegments.size()) { returndiskMerge; }把现在磁盘上的iterator也同样添加到finalSegments容器中，也就是此时，这个容器中有两个优先堆排序的队列，每next一次，要从内存与磁盘中找出最小的一个kv.finalSegments.add(newSegment < K, V > (newRawKVIteratorReader(diskMerge, onDiskBytes), true, rawBytes)); } returnMerger.merge(job, fs, keyClass, valueClass, finalSegments, finalSegments.size(), tmpDir, comparator, reporter, spilledRecordsCounter, null, null);}
shuffle部分现在全部执行完成，重新加到ReduceTask.run函数中，接着代码向下分析：rIter = shuffleConsumerPlugin.run();............RawComparatorcomparator = job.getOutputValueGroupingComparator();if (useNewApi) { runNewReducer(job, umbilical, reporter, rIter, comparator, keyClass, valueClass);} else { runOldReducer........}在以上代码中执行runNewReducer主要是执行reduce的run函数，org.apache.hadoop.mapreduce.TaskAttemptContexttaskContext = neworg.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job, getTaskID(), reporter);//make a reducerorg.apache.hadoop.mapreduce.Reducer < INKEY, INVALUE, OUTKEY, OUTVALUE > reducer = (org.apache.hadoop.mapreduce.Reducer < INKEY, INVALUE, OUTKEY, OUTVALUE > ) ReflectionUtils.newInstance(taskContext.getReducerClass(), job);org.apache.hadoop.mapreduce.RecordWriter < OUTKEY, OUTVALUE > trackedRW = newNewTrackingRecordWriter < OUTKEY, OUTVALUE > (this, taskContext);job.setBoolean("mapred.skip.on", isSkipping());job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());org.apache.hadoop.mapreduce.Reducer.Context reducerContext = createReduceContext(reducer, job, getTaskID(), rIter, reduceInputKeyCounter, reduceInputValueCounter, trackedRW, committer, reporter, comparator, keyClass, valueClass);try { reducer.run(reducerContext);} finally { trackedRW.close(reducerContext);}
以上代码中创建Reducer运行的Context, 并执行reducer.run函数：createReduceContext函数定义部分代码：org.apache.hadoop.mapreduce.ReduceContext < INKEY, INVALUE, OUTKEY, OUTVALUE > reduceContext = newReduceContextImpl < INKEY, INVALUE, OUTKEY, OUTVALUE > (job, taskId, rIter, inputKeyCounter, inputValueCounter, output, committer, reporter, comparator, keyClass, valueClass);
org.apache.hadoop.mapreduce.Reducer < INKEY, INVALUE, OUTKEY, OUTVALUE > .Context reducerContext = newWrappedReducer < INKEY, INVALUE, OUTKEY, OUTVALUE > ().getReducerContext(reduceContext);ReduceContextImpl主要是执行在RawKeyValueInterator中读取数据的相关操作。Reducer.run函数：publicvoid run(Context context) throwsIOException, InterruptedException { setup(context); try { while (context.nextKey()) { reduce(context.getCurrentKey(), context.getValues(), context); //If a back up store is used, reset it Iterator < VALUEIN > iter = context.getValues().iterator(); if (iterinstanceofReduceContext.ValueIterator) { ((ReduceContext.ValueIterator < VALUEIN > ) iter).resetBackupStore(); } } } finally { cleanup(context); }}在run函数中通过context.nextkey来得到下一行的数据，这部分主要在ReduceContextImpl中完成：nextkey调用nextKeyValue函数：publicboolean nextKeyValue() throwsIOException, InterruptedException { if (!hasMore) { key = null; value = null; returnfalse; }此处用来检查是否是一个key下面的第一个value, 如果是第一个value时，此值为false, 也就是说，nextKeyIsSame的值是true时，表示现在next的数据与current的key是一行数据。否则表示已经进行了换行操作。firstValue = !nextKeyIsSame;执行一下RawKeyValueInterator(也就是Merge中的队列)，得到当前最小的key DataInputBuffer nextKey = input.getKey();把key设置到buffer中，设置到buffer中的目的是为了通过keyDeserializer来读取一个key的值。currentRawKey.set(nextKey.getData(), nextKey.getPosition(), nextKey.getLength() - nextKey.getPosition()); buffer.reset(currentRawKey.getBytes(), 0, currentRawKey.getLength());从buffer中读取key的值，并存储到key中，这个地方要注意一下，下面先看看这部分的定义：.........................生成一个key的Deserializer实例，this.keyDeserializer = serializationFactory.getDeserializer(keyClass);把buffer当成keyDeserializer的InputStream。this.keyDeserializer.open(buffer); Deserializer中执行deserializer函数的定义：此部分定义可以看出，一个key / value只会生成实例，此部分从性能上考虑主要是为了减少对象的生成。每次生成一个数据时，都是通过readFields重新去生成Writable实例中的内容，因此，很多同学在reduce中使用value时，会出现数据引用不对的情况，因为对象还是同一个对象，但值是最后一个，所以会出现数据不对的情况publicWritable deserialize(Writable w) throwsIOException { Writable writable; if (w == null) { writable = (Writable) ReflectionUtils.newInstance(writableClass, getConf()); } else { writable = w; } writable.readFields(dataIn); returnwritable; }.........................读取key的内容key = keyDeserializer.deserialize(key);按key相同的方式，得到当前的value的值，DataInputBuffer nextVal = input.getValue(); buffer.reset(nextVal.getData(), nextVal.getPosition(), nextVal.getLength() - nextVal.getPosition()); value = valueDeserializer.deserialize(value);
currentKeyLength = nextKey.getLength() - nextKey.getPosition(); currentValueLength = nextVal.getLength() - nextVal.getPosition();
isMarked的值为false, 同时backupStore属性为null if (isMarked) { backupStore.write(nextKey, nextVal); }把input执行一次next操作，此处会从所有的文件 / memory中找到最小的一个kv.hasMore = input.next(); if (hasMore) {比较一下，是否与currentkey是同一个key, 如果是表示在同一行中。也就是key相同。nextKey = input.getKey(); nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0, currentRawKey.getLength(), nextKey.getData(), nextKey.getPosition(), nextKey.getLength() - nextKey.getPosition()) == 0; } else { nextKeyIsSame = false; } inputValueCounter.increment(1); returntrue;}
接下来是调用reduce函数，此时会通过context.getValues函数把key对应的所有的value传给reduce.此处的context.getValues如下所示：ReduceContextImpl.getValues() public Iterable < VALUEIN > getValues() throwsIOException, InterruptedException { returniterable;}以上代码中直接返回的是iterable的实例，此实例在ReduceContextImpl实例生成时生成。privateValueIterable iterable = newValueIterable();这个类是ReduceContextImpl中的内部类protectedclass ValueIterable implementsIterable < VALUEIN > { privateValueIterator iterator = newValueIterator();@Override publicIterator < VALUEIN > iterator() { returniterator; }}此实例中引用一个ValueIterator类，这也是一个内部类。每次进行执行时，通过此ValueIterator.next来获取一条数据，publicVALUEIN next() { inReset的值默认为false.也就是说inReset检查内部的代码不会执行，其实backupStore本身值就是null如果想使用backupStore, 需要执行其内部的make函数。 if (inReset) {.................里面的代码不分析 }如果是key下面的第一个value, 把firstValue设置为false, 因为下一次来时，就不是firstValue了.返回当前的value //if this is the first record, we don't need to advance if (firstValue) { firstValue = false; returnvalue; } //if this isn't the first record and the next key is different, they //can't advance it here. if (!nextKeyIsSame) { thrownewNoSuchElementException("iteratepast last value"); } //otherwise, go to the next key/value pair try {这里表示不是第一个value的时候，也就是firstValue的值为false, 执行一下nextKeyValue函数，得到当前的value.返回。nextKeyValue(); returnvalue; } catch(IOException ie) { thrownewRuntimeException("next valueiterator failed", ie); } catch(InterruptedException ie) { //this is bad, but we can't modify the exception list of java.util thrownewRuntimeException("next valueiterator interrupted", ie); }}