Hadoop源码解读(shuffle机制)

Hadoop源码解读(shuffle机制)

​ 这次主要聊一聊hadoop中比较复杂的shuffle机制。同样是通过源代码来一步一步进行解析。首先,shuffle是mapTask运行写出一个key,value键值对后,收集器收集,开始shuffle的工作。所以入口在MapTask的run()方法中的 runNewMapper(job, splitMetaInfo, umbilical, reporter);

​ 在这里我主要聊shuffle的两个方面,一个是shuffle前的准备工作,即开启Collector收集器。读入一些Job的配置信息。另外一个,也是最重要的就是shuffle具体的工作机制和整体流程。

​ 下图是一个shuffle的整体流程图,不熟悉的小伙伴,可以看看:、

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SvbsMD1p-1574342900158)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1574334457609.png)]

入口:

runNewMapper(job, splitMetaInfo, umbilical, reporter);

Shuffle的准备工作

以下为这个方法的具体实现:

private <INKEY,INVALUE,OUTKEY,OUTVALUE>
  void runNewMapper(final JobConf job,
                    final TaskSplitIndex splitIndex,
                    final TaskUmbilicalProtocol umbilical,
                    TaskReporter reporter
                    ) throws IOException, ClassNotFoundException,
                             InterruptedException {
    // make a task context so we can get the classes
    org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
      new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job, 
                                                                  getTaskID(),
                                                                  reporter);
    // make a mapper
    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =
      (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)
        ReflectionUtils.newInstance(taskContext.getMapperClass(), job);
    // make the input format
    org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
      (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
        ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
    // rebuild the input split
    org.apache.hadoop.mapreduce.InputSplit split = null;
    split = getSplitDetails(new Path(splitIndex.getSplitLocation()),
        splitIndex.getStartOffset());
    LOG.info("Processing split: " + split);

    org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
      new NewTrackingRecordReader<INKEY,INVALUE>
        (split, inputFormat, reporter, taskContext);
    
    job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
    org.apache.hadoop.mapreduce.RecordWriter output = null;
    
    // get an output object
    if (job.getNumReduceTasks() == 0) {
      output = 
        new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
    } else {
      output = new NewOutputCollector(taskContext, job, umbilical, reporter);
    }

    org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE> 
    mapContext = 
      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(), 
          input, output, 
          committer, 
          reporter, split);

    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context 
        mapperContext = 
          new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext(
              mapContext);

    try {
      input.initialize(split, mapperContext);
      mapper.run(mapperContext);
      mapPhase.complete();
      setPhase(TaskStatus.Phase.SORT);
      statusUpdate(umbilical);
      input.close();
      input = null;
      output.close(mapperContext);
      output = null;
    } finally {
      closeQuietly(input);
      closeQuietly(output, mapperContext);
    }
  }

这部分代码中大多创建一些hadoop中的组件

  • org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
    new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job,
    getTaskID(), reporter);

创建taskContext

  • org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =
    (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)
    ReflectionUtils.newInstance(taskContext.getMapperClass(), job);

    通过反射创建mapper

  • org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
    (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
    ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);

    通过反射创建inputFormat,来读取数据

  • split = getSplitDetails(new Path(splitIndex.getSplitLocation()),
    splitIndex.getStartOffset());

    获取切片的详细信息

  • org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
    new NewTrackingRecordReader<INKEY,INVALUE>
    (split, inputFormat, reporter, taskContext);

    通过反射创建RecordReader。InputFormat是通过RecordReader来读取数据

  • 重点来了。创建收集器,也就是OutputCollector对象

    if (job.getNumReduceTasks() == 0) {
          output = 
            new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
        } else {
          output = new NewOutputCollector(taskContext, job, umbilical, reporter);
        }
    

    判断reduce的个数是否为0,如果为0,就没有shuffle过程,也就是直接将数据写到磁盘中。如果不为0,就帮你创建一个OutputCollector对象。具体创建的代码如下:

     NewOutputCollector(org.apache.hadoop.mapreduce.JobContext jobContext,
                           JobConf job,
                           TaskUmbilicalProtocol umbilical,
                           TaskReporter reporter
                           ) throws IOException, ClassNotFoundException {
          collector = createSortingCollector(job, reporter);
          partitions = jobContext.getNumReduceTasks();
          if (partitions > 1) {
            partitioner = (org.apache.hadoop.mapreduce.Partitioner<K,V>)
              ReflectionUtils.newInstance(jobContext.getPartitionerClass(), job);
          } else {
            partitioner = new org.apache.hadoop.mapreduce.Partitioner<K,V>() {
              @Override
              public int getPartition(K key, V value, int numPartitions) {
                return partitions - 1;
              }
            };
          }
        }
    
    • collector = createSortingCollector(job, reporter);创建收集器对象MapOutputCollector,创建的这个收集器对象只是一个宏观的收集器对象,真正核心的是他方法里面,创建的MapOutputBuffer对象

      所以进入到这个方法中。方法中的代码如下:

      private <KEY, VALUE> MapOutputCollector<KEY, VALUE>
                createSortingCollector(JobConf job, TaskReporter reporter)
          throws IOException, ClassNotFoundException {
          MapOutputCollector.Context context =
            new MapOutputCollector.Context(this, job, reporter);
      
          Class<?>[] collectorClasses = job.getClasses(
            JobContext.MAP_OUTPUT_COLLECTOR_CLASS_ATTR, MapOutputBuffer.class);
          int remainingCollectors = collectorClasses.length;
          Exception lastException = null;
          for (Class clazz : collectorClasses) {
            try {
              if (!MapOutputCollector.class.isAssignableFrom(clazz)) {
                throw new IOException("Invalid output collector class: " + clazz.getName() +
                  " (does not implement MapOutputCollector)");
              }
              Class<? extends MapOutputCollector> subclazz =
                clazz.asSubclass(MapOutputCollector.class);
              LOG.debug("Trying map output collector class: " + subclazz.getName());
              MapOutputCollector<KEY, VALUE> collector =
                ReflectionUtils.newInstance(subclazz, job);
              collector.init(context);
              LOG.info("Map output collector class = " + collector.getClass().getName());
              return collector;
            } catch (Exception e) {
              String msg = "Unable to initialize MapOutputCollector " + clazz.getName();
              if (--remainingCollectors > 0) {
                msg += " (" + remainingCollectors + " more collector(s) to try)";
              }
              lastException = e;
              LOG.warn(msg, e);
            }
          }
          throw new IOException("Initialization of all the collectors failed. " +
            "Error in last collector was :" + lastException.getMessage(), lastException);
        }
      
      • Class<?>[] collectorClasses = job.getClasses(
        JobContext.MAP_OUTPUT_COLLECTOR_CLASS_ATTR, MapOutputBuffer.class);

        获取到具体的收集器对象的类型 MapOutputBuffer

      • MapOutputCollector<KEY, VALUE> collector = ReflectionUtils.newInstance(subclazz, job);

        创建MapOutputBuffer收集器对象

      • collector.init(context);初始化MapOutputBuffer收集器对象

        具体的代码如下:

        public void init(MapOutputCollector.Context context
                            ) throws IOException, ClassNotFoundException {
              job = context.getJobConf();
              reporter = context.getReporter();
              mapTask = context.getMapTask();
              mapOutputFile = mapTask.getMapOutputFile();
              sortPhase = mapTask.getSortPhase();
              spilledRecordsCounter = reporter.getCounter(TaskCounter.SPILLED_RECORDS);
              partitions = job.getNumReduceTasks();
              rfs = ((LocalFileSystem)FileSystem.getLocal(job)).getRaw();
        
              //sanity checks
              final float spillper =
                job.getFloat(JobContext.MAP_SORT_SPILL_PERCENT, (float)0.8);
              final int sortmb = job.getInt(JobContext.IO_SORT_MB, 100);
              indexCacheMemoryLimit = job.getInt(JobContext.INDEX_CACHE_MEMORY_LIMIT,
                                                 INDEX_CACHE_MEMORY_LIMIT_DEFAULT);
              if (spillper > (float)1.0 || spillper <= (float)0.0) {
                throw new IOException("Invalid \"" + JobContext.MAP_SORT_SPILL_PERCENT +
                    "\": " + spillper);
              }
              if ((sortmb & 0x7FF) != sortmb) {
                throw new IOException(
                    "Invalid \"" + JobContext.IO_SORT_MB + "\": " + sortmb);
              }
              sorter = ReflectionUtils.newInstance(job.getClass("map.sort.class",
                    QuickSort.class, IndexedSorter.class), job);
              // buffers and accounting
              int maxMemUsage = sortmb << 20;
              maxMemUsage -= maxMemUsage % METASIZE;
              kvbuffer = new byte[maxMemUsage];
              bufvoid = kvbuffer.length;
              kvmeta = ByteBuffer.wrap(kvbuffer)
                 .order(ByteOrder.nativeOrder())
                 .asIntBuffer();
              setEquator(0);
              bufstart = bufend = bufindex = equator;
              kvstart = kvend = kvindex;
        
              maxRec = kvmeta.capacity() / NMETA;
              softLimit = (int)(kvbuffer.length * spillper);
              bufferRemaining = softLimit;
              LOG.info(JobContext.IO_SORT_MB + ": " + sortmb);
              LOG.info("soft limit at " + softLimit);
              LOG.info("bufstart = " + bufstart + "; bufvoid = " + bufvoid);
              LOG.info("kvstart = " + kvstart + "; length = " + maxRec);
        
              // k/v serialization
              comparator = job.getOutputKeyComparator();
              keyClass = (Class<K>)job.getMapOutputKeyClass();
              valClass = (Class<V>)job.getMapOutputValueClass();
              serializationFactory = new SerializationFactory(job);
              keySerializer = serializationFactory.getSerializer(keyClass);
              keySerializer.open(bb);
              valSerializer = serializationFactory.getSerializer(valClass);
              valSerializer.open(bb);
        
              // output counters
              mapOutputByteCounter = reporter.getCounter(TaskCounter.MAP_OUTPUT_BYTES);
              mapOutputRecordCounter =
                reporter.getCounter(TaskCounter.MAP_OUTPUT_RECORDS);
              fileOutputByteCounter = reporter
                  .getCounter(TaskCounter.MAP_OUTPUT_MATERIALIZED_BYTES);
        
              // compression
              if (job.getCompressMapOutput()) {
                Class<? extends CompressionCodec> codecClass =
                  job.getMapOutputCompressorClass(DefaultCodec.class);
                codec = ReflectionUtils.newInstance(codecClass, job);
              } else {
                codec = null;
              }
        
              // combiner
              final Counters.Counter combineInputCounter =
                reporter.getCounter(TaskCounter.COMBINE_INPUT_RECORDS);
              combinerRunner = CombinerRunner.create(job, getTaskID(), 
                                                     combineInputCounter,
                                                     reporter, null);
              if (combinerRunner != null) {
                final Counters.Counter combineOutputCounter =
                  reporter.getCounter(TaskCounter.COMBINE_OUTPUT_RECORDS);
                combineCollector= new CombineOutputCollector<K,V>(combineOutputCounter, reporter, job);
              } else {
                combineCollector = null;
              }
              spillInProgress = false;
              minSpillsForCombine = job.getInt(JobContext.MAP_COMBINE_MIN_SPILLS, 3);
              spillThread.setDaemon(true);
              spillThread.setName("SpillThread");
              spillLock.lock();
              try {
                spillThread.start();
                while (!spillThreadRunning) {
                  spillDone.await();
                }
              } catch (InterruptedException e) {
                throw new IOException("Spill thread failed to initialize", e);
              } finally {
                spillLock.unlock();
              }
              if (sortSpillException != null) {
                throw new IOException("Spill thread failed to initialize",
                    sortSpillException);
              }
            }
        

        这个方法中,主要就是对收集器对象进行一些初始化,由于初始化的东西太多,在这里我就挑一部分比较重要的讲一讲。

        1. final float spillper = job.getFloat(JobContext.MAP_SORT_SPILL_PERCENT, (float)0.8);

          设置环形缓冲区溢写百分比为80%。大家知道,收集器对象将map阶段输出的数据收集起来,并将其收集到环形缓冲区中。环形缓冲区,分为两片,一片为专门存放数据的区域,一片为存放这个数据的一些元数据信息的区域。这两片区域共同组成一个环形。但是这个环形缓冲区的大小不会太大,否则会很占内存。所以当环形缓冲区内数据的大小到达整体的80%时,就会将环形缓冲区中的数据溢写到磁盘中。

        2. final int sortmb = job.getInt(JobContext.IO_SORT_MB, 100);

          设置环形缓冲区的大小为100M。

        3. sorter = ReflectionUtils.newInstance(job.getClass(“map.sort.class”,

          QuickSort.class, IndexedSorter.class), job);

          获取到排序对象,在数据由环形缓冲区溢写到磁盘中前,数据需要排序。所以需要排序器。并且排序是针对索引的,并非对数据进行排序。

        4. comparator = job.getOutputKeyComparator();

          获取分组比较器

        5. 获取压缩器

          if (job.getCompressMapOutput()) {
                  Class<? extends CompressionCodec> codecClass =
                    job.getMapOutputCompressorClass(DefaultCodec.class);
                  codec = ReflectionUtils.newInstance(codecClass, job);
                } else {
                  codec = null;
                }
          

          可能shuffle中需要对数据进行压缩,这样能提高shuffle的效率

        6. 获取Combiner

          if (combinerRunner != null) {
                  final Counters.Counter combineOutputCounter =
                    reporter.getCounter(TaskCounter.COMBINE_OUTPUT_RECORDS);
                  combineCollector= new CombineOutputCollector<K,V>(combineOutputCounter, reporter, job);
                } else {
                  combineCollector = null;
                }
          

          可能shuffle中设置了combiner,如果设置了,就创建combiner对象

        7. spillThread.start();

          在一系列准备工作做完之后,就开启溢写线程进行工作

Shuffle的工作流程

入口:在map往外写的时候,就开启了shuffle

一路debug,一直到

public void write(K key, V value) throws IOException, InterruptedException {
      collector.collect(key, value,
                        partitioner.getPartition(key, value, partitions));
    }

Collector收集器开始工作

具体的代码实现如下:

public synchronized void collect(K key, V value, final int partition
                                     ) throws IOException {
      reporter.progress();
      if (key.getClass() != keyClass) {
        throw new IOException("Type mismatch in key from map: expected "
                              + keyClass.getName() + ", received "
                              + key.getClass().getName());
      }
      if (value.getClass() != valClass) {
        throw new IOException("Type mismatch in value from map: expected "
                              + valClass.getName() + ", received "
                              + value.getClass().getName());
      }
      if (partition < 0 || partition >= partitions) {
        throw new IOException("Illegal partition for " + key + " (" +
            partition + ")");
      }
      checkSpillException();
      bufferRemaining -= METASIZE;
      if (bufferRemaining <= 0) {
        // start spill if the thread is not running and the soft limit has been
        // reached
        spillLock.lock();
        try {
          do {
            if (!spillInProgress) {
              final int kvbidx = 4 * kvindex;
              final int kvbend = 4 * kvend;
              // serialized, unspilled bytes always lie between kvindex and
              // bufindex, crossing the equator. Note that any void space
              // created by a reset must be included in "used" bytes
              final int bUsed = distanceTo(kvbidx, bufindex);
              final boolean bufsoftlimit = bUsed >= softLimit;
              if ((kvbend + METASIZE) % kvbuffer.length !=
                  equator - (equator % METASIZE)) {
                // spill finished, reclaim space
                resetSpill();
                bufferRemaining = Math.min(
                    distanceTo(bufindex, kvbidx) - 2 * METASIZE,
                    softLimit - bUsed) - METASIZE;
                continue;
              } else if (bufsoftlimit && kvindex != kvend) {
                // spill records, if any collected; check latter, as it may
                // be possible for metadata alignment to hit spill pcnt
                startSpill();
                final int avgRec = (int)
                  (mapOutputByteCounter.getCounter() /
                  mapOutputRecordCounter.getCounter());
                // leave at least half the split buffer for serialization data
                // ensure that kvindex >= bufindex
                final int distkvi = distanceTo(bufindex, kvbidx);
                final int newPos = (bufindex +
                  Math.max(2 * METASIZE - 1,
                          Math.min(distkvi / 2,
                                   distkvi / (METASIZE + avgRec) * METASIZE)))
                  % kvbuffer.length;
                setEquator(newPos);
                bufmark = bufindex = newPos;
                final int serBound = 4 * kvend;
                // bytes remaining before the lock must be held and limits
                // checked is the minimum of three arcs: the metadata space, the
                // serialization space, and the soft limit
                bufferRemaining = Math.min(
                    // metadata max
                    distanceTo(bufend, newPos),
                    Math.min(
                      // serialization max
                      distanceTo(newPos, serBound),
                      // soft limit
                      softLimit)) - 2 * METASIZE;
              }
            }
          } while (false);
        } finally {
          spillLock.unlock();
        }
      }

      try {
        // serialize key bytes into buffer
        int keystart = bufindex;
        keySerializer.serialize(key);
        if (bufindex < keystart) {
          // wrapped the key; must make contiguous
          bb.shiftBufferedKey();
          keystart = 0;
        }
        // serialize value bytes into buffer
        final int valstart = bufindex;
        valSerializer.serialize(value);
        // It's possible for records to have zero length, i.e. the serializer
        // will perform no writes. To ensure that the boundary conditions are
        // checked and that the kvindex invariant is maintained, perform a
        // zero-length write into the buffer. The logic monitoring this could be
        // moved into collect, but this is cleaner and inexpensive. For now, it
        // is acceptable.
        bb.write(b0, 0, 0);

        // the record must be marked after the preceding write, as the metadata
        // for this record are not yet written
        int valend = bb.markRecord();

        mapOutputRecordCounter.increment(1);
        mapOutputByteCounter.increment(
            distanceTo(keystart, valend, bufvoid));

        // write accounting info
        kvmeta.put(kvindex + PARTITION, partition);
        kvmeta.put(kvindex + KEYSTART, keystart);
        kvmeta.put(kvindex + VALSTART, valstart);
        kvmeta.put(kvindex + VALLEN, distanceTo(valstart, valend));
        // advance kvindex
        kvindex = (kvindex - NMETA + kvmeta.capacity()) % kvmeta.capacity();
      } catch (MapBufferTooSmallException e) {
        LOG.info("Record too large for in-memory buffer: " + e.getMessage());
        spillSingleRecord(key, value, partition);
        mapOutputRecordCounter.increment(1);
        return;
      }
    }

什么时候开始进行溢写?

当你的环形缓冲区里面的数据大小达到整个环形缓冲区大小的80%时,开始进行溢写。溢写时要上锁。spillLock.lock();

开启了溢写,溢写主要工作顺序是先排序后溢写,所以执行的是sortAndSpill方法,具体代码如下:

private void sortAndSpill() throws IOException, ClassNotFoundException,
                                       InterruptedException {
      //approximate the length of the output file to be the length of the
      //buffer + header lengths for the partitions
      final long size = distanceTo(bufstart, bufend, bufvoid) +
                  partitions * APPROX_HEADER_LENGTH;
      FSDataOutputStream out = null;
      try {
        // create spill file
        final SpillRecord spillRec = new SpillRecord(partitions);
        final Path filename =
            mapOutputFile.getSpillFileForWrite(numSpills, size);
        out = rfs.create(filename);

        final int mstart = kvend / NMETA;
        final int mend = 1 + // kvend is a valid record
          (kvstart >= kvend
          ? kvstart
          : kvmeta.capacity() + kvstart) / NMETA;
        sorter.sort(MapOutputBuffer.this, mstart, mend, reporter);
        int spindex = mstart;
        final IndexRecord rec = new IndexRecord();
        final InMemValBytes value = new InMemValBytes();
        for (int i = 0; i < partitions; ++i) {
          IFile.Writer<K, V> writer = null;
          try {
            long segmentStart = out.getPos();
            FSDataOutputStream partitionOut = CryptoUtils.wrapIfNecessary(job, out);
            writer = new Writer<K, V>(job, partitionOut, keyClass, valClass, codec,
                                      spilledRecordsCounter);
            if (combinerRunner == null) {
              // spill directly
              DataInputBuffer key = new DataInputBuffer();
              while (spindex < mend &&
                  kvmeta.get(offsetFor(spindex % maxRec) + PARTITION) == i) {
                final int kvoff = offsetFor(spindex % maxRec);
                int keystart = kvmeta.get(kvoff + KEYSTART);
                int valstart = kvmeta.get(kvoff + VALSTART);
                key.reset(kvbuffer, keystart, valstart - keystart);
                getVBytesForOffset(kvoff, value);
                writer.append(key, value);
                ++spindex;
              }
            } else {
              int spstart = spindex;
              while (spindex < mend &&
                  kvmeta.get(offsetFor(spindex % maxRec)
                            + PARTITION) == i) {
                ++spindex;
              }
              // Note: we would like to avoid the combiner if we've fewer
              // than some threshold of records for a partition
              if (spstart != spindex) {
                combineCollector.setWriter(writer);
                RawKeyValueIterator kvIter =
                  new MRResultIterator(spstart, spindex);
                combinerRunner.combine(kvIter, combineCollector);
              }
            }

            // close the writer
            writer.close();

            // record offsets
            rec.startOffset = segmentStart;
            rec.rawLength = writer.getRawLength() + CryptoUtils.cryptoPadding(job);
            rec.partLength = writer.getCompressedLength() + CryptoUtils.cryptoPadding(job);
            spillRec.putIndex(rec, i);

            writer = null;
          } finally {
            if (null != writer) writer.close();
          }
        }

        if (totalIndexCacheMemory >= indexCacheMemoryLimit) {
          // create spill index file
          Path indexFilename =
              mapOutputFile.getSpillIndexFileForWrite(numSpills, partitions
                  * MAP_OUTPUT_INDEX_RECORD_LENGTH);
          spillRec.writeToFile(indexFilename, job);
        } else {
          indexCacheList.add(spillRec);
          totalIndexCacheMemory +=
            spillRec.size() * MAP_OUTPUT_INDEX_RECORD_LENGTH;
        }
        LOG.info("Finished spill " + numSpills);
        ++numSpills;
      } finally {
        if (out != null) out.close();
      }
    }
  • final Path filename = mapOutputFile.getSpillFileForWrite(numSpills, size);

    获取溢写文件名

  • out = rfs.create(filename);

    创建溢写文件

    创建结果:

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VTRHnvH6-1574342900160)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1574339738018.png)]

  • sorter.sort(MapOutputBuffer.this, mstart, mend, reporter);

    排序。对每个数据对应的索引进行排序

  • 下面的for循环中,就是将每一个排序好的分区中的数据溢写到spill0.out文件中,由于可能要溢写多次,所以最终溢写来的文件名可能会spilln。

  • 判断存放排好序的索引占用的内存是否大于等于阈值,如果满足,也需要在溢写到磁盘.

    判断条件为:

    if (totalIndexCacheMemory >= indexCacheMemoryLimit) {
              // create spill index file
              Path indexFilename =
                  mapOutputFile.getSpillIndexFileForWrite(numSpills, partitions
                      * MAP_OUTPUT_INDEX_RECORD_LENGTH);
              spillRec.writeToFile(indexFilename, job);
            } else {
              indexCacheList.add(spillRec);
              totalIndexCacheMemory +=
                spillRec.size() * MAP_OUTPUT_INDEX_RECORD_LENGTH;
            }
    
  • 在Mapper中所有的kv全部写出去以后,会执行 runNewMapper中的 output.close(mapperContext);

    output.close(mapperContext);
    

在最后一次map的写之后,需要将环形缓冲区的剩余没有被溢写的文件,通过flush的方式,全部溢写到文件中。

调用flush方法

  1. collector.flush();

  2. 同样需要将最后溢写的数据进行排序和溢写,调用的依然是sortAndSpill()方法

    sortAndSpill()

  3. 在sortAndSpill()方法完成后,调用mergeParts();进行溢写文件的合并

    溢写文件的合并也是一个很关键的步骤,这里详细讲解下

    具体的代码实现如下:

    private void mergeParts() throws IOException, InterruptedException, 
                                         ClassNotFoundException {
          // get the approximate size of the final output/index files
          long finalOutFileSize = 0;
          long finalIndexFileSize = 0;
          final Path[] filename = new Path[numSpills];
          final TaskAttemptID mapId = getTaskID();
    
          for(int i = 0; i < numSpills; i++) {
            filename[i] = mapOutputFile.getSpillFile(i);
            finalOutFileSize += rfs.getFileStatus(filename[i]).getLen();
          }
          if (numSpills == 1) { //the spill is the final output
            sameVolRename(filename[0],
                mapOutputFile.getOutputFileForWriteInVolume(filename[0]));
            if (indexCacheList.size() == 0) {
              sameVolRename(mapOutputFile.getSpillIndexFile(0),
                mapOutputFile.getOutputIndexFileForWriteInVolume(filename[0]));
            } else {
              indexCacheList.get(0).writeToFile(
                mapOutputFile.getOutputIndexFileForWriteInVolume(filename[0]), job);
            }
            sortPhase.complete();
            return;
          }
    
          // read in paged indices
          for (int i = indexCacheList.size(); i < numSpills; ++i) {
            Path indexFileName = mapOutputFile.getSpillIndexFile(i);
            indexCacheList.add(new SpillRecord(indexFileName, job));
          }
    
          //make correction in the length to include the sequence file header
          //lengths for each partition
          finalOutFileSize += partitions * APPROX_HEADER_LENGTH;
          finalIndexFileSize = partitions * MAP_OUTPUT_INDEX_RECORD_LENGTH;
          Path finalOutputFile =
              mapOutputFile.getOutputFileForWrite(finalOutFileSize);
          Path finalIndexFile =
              mapOutputFile.getOutputIndexFileForWrite(finalIndexFileSize);
    
          //The output stream for the final single output file
          FSDataOutputStream finalOut = rfs.create(finalOutputFile, true, 4096);
    
          if (numSpills == 0) {
            //create dummy files
            IndexRecord rec = new IndexRecord();
            SpillRecord sr = new SpillRecord(partitions);
            try {
              for (int i = 0; i < partitions; i++) {
                long segmentStart = finalOut.getPos();
                FSDataOutputStream finalPartitionOut = CryptoUtils.wrapIfNecessary(job, finalOut);
                Writer<K, V> writer =
                  new Writer<K, V>(job, finalPartitionOut, keyClass, valClass, codec, null);
                writer.close();
                rec.startOffset = segmentStart;
                rec.rawLength = writer.getRawLength() + CryptoUtils.cryptoPadding(job);
                rec.partLength = writer.getCompressedLength() + CryptoUtils.cryptoPadding(job);
                sr.putIndex(rec, i);
              }
              sr.writeToFile(finalIndexFile, job);
            } finally {
              finalOut.close();
            }
            sortPhase.complete();
            return;
          }
          {
            sortPhase.addPhases(partitions); // Divide sort phase into sub-phases
            
            IndexRecord rec = new IndexRecord();
            final SpillRecord spillRec = new SpillRecord(partitions);
            for (int parts = 0; parts < partitions; parts++) {
              //create the segments to be merged
              List<Segment<K,V>> segmentList =
                new ArrayList<Segment<K, V>>(numSpills);
              for(int i = 0; i < numSpills; i++) {
                IndexRecord indexRecord = indexCacheList.get(i).getIndex(parts);
    
                Segment<K,V> s =
                  new Segment<K,V>(job, rfs, filename[i], indexRecord.startOffset,
                                   indexRecord.partLength, codec, true);
                segmentList.add(i, s);
    
                if (LOG.isDebugEnabled()) {
                  LOG.debug("MapId=" + mapId + " Reducer=" + parts +
                      "Spill =" + i + "(" + indexRecord.startOffset + "," +
                      indexRecord.rawLength + ", " + indexRecord.partLength + ")");
                }
              }
    
              int mergeFactor = job.getInt(JobContext.IO_SORT_FACTOR, 100);
              // sort the segments only if there are intermediate merges
              boolean sortSegments = segmentList.size() > mergeFactor;
              //merge
              @SuppressWarnings("unchecked")
              RawKeyValueIterator kvIter = Merger.merge(job, rfs,
                             keyClass, valClass, codec,
                             segmentList, mergeFactor,
                             new Path(mapId.toString()),
                             job.getOutputKeyComparator(), reporter, sortSegments,
                             null, spilledRecordsCounter, sortPhase.phase(),
                             TaskType.MAP);
    
              //write merged output to disk
              long segmentStart = finalOut.getPos();
              FSDataOutputStream finalPartitionOut = CryptoUtils.wrapIfNecessary(job, finalOut);
              Writer<K, V> writer =
                  new Writer<K, V>(job, finalPartitionOut, keyClass, valClass, codec,
                                   spilledRecordsCounter);
              if (combinerRunner == null || numSpills < minSpillsForCombine) {
                Merger.writeFile(kvIter, writer, reporter, job);
              } else {
                combineCollector.setWriter(writer);
                combinerRunner.combine(kvIter, combineCollector);
              }
    
              //close
              writer.close();
    
              sortPhase.startNextPhase();
              
              // record offsets
              rec.startOffset = segmentStart;
              rec.rawLength = writer.getRawLength() + CryptoUtils.cryptoPadding(job);
              rec.partLength = writer.getCompressedLength() + CryptoUtils.cryptoPadding(job);
              spillRec.putIndex(rec, parts);
            }
            spillRec.writeToFile(finalIndexFile, job);
            finalOut.close();
            for(int i = 0; i < numSpills; i++) {
              rfs.delete(filename[i],true);
            }
          }
        }
    
    • Path finalOutputFile =mapOutputFile.getOutputFileForWrite(finalOutFileSize);
      获取最终合并完的溢写文件名 file.out

    • Path finalIndexFile =mapOutputFile.getOutputIndexFileForWrite(finalIndexFileSize);

      ​ 获取最终溢写到磁盘的索引文件 file.out.index

      ​ 对于溢写到磁盘中的索引文件。所有的partition对应的数据都放在file.out文件里,虽然是顺序存放的,但是怎么直接知道某个partition在这个文件中存放的起始位置呢?强大的索引又出场了。有一个三元组记录某个partition对应的数据在这个文件中的索引:起始位置、原始数据长度、压缩之后的数据长度,一个partition对应一个三元组。然后把这些索引信息存放在内存中,如果内存中放不下了,后续的索引信息就需要写到磁盘文件中了:从所有的本地目录中轮训查找能存储这么大空间的目录,找到之后在其中创建一个类似于“file.out.index”的文件,文件中不光存储了索引数据,还存储了crc32的校验数据。这样方便reduce从多个file.out文件中来拿取同一个分区的数据。

    • FSDataOutputStream finalOut = rfs.create(finalOutputFile, true, 4096);

    获取到hdfs文件系统,能够在hdfs上进行文件的操作

    • Merger.writeFile(kvIter, writer, reporter, job);

      将合并好的数据写到最终的文件中

    • spillRec.writeToFile(finalIndexFile, job);
      将索引文件也写到文件中

    • rfs.delete(filename[i],true);
      e);
      获取最终合并完的溢写文件名 file.out

    • Path finalIndexFile =mapOutputFile.getOutputIndexFileForWrite(finalIndexFileSize);

      ​ 获取最终溢写到磁盘的索引文件 file.out.index

      ​ 对于溢写到磁盘中的索引文件。所有的partition对应的数据都放在file.out文件里,虽然是顺序存放的,但是怎么直接知道某个partition在这个文件中存放的起始位置呢?强大的索引又出场了。有一个三元组记录某个partition对应的数据在这个文件中的索引:起始位置、原始数据长度、压缩之后的数据长度,一个partition对应一个三元组。然后把这些索引信息存放在内存中,如果内存中放不下了,后续的索引信息就需要写到磁盘文件中了:从所有的本地目录中轮训查找能存储这么大空间的目录,找到之后在其中创建一个类似于“file.out.index”的文件,文件中不光存储了索引数据,还存储了crc32的校验数据。这样方便reduce从多个file.out文件中来拿取同一个分区的数据。

    • FSDataOutputStream finalOut = rfs.create(finalOutputFile, true, 4096);

    获取到hdfs文件系统,能够在hdfs上进行文件的操作

    • Merger.writeFile(kvIter, writer, reporter, job);

      将合并好的数据写到最终的文件中

    • spillRec.writeToFile(finalIndexFile, job);
      将索引文件也写到文件中

    • rfs.delete(filename[i],true);
      将溢写文件删除

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值