MapReduce原理及源码浅析(一)

MapReduce原理及源码浅析

MapReduce是ASF推出的大数据开源计算框架,虽然如今的开发过程中已经很少直接编写MR程序,但是MR仍然是大数据计算的灵魂思想,这篇文章主要分析MapReduce计算层面(先不管Yarn层)的原理。

MapReduce三个环节:

  • Client
  • MapTask
  • ReduceTask

计算流程:Client提交配置信息和split清单 —> MapTask以split中的每一条记录为单位调用一次map函数 —> ReduceTask将MapTask的输出中属于自己分区的那一部分拉取回来(shuffle)并调用reduce函数输出结果。
MapReduce原理图

Client

客户端的代码如下

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {
    public static void main(String[] args) throws Exception {
		// 基本配置信息
        Configuration conf = new Configuration(true);
        Job job = Job.getInstance(conf);

        job.setJarByClass(WordCount.class);
        job.setJobName("wc");
		// 输入文件位置
        Path infile = new Path("/input/wordcount");
        // 我们输入的是文本类型,因此InputFormat的子类实现为TextInputFormat
        TextInputFormat.addInputPath(job,infile);

        Path outfile = new Path("/out/wordcount/output");
        if(outfile.getFileSystem(conf).exists(outfile)){
            outfile.getFileSystem(conf).delete(outfile,true);
        }
        TextOutputFormat.setOutputPath(job,outfile);
        // 设置输入、输出和mapper、reducer的类型
        job.setMapperClass(MyMapper.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setReducerClass(MyReducer.class);
        
		// 提交任务并等待计算完成
        job.waitForCompletion(true);
    }
}

可以看出Job中的waitForCompletion()方法才是任务的开始,我们不妨从此说起。

/**
   * Submit the job to the cluster and wait for it to finish.
   * @param verbose print the progress to the user
   * @return true if the job succeeded
   * @throws IOException thrown if the communication with the 
   *         <code>JobTracker</code> is lost
   */
  public boolean waitForCompletion(boolean verbose
                                   ) throws IOException, InterruptedException,
                                            ClassNotFoundException {
    if (state == JobState.DEFINE) {
      submit();
    }
    if (verbose) {
      monitorAndPrintJob();
    } else {
      // get the completion poll interval from the client.
      int completionPollIntervalMillis = 
        Job.getCompletionPollInterval(cluster.getConf());
      while (!isComplete()) {
        try {
          Thread.sleep(completionPollIntervalMillis);
        } catch (InterruptedException ie) {
        }
      }
    }
    return isSuccessful();
  }

waitForCompletion()方法传入了一个verbose参数,控制是否打印日志信息;submit()方法提交任务

/**
   * Submit the job to the cluster and return immediately.
   * @throws IOException
   */
  public void submit() 
         throws IOException, InterruptedException, ClassNotFoundException {
    ensureState(JobState.DEFINE);
    setUseNewAPI();
    connect();
    final JobSubmitter submitter = 
        getJobSubmitter(cluster.getFileSystem(), cluster.getClient());
    status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {
      public JobStatus run() throws IOException, InterruptedException, 
      ClassNotFoundException {
        return submitter.submitJobInternal(Job.this, cluster);
      }
    });
    state = JobState.RUNNING;
    LOG.info("The url to track the job: " + getTrackingURL());
   }

进入submit()方法,建立连接后,实际上是调用了submitJobInternal()方法向集群中提交Job。
进入JobSubmitter类,submitJobInternal()方法是其核心,源码中指出该方法的主要任务有

  • 检查输入输出设置
  • 计算切片
  • 建立计算的分布式缓存(如有必要)
  • 将Jar包和配置文件上传到dfs的目录下
  • 向JobTracker提交Job

这里我们主要看一下submitJobInternal()是如何计算切片的,进入submitJobInternal()方法,计算切片实际上是调用了writeSplits()方法

// Create the splits for the job
      LOG.debug("Creating splits at " + jtFs.makeQualified(submitJobDir));
      int maps = writeSplits(job, submitJobDir);
      conf.setInt(MRJobConfig.NUM_MAPS, maps);
      LOG.info("number of splits:" + maps);

而writeSplits()方法又调用了writeNewSplits()方法

private <T extends InputSplit>
  int writeNewSplits(JobContext job, Path jobSubmitDir) throws IOException,
      InterruptedException, ClassNotFoundException {
    Configuration conf = job.getConfiguration();
    InputFormat<?, ?> input =
      ReflectionUtils.newInstance(job.getInputFormatClass(), conf);

    List<InputSplit> splits = input.getSplits(job);
    T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]);

    // sort the splits into order based on size, so that the biggest
    // go first
    Arrays.sort(array, new SplitComparator());
    JobSplitWriter.createSplitFiles(jobSubmitDir, conf, 
        jobSubmitDir.getFileSystem(conf), array);
    return array.length;
  }

writeNewSplits()方法返回一个数组的长度,该数组由InputSplit构成,而InputSplit是通过调用input的getSplits()方法返回的,这里的input是InputFormat这一抽象类的子类实现,我们可以看到input是通过ReflectionUtils.newInstance()方法反射出client代码中的TextInputFormat。TextInputFormat继承了FileInputFormat(InputFormat的子类),作为InputFormat面向纯文本输入的子类实现,文件以行为单位被分割开,key的值就是该行在文件中的位置,value值就是该行。

进入TextInputFormat,发现它重写了InputFormat的createRecordReader()方法,并返回一个RecordReader的子类实现LineRecordReader,上文提及的key和value就是为LineRecordReader类准备的,其getCurrentKey()和getCurrentValue()方法便是返回刚提到的key和value,具体来讲key和value是nextKeyValue()方法判断是否有下一个键值对是赋进去的。不难发现,LineRecordReader类就是真正干活儿的那个人,它将split中的每一条记录以键值对的形式输出给MapTask。

/** An {@link InputFormat} for plain text files.  Files are broken into lines.
 * Either linefeed or carriage-return are used to signal end of line.  Keys are
 * the position in the file, and values are the line of text.. */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class TextInputFormat extends FileInputFormat<LongWritable, Text> {

  @Override
  public RecordReader<LongWritable, Text> 
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
    String delimiter = context.getConfiguration().get(
        "textinputformat.record.delimiter");
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
    return new LineRecordReader(recordDelimiterBytes);
  }

  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    final CompressionCodec codec =
      new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    if (null == codec) {
      return true;
    }
    return codec instanceof SplittableCompressionCodec;
  }

}

我们来看LineRecordReader的initialize()方法,它传入一个InputSplit,并初始化split的三个必要参数:start、length和path。
向后拉,我们看到Initialize()方法调用了fileIn.seek(start)方法,将RecordReader读取的游标定位到了该split的起始处。

public void initialize(InputSplit genericSplit,
                         TaskAttemptContext context) throws IOException {
    FileSplit split = (FileSplit) genericSplit;
    Configuration job = context.getConfiguration();
    this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
    start = split.getStart();
    end = start + split.getLength();
    final Path file = split.getPath();

    // open the file and seek to the start of the split
    final FileSystem fs = file.getFileSystem(job);
    fileIn = fs.open(file);
    
    CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
    if (null!=codec) {
      isCompressedInput = true;	
      decompressor = CodecPool.getDecompressor(codec);
      if (codec instanceof SplittableCompressionCodec) {
        final SplitCompressionInputStream cIn =
          ((SplittableCompressionCodec)codec).createInputStream(
            fileIn, decompressor, start, end,
            SplittableCompressionCodec.READ_MODE.BYBLOCK);
        in = new CompressedSplitLineReader(cIn, job,
            this.recordDelimiterBytes);
        start = cIn.getAdjustedStart();
        end = cIn.getAdjustedEnd();
        filePosition = cIn;
      } else {
        in = new SplitLineReader(codec.createInputStream(fileIn,
            decompressor), job, this.recordDelimiterBytes);
        filePosition = fileIn;
      }
    } else {
      fileIn.seek(start);
      in = new UncompressedSplitLineReader(
          fileIn, job, this.recordDelimiterBytes, split.getLength());
      filePosition = fileIn;
    }
    // If this is not the first split, we always throw away first record
    // because we always (except the last split) read one extra line in
    // next() method.
    if (start != 0) {
      start += in.readLine(new Text(), 0, maxBytesToConsume(start));
    }
    this.pos = start;
  }

接下来我们回忆一下,当初在文件切片时,有很大概率将同一行切分到了两个split中
在这里插入图片描述
但如果你跑过demo的话你会发现最后的计算结果仍然是正确的,就是因为有下面这几行代码将数据重新进行了恢复

	// If this is not the first split, we always throw away first record
    // because we always (except the last split) read one extra line in
    // next() method.
    if (start != 0) {
      start += in.readLine(new Text(), 0, maxBytesToConsume(start));
    }

源代码中已经给出了注释:从第二个切片开始,我们每次读取切片时都让出第一条记录,而让上一个LineRecordReader读取时多读取一行
通过分析代码我们也可以看出,in.readLine()将读取出的数据放到一个匿名对象new Text()中,这些数据随后就会被GC,说明作者根本不是意在读取数据,而是为了得到返回值从而将start设置成第二条记录的起始位置,也就是说start本来是在llo,后来我们将其定位到下一条记录hello处。整个过程实际上是自动进行的,因为LineRecordReader读取一行时,是以换行符为分界的,读取完he但并没有读到换行符,此时它会从客户端提交的配置信息中找到下一个split的位置信息,并读下一个split的第一条记录(读到换行符结束,这时该split的start位置就应该按照上述代码更新)。事实上,集群中仍然存在少量的数据移动。

我们已经知道如何读取split并如何恢复数据了,但是split是怎么得到的呢?现在我们回到TextInputFormat类,我们发现他并没有实现getSplits()方法,说明TextInputFormat复用了父类FileInputFormat的getSplits()方法

/** 
   * Generate the list of files and make them into FileSplits.
   * @param job the job context
   * @throws IOException
   */
  public List<InputSplit> getSplits(JobContext job) throws IOException {
    Stopwatch sw = new Stopwatch().start();
    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
    long maxSize = getMaxSplitSize(job);

    // generate splits
    List<InputSplit> splits = new ArrayList<InputSplit>();
    List<FileStatus> files = listStatus(job);
    for (FileStatus file: files) {
      Path path = file.getPath();
      long length = file.getLen();
      if (length != 0) {
        BlockLocation[] blkLocations;
        if (file instanceof LocatedFileStatus) {
          blkLocations = ((LocatedFileStatus) file).getBlockLocations();
        } else {
          FileSystem fs = path.getFileSystem(job.getConfiguration());
          blkLocations = fs.getFileBlockLocations(file, 0, length);
        }
        if (isSplitable(job, path)) {
          long blockSize = file.getBlockSize();
          long splitSize = computeSplitSize(blockSize, minSize, maxSize);

          long bytesRemaining = length;
          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                        blkLocations[blkIndex].getHosts(),
                        blkLocations[blkIndex].getCachedHosts()));
            bytesRemaining -= splitSize;
          }

          if (bytesRemaining != 0) {
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
                       blkLocations[blkIndex].getHosts(),
                       blkLocations[blkIndex].getCachedHosts()));
          }
        } else { // not splitable
          splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
                      blkLocations[0].getCachedHosts()));
        }
      } else { 
        //Create empty hosts array for zero length files
        splits.add(makeSplit(path, 0, length, new String[0]));
      }
    }
    // Save the number of input files for metrics/loadgen
    job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
    sw.stop();
    if (LOG.isDebugEnabled()) {
      LOG.debug("Total # of splits generated by getSplits: " + splits.size()
          + ", TimeTaken: " + sw.elapsedMillis());
    }
    return splits;
  }

对于每一个输入File都有对应的FileStatus类,包含客户端存储的文件信息

/** Interface that represents the client side information for a file.
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class FileStatus implements Writable, Comparable {

  private Path path;
  private long length;
  private boolean isdir;
  private short block_replication;
  private long blocksize;
  private long modification_time;
  private long access_time;
  private FsPermission permission;
  private String owner;
  private String group;
  private Path symlink;

FileInputFormat中的listStatus()方法返回了一个存有FileStatus的List,遍历List从每一个FileStatus获取block并制作split。循环中维护了一个BlockLocation的数组BlockLocation[ ],里面存储了File的每一个block的物理地址,主机名等元数据

/**
 * Represents the network location of a block, information about the hosts
 * that contain block replicas, and other block metadata (E.g. the file
 * offset associated with the block, length, whether it is corrupt, etc).
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class BlockLocation {
  private String[] hosts; // Datanode hostnames
  private String[] cachedHosts; // Datanode hostnames with a cached replica
  private String[] names; // Datanode IP:xferPort for accessing the block
  private String[] topologyPaths; // Full path name in network topology
  private long offset;  // Offset of the block in the file
  private long length;
  private boolean corrupt;

  private static final String[] EMPTY_STR_ARRAY = new String[0];

并调用getBlockIndex()方法返回BlockLocation[ ]的索引,这里传入了一个参数offset,getSplits()方法中传入的是length-bytesRemaining也就是切片的偏移量而不是block的,所以在判断索引取值时用了这样一个不等式:第i个block的offset =< offset < 第i个block的offset+该block的长度,此时得到的i便是该split所在block的索引。

protected int getBlockIndex(BlockLocation[] blkLocations, 
                              long offset) {
    for (int i = 0 ; i < blkLocations.length; i++) {
      // is the offset inside this block?
      if ((blkLocations[i].getOffset() <= offset) &&
          (offset < blkLocations[i].getOffset() + blkLocations[i].getLength())){
        return i;
      }
    }
    BlockLocation last = blkLocations[blkLocations.length -1];
    long fileLength = last.getOffset() + last.getLength() -1;
    throw new IllegalArgumentException("Offset " + offset + 
                                       " is outside of file (0.." +
                                       fileLength + ")");
  }

splitSize可通过computeSplitSize()方法计算得出,如果想要splitSize变大,就将minSize改大;反之减小maxSize即可。

protected long computeSplitSize(long blockSize, long minSize,
                                  long maxSize) {
    return Math.max(minSize, Math.min(maxSize, blockSize));
  }

这里要注意,进入这些方法的前提是我们的输入文件是isSplitable的,但当输入为流式压缩的数据时,执行的不是以上操作,具体还要仔细阅读源码。

接下来进入一个循环,只要剩余的字节数大于一个split的大小,就让List不断放入新的split,这里的makeSplit()方法需要split的5个信息:

  • 路径
  • offset
  • splitSize
  • hosts和cachedhosts
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                        blkLocations[blkIndex].getHosts(),
                        blkLocations[blkIndex].getCachedHosts()));
            bytesRemaining -= splitSize;
             }

到这里,splits也都已经获得了,结合前文LineRecordReader读取split的过程,我们客户端的主要工作也就告一段落,下一篇文章我们讨论MapTask的工作流程。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值