MR源码学习（三）

最新推荐文章于 2020-07-19 17:11:15 发布

陈先生-HDU

最新推荐文章于 2020-07-19 17:11:15 发布

阅读量458

点赞数

分类专栏： hadoop

本文链接：https://blog.csdn.net/qq_20049243/article/details/50296237

版权

hadoop 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

job.setOutputFormatClass(TextOutputFormat.class);

学习MR的输出过程，依然是从上面的代码作为入口，默认的输出格式是TextOutputFormat类。先不看TextOutputFormat类，看他最顶级的父类OutputFormat

/** 
 * <code>OutputFormat</code> describes the output-specification for a 
 * Map-Reduce job.
 *
 * <p>The Map-Reduce framework relies on the <code>OutputFormat</code> of the
 * job to:<p>
 * <ol>
 *   <li>
 *   Validate the output-specification of the job. For e.g. check that the 
 *   output directory doesn't already exist. 
 *   <li>
 *   Provide the {@link RecordWriter} implementation to be used to write out
 *   the output files of the job. Output files are stored in a 
 *   {@link FileSystem}.
 *   </li>
 * </ol>
 * 
 * @see RecordWriter
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class OutputFormat<K, V>

查看相关的代码注释，这里说明了OutputFormat描述了MR作业的输出规格。1.验证作业的输出规格，比如输出文件目录是否存在；2.提供一个RecordWriter的实现类，去给作业写输出文件；根据之前的经验，这个RecordWriter需要重点关注。

查看具体的方法定义，getRecordWriter方法是为了给任务提供一个RecordWriter；checkOutputSpecs方法用于检查输出规范，通常用来检查输出文件是否存在；getOutputCommitter方法用来获得一个OutputCommitter。重要的最后说，先看getOutputCommitter，它的具体实现在FileOutputFormat中，

public synchronized 
     OutputCommitter getOutputCommitter(TaskAttemptContext context
                                        ) throws IOException {
    if (committer == null) {
      Path output = getOutputPath(context);
      committer = new FileOutputCommitter(output, context);
    }
    return committer;
  }

没干什么事情，就是new了一个FileOutputCommitter，构造方法也说明什么好看的，那就直接看OutputCommitter吧。

/**
 * <code>OutputCommitter</code> describes the commit of task output for a 
 * Map-Reduce job.
 *
 * <p>The Map-Reduce framework relies on the <code>OutputCommitter</code> of 
 * the job to:<p>
 * <ol>
 *   <li>
 *   Setup the job during initialization. For example, create the temporary 
 *   output directory for the job during the initialization of the job.
 *   </li>
 *   <li>
 *   Cleanup the job after the job completion. For example, remove the
 *   temporary output directory after the job completion. 
 *   </li>
 *   <li>
 *   Setup the task temporary output.
 *   </li> 
 *   <li>
 *   Check whether a task needs a commit. This is to avoid the commit
 *   procedure if a task does not need commit.
 *   </li>
 *   <li>
 *   Commit of the task output.
 *   </li>  
 *   <li>
 *   Discard the task commit.
 *   </li>
 * </ol>
 * The methods in this class can be called from several different processes and
 * from several different contexts.  It is important to know which process and
 * which context each is called from.  Each method should be marked accordingly
 * in its documentation.  It is also important to note that not all methods are
 * guaranteed to be called once and only once.  If a method is not guaranteed to
 * have this property the output committer needs to handle this appropriately. 
 * Also note it will only be in rare situations where they may be called 
 * multiple times for the same task.
 * 
 * @see org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 
 * @see JobContext
 * @see TaskAttemptContext 
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class OutputCommitter

看注释就发现，这个类主要描述了输出任务提交时候所需要做的一些事情。

继续看checkOutputSpecs方法，看它的具体实现，代码非常简单，完成了一些文件目录的检查。

public void checkOutputSpecs(JobContext job
                               ) throws FileAlreadyExistsException, IOException{
    // Ensure that the output directory is set and not already there
    Path outDir = getOutputPath(job);
    if (outDir == null) {
      throw new InvalidJobConfException("Output directory not set.");
    }

    // get delegation token for outDir's file system
    TokenCache.obtainTokensForNamenodes(job.getCredentials(),
        new Path[] { outDir }, job.getConfiguration());

    if (outDir.getFileSystem(job.getConfiguration()).exists(outDir)) {
      throw new FileAlreadyExistsException("Output directory " + outDir + 
                                           " already exists");
    }
  }

最后看getRecordWriter方法，看他的具体实现。

 public RecordWriter<K, V> 
         getRecordWriter(TaskAttemptContext job
                         ) throws IOException, InterruptedException {
    Configuration conf = job.getConfiguration();
    boolean isCompressed = getCompressOutput(job);
    String keyValueSeparator= conf.get(SEPERATOR, "\t");//指定kv之间的分隔符
    CompressionCodec codec = null;
    String extension = "";
    if (isCompressed) {//是否压缩
      Class<? extends CompressionCodec> codecClass = 
        getOutputCompressorClass(job, GzipCodec.class);
      codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
      extension = codec.getDefaultExtension();
    }
    Path file = getDefaultWorkFile(job, extension);
    FileSystem fs = file.getFileSystem(conf);
    if (!isCompressed) {//是否压缩
      FSDataOutputStream fileOut = fs.create(file, false);
      return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
    } else {
      FSDataOutputStream fileOut = fs.create(file, false);
      return new LineRecordWriter<K, V>(new DataOutputStream
                                        (codec.createOutputStream(fileOut)),
                                        keyValueSeparator);
    }
  }

</pre><pre name="code" class="java">

</pre><pre name="code" class="java">

看一下具体写文件的过程，流的建立就不看了，之前hdfs博文分析过，着重看LineRecordWriter。它的父类是RecordWriter

/**
 * <code>RecordWriter</code> writes the output <key, value> pairs 
 * to an output file.
 
 * <p><code>RecordWriter</code> implementations write the job outputs to the
 * {@link FileSystem}.
 * 
 * @see OutputFormat
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class RecordWriter<K, V>

注释看出该类负责把kv对写到输出文件中，主要的方法是writer，看下具体的实现。

public synchronized void write(K key, V value)
      throws IOException {

      boolean nullKey = key == null || key instanceof NullWritable;
      boolean nullValue = value == null || value instanceof NullWritable;
      if (nullKey && nullValue) {
        return;
      }
      if (!nullKey) {
        writeObject(key);
      }
      if (!(nullKey || nullValue)) {
        out.write(keyValueSeparator);
      }
      if (!nullValue) {
        writeObject(value);
      }
      out.write(newline);
    }

这个方法非常简单，就不说了