job.setOutputFormatClass(TextOutputFormat.class);
学习MR的输出过程,依然是从上面的代码作为入口,默认的输出格式是TextOutputFormat类。先不看TextOutputFormat类,看他最顶级的父类OutputFormat
/**
* <code>OutputFormat</code> describes the output-specification for a
* Map-Reduce job.
*
* <p>The Map-Reduce framework relies on the <code>OutputFormat</code> of the
* job to:<p>
* <ol>
* <li>
* Validate the output-specification of the job. For e.g. check that the
* output directory doesn't already exist.
* <li>
* Provide the {@link RecordWriter} implementation to be used to write out
* the output files of the job. Output files are stored in a
* {@link FileSystem}.
* </li>
* </ol>
*
* @see RecordWriter
*/
@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class OutputFormat<K, V>
查看相关的代码注释,这里说明了OutputFormat描述了MR作业的输出规格。1.验证作业的输出规格,比如输出文件目录是否存在;2.提供一个RecordWriter的实现类,去给作业写输出文件;根据之前的经验,这个RecordWriter需要重点关注。
查看具体的方法定义,getRecordWriter方法是为了给任务提供一个RecordWriter;checkOutputSpecs方法用于检查输出规范,通常用来检查输出文件是否存在;getOutputCommitter方法用来获得一个OutputCommitter。重要的最后说,先看getOutputCommitter,它的具体实现在FileOutputFormat中,
public synchronized
OutputCommitter getOutputCommitter(TaskAttemptContext context
) throws IOException {
if (committer == null) {
Path output = getOutputPath(context);
committer = new FileOutputCommitter(output, context);
}
return committer;
}
没干什么事情,就是new了一个FileOutputCommitter,构造方法也说明什么好看的,那就直接看OutputCommitter吧。
/**
* <code>OutputCommitter</code> describes the commit of task output for a
* Map-Reduce job.
*
* <p>The Map-Reduce framework relies on the <code>OutputCommitter</code> of
* the job to:<p>
* <ol>
* <li>
* Setup the job during initialization. For example, create the temporary
* output directory for the job during the initialization of the job.
* </li>
* <li>
* Cleanup the job after the job completion. For example, remove the
* temporary output directory after the job completion.
* </li>
* <li>
* Setup the task temporary output.
* </li>
* <li>
* Check whether a task needs a commit. This is to avoid the commit
* procedure if a task does not need commit.
* </li>
* <li>
* Commit of the task output.
* </li>
* <li>
* Discard the task commit.
* </li>
* </ol>
* The methods in this class can be called from several different processes and
* from several different contexts. It is important to know which process and
* which context each is called from. Each method should be marked accordingly
* in its documentation. It is also important to note that not all methods are
* guaranteed to be called once and only once. If a method is not guaranteed to
* have this property the output committer needs to handle this appropriately.
* Also note it will only be in rare situations where they may be called
* multiple times for the same task.
*
* @see org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
* @see JobContext
* @see TaskAttemptContext
*/
@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class OutputCommitter
看注释就发现,这个类主要描述了输出任务提交时候所需要做的一些事情。
继续看checkOutputSpecs方法,看它的具体实现,代码非常简单,完成了一些文件目录的检查。
public void checkOutputSpecs(JobContext job
) throws FileAlreadyExistsException, IOException{
// Ensure that the output directory is set and not already there
Path outDir = getOutputPath(job);
if (outDir == null) {
throw new InvalidJobConfException("Output directory not set.");
}
// get delegation token for outDir's file system
TokenCache.obtainTokensForNamenodes(job.getCredentials(),
new Path[] { outDir }, job.getConfiguration());
if (outDir.getFileSystem(job.getConfiguration()).exists(outDir)) {
throw new FileAlreadyExistsException("Output directory " + outDir +
" already exists");
}
}
最后看getRecordWriter方法,看他的具体实现。
public RecordWriter<K, V>
getRecordWriter(TaskAttemptContext job
) throws IOException, InterruptedException {
Configuration conf = job.getConfiguration();
boolean isCompressed = getCompressOutput(job);
String keyValueSeparator= conf.get(SEPERATOR, "\t");//指定kv之间的分隔符
CompressionCodec codec = null;
String extension = "";
if (isCompressed) {//是否压缩
Class<? extends CompressionCodec> codecClass =
getOutputCompressorClass(job, GzipCodec.class);
codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
extension = codec.getDefaultExtension();
}
Path file = getDefaultWorkFile(job, extension);
FileSystem fs = file.getFileSystem(conf);
if (!isCompressed) {//是否压缩
FSDataOutputStream fileOut = fs.create(file, false);
return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
} else {
FSDataOutputStream fileOut = fs.create(file, false);
return new LineRecordWriter<K, V>(new DataOutputStream
(codec.createOutputStream(fileOut)),
keyValueSeparator);
}
}
</pre><pre name="code" class="java">
</pre><pre name="code" class="java">
看一下具体写文件的过程,流的建立就不看了,之前hdfs博文分析过,着重看LineRecordWriter。它的父类是RecordWriter
/**
* <code>RecordWriter</code> writes the output <key, value> pairs
* to an output file.
* <p><code>RecordWriter</code> implementations write the job outputs to the
* {@link FileSystem}.
*
* @see OutputFormat
*/
@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class RecordWriter<K, V>
注释看出该类负责把kv对写到输出文件中,主要的方法是writer,看下具体的实现。
public synchronized void write(K key, V value)
throws IOException {
boolean nullKey = key == null || key instanceof NullWritable;
boolean nullValue = value == null || value instanceof NullWritable;
if (nullKey && nullValue) {
return;
}
if (!nullKey) {
writeObject(key);
}
if (!(nullKey || nullValue)) {
out.write(keyValueSeparator);
}
if (!nullValue) {
writeObject(value);
}
out.write(newline);
}
这个方法非常简单,就不说了