MapReduce原理及源码浅析
MapReduce是ASF推出的大数据开源计算框架,虽然如今的开发过程中已经很少直接编写MR程序,但是MR仍然是大数据计算的灵魂思想,这篇文章主要分析MapReduce计算层面(先不管Yarn层)的原理。
MapReduce三个环节:
- Client
- MapTask
- ReduceTask
计算流程:Client提交配置信息和split清单 —> MapTask以split中的每一条记录为单位调用一次map函数 —> ReduceTask将MapTask的输出中属于自己分区的那一部分拉取回来(shuffle)并调用reduce函数输出结果。
Client
客户端的代码如下
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
// 基本配置信息
Configuration conf = new Configuration(true);
Job job = Job.getInstance(conf);
job.setJarByClass(WordCount.class);
job.setJobName("wc");
// 输入文件位置
Path infile = new Path("/input/wordcount");
// 我们输入的是文本类型,因此InputFormat的子类实现为TextInputFormat
TextInputFormat.addInputPath(job,infile);
Path outfile = new Path("/out/wordcount/output");
if(outfile.getFileSystem(conf).exists(outfile)){
outfile.getFileSystem(conf).delete(outfile,true);
}
TextOutputFormat.setOutputPath(job,outfile);
// 设置输入、输出和mapper、reducer的类型
job.setMapperClass(MyMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setReducerClass(MyReducer.class);
// 提交任务并等待计算完成
job.waitForCompletion(true);
}
}
可以看出Job中的waitForCompletion()方法才是任务的开始,我们不妨从此说起。
/**
* Submit the job to the cluster and wait for it to finish.
* @param verbose print the progress to the user
* @return true if the job succeeded
* @throws IOException thrown if the communication with the
* <code>JobTracker</code> is lost
*/
public boolean waitForCompletion(boolean verbose
) throws IOException, InterruptedException,
ClassNotFoundException {
if (state == JobState.DEFINE) {
submit();
}
if (verbose) {
monitorAndPrintJob();
} else {
// get the completion poll interval from the client.
int completionPollIntervalMillis =
Job.getCompletionPollInterval(cluster.getConf());
while (!isComplete()) {
try {
Thread.sleep(completionPollIntervalMillis);
} catch (InterruptedException ie) {
}
}
}
return isSuccessful();
}
waitForCompletion()方法传入了一个verbose参数,控制是否打印日志信息;submit()方法提交任务
/**
* Submit the job to the cluster and return immediately.
* @throws IOException
*/
public void submit()
throws IOException, InterruptedException, ClassNotFoundException {
ensureState(JobState.DEFINE);
setUseNewAPI();
connect();
final JobSubmitter submitter =
getJobSubmitter(cluster.getFileSystem(), cluster.getClient());
status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {
public JobStatus run() throws IOException, InterruptedException,
ClassNotFoundException {
return submitter.submitJobInternal(Job.this, cluster);
}
});
state = JobState.RUNNING;
LOG.info("The url to track the job: " + getTrackingURL());
}
进入submit()方法,建立连接后,实际上是调用了submitJobInternal()方法向集群中提交Job。
进入JobSubmitter类,submitJobInternal()方法是其核心,源码中指出该方法的主要任务有
- 检查输入输出设置
- 计算切片
- 建立计算的分布式缓存(如有必要)
- 将Jar包和配置文件上传到dfs的目录下
- 向JobTracker提交Job
这里我们主要看一下submitJobInternal()是如何计算切片的,进入submitJobInternal()方法,计算切片实际上是调用了writeSplits()方法
// Create the splits for the job
LOG.debug("Creating splits at " + jtFs.makeQualified(submitJobDir));
int maps = writeSplits(job, submitJobDir);
conf.setInt(MRJobConfig.NUM_MAPS, maps);
LOG.info("number of splits:" + maps);
而writeSplits()方法又调用了writeNewSplits()方法
private <T extends InputSplit>
int writeNewSplits(JobContext job, Path jobSubmitDir) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = job.getConfiguration();
InputFormat<?, ?> input =
ReflectionUtils.newInstance(job.getInputFormatClass(), conf);
List<InputSplit> splits = input.getSplits(job);
T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]);
// sort the splits into order based on size, so that the biggest
// go first
Arrays.sort(array, new SplitComparator());
JobSplitWriter.createSplitFiles(jobSubmitDir, conf,
jobSubmitDir.getFileSystem(conf), array);
return array.length;
}
writeNewSplits()方法返回一个数组的长度,该数组由InputSplit构成,而InputSplit是通过调用input的getSplits()方法返回的,这里的input是InputFormat这一抽象类的子类实现,我们可以看到input是通过ReflectionUtils.newInstance()方法反射出client代码中的TextInputFormat。TextInputFormat继承了FileInputFormat(InputFormat的子类),作为InputFormat面向纯文本输入的子类实现,文件以行为单位被分割开,key的值就是该行在文件中的位置,value值就是该行。
进入TextInputFormat,发现它重写了InputFormat的createRecordReader()方法,并返回一个RecordReader的子类实现LineRecordReader,上文提及的key和value就是为LineRecordReader类准备的,其getCurrentKey()和getCurrentValue()方法便是返回刚提到的key和value,具体来讲key和value是nextKeyValue()方法判断是否有下一个键值对是赋进去的。不难发现,LineRecordReader类就是真正干活儿的那个人,它将split中的每一条记录以键值对的形式输出给MapTask。
/** An {@link InputFormat} for plain text files. Files are broken into lines.
* Either linefeed or carriage-return are used to signal end of line. Keys are
* the position in the file, and values are the line of text.. */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class TextInputFormat extends FileInputFormat<LongWritable, Text> {
@Override
public RecordReader<LongWritable, Text>
createRecordReader(InputSplit split,
TaskAttemptContext context) {
String delimiter = context.getConfiguration().get(
"textinputformat.record.delimiter");
byte[] recordDelimiterBytes = null;
if (null != delimiter)
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
return new LineRecordReader(recordDelimiterBytes);
}
@Override
protected boolean isSplitable(JobContext context, Path file) {
final CompressionCodec codec =
new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
if (null == codec) {
return true;
}
return codec instanceof SplittableCompressionCodec;
}
}
我们来看LineRecordReader的initialize()方法,它传入一个InputSplit,并初始化split的三个必要参数:start、length和path。
向后拉,我们看到Initialize()方法调用了fileIn.seek(start)方法,将RecordReader读取的游标定位到了该split的起始处。
public void initialize(InputSplit genericSplit,
TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
// open the file and seek to the start of the split
final FileSystem fs = file.getFileSystem(job);
fileIn = fs.open(file);
CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
if (null!=codec) {
isCompressedInput = true;
decompressor = CodecPool.getDecompressor(codec);
if (codec instanceof SplittableCompressionCodec) {
final SplitCompressionInputStream cIn =
((SplittableCompressionCodec)codec).createInputStream(
fileIn, decompressor, start, end,
SplittableCompressionCodec.READ_MODE.BYBLOCK);
in = new CompressedSplitLineReader(cIn, job,
this.recordDelimiterBytes);
start = cIn.getAdjustedStart();
end = cIn.getAdjustedEnd();
filePosition = cIn;
} else {
in = new SplitLineReader(codec.createInputStream(fileIn,
decompressor), job, this.recordDelimiterBytes);
filePosition = fileIn;
}
} else {
fileIn.seek(start);
in = new UncompressedSplitLineReader(
fileIn, job, this.recordDelimiterBytes, split.getLength());
filePosition = fileIn;
}
// If this is not the first split, we always throw away first record
// because we always (except the last split) read one extra line in
// next() method.
if (start != 0) {
start += in.readLine(new Text(), 0, maxBytesToConsume(start));
}
this.pos = start;
}
接下来我们回忆一下,当初在文件切片时,有很大概率将同一行切分到了两个split中
但如果你跑过demo的话你会发现最后的计算结果仍然是正确的,就是因为有下面这几行代码将数据重新进行了恢复
// If this is not the first split, we always throw away first record
// because we always (except the last split) read one extra line in
// next() method.
if (start != 0) {
start += in.readLine(new Text(), 0, maxBytesToConsume(start));
}
源代码中已经给出了注释:从第二个切片开始,我们每次读取切片时都让出第一条记录,而让上一个LineRecordReader读取时多读取一行。
通过分析代码我们也可以看出,in.readLine()将读取出的数据放到一个匿名对象new Text()中,这些数据随后就会被GC,说明作者根本不是意在读取数据,而是为了得到返回值从而将start设置成第二条记录的起始位置,也就是说start本来是在llo,后来我们将其定位到下一条记录hello处。整个过程实际上是自动进行的,因为LineRecordReader读取一行时,是以换行符为分界的,读取完he但并没有读到换行符,此时它会从客户端提交的配置信息中找到下一个split的位置信息,并读下一个split的第一条记录(读到换行符结束,这时该split的start位置就应该按照上述代码更新)。事实上,集群中仍然存在少量的数据移动。
我们已经知道如何读取split并如何恢复数据了,但是split是怎么得到的呢?现在我们回到TextInputFormat类,我们发现他并没有实现getSplits()方法,说明TextInputFormat复用了父类FileInputFormat的getSplits()方法
/**
* Generate the list of files and make them into FileSplits.
* @param job the job context
* @throws IOException
*/
public List<InputSplit> getSplits(JobContext job) throws IOException {
Stopwatch sw = new Stopwatch().start();
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
long maxSize = getMaxSplitSize(job);
// generate splits
List<InputSplit> splits = new ArrayList<InputSplit>();
List<FileStatus> files = listStatus(job);
for (FileStatus file: files) {
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
FileSystem fs = path.getFileSystem(job.getConfiguration());
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (isSplitable(job, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
}
} else { // not splitable
splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
blkLocations[0].getCachedHosts()));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
// Save the number of input files for metrics/loadgen
job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
sw.stop();
if (LOG.isDebugEnabled()) {
LOG.debug("Total # of splits generated by getSplits: " + splits.size()
+ ", TimeTaken: " + sw.elapsedMillis());
}
return splits;
}
对于每一个输入File都有对应的FileStatus类,包含客户端存储的文件信息
/** Interface that represents the client side information for a file.
*/
@InterfaceAudience.Public
@InterfaceStability.Stable
public class FileStatus implements Writable, Comparable {
private Path path;
private long length;
private boolean isdir;
private short block_replication;
private long blocksize;
private long modification_time;
private long access_time;
private FsPermission permission;
private String owner;
private String group;
private Path symlink;
FileInputFormat中的listStatus()方法返回了一个存有FileStatus的List,遍历List从每一个FileStatus获取block并制作split。循环中维护了一个BlockLocation的数组BlockLocation[ ],里面存储了File的每一个block的物理地址,主机名等元数据
/**
* Represents the network location of a block, information about the hosts
* that contain block replicas, and other block metadata (E.g. the file
* offset associated with the block, length, whether it is corrupt, etc).
*/
@InterfaceAudience.Public
@InterfaceStability.Stable
public class BlockLocation {
private String[] hosts; // Datanode hostnames
private String[] cachedHosts; // Datanode hostnames with a cached replica
private String[] names; // Datanode IP:xferPort for accessing the block
private String[] topologyPaths; // Full path name in network topology
private long offset; // Offset of the block in the file
private long length;
private boolean corrupt;
private static final String[] EMPTY_STR_ARRAY = new String[0];
并调用getBlockIndex()方法返回BlockLocation[ ]的索引,这里传入了一个参数offset,getSplits()方法中传入的是length-bytesRemaining也就是切片的偏移量而不是block的,所以在判断索引取值时用了这样一个不等式:第i个block的offset =< offset < 第i个block的offset+该block的长度,此时得到的i便是该split所在block的索引。
protected int getBlockIndex(BlockLocation[] blkLocations,
long offset) {
for (int i = 0 ; i < blkLocations.length; i++) {
// is the offset inside this block?
if ((blkLocations[i].getOffset() <= offset) &&
(offset < blkLocations[i].getOffset() + blkLocations[i].getLength())){
return i;
}
}
BlockLocation last = blkLocations[blkLocations.length -1];
long fileLength = last.getOffset() + last.getLength() -1;
throw new IllegalArgumentException("Offset " + offset +
" is outside of file (0.." +
fileLength + ")");
}
splitSize可通过computeSplitSize()方法计算得出,如果想要splitSize变大,就将minSize改大;反之减小maxSize即可。
protected long computeSplitSize(long blockSize, long minSize,
long maxSize) {
return Math.max(minSize, Math.min(maxSize, blockSize));
}
这里要注意,进入这些方法的前提是我们的输入文件是isSplitable的,但当输入为流式压缩的数据时,执行的不是以上操作,具体还要仔细阅读源码。
接下来进入一个循环,只要剩余的字节数大于一个split的大小,就让List不断放入新的split,这里的makeSplit()方法需要split的5个信息:
- 路径
- offset
- splitSize
- hosts和cachedhosts
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
bytesRemaining -= splitSize;
}
到这里,splits也都已经获得了,结合前文LineRecordReader读取split的过程,我们客户端的主要工作也就告一段落,下一篇文章我们讨论MapTask的工作流程。