从头开始系列(三)--Hadoop篇之MapReduce编程模型

最新推荐文章于 2024-04-28 18:20:22 发布

置顶千易云Lee

最新推荐文章于 2024-04-28 18:20:22 发布

阅读量563

点赞数 3

分类专栏：大数据文章标签： mapreduce 大数据编程 hadoop

本文链接：https://blog.csdn.net/weixin_42028303/article/details/80081347

版权

大数据专栏收录该内容

4 篇文章 0 订阅

订阅专栏

标签（空格分隔）：大数据从头开始系列

- 1MapReduce的一生
- 2总结

1、MapReduce的一生

本篇文章是根据Hadoop2.7.5版本而来~~~开局一张图，牛逼全靠吹~
这里写图片描述

1.1、`*FileBlock`

此处的FileBlock是HDFS上的文件，当然也可以是本地文件，但并不推荐使用本地文件。如果是学习，或者测试用，完全可以使用本地文件，本篇文章使用的本地文件
如果FileBlock在HDFS上，我们按照一个块来分析。

1.2、`InputFormat`

InputFormat是获取文件信息，并且获取文件内容的核心抽象类。该类有如下方法：

public abstract class InputFormat<K, V> {
    public InputFormat() {//构造
    }

    public abstract List<InputSplit> getSplits(JobContext var1) throws IOException, InterruptedException;//获取文件逻辑分片

    public abstract RecordReader<K, V> createRecordReader(InputSplit var1, TaskAttemptContext var2) throws IOException, InterruptedException;//返回获取文件内容的类。
}

我们先来看getSplits方法，该方法主要是获取逻辑分片，我们知道在HDFS中物理分片是128M一个块，那在MapReduce程序中，getSplites方法是获取逻辑分区，该分区是在一个物理块上来做的分区，需要注意的是，逻辑分区不会跨越物理分区，也就是说，逻辑分区是在一块物理分区上做的分区。比如：128M一块，如果我们在getSplits中设置10M一个逻辑分区，那么128M的物理分区会被分成13个逻辑分区，而不会是12个10M分区加上另一个块的2M数据来凑出第13个逻辑分区，也就是说会分成12个10M分区+1个8M分区。当然，我们一般情况下并不会对getSplits有过多的开发需求，那么默认的情况就是一个物理块对应一个逻辑分区，在FileInputFomat类中，我们可以看到：

 public List<InputSplit> getSplits(JobContext job) throws IOException {
        StopWatch sw = (new StopWatch()).start();
        long minSize = Math.max(this.getFormatMinSplitSize(), getMinSplitSize(job));//最小的逻辑分区大小
        long maxSize = getMaxSplitSize(job);//最大的逻辑分区大小。
        List<InputSplit> splits = new ArrayList();
        List<FileStatus> files = this.listStatus(job);//job来自参数JobContext，所以可以预见，该变量为可以获取到文件的物理块信息
        Iterator var9 = files.iterator();//从files中获取迭代器，那么迭代的就是每个物理块，我们继续往下看

        while(true) {
            while(true) {
                while(var9.hasNext()) {//迭代var9，也就是每个物理块
                    FileStatus file = (FileStatus)var9.next();
                    Path path = file.getPath();
                    long length = file.getLen();//获取该物理块的大小
                    if (length != 0L) {
                        BlockLocation[] blkLocations;
                        if (file instanceof LocatedFileStatus) {
                            blkLocations = ((LocatedFileStatus)file).getBlockLocations();
                        } else {
                            FileSystem fs = path.getFileSystem(job.getConfiguration());
                            blkLocations = fs.getFileBlockLocations(file, 0L, length);
                        }//该步骤获取块信息。

                        if (this.isSplitable(job, path)) {
                            long blockSize = file.getBlockSize();
                            long splitSize = this.computeSplitSize(blockSize, minSize, maxSize);//计算逻辑分区大小，该方法我们一会看。

                            long bytesRemaining;//剩余长度
                            int blkIndex;
                            for(bytesRemaining = length; (double)bytesRemaining / (double)splitSize > 1.1D; bytesRemaining -= splitSize) {//循环，此时剩余长度=块长度。计算剩余长度和逻辑分区长度的比值是否大于1.1，如果大于1.1说明逻辑分区大小小于块大小。
                                blkIndex = this.getBlockIndex(blkLocations, length - bytesRemaining);
                                splits.add(this.makeSplit(path, length - bytesRemaining, splitSize, blkLocations[blkIndex].getHosts(), blkLocations[blkIndex].getCachedHosts()));//循环加入逻辑分区，此时注意，逻辑分区并没有跨物理块，都是在一个物理块中做的切分。
                            }

                            if (bytesRemaining != 0L) {
                                blkIndex = this.getBlockIndex(blkLocations, length - bytesRemaining);
                                splits.add(this.makeSplit(path, length - bytesRemaining, bytesRemaining, blkLocations[blkIndex].getHosts(), blkLocations[blkIndex].getCachedHosts()));
                            }//将最后一点数据作为单独的一个逻辑分区加入splits中。
                        } else {
                            splits.add(this.makeSplit(path, 0L, length, blkLocations[0].getHosts(), blkLocations[0].getCachedHosts()));
                        }//如果不支持逻辑分区，直接加入整块
                    } else {
                        splits.add(this.makeSplit(path, 0L, length, new String[0]));
                    }
                }

                job.getConfiguration().setLong("mapreduce.input.fileinputformat.numinputfiles", (long)files.size());
                sw.stop();
                if (LOG.isDebugEnabled()) {
                    LOG.debug("Total # of splits generated by getSplits: " + splits.size() + ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
                }

                return splits;
            }
        }
    }

OK，我们看完FileInputFormat的源码之后，会发现，有些地方不太了解，那么我们来一个一个的刨析。
minSize最小逻辑分区大小。在源码中get，set方法：

public static void setMinInputSplitSize(Job job, long size) {
        job.getConfiguration().setLong("mapreduce.input.fileinputformat.split.minsize", size);
    }//set方法

    public static long getMinSplitSize(JobContext job) {
        return job.getConfiguration().getLong("mapreduce.input.fileinputformat.split.minsize", 1L);
    }//get方法，在我们没有设置最小逻辑分区大小是，默认是1byte。

maxSize最大逻辑分区。源码也是有get，set方法的。

 public static void setMaxInputSplitSize(Job job, long size) {
        job.getConfiguration().setLong("mapreduce.input.fileinputformat.split.maxsize", size);
    }//set方法。

    public static long getMaxSplitSize(JobContext context) {
        return context.getConfiguration().getLong("mapreduce.input.fileinputformat.split.maxsize", 9223372036854775807L);
    }//默认情况下，是好大的一个数字，绝对是大于128M的

Ok，如果在没有设置最大最小逻辑分区时，上面源码的splitSize会是多大呢，我们看代码：

  protected long computeSplitSize(long blockSize, long minSize, long maxSize) {
        return Math.max(minSize, Math.min(maxSize, blockSize));
    }//在我们没有设置最大和最小逻辑分区是，我们计算的是Math.max(1,Math.min(9223372036854775807L,134217728L)）的最大值，那么返回值绝对是128M啦。

那么源码中(double)bytesRemaining / (double)splitSize=1,就会走下面的逻辑，将整块作为一个逻辑分区加入splits中。所以，在默认情况下，一个物理分区对应一个逻辑分区！

我们看createRecordReader方法，该方法回返回一个RecordReader类，那么我们去看看RecordReader类是干什么的。

1.3 `RecordReader`

FileInputFormat并没有对createRecordReader进行实现，所以我们找到TextInputFormat类来完成该方法的解读，并且进一步刨析RecordReader
以下是TextInputFormat实现的createRecordReader方法：

public RecordReader<LongWritable, Text> 
createRecordReader(InputSplit split, TaskAttemptContext context)
/**
先看参数，其中split是逻辑分区，context是任务上下文。
**/
{
        String delimiter = context.getConfiguration().get("textinputformat.record.delimiter");
        byte[] recordDelimiterBytes = null;
        if (null != delimiter) {
            recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
        }

        return new LineRecordReader(recordDelimiterBytes);//返回LineRecordReader。
    }

在我们去刨析，LineRecordReader之前，我们先看一下RecorderReader都有什么方法：

public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable {
    public RecordReader() {
    }

    public abstract void initialize(InputSplit var1, TaskAttemptContext var2) throws IOException, InterruptedException;//初始化，参数是逻辑分区和task的上下文，我们从这里很容易看出来，该方法的参数和InputFormat的`createRecordReader`的参数是一摸一样的。

    public abstract boolean nextKeyValue() throws IOException, InterruptedException;//是否又下一个key-value值

    public abstract KEYIN getCurrentKey() throws IOException, InterruptedException;//获取当前值

    public abstract VALUEIN getCurrentValue() throws IOException, InterruptedException;//获取当前value

    public abstract float getProgress() throws IOException, InterruptedException;//返回进度

    public abstract void close() throws IOException;
}//关闭，一般doNothing

从源码我们可以看到，该类主要是从文件中获取对应的Key-Value值，那么我们来刨析一下LineRecordReader：

public LineRecordReader(byte[] recordDelimiter) {
        this.recordDelimiterBytes = recordDelimiter;
    }//构造方法，传入的事一个byte[]数组，该数组是一个分割符，默认情况下是/r/n,也就是回车键。

下面我们来刨析一下它的初始化方法：

 public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
        FileSplit split = (FileSplit)genericSplit;//讲逻辑分区强转为物理分区类
        Configuration job = context.getConfiguration();
        this.maxLineLength = job.getInt("mapreduce.input.linerecordreader.line.maxlength", 2147483647);
        this.start = split.getStart();//获取块开始位置
        this.end = this.start + split.getLength();//获取块结束位置
        Path file = split.getPath();//获取路径
        FileSystem fs = file.getFileSystem(job);//获取文件系统类
        this.fileIn = fs.open(file);//打开该文件，获取输入流
        CompressionCodec codec = (new CompressionCodecFactory(job)).getCodec(file);
        if (null != codec) {
            this.isCompressedInput = true;
            this.decompressor = CodecPool.getDecompressor(codec);
            if (codec instanceof SplittableCompressionCodec) {
                SplitCompressionInputStream cIn = ((SplittableCompressionCodec)codec).createInputStream(this.fileIn, this.decompressor, this.start, this.end, READ_MODE.BYBLOCK);
                this.in = new CompressedSplitLineReader(cIn, job, this.recordDelimiterBytes);
                this.start = cIn.getAdjustedStart();
                this.end = cIn.getAdjustedEnd();
                this.filePosition = cIn;
            } else {
                this.in = new SplitLineReader(codec.createInputStream(this.fileIn, this.decompressor), job, this.recordDelimiterBytes);
                this.filePosition = this.fileIn;
            }//以上是有压缩时的处理方式，我们只看没有压缩时的处理方式
        } else {//没有压缩
            this.fileIn.seek(this.start);//因为我们的输入流是完全打开一个文件，那么如果我们直接读，那肯定是从0开始读的，所以这里使用seek方法定位到块的开头位置，也就是该块在整个文件的开头位置。
            this.in = new UncompressedSplitLineReader(this.fileIn, job, this.recordDelimiterBytes, split.getLength());
            this.filePosition = this.fileIn;//从该块的开头位置开启一个输入流
        }

        if (this.start != 0L) {//如果开始位置不为0
            this.start += (long)this.in.readLine(new Text(), 0, this.maxBytesToConsume(this.start));//start=0+start,为什么是这样呢，我们看readLine的第二个参数是0，代表着Line的最大长度。那么这个方法返回的值是0
        }

        this.pos = this.start;//标记pos为开始位置。
    }

通过刨析上面的源码，我们发现，初始化方法，会获取文件开始位置和结束位置。
我们继续往下看，nextKeyValue:

/**
该方法会一直运行，一直到该块被读取完毕~~~
**/
    public boolean nextKeyValue() throws IOException {
        if (this.key == null) {
            this.key = new LongWritable();//如果当前key为空，新建一个
        }

        this.key.set(this.pos);//如果是第一次运行，那么第一个key值就是该块的开始位置。
        if (this.value == null) {
            this.value = new Text();//如果value为空，新建一个value。
        }

        int newSize = 0;

        while(this.getFilePosition() <= this.end || this.in.needAdditionalRecordAfterSplit()) //循环条件，当前文件位置小于等于块结束位置或者当前输入流需要等待切分。。。
        {
            if (this.pos == 0L) {
                newSize = this.skipUtfByteOrderMark();
            } else {
                newSize = this.in.readLine(this.value, this.maxLineLength, this.maxBytesToConsume(this.pos));//获取value值，为一行数据，并且获取到这行数据的长度
                this.pos += (long)newSize;//更新位置了。
            }

            if (newSize == 0 || newSize < this.maxLineLength) {
                break;//跳出条件
            }

            LOG.info("Skipped line of size " + newSize + " at pos " + (this.pos - (long)newSize));
        }

        if (newSize == 0) {//如果没有获取到行长度，将key和value值空，并不发送给Mapper
            this.key = null;
            this.value = null;
            return false;
        } else {
            return true;//确认将key-value发给Mapper。
        }
    }

我们可以看到，该方法是真正处理怎么获取数据的逻辑，如果设计到二次开发，那么我们这里可以按照需求来完成编写。具体怎么编写，这里不多做介绍。
我们继续看getCurrentKey和getCurrentValue:

 public LongWritable getCurrentKey() {
        return this.key;
    }

    public Text getCurrentValue() {
        return this.value;
    }

这上面读取到的Key-Value发送给Mapper类~~~看到这里是不是已经很明晰了，Mapper的Key-Value数据是怎么来的吧？至于RecordReader的其他方法就不多说了，不是太重要。我们继续往下面看

1.4 `Mapper<KeyIn,ValueIn,KeyOut,ValueOut>`

通过上面的刨析，我们知道了Mapper的输入<key-Value>是通过RecordReader来获取到的，那么Mapper中，数据有发生了什么样子的变化呢？我们先看一下Mapper类的源码：

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
    public Mapper() {
    }

    protected void setup(Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {
    //初始化的一些设置，这个方法在一个MapTask中只运行一次
    }

    protected void map(KEYIN key, VALUEIN value, Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {
        context.write(key, value);//该方法是循环方法，生命周期到MapTask结束，一个Key-Value运行一次。
    }

    protected void cleanup(Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {
    //doNothing
    }

    public void run(Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {
        this.setup(context);

        try {
            while(context.nextKeyValue()) {
                this.map(context.getCurrentKey(), context.getCurrentValue(), context);
            }
        } finally {
            this.cleanup(context);
        }

    }

    public abstract class Context implements MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
        public Context() {
        }
    }
}

我们看一下Hadoop自带的实现TokenCounterMapper是怎么实现功能的：

public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());//会将输入value按照/r/n/f/t进行切分

        while(itr.hasMoreTokens()) {
            this.word.set(itr.nextToken());
            context.write(this.word, one);//迭代itr，将每个单词发送出去，大概是这样的<word,1>
        }

    }

嗯，Mapper的逻辑其实还是挺简单的，那么我们来看Reducer是怎么处理从Mapper过来的数据的。

1.5 `Reducer<keyIn,Iterator<ValueIn>,keyOut,ValueOut`

我们发现，Reducer的输入数据有变化，输入的Value变成了迭代器，为什么呢？这就涉及到了Mapper到Reducer中间的过程Shuffle阶段，本篇文章不对Shuufle进行刨析，有兴趣的小伙伴可以耐心等待下篇文章。那么我们来看看Redcer的源码：

  protected void setup(Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {
  //每个ReduceTask只运行一次
    }

    protected void reduce(KEYIN key, Iterable<VALUEIN> values, Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {//一个Key运行一次的reduce方法
        Iterator var4 = values.iterator();

        while(var4.hasNext()) {
            VALUEIN value = var4.next();
            context.write(key, value);//输出处理结果。
        }

    }

通过源码我们发现，Reducer的输入Key-Value和Mapper不同，因为，通过Shuffle阶段，Map阶段输出的相同的key的Value会聚合到一起，编程Key-Iterator<Value>,那么在Reducer这边，我们就会对相同Key的值进行统一处理，这要看相关的业务逻辑。OK。Reducer我们就说这么多。让我们继续往下刨析。

1.6 `RecordWriter`

上面我们已经刨析过RecordReader了，那么作为对应的，RecordWriter会处理Reducer发过来的数据，它主要确认以什么方式写入目标文件：

public abstract class RecordWriter<K, V> {
    public RecordWriter() {
    }

    public abstract void write(K var1, V var2) throws IOException, InterruptedException;//主要方法，写方法，参数是Reducer那边传过来的Key-Value

    public abstract void close(TaskAttemptContext var1) throws IOException, InterruptedException;
}

我们可以通过刨析LineRecordWriter来理解这个类：

 protected static class LineRecordWriter<K, V> extends RecordWriter<K, V> {
        private static final String utf8 = "UTF-8";
        private static final byte[] newline;
        protected DataOutputStream out;//输出流
        private final byte[] keyValueSeparator;//key-value分隔符

        public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {//构造器一，可以自定义分隔符
            this.out = out;

            try {
                this.keyValueSeparator = keyValueSeparator.getBytes("UTF-8");
            } catch (UnsupportedEncodingException var4) {
                throw new IllegalArgumentException("can't find UTF-8 encoding");
            }
        }

        public LineRecordWriter(DataOutputStream out) {
        //默认制表符来分割输出。
            this(out, "\t");
        }

        private void writeObject(Object o) throws IOException {
            if (o instanceof Text) {
                Text to = (Text)o;
                this.out.write(to.getBytes(), 0, to.getLength());
            } else {
                this.out.write(o.toString().getBytes("UTF-8"));
            }

        }

        public synchronized void write(K key, V value) throws IOException {//主要方法。
            boolean nullKey = key == null || key instanceof NullWritable;
            boolean nullValue = value == null || value instanceof NullWritable;
            if (!nullKey || !nullValue) {
                if (!nullKey) {
                    this.writeObject(key);//写入key值
                }

                if (!nullKey && !nullValue) {
                    this.out.write(this.keyValueSeparator);//写入分隔符，默认是`\t`
                }

                if (!nullValue) {
                    this.writeObject(value);//写入value值
                }

                this.out.write(newline);//写入`\n`
            }
        }

        public synchronized void close(TaskAttemptContext context) throws IOException {
            this.out.close();
        }

        static {
            try {
                newline = "\n".getBytes("UTF-8");
            } catch (UnsupportedEncodingException var1) {
                throw new IllegalArgumentException("can't find UTF-8 encoding");
            }
        }
    }
}

这样看起来是不是很简单。那么我们的输出流到底是在哪里创建的呢，我们继续刨析。

1.7 `OutputFormat`

首先，先看一下OutputFormat的源码：

public abstract class OutputFormat<K, V> {
    public OutputFormat() {
    }

    public abstract RecordWriter<K, V> getRecordWriter(TaskAttemptContext var1) throws IOException, InterruptedException;//核心方法。

    public abstract void checkOutputSpecs(JobContext var1) throws IOException, InterruptedException;

    public abstract OutputCommitter getOutputCommitter(TaskAttemptContext var1) throws IOException, InterruptedException;
}

我们看到，该类的核心方法是getRecordWriter,我们可以通过TextOutputFormat来详细刨析该方法可以干什么：

    public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
        Configuration conf = job.getConfiguration();
        boolean isCompressed = getCompressOutput(job);
        String keyValueSeparator = conf.get(SEPERATOR, "\t");
        CompressionCodec codec = null;
        String extension = "";
        if (isCompressed) {
            Class<? extends CompressionCodec> codecClass = getOutputCompressorClass(job, GzipCodec.class);
            codec = (CompressionCodec)ReflectionUtils.newInstance(codecClass, conf);
            extension = codec.getDefaultExtension();
        }

        Path file = this.getDefaultWorkFile(job, extension);//通过job获取到Path
        FileSystem fs = file.getFileSystem(conf);//根据Path获取到文件系统类
        FSDataOutputStream fileOut;
        if (!isCompressed) {
            fileOut = fs.create(file, false);//创建输出流。
            return new TextOutputFormat.LineRecordWriter(fileOut, keyValueSeparator);//通过上面的输出流构建LineRecordWriter，下面的逻辑是一样的。只是多了压缩格式。
        } else {
            fileOut = fs.create(file, false);
            return new TextOutputFormat.LineRecordWriter(new DataOutputStream(codec.createOutputStream(fileOut)), keyValueSeparator);
        }
    }

到了这里，我们发现，最终我们的数据会输出到我们设定的OutputPath那个文件夹中。

2、总结

简单来说，MapReduce可以做如下总结：
MapReduce的编程思想核心为：Map和Reduce以及Map和Reduce之间的Shuffle过程。
Map：以<Key-Value>形式输入，以<Key-Iterator<Value>>形式输出
Shuffle：它是Map阶段到Reduce阶段的逻辑过程。包含了排序<sort>、分区、聚合等操作，最终会将Map阶段的输出传递给Reduce阶段
Reduce:以<Key-Iterator<Value>>形式接受Map阶段的输出，以<Key-Value>形式输出结果。