Hadoop map reduce执行过程概述

最新推荐文章于 2019-05-29 19:07:02 发布

CW_Wei

最新推荐文章于 2019-05-29 19:07:02 发布

阅读量349

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/CW_Wei/article/details/79746276

版权

大数据专栏收录该内容

1 篇文章 0 订阅

订阅专栏

看了不少hadoop相关的学习教程以及教学视频，这几天无意中看了下https://wiki.apache.org/hadoop/中的英文资料，感觉还是英文讲解得最是清楚，母语为英语的人学软件技术实在太有优势了。

为了加深巩固理解，我尝试着将部分比较好的英文资料翻译一下，并加上了自己的理解，以及疑问等。

本文翻译自：点击打开链接

=====================================================================

Map
=====================================================================
As the Map operation is parallelized the input file set is first split to several pieces called FileSplits. If an individual file is so large that it will affect seek time it will be split to several Splits. The splitting does not know anything about the input file's internal logical structure, for example line-oriented text files are split on arbitrary byte boundaries. Then a new map task is created per FileSplit.
由于map操作是并行执行，如果源文件过大，FileSplits会将源文件拆分成多个块(FileSplit)，但FileSplits不知道文件内部逻辑，所以可能出现任意的拆分边界，即每个块的最后一行数据不一定为完整数据（如何处理这样的特殊数据呢？下一段有答案）。
创建的map task数量与FileSplit的数量相同。并且可由代码进行设置：conf.setNumMapTasks(10)。

When an individual map task starts it will open a new output writer per configured reduce task. It will then proceed to read its FileSplit using the RecordReader it gets from the specified InputFormat. InputFormat parses the input and generates key-value pairs. InputFormat must also handle records that may be split on the FileSplit boundary. For example TextInputFormat will read the last line of the FileSplit past the split boundary and, when reading other than the first FileSplit, TextInputFormat ignores the content up to the first newline.
当一个map task启动时，针对每个reduce task，它会打开一个新的输出流（output writer）。
然后通过RecordReader（通过InputFormat得到的）从FileSplit中读取数据。InputFormat解析读取数据并生成key-value对。
并且，InputFormat必须处理FileSplit边界情况。例如，TextInputFormat在读取到末尾时，会读取超过边界后的下一行数据进行处理（需要到HDFS中去读取下一行），且在非第一个FileSplit的情况下，忽略掉第一行数据。
实际举例：
FileSplit1内容如下：
aaaaaaaa 111
bbbbbbbb 222
cccccccc 333
dddddddd 4 -- 在读到此处时，会从其它地方把"44"读取回来（如何实现的？）

FileSplit2内容如下：
44 -- 忽略此行
bbbbbbbb 222
cccccccc 333
dddddddd 4

疑问：特殊情况下，分界处刚好为换行符，是否还会读取下一行？按照上边的说法，我推测会读取。待验证。

It is not necessary for the InputFormat to generate both meaningful keys and values. For example the default output from TextInputFormat consists of input lines as values and somewhat meaninglessly line start file offsets as keys - most applications only use the lines and ignore the offsets.
InputFormat没有必要必须生成有意义的key和value。默认情况下，TextInputFormat就以offset和行内容为key/value，当然，大部分情况下，我们用不到offset，只需要使用line内容。

As key-value pairs are read from the RecordReader they are passed to the configured Mapper. The user supplied Mapper does whatever it wants with the input pair and calls OutputCollector.collect with key-value pairs of its own choosing. The output it generates must use one key class and one value class. This is because the Map output will be written into a SequenceFile which has per-file type information and all the records must have the same type (use subclassing if you want to output different data structures). The Map input and output key-value pairs are not necessarily related typewise or in cardinality.
从RecordReader中读取出的key-value对会被送到预先配置好的mapper。用户的map可对key-value做任何操作，并且自行决定是否调用OutputCollector.collect（what?）。map的输出必须为同一种key和同一种value类型，因为map的输出会被写到SequenceFile中，而SequenceFile中有每文件的类型信息，且所有记录必须有相同的类型（这一句翻译得牵强，具体何意?）。如果想输出不同的数据结构，需要自已子类化。
map的输入输出key-value对不是必须related typewise（？）或者为基数（？）。

When Mapper output is collected it is partitioned, which means that it will be written to the output specified by the Partitioner. The default HashPartitioner uses the hashcode function on the key's class (which means that this hashcode function must be good in order to achieve an even workload across the reduce tasks). See MapTask for details.
map的输出在被收集时，会进行分区，分区意味着输出会被写到Partitioner指定的输出中。默认的HashPartitioner对key进行hash。has函数的效果影响reduce task的均衡效果。

N input files will generate M map tasks to be run and each map task will generate as many output files as there are reduce tasks configured in the system. Each output file will be targeted at a specific reduce task and the map output pairs from all the map tasks will be routed so that all pairs for a given key end up in files targeted at a specific reduce task.
N个输入文件会产生M个map task，每个map task会生成和reduce task数量相同的输出文件（可以为物理文件或内存文件吧？）。每一个输出文件唯一对应一个reduce task。所有的map的所有输出key-value对都会路由，所以相同的key，必定会路由到同一个reduce。

Combine
=====================================================================
When the map operation outputs its pairs they are already available in memory. For efficiency reasons, sometimes it makes sense to take advantage of this fact by supplying a combiner class to perform a reduce-type function. If a combiner is used then the map key-value pairs are not immediately written to the output. Instead they will be collected in lists, one list per each key value. When a certain number of key-value pairs have been written, this buffer is flushed by passing all the values of each key to the combiner's reduce method and outputting the key-value pairs of the combine operation as if they were created by the original map operation.
当map完成时，其输出key-value对存储在内存中。出于效率考虑，有时合理使用combiner完成reduce的功能会更高效。在输出之前使用combiner。方式如下：ey-value对被收集到lists，每个key，对应一个list。当一定数量的key-value对输出到list后，再将其传给combiner，再由combiner输出，作为map的最终输出。combiner基本和reduce的功能等价，但是在map节点处执行，可减少map的输出数据量，降低数据传输带宽要求，同时减小reduce计算负担。

For example, a word count MapReduce application whose map operation outputs (word, 1) pairs as words are encountered in the input can use a combiner to speed up processing. A combine operation will start gathering the output in in-memory lists (instead of on disk), one list per word. Once a certain number of pairs is output, the combine operation will be called once per unique word with the list available as an iterator. The combiner then emits (word, count-in-this-part-of-the-input) pairs. From the viewpoint of the Reduce operation this contains the same information as the original Map output, but there should be far fewer pairs output to disk and read from disk.
比如，单词统计功能就适合使用combiner。

Reduce
=====================================================================
When a reduce task starts, its input is scattered in many files across all the nodes where map tasks ran. If run in distributed mode these need to be first copied to the local filesystem in a copy phase (see ReduceTaskRunner).
由于recuce的输入分布于所有map节点，所以需要先将所有map的输出拷贝到reduce的本地文件系统（参考ReduceTaskRunner）。

Once all the data is available locally it is appended to one file in an append phase. The file is then merge sorted so that the key-value pairs for a given key are contiguous (sort phase). This makes the actual reduce operation simple: the file is read sequentially and the values are passed to the reduce method with an iterator reading the input file until the next key value is encountered. See ReduceTask for details.
一旦数据全部到达reduce本地，数据会被追加到同一个文件，且按key排序，保证相同key的数据在物理连接的，进而保证reduce的操作简单性。
每个key及其所有的key-value会传给reduce进行迭代处理。

At the end, the output will consist of one output file per executed reduce task. The format of the files can be specified with JobConf.setOutputFormat. If SequentialOutputFormat is used then the output key and value classes must also be specified.
最后，每个recude会产生一个输出文件。输出文件格式可由JobConf.setOutputFormat指定。