直接翻译from Apache Document...纯自我熟悉为目的整理
org.apache.hadoop.mapreduce
Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
Maps input key/value pairs to a set of intermediate key/value pairs.
Maps是将输入记录转运成中介记录的独立任务。中介记录不需要和输入记录一个type. A given input pair may map to zero or many output pairs.1到n的输入输出关系
The Hadoop Map-Reduce framework 为每一个InputFormat生成的
InputSplit创建一个Map任务.
Mapper的
implementations可以通过JobContext.getConfiguration()来
access Job的Configuration
.
The framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context)
, followed by map(Object, Object, Context)
for each key/value pair in the InputSplit
. Finally cleanup(Context)
is called.
所有给定输出key的中介值都会被framework组合在一起传给一个Reducer
. 排序和组合可以通过specifying two key RawComparator
classes来控制.
The Mapper
输出分给每个Reducer
. 哪个keys以及对应的records分给哪个Reducer可以由实现
Partitioner控制
.
Users can optionally specify a combiner
, via Job.setCombinerClass(Class)
, to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper
to the Reducer
.
Applications can specify if and how the intermediate outputs are to be compressed and which CompressionCodec
s are to be used via the Configuration
.
如果没有reduces那么Mapper输出将直接写入the OutputFormat
without sorting by keys.
Example:
public class TokenCounterMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); //key根本就不用 public void map(Object key, Text value, Context context)
throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one);
//直接往context写有一种jsp里面request的味道呐
//每write一次是一行,包含了Key和Value } } }
Applications may override the run(Context)
method to exert greater control on map processing e.g. multi-threaded Mapper
s etc.
org.apache.hadoop.mapreduce
Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
Reduces a set of intermediate values which share a key to a smaller set of values.
Reducer
implementations can access the Configuration
for the job via the JobContext.getConfiguration()
method.
Reducer
3个关键过程:
-
洗牌
The
Reducer
使用HTTP从每个Mapper复制排序好的输出 -
排序
The framework 通过Key来对
Reducer
inputs合并排序(since differentMapper
s may have output the same key).洗牌和排序同步发生i.e.取回输出的时候就直接合并了.
二级排序
To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.The grouping comparator is specified via
For example, say that you want to find duplicate web pages and tag them all with the url of the "best" known example. You would set up the job like:Job.setGroupingComparatorClass(Class)
. The sort order is controlled byJob.setSortComparatorClass(Class)
.- Map Input Key: url
- Map Input Value: document
- Map Output Key: document checksum, url pagerank
- Map Output Value: url
- Partitioner: by checksum
- OutputKeyComparator: by checksum and then decreasing pagerank
- OutputValueGroupingComparator: by checksum
-
Reduce
In this phase the
reduce(Object, Iterable, Context)
method is called for each<key, (collection of values)>
in the sorted inputs.The output of the reduce task is typically written to a
RecordWriter
viaTaskInputOutputContext.write(Object, Object)
.
The output of the Reducer
is not re-sorted.
Example:
public class IntSumReducer<Key> extends Reducer<Key,IntWritable, Key,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Key key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
org.apache.hadoop.mapreduce.lib.input
Class TextInputFormat
@InterfaceAudience.Public @InterfaceStability.Stable public class TextInputFormat extends FileInputFormat< LongWritable, Text>
An InputFormat
for 普通文本文件. 文件被分成行。换行键或者回车都是换行的标志。Keys是在文件中的位置,值是一整行text..
org.apache.hadoop.mapreduce.lib.input
Class FileInputFormat<K,V>
所有文件型InputFormat的
基础类.
提供了getSplits(JobContext)
一般性的实现. 子类可以override the isSplitable(JobContext, Path)
method 来保证输入文件没有被split-up并且被 Mapper
s完整处理.