Hadoop-Mapper-Reducer文档学习

最新推荐文章于 2023-08-27 11:35:35 发布

右点点

最新推荐文章于 2023-08-27 11:35:35 发布

阅读量1.4k

点赞数

分类专栏： Hadoop&MapReduce

Hadoop&MapReduce 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

直接翻译from Apache Document...纯自我熟悉为目的整理

org.apache.hadoop.mapreduce
Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Maps input key/value pairs to a set of intermediate key/value pairs.

Maps是将输入记录转运成中介记录的独立任务。中介记录不需要和输入记录一个type. A given input pair may map to zero or many output pairs.1到n的输入输出关系

The Hadoop Map-Reduce framework 为每一个InputFormat生成的InputSplit创建一个Map任务. Mapper的implementations可以通过JobContext.getConfiguration()来access Job的Configuration.

The framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context), followed by map(Object, Object, Context) for each key/value pair in the InputSplit. Finally cleanup(Context) is called.

所有给定输出key的中介值都会被framework组合在一起传给一个Reducer. 排序和组合可以通过specifying two key RawComparatorclasses来控制.

The Mapper 输出分给每个Reducer. 哪个keys以及对应的records分给哪个Reducer可以由实现Partitioner控制.

Users can optionally specify a combiner, via Job.setCombinerClass(Class), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.

Applications can specify if and how the intermediate outputs are to be compressed and which CompressionCodecs are to be used via the Configuration.

如果没有reduces那么Mapper输出将直接写入the OutputFormat without sorting by keys.

Example:

 public class TokenCounterMapper 
     extends Mapper<Object, Text, Text, IntWritable>{
    
   private final static IntWritable one = new IntWritable(1);
   private Text word = new Text();
   //key根本就不用
   public void map(Object key, Text value, Context context)

		throws IOException, InterruptedException {
     StringTokenizer itr = new StringTokenizer(value.toString());
     while (itr.hasMoreTokens()) {
       word.set(itr.nextToken());
       context.write(word, one);

	//直接往context写有一种jsp里面request的味道呐

	//每write一次是一行,包含了Key和Value
     }
   }
 }

Applications may override the run(Context) method to exert greater control on map processing e.g. multi-threaded Mappers etc.

org.apache.hadoop.mapreduce
Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Reduces a set of intermediate values which share a key to a smaller set of values.

Reducer implementations can access the Configuration for the job via the JobContext.getConfiguration() method.

Reducer 3个关键过程:

洗牌

The Reducer 使用HTTP从每个Mapper复制排序好的输出
排序

The framework 通过Key来对Reducer inputs合并排序(since different Mappers may have output the same key).

洗牌和排序同步发生i.e.取回输出的时候就直接合并了.

二级排序

To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.The grouping comparator is specified via Job.setGroupingComparatorClass(Class). The sort order is controlled byJob.setSortComparatorClass(Class).
For example, say that you want to find duplicate web pages and tag them all with the url of the "best" known example. You would set up the job like:
- Map Input Key: url
- Map Input Value: document
- Map Output Key: document checksum, url pagerank
- Map Output Value: url
- Partitioner: by checksum
- OutputKeyComparator: by checksum and then decreasing pagerank
- OutputValueGroupingComparator: by checksum
Reduce

In this phase the reduce(Object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs.

The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).

The output of the Reducer is not re-sorted.

Example:

 public class IntSumReducer<Key> extends Reducer<Key,IntWritable,
                                                 Key,IntWritable> {
   private IntWritable result = new IntWritable();
 
   public void reduce(Key key, Iterable<IntWritable> values,
                      Context context) throws IOException, InterruptedException {
     int sum = 0;
     for (IntWritable val : values) {
       sum += val.get();
     }
     result.set(sum);
     context.write(key, result);
   }
 }

org.apache.hadoop.mapreduce.lib.input
Class TextInputFormat

@InterfaceAudience.Public
@InterfaceStability.Stable
public class TextInputFormat
   
   
    
    extends 
    
    FileInputFormat<
    
    LongWritable,
    
    Text>

An InputFormat for 普通文本文件. 文件被分成行。换行键或者回车都是换行的标志。Keys是在文件中的位置，值是一整行text..