Hadoop-Mapper-Reducer文档学习

直接翻译from Apache Document...纯自我熟悉为目的整理

org.apache.hadoop.mapreduce 
Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Maps input key/value pairs to a set of intermediate key/value pairs.

Maps是将输入记录转运成中介记录的独立任务。中介记录不需要和输入记录一个type. A given input pair may map to zero or many output pairs.1到n的输入输出关系

The Hadoop Map-Reduce framework 为每一个InputFormat生成的InputSplit创建一个Map任务. Mapper的implementations可以通过JobContext.getConfiguration()access Job的Configuration.

The framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context), followed by map(Object, Object, Context) for each key/value pair in the InputSplit. Finally cleanup(Context) is called.

所有给定输出key的中介值都会被framework组合在一起传给一个Reducer. 排序和组合可以通过specifying two key RawComparatorclasses来控制.

The Mapper 输出分给每个Reducer. 哪个keys以及对应的records分给哪个Reducer可以由实现Partitioner控制.

Users can optionally specify a combiner, via Job.setCombinerClass(Class), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.

Applications can specify if and how the intermediate outputs are to be compressed and which CompressionCodecs are to be used via the Configuration.

如果没有reduces那么Mapper输出将直接写入the OutputFormat without sorting by keys.

Example:

 public class TokenCounterMapper 
     extends Mapper<Object, Text, Text, IntWritable>{
    
   private final static IntWritable one = new IntWritable(1);
   private Text word = new Text();
   //key根本就不用
   public void map(Object key, Text value, Context context) 
		throws IOException, InterruptedException {
     StringTokenizer itr = new StringTokenizer(value.toString());
     while (itr.hasMoreTokens()) {
       word.set(itr.nextToken());
       context.write(word, one);
	//直接往context写有一种jsp里面request的味道呐
	//每write一次是一行,包含了Key和Value
     }
   }
 }
 

Applications may override the run(Context) method to exert greater control on map processing e.g. multi-threaded Mappers etc.

org.apache.hadoop.mapreduce 
Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Reduces a set of intermediate values which share a key to a smaller set of values.

Reducer implementations can access the Configuration for the job via the JobContext.getConfiguration() method.

Reducer 3个关键过程:

  1. 洗牌

    The Reducer 使用HTTP从每个Mapper复制排序好的输出

  2. 排序

    The framework 通过Key来对Reducer inputs合并排序(since different Mappers may have output the same key).

    洗牌和排序同步发生i.e.取回输出的时候就直接合并了.

    二级排序

    To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.The grouping comparator is specified via Job.setGroupingComparatorClass(Class). The sort order is controlled byJob.setSortComparatorClass(Class).

    For example, say that you want to find duplicate web pages and tag them all with the url of the "best" known example. You would set up the job like:
    • Map Input Key: url
    • Map Input Value: document
    • Map Output Key: document checksum, url pagerank
    • Map Output Value: url
    • Partitioner: by checksum
    • OutputKeyComparator: by checksum and then decreasing pagerank
    • OutputValueGroupingComparator: by checksum
  3. Reduce

    In this phase the reduce(Object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs.

    The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).

The output of the Reducer is not re-sorted.

Example:

 public class IntSumReducer<Key> extends Reducer<Key,IntWritable,
                                                 Key,IntWritable> {
   private IntWritable result = new IntWritable();
 
   public void reduce(Key key, Iterable<IntWritable> values,
                      Context context) throws IOException, InterruptedException {
     int sum = 0;
     for (IntWritable val : values) {
       sum += val.get();
     }
     result.set(sum);
     context.write(key, result);
   }
 }

org.apache.hadoop.mapreduce.lib.input 
Class TextInputFormat


@InterfaceAudience.Public
@InterfaceStability.Stable
public class TextInputFormat
   
   
    
    extends 
    
    FileInputFormat<
    
    LongWritable,
    
    Text>
   
   

An InputFormat for 普通文本文件. 文件被分成行。换行键或者回车都是换行的标志。Keys是在文件中的位置,值是一整行text..


org.apache.hadoop.mapreduce.lib.input 
Class FileInputFormat<K,V>

所有文件型InputFormat基础类.

提供了getSplits(JobContext)一般性的实现. 子类可以override the isSplitable(JobContext, Path) method 来保证输入文件没有被split-up并且被 Mappers完整处理.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值