hadoop学习笔记之二：MapReduce基本编程_mapreduce基本编程方法-CSDN博客

引言

在本系列的上篇文章中介绍了Hadoop的基本概念和架构，本文将通过一个实例演示MapReduce基本编程。在继续进行前希望能重温下前面的内容，至少理解这张图是怎么回事。

实践

创建maven工程并加入hadoop依赖

我们选用maven来管理工程，用自己喜爱的m2eclipse插件在eclipse里创建或在命令行里创建一个工程。在pom.xml里加入hadoop依赖。

   
   
<dependency>
   
   
 <groupId>org.apache.hadoop</groupId>
   
   
 <artifactId>hadoop-core</artifactId>
   
   
 <version>0.20.2</version>
   
   
</dependency>
   
   

   
   
<repositories>
   
   
 <repository>
   
   
  <id>cloudera</id>
   
   
  <url>https://repository.cloudera.com/content/groups/public</url>
   
   
 </repository>
   
   
</repositories>

运行mvn eclipse:eclipse命令后，将工程导入eclipse，可以看到以下相关的依赖。

Ok，现在开始我们第一个MapReduce程序，用这个程序实现字数统计功能。

概述

一个简单的MapReduce程序需要三样东西

1.实现Mapper，处理输入的对，输出中间结果

2.实现Reduce，对中间结果进行运算，输出最终结果

3.在main方法里定义运行作业，定义一个job，在这里控制job如何运行等。

编写Map类

   
   
public  class WordCountMapper 
   
   
       extends Mapper <Object, Text, Text, IntWritable>{
   
   
    private final static IntWritable one = new IntWritable(1);
   
   
    private Text word = new Text();
   
   
    public void map(Object key, Text value, Context context
   
   
                    ) throws IOException, InterruptedException {
   
   
      StringTokenizer itr = new StringTokenizer(value.toString());
   
   
      while (itr.hasMoreTokens()) {
   
   
        word.set(itr.nextToken());
   
   
        context.write(word, one);
   
   
      }
   
   
    }
   
   
  }

Mapper接口是一个泛型，有4个形式的参数类型，分别指定map函数的输入键，输入值，输出键，输出值。就上面的示例来说，输入键没有用到（实际代表行在文本中格的位置，没有这方面的需要，所以忽略），输入值是一样文本，输出键为单词，输出值为整数代表单词出现的次数。需要注意的是Hadoop规定了自己的一套可用于网络序列优化的基本类型，而不是使用内置的java类型，这些都在org.apache.hadoop.io包中定义，上面使用的Text类型相当于java的String类型，IntWritable类型相当于java的Integer类型。除此之外，看不到任何分布式编程的细节，一切都是那么的简单。

编写Reduce类

   
   
public class WordCountReducer extends
   
   
		Reducer <Text, IntWritable, Text, IntWritable> {
   
   
	private IntWritable result = new IntWritable();
   
   

   
   
	public void reduce(Text key, Iterable values, Context context)
   
   
			throws IOException, InterruptedException {
   
   
		int sum = 0;
   
   
		for (IntWritable val : values) {
   
   
			sum += val.get();
   
   
		}
   
   
		result.set(sum);
   
   
		context.write(key, result);
   
   
	}
   
   
}

同样，Reducer接口的四个形式参数类型指定了reduce函数的输入和输出类型。在上面的例子中，输入键是单词，输入值是单词出现的次数，将单词出现的次数进行叠加，输出单词和单词总数。

定义job

   
   
public class WordCount {
   
   
  public static void main(String[] args) throws Exception {
   
   
     Configuration conf = new Configuration();
   
   
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
   
   
         if (otherArgs.length != 2) {
   
   
           System.err.println("Usage: wordcount  ");
   
   
           System.exit(2);
   
   
         }
   
   
         /**创建一个job，起个名字以便跟踪查看任务执行情况**/
   
   
         Job job = new Job(conf, "word count");
   
   

   
   
         /**当在hadoop集群上运行作业时，需要把代码打包成一个jar文件（hadoop会在集群分发这个文件），通过job的setJarByClass设置一个类，hadoop根据这个类找到所在的jar文件**/
   
   
        job.setJarByClass(WordCount.class);
   
   
        
   
   
        /**设置要使用的map、combiner、reduce类型**/
   
   
        job.setMapperClass(WordCountMapper.class);   
   
   
        job.setCombinerClass(WordCountReducer.class);
   
   
        job.setReducerClass(WordCountReducer.class);
   
   
        
   
   
       /**设置map和reduce函数的输入类型，这里没有代码是因为我们使用默认的TextInputFormat，针对文本文件，按行将文本文件切割成 InputSplits, 并用 LineRecordReader 将 InputSplit 解析成 <key,value&gt: 对，key 是行在文件中的位置，value 是文件中的一行**/
   
   

   
   

   
   
        /**设置map和reduce函数的输出键和输出值类型**/
   
   
        job.setOutputKeyClass(Text.class);
   
   
        job.setOutputValueClass(IntWritable.class);
   
   

   
   
        /**设置输入和输出路径**/
   
   
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
   
   
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
   
   

   
   
       /**提交作业并等待它完成**/
   
   
        System.exit(job.waitForCompletion(true) ? 0 : 1);
   
   
      }
   
   
}