Hadoop（二）-MapReduce

最新推荐文章于 2024-10-15 20:04:37 发布

带着小板凳学习

最新推荐文章于 2024-10-15 20:04:37 发布

阅读量281

点赞数

文章标签： mapreduce hadoop 并行计算

本文链接：https://blog.csdn.net/u013963380/article/details/61926482

版权

接下来介绍hadoop的第二个核心：MapReduce
1.MapReduce简介
MapReduce是Google的一项重要的技术，它是一个编程模型，用来进行大量数据的并行计算。同时，MapReduce也是一种并行计算框架，开发者基于该框架能够地编写应用程序，这些应用程序能够运行在hadoop集群上，并以一种可靠的，具有容错能力的方式并行处理TB级别的海量数据集。

上面提到MapReduce是一个计算框架，既然是计算框架必然有一个约定俗成的程序编写形式，在编写mapreduce程序计算任务的时候，任务过程会被成两个阶段：map阶段和reduce阶段，每个阶段是用键值对（key/value）作为输入和输出。而开发者要做的就是定义好着两个阶段的函数：map函数和reduce函数。

2.MapReduce的实例
接下来，我们简介下mapreduce里面的WordCount实例，代码如下：

package hfut.xudong.hadoop.wc;

import java.io.IOException;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
//这里面的Mapper和Reducer是约定好的，相当于map和reduce过程，如果有的时候你只需要map过程的结果也可以不用写reduce。
    public class WCMapper extends Mapper<LongWritable, Text, Text, LongWritable>{

        @Override
        protected void map(LongWritable key, Text value,Context context)
                throws IOException, InterruptedException {
            //这里的map过程其实和java正常处理文本程序没什么差别，每行按照tab分割，然后将结果对应成key/value形式，value是1
            String line=value.toString();
            String[] words=StringUtils.split(line, "\t");

            for(String word:words){
                context.write(new Text(word), new LongWritable(1));
            }

        }
    }
    //注意：map的输出的类型必然和reduce的输入类型一样
    public class WCReducer extends Reducer<Text, LongWritable, Text, LongWritable>{
//reduce将map的输出作为输入，处理map的结果，也就是整合的过程，将相同的key下的value相加
        @Override
        protected void reduce(Text key, Iterable<LongWritable> values,Context context)
                throws IOException, InterruptedException {
            int count=0;
            for(LongWritable value:values){
                 count +=value.get();

            }
            context.write(key, new LongWritable(count));
        }

    }
    //main函数，配置job运行的一些必须的资源
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf=new Configuration();
        Job job=Job.getInstance(conf);

        //设置job所用的那些类在哪个jar包下面
        job.setJarByClass(WCRunner.class);//通过classloader去找到class文件
        job.setJobName("wordcount");

        //本job使用的mapper和reduce类
        job.setMapperClass(WCMapper.class);
        job.setReducerClass(WCReducer.class);

        //以下指定map和reduce的输出数据kv类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        //FileInputFormat.addInputPath(job, new Path(args[0]));
        //指定数据的路径，输入和输出
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true)?0:1);
    }
}

如何去运行它，大家可以去百度一下，本人在本地Eclipse编写完程序打包上传到集群中然后通过运行命令：hadoop jar wordcount_examples.jar WordCount /data/input /data/output 运行，最后可以在hdfs里的data文件夹下的output里查看运行结果。

3.MapReduce的运行机制
这里写图片描述