Hadoop MapReduce初窥-wordcount示例

最新推荐文章于 2022-10-13 19:16:31 发布

cjf_wei

最新推荐文章于 2022-10-13 19:16:31 发布

阅读量786

点赞数 1

分类专栏： hadoop 文章标签： mapreduce hadoop

本文链接：https://blog.csdn.net/cjf_wei/article/details/77606783

版权

hadoop 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

MapReduce是一种可用于并行处理大规模集群上的海量数据的编程模型。它的出现使得并行处理海量数据变的更容易，容错性更高。本文借助wordcount程序介绍MapReduce的一些基本知识。本文在Eclipse环境中开发，然后编译成jar包，放单节点的伪集群中运行。

MapReduce程序由Mapper类，Reducer类以及一些用于运行作业的代码完成。map负责把任务分解成多个任务，reduce负责把分解后多任务处理的结果汇总起来。MapReduce框架处理键值对形式的数据，处理框架作业的输入是键值对形式的集合，处理后的输出也是同样形式，但并不要求输入和输出的数据类型是一样的。一个MapReduce作业的输入输出流程中各环节的数据形式如下：

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

WordCount过程解析
Map阶段：首先对待处理的文件按照文件块（例如当前默认的256M）的大小进行切分，切分后的split作为Map任务的输入，每个split对应一个map任务进行并行处理（由此可见HDFS上过多的小文件，不仅会加重HDFS存储时NameNode的负担，也会造成MapReduce资源的调度的压力）。Map输入的key，默认为LongWritable，表示该行数据的起始位置相对于整个文件位置的偏移量。在本文中，并不关心该偏移量，故将其类型设置为Object，也不对其进行处理。例如，如果输入的文本内容如下：

hello hadoop
hello mapreduce

那么，map输入的key和value，如下：

(0,hello hadoop)  //key为0，是第一行文本的偏移量
(12,hello mapreduce) //key为12，因为第二行跳过了第一行的12个字符

经过map阶段对每行文本进行单词切分，并将每个单词的计数置为1，输出如下：

(hello,1)
(hadoop,1)
(hello,1)
(mapreduce,1)

Combiner阶段（非必须）：会对map的输出进行本地的合并，减少网络的传输。

(hadoop,[1])
(hello，[2]) //如果没有combiner，(hello,1)会分两次传输
(mapreduce,[1])

Reduce阶段：对于map的结果进行合并，输入的key是单词，value是相同单词计数的列表。在数据从map到reduce的过程，还需经历shuffle阶段，此过程根据map输出的key进行重新排序和分组。根据排序和分组的结果，hadoop框架会决定reduce任务的分配。

代码示例：

package com.test.wordcount;

import java.io.IOException;
import java.util.regex.Pattern;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class WordCount {

    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{
         //创建正则模式，"\\W+"表示所有非英文字母的字符
        private static  final Pattern  splits = Pattern.compile("\\W+");  
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        protected void setup(Context context)
        {
            //do something initial work
        }

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            //以所有非英文字符作为分隔符，切分输入的值
            String [] words = splits.split(value.toString());
            for(int i = 0; i < words.length; ++i )
            {
                word.set(words[i]);      //调用Text类的方法，设置结果
                context.write(word, one); //context实例用于将map处理结果以键值对的形式输出
            }
        }

        protected void cleanup(Context context)
        {
            //do something clean up work
        }
    }

    public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values,Context context ) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "wordcount");

        job.setJarByClass(WordCount.class);

        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
  }

}

代码分析
Map

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{
    //创建正则模式，"\\W+"表示所有非英文字母的字符
    private static  final Pattern  splits = Pattern.compile("\\W+");
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    protected void setup(Context context)
    {
        //do something initial work
    }

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        //以所有非英文字符作为分隔符，切分输入的值
        String [] words = splits.split(value.toString());  
        for(int i = 0; i < words.length; ++i )
        {
            word.set(words[i]);  //调用Text类的方法，设置结果
            context.write(word, one); //context实例用于将map处理结果以键值对的形式输出
        }
    }

    protected void cleanup(Context context)
    {
        //do something clean up work
    }
}

map函数由Mapper实现，mapper类中声明了map()虚方法，此外还有run(),setup()和cleanup()。run()方法会根据上下文环境，判断是否还有输入，如果有，则不断的调用map函数进行处理。setup()用于在TokenizerMapper第一次实例化时，做一些必要的初始化工作；cleanup()则是在数据处理结束后提供需要的清理工作。当然，如果没有特别要初始化或者清理的对象，两者也没必要一定要实现。
Mapper的泛型参数[Object, Text, Text, IntWritable]；分别是输入和输入的键值对，即Map的输入类型是[Object, Text],输出是[Text, IntWritable]。而Text,IntWritable等是实现了Writable接口，提供了Hadoop特有的序列化方式的类型，可以理解为int,String等类型在Hadoop中的包装。
Text类类似于Java中的String，IntWritable相当于int型。此外还有ObjectWritbale,NullWritable,ByteWritable,BooleanWritable,ShortWritable,FloatWritable,LongWritable,DoubleWritable等。此外还可以自定义Witebale类型，例如如下方式的自定义类型，readFields()和write()是必须要实现的方法：

public class MinMaxCountTuple implements Writable
  {
      private int Min ;
      private int Max ;

      MinMaxCountTuple(int min, int max)
      {
          //
      }

      @Override
      public void readFields(DataInput in) throws IOException {
          // TODO Auto-generated method stub
          Min = in.readInt();
          Max = in.readInt();
      }

      @Override
      public void write(DataOutput out) throws IOException {
          // TODO Auto-generated method stub
          out.writeInt(Min);
          out.writeInt(Max);
      }

      public String toString()
      {
          return "min:" + Min + " , max:" + Max;
      }   

  }

map方法的调用Context内部类的context实例对键值对进行写入操作，该实例包含系统内部的上下文环境，用来存储Map方法产生的输出记录。

Reducer
Reducer函数由Reducer类实现，用于进一步处理map的输出。Reducer除了提供reduce方法外，同样提供了类似于mapper的run(),setup()和cleanup()方法。Reducer的泛型参数[Text,IntWritable,Text,IntWritable]类似与mapper，分别表示Reducer的输入和输出的键值对类型，且Reducer的输入键值对类型[Text,IntWritable]对应于mapper的输出键值对类型。
reduce方法的输入参数是map的输出结果经过MapReducer框架混洗(shuffle)之后的结果，shuffle之后，相同的key会被规约到同一个reduce作业中，所以reduce的参数values是一个key的值列表。

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,Context context ) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

Combiner
Combiner是一个可选的本地reducer，可以在map阶段聚合结果，从而减少map到reduce的网络传输代价，实现mapreduce性能的提升。例如如果同一个map中具有两个”hadoop”单词，如果没有Combiner，则map的输出到reduce接收前网络中会发送两次(“hadoop”,1),而有了Combiner之后，则可以在本地归约为(“hadoop”,2)进行一次网络传输。Combiner函数通常和Reducer的实现一样，但这只是在通常情况下，对于本例的计数是这样，因为在本地的合并不影响最终的结果(a+b = b+a);但是如果计算的是平均值，就不同了，(avg1 + agv2)/2 != 真实的平均值(avg1*m + avg2*n)/(m+n)

任务启动
获取当前系统的环境变量，并据此获得job实例，其中”wordcount”为该job的命名。通过job类可以配置输入/输出的数据格式，跟踪、控制整个任务的执行。

Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "wordcount");

setJarByClass()用于设置运行的jar包，hadoop利用传入的参数，查找包含它的jar文件

job.setJarByClass(WordCount.class);

FileInputFormat类的静态方法addInputPath用于新增mapreduce作业的输入目录
FileOutputFormat类的静态方法setOutputPath则定义mapreduce作业的输出保存路径

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

设置map,reduce等处理类:

job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);

设置输出的键值对类型:

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

提交作业到集群
1.将代码打成jar file，export步骤中在如下的环节为jar选择main class；
这里写图片描述
2.准备输入数据

可以从本地拷贝任意文件，通过hadoop fs -copyFromLoacl 或者 hadoop fs -put 上传至hdfs

3.提交MR作业
hadoop jar wordcount.jar /import/data/wordcount/ /count
/import/data/wordcount/是待进行计算的输入目录,如果该目录下还有子目录，执行后将会抛出异常
/count是任务的结果路径，如果该路径已经在HDFS上存在，任务将无法提交；也就是说，只需指定结果输出路径，同时确保该路径在当前HDFS上不存在，Reduce会自动创建结果存放路径

4.MapReduce任务调度页面 http://{IP}:8099/
这里写图片描述

5.控制台输出

# hadoop jar wordcount.jar /import/data/wordcount/ /count
17/08/12 11:37:35 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
...
17/08/12 11:37:36 INFO input.FileInputFormat: Total input paths to process : 2
17/08/12 11:37:36 INFO mapreduce.JobSubmitter: number of splits:2
...
job_1502508379954_0002 running in uber mode : false
17/08/12 11:37:46 INFO mapreduce.Job:  map 0% reduce 0%
17/08/12 11:37:57 INFO mapreduce.Job:  map 100% reduce 0%
17/08/12 11:38:03 INFO mapreduce.Job:  map 100% reduce 100%
17/08/12 11:38:05 INFO mapreduce.Job: Job job_1502508379954_0002 completed successfully

    Map-Reduce Framework
    ...
    Shuffle Errors
    ...
    File Input Format Counters 
        Bytes Read=156
    File Output Format Counters 
        Bytes Written=81
# hadoop fs -cat /count/*
a   2
compute 2
data    2
framework   2
hadoop  2
hello   4
mapreduce   4
mass    2
test    2
to  2