WordCount源码剖析

本文深入剖析WordCount的MapReduce实现,包括Map过程、Reduce过程、Input输入和Output输出,以及Mapper和Reducer抽象类的使用。Map任务并行读取文件,将单词转化为<key, value>对,Reduce则负责对Map结果进行排序和合并,得到最终词频。通过设置不同的输出数据类型,调整Job配置,完成词频统计作业。" 51010902,5097065,Ubuntu重置MySQL数据库密码,"['数据库管理', 'MySQL', 'Ubuntu系统']

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

       为了把抽象问题转化成为具体问题,深入理解MapReduce的工作原理,因此,以WordCount为例,详细分析MapReduce是怎么来执行的,中间的执行经过了哪些步骤,每个步骤产生的结果是什么。简单来说,大体上工作流程是Input从HDFS里面并行读取文本中的内容,经过MapReduce模型,最终把分析出来的结果用Output封装,持久化到HDFS中。

一、WordCount的Map过程

1、使用三个Map任务并行读取三行文件中的内容,对读取的单词进行map操作,每个单词都以<key, value>形式生成,如下所示:


2、Map端源码,如下所示:

public class WordMapper extends
            Mapper<Object, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            String line = value.toString();
            StringTokenizer itr = new StringTokenizer(line);
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken().toLowerCase());
                context.write(word, one);
            }
        }
    }


二、WordCount的Reduce过程

1、Reduce操作是对Map的结果进行排序、合并等操作最后得出词频,如下所示:


2、Reduce端源码,如下所示:

public class WordReducer extends
            Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values,
                Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, new IntWritable(sum));
        }
    }


三、每个过程产生的结果

1、Input输入

Hello World Bye World
Hello Hadoop Bye Hadoop
Bye Hadoop Hello Hadoop

2、Map结果

<Hello, 1>
<World, 1>
<Bye, 1>
<World, 1>
<Hello, 1>
<Hadoop, 1>
<Bye, 1>
<Hadoop, 1>
<Bye, 1>
<Hadoop, 1>
<Hello, 1>
<Hadoop, 1>
3、Sort结果
<Bye, 1>
<Bye, 1>
<Bye, 1>
<Hadoop, 1>
<Hadoop, 1>
<Hadoop, 1>
<Hadoop, 1>
<Hello, 1>
<Hello, 1>
<Hello, 1>
<World, 1>
<World, 1>

4、Combine结果

<Bye, 1, 1, 1>
<Hadoop, 1, 1, 1, 1>
<Hello, 1, 1, 1>
<World, 1, 1>
5、Reduce结果
<Bye, 3>
<Hadoop, 4>
<Hello, 3>
<World, 2>

补充:MergeSort(归并排序)的过程,如下所示:


(1)用"|"分隔符对Map的结果进行第一次二分

<Hello, 1><World, 1><Bye, 1><World, 1><Hello, 1><Hadoop, 1> | <Bye, 1><Hadoop, 1><Bye, 1><Hadoop, 1><Hello, 1><Hadoop, 1>
(2)用"||"分隔符对Map的结果进行第二次二分(在第一次的基础之上)

<Hello, 1><World, 1><Bye, 1> || <World, 1><Hello, 1><Hadoop, 1> | <Bye, 1><Hadoop, 1><Bye, 1> || <Hadoop, 1><Hello, 1><Hadoop, 1>
(3)用"|||"分隔符对Map的结果进行第三次二分(在第二次的基础之上)
<Hello, 1><World, 1> ||| <Bye, 1> || <World, 1><Hello, 1> ||| <Hadoop, 1> | <Bye, 1><Hadoop, 1> ||| <Bye, 1> || <Hadoop, 1><Hello, 1> ||| <Hadoop, 1>
(4)MergeArray第一次排序结果
<Hello, 1><World, 1> ||| <Bye, 1> || <World, 1><Hello, 1> ||| <Hadoop, 1> | <Bye, 1><Hadoop, 1> ||| <Bye, 1> || <Hadoop, 1><Hello, 1> ||| <Hadoop, 1>
(5) MergeArray第二次排序结果
<Bye, 1><Hello, 1><World, 1> ||<Hadoop, 1><Hello, 1><World, 1> | <Bye, 1><Bye, 1><Hadoop, 1> || <Hadoop, 1><Hadoop, 1><Hello, 1>
(6)MergeArray第三次排序结果
<Bye, 1><Hadoop, 1><Hello, 1><Hello, 1><World, 1><World, 1> | <Bye, 1><Bye, 1><Hadoop, 1><Hadoop, 1><Hadoop, 1><Hello, 1>
(7)MergeArray最终排序结果
<Bye, 1><Bye, 1><Bye, 1><Hadoop, 1><Hadoop, 1><Hadoop, 1><Hadoop, 1><Hello, 1><Hello, 1><Hello, 1><World, 1><World, 1>

说明:

以上归并排序个别步骤省略,并不是完整的,详见上图。


四、Mapper抽象类

我们写的Map类是继承Mapper<Object, Text, Text, IntWritable>抽象类,方法源码如下所示:

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
	public class Context extends MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
		public Context(Configuration conf, TaskAttemptID taskid, 
			   RecordReader<KEYIN, VALUEID> reader,
			   RecordWriter<KEYOUT, VALUEOUT> writer, 
			   OutputCommitter committer,
			   StatusReporter reporter,
			   InputSplit split) throws IOException, InterruptedException {
			super(conf, taskid, reader, writer, committer, reporter, split);
		}
	}
	
/**
 * Called once at the beginning of the task.
 */
	protected void setup(Context context) throws IOException, InterruptedException {
		// NOTHING
	}
	
/**
 * Called once for each key/value pair in the input split. Most applications should
 * override this, but the default is the identity function.
 */
	@SuppressWarning("unchecked")
	protected void map(KEYIN key, VALUEIN value, Context context) throws IOException, InterruptedException {
		context.write((KEYOUT)key, (VALUEOUT)value);
	}
	
	/**
	 * Called once at the end of the task.
	 */
	protected void cleanup(Context context) throws IOException, InterruptedException {
		// NOTHING
	}
	/**
	 * Expert users can override this method for more complete control over the 
	 * execution of the Mapper.
	 * 
	 * @param context
	 * @throws IOException
	 */
	public void run(Context context) throws IOException, InterruptedException {
		setup(context);
		while(context.nextKeyValue()) {
			map(context.getCurrentKey(), context.getCurrentValue(), context);
		}
		cleanup(context);
	}
}
解析:
  1. MapReduce编程的时候,都要继承Mapper这个抽象类,通常重写map()。map()每次接受一个Key-Value对,然后对这个Key-Value对进行处理,再分发出处理后的数据。
  2. 重写setup()主要对这个Map Task进行一些预处理。
  3. 重写cleanup()主要做一些处理后的工作。
  4. run()相当于Map Task的驱动,它提供了setup()—>map()—>cleanup()的执行模板。
  5. Mapper<Object, Text, Text, IntWritable>抽象类前两个参数为map()的输入,后两个参数为map()的输出。map()方法中,Object key参数表示偏移量,Value value表示一行文本的值,Context context参数为MapReduce模型Map端的上下文对象。
说明:
举个例子来说,如果输入文件test.txt的内容,如下所示:
hello world
easy hadoop
map()方法被执行两次,第一次,Object key的值为0,Value value的值为hello world。第二次, Object key的值为11,Value value的值为easy hadoop。

五、Reducer抽象类
我们写的Reducer类是继承Reducer<Text, IntWritable, Text, IntWritable>抽象类的,方法源码如下所示:
public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
	public class Context extends ReduceContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
		public Context(Configuration conf, TaskAttemptID, taskid,
			   RawKeyValueInterator input,
			   Counter inputKeyConuter,
			   Counter inputValueCounter,
			   RecordWriter<KEYOUT, VALUEOUT> output,
			   OutputCommitter committer,
			   StatusReporter reported,
			   RawComparator<KEYIN> comparator,
			   Class<KEYIN> keyClass,
			   Class<VALUEIN> valueClass
			   ) throws IOException, InterruptedException {
			super(conf, taskid, input, inputKeyCounter, inputValueCounter,
				  output, committer, reporter, comparator, keyClass, valueClass);
		}
	}

/**
 * Called once at the start of the task.
 */
	protected void setup(Context context) throws IOException, InterruptedException {
		// 	NOTHING
	}
	
/**
 * This method is called once for each key. Most application will define
 * their reduce class by overriding this method. The default implementation
 * is an indentity function.
 */
   @SuppressWarning("unchecked")
   protected void reduce(KEYIN key, Interable<VALUEIN> values, Context context) 
                  throws IOException, InterruptedException {
	   for(VALUEIN value: values) {
		   context.write((KEYOUT)key, (VALUEOUT)value);
	   }
   }
   
  /**
   * Called once at the end of the task.
   */
   protected void cleanup(Context context) throws IOException, InterruptedException {
	   // NOTHING
   }
   
   public void run(Context context) throws IOException, InterruptedException {
	   setup(context);
	   while(context.nextkey()) {
		   reduce(context.getCurrentKey(), context.getValues(), context);
	   }
	   cleanup(context);
   }
}
解析:
  1. Reducer<Text, IntWritable, Text, IntWritable>抽象类里面的方法和Mapper<Object, Text, Text, IntWritable>有点类似,除了reduce()不一样,其它的都一样,并且功能也一样。
  2. Reducer<Text, IntWritable, Text, IntWritable>抽象类前两个参数为Reduce的输入(即Map的输出),后两个参数为Reduce的输出。reduce()方法中,Text key参数表示Map端输出Key的值。Iterable<IntWritable> values参数表示Key相同的值,Value的集合。Context context参数表示MapReduce的Reduce端上下文。
  3. Reduce功能:获取map函数的中间结果;将中间结果中的Value按Key划分组,而组按照Key排序。形成了<key, (collection of values)>的结构,此时Key是唯一的;处理组中的所有Value,相同Key的Value相加,最终Key对应的Value唯一,<key, value>序对形成。
说明:
举个例子来说,如果Map端输出的结果,如下所示:
<hello, 1>
<hello, 1>
<hello, 1>
<hello, 1>
这样,reduce()方法中Text key的值为hello,Iterable<IntWritable> values的值为[1, 1, 1, 1]。

六、MapReduce驱动
现在有Mapper了,也有Reducer了,需要一个MapReduce驱动,即一个叫做Driver的组件,也就相当于Java中的main(),它会初始化Job和指示Hadoop平台在输入文件集合上执行你的代码,并且控制输出文件的存放地址,方法源码如下所示:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordMain {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args)
                .getRemainingArgs();
        if (otherArgs.length != 2) {
            System.err.println("Usage: wordcount <in> <out>");
            System.exit(2);
        }
        Job job = new Job(conf, "word count");
        job.setJarByClass(WordMain.class);        // main class     
        job.setMapperClass(WordMapper.class);     // mapper
        job.setCombinerClass(WordReducer.class);  // combiner
        job.setReducerClass(WordReducer.class);   // reducer
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));  // file input
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));// file output
        System.exit(job.waitForCompletion(true) ? 0 : 1);           // wait for completion
    }
}
解析:
  1. main()的作用是在给定输入文件夹(InputPath参数)里的文件上执行词频统计程序的作业(Job)。Reduce的输出被写到OutputPath指定的文件夹内,用于运行Job的配置信息保存在Job对象里。通过setMapperClass()和setReducerClass()方法可以设定map和reduce函数。
  2. Reduce生成的数据类型由setOutputKeyClass()和setOutputValueClass()方法设定。默认情况下,假定这些也是Map的输出数据类型。如果想要设定不同的数据格式的话,可以通过Job的setMapOutputKeyClass()和setMapOutputValueClass()方法设定。
  3. 通过调用Job.runJob(conf)即可向MapReduce提交Job,这个调用会阻塞直到Job完成。如果Job失败了,它会抛出一个IOException。

七、WordCount源码
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

    public static class WordMapper extends
            Mapper<Object, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            String line = value.toString();
            StringTokenizer itr = new StringTokenizer(line);
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken().toLowerCase());
                context.write(word, one);
            }
        }
    }

    public static class WordReducer extends
            Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values,
                Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, new IntWritable(sum));
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args)
                .getRemainingArgs();
        if (otherArgs.length != 2) {
            System.err.println("Usage: wordcount <in> <out>");
            System.exit(2);
        }
        Job job = new Job(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(WordMapper.class);
        job.setCombinerClass(WordReducer.class);
        job.setReducerClass(WordReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值