WordCount
对hadoop例子WordCount进行代码分析学习。
注:本文仅为学习笔记,中间会包含从网络或其他出处获取的资料,文后会标注出处,若有遗漏,麻烦提醒以便修订,敬请原谅
作用
计算文件中各个词出现的次数。
Map
publicstaticclass TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{
privatefinalstatic IntWritable one = new IntWritable(1); private Text word = new Text();
publicvoid map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } |
1. 继承自org.apache.haddop.mapreduce.Mapper类,覆盖实现public voidmap(Object key,Text value,Context context) throwsIOException,InterruptedException方法。
2. 输入的key为偏移量,value为每行文本,context为上下文操作对象。
3. 输出为每一个词<word,1>的键值对。
4. 泛型Object,Text,Text,IntWritable分别为输入键类型,输入值类型,输出键类型,输出值类型。
Combiner和reducer
publicstaticclass IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();
publicvoid reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { intsum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } |
1. 继承自org.apache.hadoop.mapreduce.Reducer,覆盖实现了public voidreduce(Text key,Iterable<IntWritable> values, Context context) throwsIOException, InterruptedException方法
2. 输入的key 为词,values为map中生成的该词每出现一次的值1的列表。
3. 输出为<word,count>,count即为最终结果
4. Text,IntWritable,Text,IntWritable,分别为reduce或combiner输入的键类型,输入值类型,输出键类型,输出值类型。
主函数
publicstaticvoid main(String[] args) throws Exception { Configuration conf = new Configuration(); //1 String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length < 2) { System.err.println("Usage: wordcount <in> [<in>...] <out>"); System.exit(2); } Job job = Job.getInstance(conf, "word count"); //2 job.setJarByClass(WordCount.class); //3 job.setMapperClass(TokenizerMapper.class); //4 job.setCombinerClass(IntSumReducer.class); //5 job.setReducerClass(IntSumReducer.class); //6 job.setOutputKeyClass(Text.class); //7 job.setOutputValueClass(IntWritable.class); //8 for (inti = 0; i < otherArgs.length - 1; ++i) { FileInputFormat.addInputPath(job, new Path(otherArgs[i])); //9 } FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));//10 System.exit(job.waitForCompletion(true) ? 0 : 1); //11 } |
1. 创建conf实例用于生成Job实例
2. 使用conf实例创建Job实例
3. 通过类来设置应用的Jar
4. 设置mapper类
5. 设置Combiner类
6. 设置Reducer类
7. 设置最终输出的键类型
8. 设置最终输出的值类型
9. 添加输入文件路径
10. 设置输出文件路径
11. 等待Job完成
总结
Map-reduce应用组成:
1. 确认输入和输出的最终键类型和值类型,开发Mapper和Reducer
2. 入口函数配置Mapper,Combiner,Reducer,最终输出键类型,最终输出值类型,添加输入文件路径,设置输出文件路径,提交任务
输入和输出文件的设置
FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
FileOutputFormat.setOutputPath(job,new Path(otherArgs[otherArgs.length - 1]))
任务的提交
Job.submit()
Job.waitForCompletion(Boolean)
引用:
1. 文中代码引用自Hadoop 2.7.0 自带例子:org.apache.hadoop.examples.WordCount