写MapReduce程序的步骤:
- 把问题转化为MapReduce模型;
- 设置运行参数;
- 写map类;
- 写reduce类;
例子:统计单词个数
Map的任务是将内容用“ ”分开,然后每个都对应1,Reduce将相同的统计起来
1,Map端:一行行读文件,程序转化为中间Key/Value。每一个键值对调用一次map函数。
hello you hello me → hello 1,you 1,hello 1,me 1;
2,Reduce端:相同的Key肯定会在一起。经过Reduce方法处理后形成最终的Key/Value
hello 1,hello 1→hello 2;
写一个MapClass extends Mapper<keyin,valuein,keyout,valueout>类,实现map方法;
用java思想理解:
//word.txt 内容(两行) //"hello you hello world →hello 1, you 1,hello 1,world 1 //hello me"; →hello 1,me 1 String str = "hello you hello world"; //String[] strs=str.split(" ");//按空格划分 //strs[0]=hello →map(key,1) →map(hello,1)
用MapReduce实现:
package mapreduce; import java.util.StringTokenizer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class MapClass extends Mapper<Object, Text, Text, IntWritable> { //Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> //参数:Object,Text(和java的String一样),Text,IntWritable(和java int一样) //Map的输出是Reduce的输入; public Text keyText = new Text("key");//相当于String ketText="key" public IntWritable intValue = new IntWritable(1); protected void map(Object key, Text value, Context context) throws java.io.IOException, InterruptedException { //获取值 String str = value.toString(); //默认空格分割 StringTokenizer stringToKenizer = new StringTokenizer(str); while (stringToKenizer.hasMoreTokens()) { keyText.set(stringToKenizer.nextToken()); context.write(keyText, intValue);//context.write("My",1) //上下文 } }; }
一个ReduceClass extends Reducer< keyin,valuein,keyout,valueout >类;实现reduce方法:
接下来写main测试,新建一个类WordCounter(其中的main拷贝源码例子中的main如下:
package mapreduce; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.VLongWritable; import org.apache.hadoop.mapreduce.Reducer; public class ReduceClass extends Reducer<Text, IntWritable, Text, IntWritable> { //Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> //Map的输出是Reduce的输入; public IntWritable intValue = new IntWritable(1); protected void reduce(Text key, java.lang.Iterable<IntWritable> values,//name [1,1] org.apache.hadoop.mapreduce.Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws java.io.IOException, InterruptedException { int sum = 0; while (values.iterator().hasNext()) { sum += values.iterator().next().get(); } intValue.set(sum); context.write(key, intValue);//上下文 }; }
完成后导出成jar包放到hadoop运行:
hadoop-1.1.2\src\examples\org\apache\hadoop\examples\WordCount.java
public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }
WordCounter类
package mapreduce; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner; import org.apache.hadoop.util.GenericOptionsParser; public class WordCounter { public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCounter.class);//打包jar要写,执行的类:本类 job.setMapperClass(MapClass.class);//MapClass类 //job.setCombinerClass(IntSumReducer.class); job.setReducerClass(ReduceClass.class);//ReduceClass类 job.setOutputKeyClass(Text.class);//输出的key类型 job.setOutputValueClass(IntWritable.class);//输出的value类型 FileInputFormat.addInputPath(job, new Path(otherArgs[0]));//输入参数 FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));//输出参数 System.exit(job.waitForCompletion(true) ? 0 : 1); } }
右击包名,Export,Java,JAR file 导出,拷贝到Linux桌面;
上传到hadoop: 记得打开java进程:start-all.sh
新建一个文件:[root@hadoop Desktop]# hadoop fs -put mapreduceTest.jar /
内容:root@hadoop Desktop]# vi wordTest.txt
将文件上传到hadoop:hello you hello me hello world
运行:参数4:包名加类名;5上一步上传到的文件,6输出到哪里[root@hadoop Desktop]# hadoop fs -put wordTest.txt /
查看日志:(/part-r-00000是固定的)[root@hadoop Desktop]# hadoop jar mapreduceTest.jar cn.mapreduce.WordCounter /wordTest.txt /outputTest
[root@hadoop Desktop]# hadoop fs -text /outputTest/part-r-00000 Warning: $HADOOP_HOME is deprecated. hello 3 me 1 world 1 you 1
完成;