项目结构
具体代码
WordCout.java
FileInputFormat.setInputPaths(job, new Path("/input/input.txt"));
这一步可以设置运行时参数,也就是String[] args
修改为
String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs();
if(otherArgs.length < 2) {
System.err.println("Usage: wordcount <in> [<in>...] <out>");
System.exit(2);
}
####中间省略######
for(int i = 0; i < otherArgs.length - 1; ++i) {
FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
}
FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));
这样就不需要在频繁输入一大串的path信息,另外一个好处就是,当hdfs中的文件发生改变的时候,也不需要去修改path信息
package com.jxufe.xzy.wordcount;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
public class WordCount {
/**
* @param args
* @throws IOException
* @throws InterruptedException
* @throws ClassNotFoundException
*/
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
//配置信息
Configuration conf = new Configuration();
conf.set("fs.defaultFS","hdfs://Master:9000");
conf.set("fs.hdfs.impl","org.apache.hadoop.hdfs.DistributedFileSystem");
Job job = Job.getInstance(conf);
//设置整个程序的类名
job.setJarByClass(WordCount.class);
job.setMapperClass(MMapper.class);//添加mapper类
job.setReducerClass(RRducer.class);//添加reducer类
job.setCombinerClass(RRducer.class);
job.setOutputKeyClass(Text.class);//设置输出类型
job.setOutputValueClass(IntWritable.class);//设置输出类型
//设置输入输出文件夹
FileInputFormat.setInputPaths(job, new Path("/input/input.txt"));
FileOutputFormat.setOutputPath(job, new Path("/output"));
System.exit(job.waitForCompletion(true)?0:1);
}
}
MMaper.java
Maper的任务是将输入的文件<key,value>进行处理,得到一系列<k1,v1>,<k2,v2>…<kn,vn>类型数据,这些数据将通过Context传递给Reducer
package com.jxufe.xzy.wordcount;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;
public class MMapper extends Mapper<Object, Text, Text, IntWritable> {
public static final IntWritable one = new IntWritable(1);
private Text word = new Text();
//Text可简单理解就是java中的String
public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException{
//将value转换成String进行分词(分成一个一个的单词),默认使用空格进行分词
/*
while (st.hasMoreElements()) {
System.out.println(st.nextToken());
}
StringTokenizer(String str, String delim, boolean returnDelims)
第一个参数为需要进行分词的字符串,第二个参数为使用什么符号进行分词
如果 returnDelims 标志为 true,则分隔符字符也作为标记返回
*/
StringTokenizer itr = new StringTokenizer(value.toString());
while(itr.hasMoreElements()){
this.word.set(itr.nextToken());
context.write(this.word,one);
//context相当于web中的session,在这里用于存储map生成的<k1,<v1,v2,....vn>>(还可以存储其他的运行时参数)
}
}
}
RRducer.java
Reducer的任务是从Mapper那里领取属于自己那一块的数据,对这一堆的<k1,v1>,<k2,v2>…<kn,vn>数据进行归并操作
package com.jxufe.xzy.wordcount;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Reducer;
public class RRducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key,Iterable<IntWritable> value, Reducer<Text,IntWritable,Text,IntWritable>.Context context) throws IOException, InterruptedException{
int sum = 0;
for(IntWritable val : value){
sum += val.get();
}
context.write(key,new IntWritable(sum));
}
}