一:前一段时间学习了Hadoop,快要找工作了。虽然学习的不深,还是稍微回顾一下,做点准备。多看看代码,及过程吧。
题目:就是统计每个单词出现的频率,但是有一个停词表,以及最低频率参数的限制。
二:简要过程
1)编写map类
这里面需要注意的是有停词表,表中的单词不需要统计,恩,注意skip.txt格式不同,读取的方式也
public static class skpeMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
protected void map(LongWritable key, Text value,
org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
BufferedReader in = null;
InputStream fstream = Thread.currentThread().getContextClassLoader().getResourceAsStream("skip.txt");
in = new BufferedReader(new InputStreamReader(new DataInputStream(fstream),"UTF-8"));
String temp= null;
Set<String> skipword = new HashSet<String>();
while((temp=in.readLine())!=null){
skipword.add(temp);
}
final IntWritable one = new IntWritable(1);//很奇怪这里为什么都不能加private 这个修饰符
String line = value.toString();
line = line.replaceAll("[^\\w]", " ");//去掉非数字,字母的字符
StringTokenizer tokenizer = new StringTokenizer(line);
String word;
while(tokenizer.hasMoreTokens()){
word = tokenizer.nextToken();
if(!skipword.contains(word))
context.write(new Text(word), one);
}
}
}
2)写点reducer
这里需要注意的是要读取一个全局的参数k,最低频率值
//感觉需要重写一个combiner
public static class skpeReducer extends Reducer<Text ,IntWritable,Text,IntWritable>{
int frequency;
@Override
protected void reduce(Text key, Iterable<IntWritable> value,
Context context) throws IOException, InterruptedException {
//他的工作是简单的合并
int sum = 0;
for(IntWritable in:value){
sum+=in.get();
}
if(sum>=frequency)
context.write(key, new IntWritable(sum));
}
// 读取全局变量frequency
@Override
protected void setup(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
frequency = conf.getInt("frequency", -1);
}
}
3)写主函数
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
//1:感觉这个要做为全局变量设置
Configuration conf = new Configuration();
conf.setInt("frequency", Integer.parseInt(args[0]));
//进行配置
Job job = new Job(conf,"skpewordcount");//这里为什么是这个包?
FileSystem.get(conf);
job.setJarByClass(skpeWordCount.class);
job.setMapperClass(skpeMapper.class);
job.setReducerClass(skpeReducer.class);
//job.setCombinerClass(skpeReducer.class);就不能设置这个combiner,因为二者之间有差距
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[2]));
FileOutputFormat.setOutputPath(job, new Path(args[3]));
System.exit(job.waitForCompletion(true)?0:1);
}
三:
其实也没什么好写的,看看流程。注意怎样读取全局参数,和文件。