Hadoop学习之莎士比亚文档词频统计

最新推荐文章于 2021-07-12 10:29:48 发布

linluyisb

最新推荐文章于 2021-07-12 10:29:48 发布

阅读量1.7k

点赞数

分类专栏： Hadoop

本文链接：https://blog.csdn.net/buring_/article/details/10149157

版权

Hadoop 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

一：前一段时间学习了Hadoop，快要找工作了。虽然学习的不深，还是稍微回顾一下，做点准备。多看看代码，及过程吧。

题目：就是统计每个单词出现的频率，但是有一个停词表，以及最低频率参数的限制。

二：简要过程

1）编写map类

这里面需要注意的是有停词表，表中的单词不需要统计，恩，注意skip.txt格式不同，读取的方式也

public static class skpeMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

		@Override
		protected void map(LongWritable key, Text value,
				org.apache.hadoop.mapreduce.Mapper.Context context)
				throws IOException, InterruptedException {
			// TODO Auto-generated method stub
			BufferedReader in = null;
			InputStream fstream = Thread.currentThread().getContextClassLoader().getResourceAsStream("skip.txt");
			in = new BufferedReader(new InputStreamReader(new DataInputStream(fstream),"UTF-8"));
			String  temp= null;
			Set<String> skipword = new HashSet<String>();
			while((temp=in.readLine())!=null){
				skipword.add(temp);
			}
						
			final  IntWritable one = new IntWritable(1);//很奇怪这里为什么都不能加private 这个修饰符
			String line = value.toString();
			line = line.replaceAll("[^\\w]", " ");//去掉非数字，字母的字符
			StringTokenizer tokenizer = new StringTokenizer(line);
			String word;
			while(tokenizer.hasMoreTokens()){
				word = tokenizer.nextToken();
				if(!skipword.contains(word))
					context.write(new Text(word), one);
			}
		}
	}

2）写点reducer

这里需要注意的是要读取一个全局的参数k，最低频率值

//感觉需要重写一个combiner
	public static class skpeReducer extends Reducer<Text ,IntWritable,Text,IntWritable>{

		int frequency;
		@Override
		protected void reduce(Text key, Iterable<IntWritable> value,
				Context context) throws IOException, InterruptedException {
			//他的工作是简单的合并
			
			int sum = 0;
			for(IntWritable in:value){
				sum+=in.get();
			}
			if(sum>=frequency)
				context.write(key, new IntWritable(sum));
		}

		// 读取全局变量frequency
		@Override
		protected void setup(Context context) throws IOException,
				InterruptedException {
			Configuration conf = context.getConfiguration();
			frequency = conf.getInt("frequency", -1);
		}
	}

3）写主函数

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		
		//1:感觉这个要做为全局变量设置
		Configuration conf = new Configuration();
		conf.setInt("frequency", Integer.parseInt(args[0]));
		//进行配置
		Job job = new Job(conf,"skpewordcount");//这里为什么是这个包？
		FileSystem.get(conf);
		job.setJarByClass(skpeWordCount.class);
		job.setMapperClass(skpeMapper.class);
		job.setReducerClass(skpeReducer.class);
		//job.setCombinerClass(skpeReducer.class);就不能设置这个combiner,因为二者之间有差距
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		FileInputFormat.addInputPath(job, new Path(args[2]));
		FileOutputFormat.setOutputPath(job, new Path(args[3]));
		
		System.exit(job.waitForCompletion(true)?0:1);
	}

三：

其实也没什么好写的，看看流程。注意怎样读取全局参数，和文件。

linluyisb

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hadoop学习之莎士比亚文档词频统计

一：前一段时间学习了Hadoop，快要找工作了。虽然学习的不深，还是稍微回顾一下，做点准备。多看看代码，及过程吧。题目：就是统计每个单词出现的频率，很简单的。二：简要过程 1）编写map类class TokenizerMapper extends Mapper { //这些均是封装的数据类型，可视为int,long,String private f
复制链接

扫一扫