【Mapreduce】去除重复的行

最新推荐文章于 2024-08-10 16:30:06 发布

yongh701

最新推荐文章于 2024-08-10 16:30:06 发布

阅读量7.8k

点赞数

分类专栏： Hadoop 文章标签： hadoop Mapreduce wordcount 合并去重

本文链接：https://blog.csdn.net/yongh701/article/details/50596452

版权

Hadoop 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

基于《【Mapreduce】以逗号为分隔符的WordCount词频统计》（点击打开链接）中Mapreduce的处理过程，由于Mapreduce会在Map~reduce中，将重复的Key合并在一起，所以Mapreduce很容易就去除重复的行。

Map无须做任何处理，设置Map中写入context的东西为不作任何处理的行，也就是Map中最初处理的value即可，

而Reduce同样无须做任何处理，写入输出文件的东西就是，最初得到的Key，

因此其代码比WordCount还要简单，具体如下：

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class MyMapReduce {

	public static class MyMapper extends
			Mapper<Object, Text, Text, IntWritable> {
		public void map(Object key, Text value, Context context)
				throws IOException, InterruptedException {
			context.write(value, new IntWritable());// 这里不能为NULL，只能是new
													// IntWritable()，不然会报空指针异常
		}
	}

	public static class MyReducer extends
			Reducer<Text, IntWritable, Text, IntWritable> {
		public void reduce(Text key, Iterable<IntWritable> values,
				Context context) throws IOException, InterruptedException {
			context.write(key, null);// 这里则可以是为null，写入文件的value值为空，也就就是什么都不写，只写键
		}
	}

	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();

		String[] otherArgs = new GenericOptionsParser(conf, args)
				.getRemainingArgs();
		if (otherArgs.length != 2) {
			System.err.println("Usage: wordcount <in> <out>");
			System.exit(2);
		}
		Job job = new Job(conf, "");
		job.setMapperClass(MyMapper.class);
		job.setReducerClass(MyReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
		FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}

}

以下是输入文件：

以下是输出文件：

yongh701

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
2
评论
【Mapreduce】去除重复的行

基于《【Mapreduce】以逗号为分隔符的WordCount词频统计》（点击打开链接）中Mapreduce的处理过程，由于Mapreduce会在Map~reduce中，将重复的Key合并在一起，所以Mapreduce很容易就去除重复的行。Map无须做任何处理，设置Map中写入context的东西为不作任何处理的行，也就是Map中最初处理的value即可，而Reduce同样无须做任何处理，
复制链接

扫一扫

专栏目录