MapReduce对 file1.txt , file2.txt里面的内容进行去重，排序，并输出结果

最新推荐文章于 2022-04-19 10:38:17 发布

小A__

最新推荐文章于 2022-04-19 10:38:17 发布

阅读量940

点赞数

分类专栏： MapReduce Hadoop

本文链接：https://blog.csdn.net/xiaozelulu/article/details/81072825

版权

Hadoop 同时被 2 个专栏收录

62 篇文章 0 订阅

订阅专栏

MapReduce

21 篇文章 0 订阅

订阅专栏

题目：利用MapReduce对 file1.txt和 file2.txt里面对里面的内容进行去重，排序，并输出结果。。。

1.Mapper阶段：
主要是对<k1,v1>进行排序，排序之后<k2,v2>作为Map的输出；

public class DistinctMapper extends Mapper<LongWritable,Text,Text,Text>{
	@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		context.write(value, new Text()); //置v2为空，不可直接写null
	}
}

2.Reducer阶段：此时<k2,v2>是已经排好序的，

public class DistinctReducer extends Reducer<Text, Text, Text, Text> {
	 @Override
	protected void reduce(Text key, Iterable<Text> values, Context context)
			throws IOException, InterruptedException {
            //v3可以直接：null , 在<k3,v3>该阶段已对k3进行去重处理
	
        	 context.write(key, null); 
	}
}

3.Driver阶段-主类

public class DistinctDriver {
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		Configuration conf = new Configuration();
		 Path  outfile=new Path("file:///D:/outToDate");
		 FileSystem fs=outfile.getFileSystem(conf);
		 if(fs.exists(outfile)){
			 fs.delete(outfile,true);
		 }
		
	    Job job = Job.getInstance(conf);		  
		job.setJarByClass(DistinctDriver.class);
		job.setJobName("mysort");
		job.setMapperClass(DistinctMapper.class);//输入数据方法
		job.setReducerClass(DistinctReducer.class);//计算结果
		 
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);
		 
		 FileInputFormat.addInputPath(job, new Path("file:///D:/quchong"));
		 FileOutputFormat.setOutputPath(job, outfile);
	 
		System.exit(job.waitForCompletion(true) ? 0 : 1);

	}
}

4.处理结果

【file1.txt】
        2012-3-1 a
	2012-3-2 b
	2012-3-3 c
	2012-3-4 d
	2012-3-5 a
	2012-3-6 b
	2012-3-7 c
	2012-3-3 c
【file2.txt】
    2012-3-1 b
	2012-3-2 a
	2012-3-3 b
	2012-3-4 d
	2012-3-5 a
	2012-3-6 c
	2012-3-7 d
	2012-3-3 c
part-r-00000  --运行程序输出的结果（已去重并且排序后的数据）
    2012-3-1 a
    2012-3-1 b
    2012-3-2 a
    2012-3-2 b
    2012-3-3 b
    2012-3-3 c
    2012-3-4 d
    2012-3-5 a
    2012-3-6 b
    2012-3-6 c
    2012-3-7 c
    2012-3-7 d