转自:http://blog.csdn.net/kwu_ganymede/article/details/50474763
Hadoop经典案例Spark实现(二)——数据去重问题
1、原始数据
1)file1:
- 2012-3-1 a
- 2012-3-2 b
- 2012-3-3 c
- 2012-3-4 d
- 2012-3-5 a
- 2012-3-6 b
- 2012-3-7 c
- 2012-3-3 c
2)file2:
- 2012-3-1 b
- 2012-3-2 a
- 2012-3-3 b
- 2012-3-4 d
- 2012-3-5 a
- 2012-3-6 c
- 2012-3-7 d
- 2012-3-3 c
数据输出:
- 2012-3-1 a
- 2012-3-1 b
- 2012-3-2 a
- 2012-3-2 b
- 2012-3-3 b
- 2012-3-3 c
- 2012-3-4 d
- 2012-3-5 a
- 2012-3-6 b
- 2012-3-6 c
- 2012-3-7 c
- 2012-3-7 d
3)、说明
数据去重的最终目标是让原始数据中出现次数超过一次的数据在输出文件中只出现一次。我们自然而然会想到将同一个数据的所有记录都交给一台reduce机器,
无论这个数据出现多少次,只要在最终结果中输出一次就可以了。具体就是reduce的输入应该以数据作为key,
而对value-list则没有要求。当reduce接收到一个<key,value-list>时就直接将key复制到输出的key中,并将value设置成空值。
2、MapReduce实现
代码编写
- import java.io.IOException;
-
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.Mapper;
- import org.apache.hadoop.mapreduce.Reducer;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- import org.apache.hadoop.util.GenericOptionsParser;
-
-
- public class Dedup {
-
-
-
- public static class Map extends Mapper<Object,Text,Text,Text>{
-
- private static Text line=new Text();
-
-
-
- public void map(Object key,Text value,Context context)
-
- throws IOException,InterruptedException{
-
- line=value;
-
- context.write(line, new Text(""));
-
- }
-
-
- }
-
-
-
-
-
- public static class Reduce extends Reducer<Text,Text,Text,Text>{
-
-
-
- public void reduce(Text key,Iterable<Text> values,Context context)
-
- throws IOException,InterruptedException{
-
- context.write(key, new Text(""));
-
- }
-
- }
-
-
-
- public static void main(String[] args) throws Exception{
-
- Configuration conf = new Configuration();
-
-
-
- conf.set("mapred.job.tracker", "192.168.1.2:9001");
-
-
-
- String[] ioArgs=new String[]{"dedup_in","dedup_out"};
-
- String[] otherArgs = new GenericOptionsParser(conf, ioArgs).getRemainingArgs();
-
- if (otherArgs.length != 2) {
-
- System.err.println("Usage: Data Deduplication <in> <out>");
-
- System.exit(2);
-
- }
-
-
-
- Job job = new Job(conf, "Data Deduplication");
-
- job.setJarByClass(Dedup.class);
-
-
-
-
-
- job.setMapperClass(Map.class);
-
- job.setCombinerClass(Reduce.class);
-
- job.setReducerClass(Reduce.class);
-
-
-
-
-
- job.setOutputKeyClass(Text.class);
-
- job.setOutputValueClass(Text.class);
-
-
-
-
-
- FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
-
- FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
-
- System.exit(job.waitForCompletion(true) ? 0 : 1);
-
- }
-
- }
3、Spark实现Scala版本
- val two = sc.textFile("/tmp/spark/two")
-
- two.filter(_.trim().length>0).map(line=>(line.trim,"")).groupByKey().sortByKey().keys.collect.foreach(println _)
上面通过groupByKey来去重,并sortByKey排序,因为hadoop的结果也是排序过的,验证结果是一样的,代码精简不少