mapreduce 去重的问题怎么解决

最新推荐文章于 2022-12-19 10:00:00 发布

iteye_19679

最新推荐文章于 2022-12-19 10:00:00 发布

阅读量298

点赞数

分类专栏： Hadoop 文章标签：大数据 java 游戏

本文链接：https://blog.csdn.net/iteye_19679/article/details/82583417

版权

Hadoop 专栏收录该内容

25 篇文章 0 订阅

订阅专栏

返回列表

mapreduce 去重的问题怎么解决？ [问题点数：40分]

快速回复只显示楼主关注帖子

mapreduce 去重的问题怎么解决？ [问题点数：40分]

不显示删除回复显示所有回复显示星级回复显示得分回复只显示楼主

关注

wzl189

等级：

结帖率：0%

楼主发表于： 2014-06-14 19:05:47

mapreduce

john 89
tom 100
mary 100
mary 200
tom 20
———–
我刚学mapreduce，正在练习，上面这个我计算了很久也不对，就是对第一列去重，去重后应该是3
如果用mapreduce计算成功后，part-00000 的文件内容是：
3
请问下，这个mapreduce怎么写啊？

对我有用[0] 丢个板砖[0] 引用 | 举报 | 管理

回复次数：13

关注 tntzbzc 撸大湿等级： 6	#1 得分：0 回复于： 2014-06-15 10:03:12 map按第一列为key，value无所谓 reduce class中初始化一个计数器每个reduce方法中计数器每次加一 reduce 的cleanup方法中commit计数器就可以了
	如果您对CSDN论坛有意见和建议请直接在本帖指教对我有用[0] 丢个板砖[0] 引用 \| 举报 \| 管理

关注 wzl189 wzl189 等级：	#2 得分：0 回复于： 2014-06-15 21:40:54 map 知道怎么写了，那reduce的具体怎么写啊？
	CSDN投诉事项说明对我有用[0] 丢个板砖[0] 引用 \| 举报 \| 管理

关注 wulinshishen ONEPIECE_2013 等级：	#3 得分：0 回复于： 2014-06-18 10:37:48 直接一个Map,在Map里面定义一个全局的HashSet,map方法里面把key加入进去，cleanup方法里面把结果写入就行了。
	对我有用[0] 丢个板砖[1] 引用 \| 举报 \| 管理

关注 tjytad1982 tjytad1982 等级：	#4 得分：0 回复于： 2014-06-18 15:18:45 学习
	对我有用[0] 丢个板砖[0] 引用 \| 举报 \| 管理

关注 tntzbzc 撸大湿等级： 6	#5 得分：0 回复于： 2014-06-18 17:35:41 引用 3 楼 wulinshishen 的回复: 直接一个Map,在Map里面定义一个全局的HashSet,map方法里面把key加入进去，cleanup方法里面把结果写入就行了。只用map不可能解决这个问题如果在不同的map中都用同一个key，怎么解决？必须用reduce去group后的key才能得到去重效果
	对我有用[0] 丢个板砖[0] 引用 \| 举报 \| 管理

关注

wulinshishen

ONEPIECE_2013

等级：

#6 得分：0 回复于： 2014-06-19 09:43:31

引用 5 楼 tntzbzc 的回复:

Quote: 引用 3 楼 wulinshishen 的回复:
直接一个Map,在Map里面定义一个全局的HashSet,map方法里面把key加入进去，cleanup方法里面把结果写入就行了。

只用map不可能解决这个问题
如果在不同的map中都用同一个key，怎么解决？

必须用reduce去group后的key才能得到去重效果

嗯，对，没想那么仔细，谢谢指正。

对我有用[0] 丢个板砖[0] 引用 | 举报 | 管理

关注

wzl189

等级：

#7 得分：0 回复于： 2014-06-20 15:16:20

引用 4 楼 tjytad1982 的回复:

学习

   public static class Map extends Mapper<LongWritable, Text, Text, Text> {
                public void map(LongWritable key, Text value, Context context)
                                throws IOException, InterruptedException {
                        String line = value.toString();
                        try {
                                String[] lineSplit = line.split("\t");
                                context.write(new Text(lineSplit[0]), new Text(""));
                            } catch (java.lang.ArrayIndexOutOfBoundsException e) {
                                context.getCounter(Counter.LINESKIP).increment(1);
                                return;
                        }
                }
        }

        public static class Reduce extends Reducer<Text, Text, Text, Text> {
                private Set<String> count = new HashSet<String>();

                public void reduce(Text key, Iterable<Text> values, Context context)
                                throws IOException, InterruptedException {
                      for(Text value:values){
                             count.add(value.toString());
                     }
                        context.write(key, new Text(""));
                }
        }
-------------------------  这个问题纠结我2周了，这个方面的学习资料太少了，我的map和reduce是这样写的，但是数据量大一些，就会内存溢出，我想我这个思路是错误的
        你说的  “必须用reduce去group后的key才能得到去重效果 ”，这个 map和reduce是具体怎么写的啊？

对我有用[0] 丢个板砖[0] 引用 | 举报 | 管理

关注

wzl189

等级：

#8 得分：0 回复于： 2014-06-20 15:22:52

引用 7 楼 wzl189 的回复:

Quote: 引用 4 楼 tjytad1982 的回复:
学习

   public static class Map extends Mapper<LongWritable, Text, Text, Text> {
                public void map(LongWritable key, Text value, Context context)
                                throws IOException, InterruptedException {
                        String line = value.toString();
                        try {
                                String[] lineSplit = line.split("\t");
                                context.write(new Text(lineSplit[0]), new Text(""));
                             context.write(new Text("uniq") ,new Text(lineSplit[0]) );

                            } catch (java.lang.ArrayIndexOutOfBoundsException e) {
                                context.getCounter(Counter.LINESKIP).increment(1);
                                return;
                        }
                }
        }

        public static class Reduce extends Reducer<Text, Text, Text, Text> {
                private Set<String> count = new HashSet<String>();

                public void reduce(Text key, Iterable<Text> values, Context context)
                                throws IOException, InterruptedException {
                      for(Text value:values){
                             count.add(value.toString());
                     }
                        context.write("uniq", new Text(count.size()+""));
                }
        }
-------------------------  这个问题纠结我2周了，这个方面的学习资料太少了，我的map和reduce是这样写的，但是数据量大一些，就会内存溢出，我想我这个思路是错误的
        你说的  “必须用reduce去group后的key才能得到去重效果 ”，这个 map和reduce是具体怎么写的啊？

-------------刚才写的mapreduce错了，以这个为准

对我有用[0] 丢个板砖[0] 引用 | 举报 | 管理

关注 wzl189 wzl189 等级：	#9 得分：0 回复于： 2014-06-20 15:33:13 引用 1 楼 tntzbzc 的回复: map按第一列为key，value无所谓 reduce class中初始化一个计数器每个reduce方法中计数器每次加一 reduce 的cleanup方法中commit计数器就可以了谢谢了，请教下，你说的这个map我知道怎么写了，但是这个reduce怎么写啊？
	对我有用[0] 丢个板砖[0] 引用 \| 举报 \| 管理

关注 tntzbzc 撸大湿等级： 6	#10 得分：0 回复于： 2014-06-20 17:30:04 我晚点写个完整例子给你
	对我有用[0] 丢个板砖[0] 引用 \| 举报 \| 管理

关注

tntzbzc

撸大湿

等级：

#11 得分：0 回复于： 2014-06-20 20:47:41

 
                import  
                java.io.IOException; 
               
                import  
                org.apache.hadoop.conf.Configuration; 
               
                import  
                org.apache.hadoop.fs.Path; 
               
                import  
                org.apache.hadoop.io.LongWritable; 
               
                import  
                org.apache.hadoop.io.NullWritable; 
               
                import  
                org.apache.hadoop.io.Text; 
               
                import  
                org.apache.hadoop.mapreduce.Job; 
               
                import  
                org.apache.hadoop.mapreduce.Mapper; 
               
                import  
                org.apache.hadoop.mapreduce.Reducer; 
               
                import  
                org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
               
                import  
                org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 
               
                import  
                org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
               
                public  
                class  
                wzl189_distinct { 
               
                public  
                static  
                class  
                MyMapper  
                extends 
               
                Mapper<Object, Text, Text, NullWritable> { 
               
                Text outKey =  
                new  
                Text(); 
               
                @Override 
               
                public  
                void  
                map(Object key, Text value, Context context) 
               
                throws  
                IOException, InterruptedException { 
               
                String tmp[] = value.toString().split( 
                " " 
                ); 
               
                if  
                (tmp.length !=  
                2 
                ) 
               
                return 
                ; 
               
                outKey.set(tmp[ 
                0 
                ]); 
               
                context.write(outKey, NullWritable.get()); 
               
                } 
               
                } 
               
                public  
                static  
                class  
                MyReducer  
                extends 
               
                Reducer<Text, NullWritable, LongWritable, NullWritable> { 
               
                long  
                myCount = 0l; 
               
                @Override 
               
                public  
                void  
                reduce(Text key, Iterable<NullWritable> values, 
               
                Context context)  
                throws  
                IOException, InterruptedException { 
               
                ++myCount; 
               
                } 
               
                @Override 
               
                public  
                void  
                cleanup(Context context)  
                throws  
                IOException, 
               
                InterruptedException { 
               
                context.write( 
                new  
                LongWritable(myCount), NullWritable.get()); 
               
                }; 
               
                } 
               
                public  
                static  
                void  
                main(String[] args)  
                throws  
                Exception { 
               
                Configuration conf =  
                new  
                Configuration(); 
               
                if  
                (args.length !=  
                2 
                ) { 
               
                System.err.println( 
                "Usage: <in> <out>" 
                ); 
               
                System.exit( 
                2 
                ); 
               
                } 
               
                conf.set( 
                "mapred.child.java.opts" 
                ,  
                "-Xmx350m -Xmx1024m" 
                ); 
               
                @SuppressWarnings 
                ( 
                "deprecation" 
                ) 
               
                Job job =  
                new  
                Job(conf,  
                "wzl189_distinct" 
                ); 
               
                job.setNumReduceTasks( 
                1 
                ); 
               
                job.setInputFormatClass(TextInputFormat. 
                class 
                ); 
               
                job.setJarByClass(wzl189_distinct. 
                class 
                ); 
               
                job.setMapperClass(MyMapper. 
                class 
                ); 
               
                job.setMapOutputKeyClass(Text. 
                class 
                ); 
               
                job.setMapOutputValueClass(NullWritable. 
                class 
                ); 
               
                job.setReducerClass(MyReducer. 
                class 
                ); 
               
                job.setOutputKeyClass(Text. 
                class 
                ); 
               
                job.setOutputValueClass(NullWritable. 
                class 
                ); 
               
                FileInputFormat.addInputPath(job,  
                new  
                Path(args[ 
                0 
                ])); 
               
                FileOutputFormat.setOutputPath(job,  
                new  
                Path(args[ 
                1 
                ])); 
               
                System.exit(job.waitForCompletion( 
                true 
                ) ?  
                0  
                :  
                1 
                ); 
               
                } 
               
                }

reduce阶段只用一个计数器就行了

对我有用[1] 丢个板砖[0] 引用 | 举报 | 管理

关注

wzl189

等级：

#12 得分：0 回复于： 2014-06-20 22:45:45

引用 11 楼 tntzbzc 的回复:

Java code ?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class wzl189_distinct {

     public static class MyMapper extends

             Mapper<Object, Text, Text, NullWritable> {

         Text outKey = new Text();

         @Override

         public void map(Object key, Text value, Context context)

                 throws IOException, InterruptedException {

             String tmp[] = value.toString().split( " " );

             if (tmp.length != 2 )

                 return ;

             outKey.set(tmp[ 0 ]);

             context.write(outKey, NullWritable.get());

         }

     }

     public static class MyReducer extends

             Reducer<Text, NullWritable, LongWritable, NullWritable> {

         long myCount = 0l;

         @Override

         public void reduce(Text key, Iterable<NullWritable> values,

                 Context context) throws IOException, InterruptedException {

             ++myCount;

         }

         @Override

         public void cleanup(Context context) throws IOException,

                 InterruptedException {

             context.write( new LongWritable(myCount), NullWritable.get());

         };

     }

     public static void main(String[] args) throws Exception {

         Configuration conf = new Configuration();

         if (args.length != 2 ) {

             System.err.println( "Usage: <in> <out>" );

             System.exit( 2 );

         }

         conf.set( "mapred.child.java.opts" , "-Xmx350m -Xmx1024m" );

         @SuppressWarnings ( "deprecation" )

         Job job = new Job(conf, "wzl189_distinct" );

         job.setNumReduceTasks( 1 );

         job.setInputFormatClass(TextInputFormat. class );

         job.setJarByClass(wzl189_distinct. class );

         job.setMapperClass(MyMapper. class );

         job.setMapOutputKeyClass(Text. class );

         job.setMapOutputValueClass(NullWritable. class );

         job.setReducerClass(MyReducer. class );

         job.setOutputKeyClass(Text. class );

         job.setOutputValueClass(NullWritable. class );

         FileInputFormat.addInputPath(job, new Path(args[ 0 ]));

         FileOutputFormat.setOutputPath(job, new Path(args[ 1 ]));

         System.exit(job.waitForCompletion( true ) ? 0 : 1 );

     }

}

reduce阶段只用一个计数器就行了

太感谢了，你了解这么多啊，我都搞了2周，没有结果，想再请教最后一个问题：
假如第一列是姓名，第二列是班级（先不管我这个需求是否合理）
john 100
john 100
mary 100
mary 200
tom 200

想统计处如下结果，就是按班级人数去重
100 2
200 2

这个mapreduce怎么写啊？望高手最后再解答下，万分感谢了。

对我有用[0] 丢个板砖[0] 引用 | 举报 | 管理

关注 tntzbzc 撸大湿等级： 6	#13 得分：0 回复于： 2014-06-21 11:41:41 map 输出key 用班级 + 分隔符 + 姓名重写 grouping 实现二次排序，如果reduce num > 1 还需要重写 partition reduce略作修改，增个姓名变量，比较当前姓名是否和前一个姓名是否一致，如果不一致计数器+=1 代码就不贴了，LZ多思考一下，这种简单的MR不难解决
	对我有用[0] 丢个板砖[0] 引用 \| 举报 \| 管理

返回列表

2014年4月微软MVP当选名单揭晓！

CSDN

CSDN社区问答精华QA

回复内容

编辑
预览

粗体
斜体
下划线
---------------
字体大小
字体颜色
---------------
图片
链接
---------------
左对齐
居中对齐
右对齐
---------------
引用
代码
---------------
QQ
monkey
onion
---------------
押宝
---------------
清除格式

每天回帖即可获得10分可用分！小技巧：教您如何更快获得可用分你还可以输入10000个字符(Ctrl+Enter)

请遵守CSDN用户行为准则，不得违反国家法律法规。
转载文章请注明出自“CSDN（www.csdn.net）”。如是商业用途请联系原作者。

回到首页回到频道

网站客服杂志客服微博客服 webmaster@csdn.net 400-600-2320

京 ICP 证 070598 号

江苏乐知网络技术有限公司提供商务支持

iteye_19679

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
mapreduce 去重的问题怎么解决

登录 | 注册 http://bbs.csdn.net/topics/390811736?page=1#post-397617777 返回列表管理菜单结帖发帖回复关注mapreduce 去重的问题怎么解决？ [问题点数：40分]快速回复只显...
复制链接

扫一扫