MapReduce的分组

最新推荐文章于 2020-04-22 23:54:11 发布

小丫小屁孩

最新推荐文章于 2020-04-22 23:54:11 发布

阅读量1.9k

点赞数 6

分类专栏： mapreduce

本文链接：https://blog.csdn.net/qq_42636010/article/details/90071896

版权

mapreduce 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

不想看废话就直接到最后找总结

一：
MapReduce的分组其实就是把相同的key合并到一起
比如map后输出
hadoop 1
hadoop 1
hadoop 1
分组后变成 hadoop <1，1，1> 所以Reducer类的reduce的方法的第二个参数是map传出的value的迭代器，这个迭代器就是 <1，1，1>
在这里插入图片描述
那么问题来，他是怎么比较key是否相同的，一开始我以为是key的equal方法，但是偶然之下注释了equal方法（在我定义的Writable中的equal方法），但是还是分组成功了。这个时候想到了compareTo方法，没错就是它，当这个方法返回0的时候就认为这两个元素相同从而分到一组中。
而且分组的时候key就是原来的key，只是把value加到原来的value后边。为什么这么说呢，测试的时候我定义了一个Writable类，

public static class GroupWritable implements WritableComparable<GroupWritable>{
		private String word;
		private int count ;
		public void write(DataOutput out) throws IOException {
			out.writeUTF(this.word);
			out.writeInt(this.count);
		}
		public void readFields(DataInput in) throws IOException {
			// TODO Auto-generated method stub
			this.word = in.readUTF();
			this.count = in.readInt();
		}
		public int compareTo(GroupWritable o) {
			// TODO Auto-generated method stub
			//return o.count - this.count;
			return this.count - o.count;
		}
//		@Override
//		public boolean equals(Object obj) {	
//			GroupWritable gw = (GroupWritable)obj;
//			return (gw.word == this.word) && (this.count == gw.count);
//		}
		public GroupWritable(String word, int count) {
			super();
			this.word = word;
			this.count = count;
		}
		public GroupWritable() {
			super();
			// TODO Auto-generated constructor stub
		}
		public String getWord() {
			return word;
		}
		public void setWord(String word) {
			this.word = word;
		}
		public int getCount() {
			return count;
		}
		public void setCount(int count) {
			this.count = count;
		}
		@Override
		public String toString() {
			// TODO Auto-generated method stub
			return "word:"+word+" count:"+count;
		}
	}

这个类有两个属性String 和 int ，我在写自己的Writable类的compareTo方法的时候只要两个int 属性相同就返回0.这个时候奇妙的事发生了。
下边是map输出key和value
GroupWritable 后面是value，
(hello，11）1
(world，11）1
(haha，11） 1
分组后的结果成了 (hello，11) <1,1,1>
因为在reduce中我把迭代器当中的元素进行了累加，并在最终结果中得到了（hello 11） 3 的结果
如果没有分到一起，那么最终结果会是
（hello，11）1
(world，11）1
(haha，11） 1

public static class GroupReducer extends Reducer<GroupWritable, IntWritable, GroupWritable, IntWritable>{
		
		@Override
		protected void reduce(GroupWritable key, Iterable<IntWritable> value,Context context)
				throws IOException, InterruptedException {
			int sum = 0 ;
			
			for(IntWritable i : value) {
				sum += i.get();
			}
			context.write(key, new IntWritable(sum));
			

		}
		
		
	}

为什么会分到一起，因为我定义的compareTo方法认为这三个元素相同，然后reducer分组把三个分到一起了，并且以第一个key为key，然后把value追加到迭代器中。

完整代码：最终得到结果是（hello，11） 3
可以自己输入
(hello，11）1
(world，11）1
(haha，11） 1

package GroupTest;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import MoreTopNTest.CoursePartition;
import MoreTopNTest.FindPersonalAvgAndPatition;
import MoreTopNTest.StudentWritable;
import MoreTopNTest.FindPersonalAvgAndPatition.MyMapper;
import MoreTopNTest.FindPersonalAvgAndPatition.MyReducer;

public class GroupTestDemo {
	public static class GroupWritable implements WritableComparable<GroupWritable>{
		private String word;
		private int count ;
		public void write(DataOutput out) throws IOException {
			out.writeUTF(this.word);
			out.writeInt(this.count);
			
		}
		public void readFields(DataInput in) throws IOException {
			// TODO Auto-generated method stub
			this.word = in.readUTF();
			this.count = in.readInt();
		}
		public int compareTo(GroupWritable o) {
			// TODO Auto-generated method stub
			return o.count - this.count;
		}
		//把equal注释掉后发现新世界
//		@Override
//		public boolean equals(Object obj) {	
//			GroupWritable gw = (GroupWritable)obj;
//			return (gw.word == this.word) && (this.count == gw.count);
//		}
		public GroupWritable(String word, int count) {
			super();
			this.word = word;
			this.count = count;
		}
		public GroupWritable() {
			super();
			// TODO Auto-generated constructor stub
		}
		public String getWord() {
			return word;
		}
		public void setWord(String word) {
			this.word = word;
		}
		public int getCount() {
			return count;
		}
		public void setCount(int count) {
			this.count = count;
		}
		@Override
		public String toString() {
			// TODO Auto-generated method stub
			return "word:"+word+" count:"+count;
		}
	}
	public static class GroupMapper extends Mapper<LongWritable, Text, GroupWritable, IntWritable>{
		@Override
		protected void map(LongWritable key, Text value,Context context)
				throws IOException, InterruptedException {
			
			String [] splited = value.toString().split(" ");
			GroupWritable gw = new GroupWritable(splited[0],Integer.parseInt(splited[1]));
			context.write(gw,new IntWritable(1));
			
		}
	
	}
	public static class GroupReducer extends Reducer<GroupWritable, IntWritable, GroupWritable, IntWritable>{
	
		@Override
		protected void reduce(GroupWritable key, Iterable<IntWritable> value,Context context)
				throws IOException, InterruptedException {
			int sum = 0 ;
			
			for(IntWritable i : value) {
				sum += i.get();
			}
			context.write(key, new IntWritable(sum));
		

		}
		
		
	}
	public static void main(String [] args) throws IOException, ClassNotFoundException, InterruptedException {
		Job job = Job.getInstance(new Configuration ());
		job.setJarByClass(GroupTestDemo.class);
		
		job.setMapperClass(GroupMapper.class);
		job.setMapOutputKeyClass(GroupWritable.class);
		job.setMapOutputValueClass(IntWritable.class);
		
	
		
		job.setReducerClass(GroupReducer.class);
		job.setOutputKeyClass(GroupWritable.class);
		job.setOutputValueClass(IntWritable.class);
		
		
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		job.waitForCompletion(true);
		
		
		
	}
	

}

二：ok，那么还有个问题，就是job.setGroupingComparatorClass是干什么用的，看别人博客都说她是分组用的，于是我在想，这个优先级和compareTo的优先级谁高，
试验一下：
方法：
定义一个job.setGroupingComparatorClass，在这个类中的compare方法都返回-1，同时实现了GroupWritable的compareTo方法，这个方法还是认为两者的属性int属性相同就返回，
然后再主类中job.setGroupingComparatorClass一下。
输入数据还是
(hello，11）1
(world，11）1
(haha，11） 1
预期结果：
如果comparTo方法比GroupingComparatorClass 更高优先级，那么就会把上述三个分到一组中，相反则会分成三组

package GroupTest;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import MoreTopNTest.CoursePartition;
import MoreTopNTest.FindPersonalAvgAndPatition;
import MoreTopNTest.StudentWritable;
import MoreTopNTest.FindPersonalAvgAndPatition.MyMapper;
import MoreTopNTest.FindPersonalAvgAndPatition.MyReducer;

public class GroupTestDemo {
	public static class GroupWritable implements WritableComparable<GroupWritable>{
		private String word;
		private int count ;
		public void write(DataOutput out) throws IOException {
			out.writeUTF(this.word);
			out.writeInt(this.count);
		}
		public void readFields(DataInput in) throws IOException {
			// TODO Auto-generated method stub
			this.word = in.readUTF();
			this.count = in.readInt();
		}
		public int compareTo(GroupWritable o) {
			// TODO Auto-generated method stub
			//return o.count - this.count;
			return this.count - o.count;
		}
//		@Override
//		public boolean equals(Object obj) {	
//			GroupWritable gw = (GroupWritable)obj;
//			return (gw.word == this.word) && (this.count == gw.count);
//		}
		public GroupWritable(String word, int count) {
			super();
			this.word = word;
			this.count = count;
		}
		public GroupWritable() {
			super();
			// TODO Auto-generated constructor stub
		}
		public String getWord() {
			return word;
		}
		public void setWord(String word) {
			this.word = word;
		}
		public int getCount() {
			return count;
		}
		public void setCount(int count) {
			this.count = count;
		}
		@Override
		public String toString() {
			// TODO Auto-generated method stub
			return "word:"+word+" count:"+count;
		}
	}
	public static class GroupMapper extends Mapper<LongWritable, Text, GroupWritable, IntWritable>{
		@Override
		protected void map(LongWritable key, Text value,Context context)
				throws IOException, InterruptedException {
			
			String [] splited = value.toString().split(" ");
			GroupWritable gw = new GroupWritable(splited[0],Integer.parseInt(splited[1]));
			context.write(gw,new IntWritable(1));
			
			
		}
	
	}
	public static class GroupReducer extends Reducer<GroupWritable, IntWritable, GroupWritable, IntWritable>{
		
		@Override
		protected void reduce(GroupWritable key, Iterable<IntWritable> value,Context context)
				throws IOException, InterruptedException {
			int sum = 0 ;
			
			for(IntWritable i : value) {
				sum += i.get();
			}
			context.write(key, new IntWritable(sum));
			

		}
		
		
	}
	public static void main(String [] args) throws IOException, ClassNotFoundException, InterruptedException {
		Job job = Job.getInstance(new Configuration ());
		job.setJarByClass(GroupTestDemo.class);
		
		job.setMapperClass(GroupMapper.class);
		job.setMapOutputKeyClass(GroupWritable.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		job.setGroupingComparatorClass(MyGroupingComparator.class);
		
		job.setReducerClass(GroupReducer.class);
		job.setOutputKeyClass(GroupWritable.class);
		job.setOutputValueClass(IntWritable.class);
		
		
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		job.waitForCompletion(true);
		
		
		
	}
	public static class MyGroupingComparator implements RawComparator<GroupWritable>{
		
		public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
			// TODO Auto-generated method stub
			return -1;
		}
		public int compare(GroupWritable o1, GroupWritable o2) {
			// TODO Auto-generated method stub
			//如果是的话就返回1否则返回0
			return -1;
		}
	}

}

最终结果是分成三组，也就是说，这个GroupingComparatorClass会覆盖compareTo方法。

总结:
分组会把相同的key的value分到一个组中（以迭代器的方式，就相当于<value1，value2，value3>），那么怎么比较key是否相同呢，有两种方法，第一种是key的compareTo方法，返回0时认为两个key相同，第二种是设置一个自定义的GroupingComparatorClass(怎么自定义请百度)，然后重写类里compare方法，这个方法会覆盖掉key中的compareTo方法，总而作为比较key是否相同的方法。

其实分析也可以得到，GroupingComparatorClass的比较方法优先级高于自定义数据类型中的compareTo方法，因为每个自定义数据类型都要实现compareTo方法，如果GroupingComparatorClass的比较方法的优先级不高，那么这个类根本就没什么用。

小丫小屁孩

关注

6
点赞
踩
16

收藏

觉得还不错? 一键收藏
0
评论
MapReduce的分组

一：MapReduce的分组其实就是把相同的key合并到一起比如hadoop 1hadoop 1hadoop 1分组后变成 hadoop <1，1，1> 所以Reducer类的reduce的方法的参数是map传出的key和value的迭代器，这个迭代器就是 <1，1，1>那么问题来，他是怎么比较key是否相同的，一开始我以为是key的equal方法，但是偶...
复制链接

扫一扫