mapreduce排序_二次排序

最新推荐文章于 2023-03-07 23:40:20 发布

Charles__D

最新推荐文章于 2023-03-07 23:40:20 发布

阅读量258

点赞数

分类专栏： Hadoop 文章标签： Hadoop Mapreduce 二次排序 GroupingComparator SortComparator

本文链接：https://blog.csdn.net/qq_40929246/article/details/91352183

版权

Hadoop 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

二次排序Partitioner、SortComparator、GroupingComparator

Partitioner：完成分区，重写getPartition()函数
SortComparator与GroupingComparator异同：
相同：都要继承WritableComparator对象，构造函数关联bean对象，重写compare()方法.
不同：SortComparator完成的是二次排序功能，其compare()方法完成bean对象的排序，GroupingComparator完成分组功能，其compare()方法完成bean对象分组。

需求分析：
1.键值对是两个整数(int1,int2)，int1范围是1-100000，int2范围是1-100.
2.要求先按int2排序，再按int1排序。
3.reduce至少五个，且reduce的输出全排序

bean对象：要实现WritableComparable的功能，这里重写了compareTo方法进行排序，其功能与SortComparator一致。

public class MyBean implements WritableComparable<MyBean> {

	private int int1;
	private int int2;

	public MyBean() {
	}
	public MyBean(int int1, int int2) {
		this.int1 = int1;
		this.int2 = int2;
	}
	@Override
	public void write(DataOutput out) throws IOException {
		out.writeInt(int1);
		out.writeInt(int2);
	}
	@Override
	public void readFields(DataInput in) throws IOException {
		this.int1 = in.readInt();
		this.int2 = in.readInt();
	}
	@Override
	public int compareTo(MyBean o) { 	//实现二次排序
		if (this.int2 == o.getInt2()) {
			return this.int1 - o.getInt1();
		} else {
			return this.int2 - o.getInt2();
		}
	}
	@Override
	public String toString() {
		return "(" + int1 + "," + int2 + ")";
	}
	public int getInt1() {
		return int1;
	}
	public void setInt1(int int1) {
		this.int1 = int1;
	}
	public int getInt2() {
		return int2;
	}
	public void setInt2(int int2) {
		this.int2 = int2;
	}
}

map

	public static class MyMapper extends Mapper<LongWritable, Text, MyBean, NullWritable> {

		@Override
		protected void map(LongWritable key, Text value, Context context) 
		throws IOException, InterruptedException {

			String line = value.toString(); // (int1,int2)
			String[] fields = line.split(",");
			String num1 = fields[0].substring(1, fields[0].length());
			String num2 = fields[1].substring(0, fields[1].length() - 1);

			MyBean bean = new MyBean(Integer.parseInt(num1), Integer.parseInt(num2));
			context.write(bean, NullWritable.get());
		}
	}

partitioner

	public static class MyPartitioner extends Partitioner<MyBean, NullWritable> {

		@Override
		public int getPartition(MyBean key, NullWritable value, int numPartitions) {
			int int2 = key.getInt2();
			return (int2 - 1) / 20;
		}
	}

groupingComparator：只根据int2分区

public static class GroupingComparator extends WritableComparator {

		public GroupingComparator() {
			super(MyBean.class, true);
		}

		@Override
		public int compare(WritableComparable a, WritableComparable b) {

			MyBean beanA = (MyBean) a;
			MyBean beanB = (MyBean) b;
			return beanA.getInt2() - beanB.getInt2();
		}
	}

reducer
说明一点：相同的key值会进入同一个reduce函数，这里二次排序只根据int2对key（bean对象）进行分组，实际上key值（bean对象）不完全相同，存在多个在同一组的key值（bean对象），存在两种情况：
1.int2相同，int1不同。
2.int2相同，int1也相同。
这时value是NullWritable类型，要获取不同的bean对象，必须通过遍历values来获得不同的key值。否则每次获取的都是第一个key值（bean对象）

public static class MyReducer extends Reducer<MyBean, NullWritable, Text, NullWritable> {

		@Override
		protected void reduce(MyBean key, Iterable<NullWritable> values, Context context)
				throws IOException, InterruptedException {
			String str = "";
			str += key.getInt2()+":";
			for (NullWritable value : values) {
				str += key.getInt1() + ",";
			}
			context.write(new Text(str.substring(0, str.length()-1)), NullWritable.get());
		}
	}

driver

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

		Configuration configuration = new Configuration();
		Job job = Job.getInstance(configuration);

		job.setJarByClass(MyGroupingComparator.class);

		job.setPartitionerClass(MyPartitioner.class);
		job.setGroupingComparatorClass(GroupingComparator.class);
		job.setNumReduceTasks(5);

		job.setMapperClass(MyMapper.class);
		job.setReducerClass(MyReducer.class);

		job.setMapOutputKeyClass(MyBean.class);
		job.setMapOutputValueClass(NullWritable.class);

		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(NullWritable.class);

		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		boolean result = job.waitForCompletion(true);
		System.exit(result ? 0 : 1);
	}

其中mapper、reducer和driver都在同一个类里

public class MyGroupingComparator {}

以上完成了分区、排序、分组的功能。排序的功能有两种实现方法：

1.以上是用继承了WritableComparable的bean对象的compareTo函数实现的。

@Override
	public int compareTo(MyBean o) { 	//实现二次排序
		if (this.int2 == o.getInt2()) {
			return this.int1 - o.getInt1();
		} else {
			return this.int2 - o.getInt2();
		}
	}

2.也可以继承SortComparator类实现

job.setSortComparatorClass(SortComparator.class); 	//driver

public class SortComparator extends WritableComparator {

		public SortComparator() {
			super(MyBean.class, true);
		}

		@Override
		@SuppressWarnings("rawtypes")
		public int compare(WritableComparable a, WritableComparable b) {
			MyBean beanA = (MyBean) a;
			MyBean beanB = (MyBean) b;
			if (beanA.getInt2() == beanB.getInt2()) {
				return beanA.getInt1() - beanB.getInt1();
			} else {
				return beanA.getInt2() - beanB.getInt2();
			}
		}
	}

附： (int1,int2)的生成类

/*
 * 使用随机数生成以(整数1,整数2)为(int1,int2)的文本文件，
 * 文件数量不少于100个,
 * 单个文件记录数量不少于10万条,
 * 其中int1为1-100000的随机数，int2位1-100的随机数。
 */


public class InitRandom {

	public static void main(String[] args) throws IOException {
		
		int int1 = 100000;
		int int2 = 100;
		int numOfFiles = 100;
		int numOfRecords = 100000;
		
		String path = args[0];		//inputPath
		FileOutputStream fos = null;
		Random random = new java.util.Random();
		
		for (int i = 1; i <= numOfFiles; i++) {
			System.out.println("writing file#"+i);
			fos = new FileOutputStream(new File(path + "/file" + i));
			List<String> list = new ArrayList<String>();
			for (int j = 0; j < numOfRecords; j++)
				list.add("(" + (random.nextInt(int1) + 1) +","+ (random.nextInt(int2) + 1) +")");//line
			PrintStream pStream = new PrintStream(new BufferedOutputStream(fos));
			for (String str : list) {
				pStream.println(str);
			}
			pStream.close();
			fos.close();
		}
	}
}