实验6 MapReduce-二次排序

最新推荐文章于 2021-07-03 14:55:39 发布

Avalonist

最新推荐文章于 2021-07-03 14:55:39 发布

阅读量751

点赞数

分类专栏： [大数据实验手册刘鹏] 文章标签： maprecude secondarysort

[大数据实验手册刘鹏] 专栏收录该内容

3 篇文章 3 订阅

订阅专栏

6.1实验原理

首先需要认识到一点就是MR默认会对键进行排序[https://www.cnblogs.com/acSzz/p/6383618.html]

Spill过程
在collect阶段的执行过程中，当内存中的环形数据缓冲区中的数据达到一定发之后，便会触发一次Spill操作，将部分数据spill到本地磁盘上。SpillThread线程实际上是kvbuffer缓冲区的消费者，主要代码如下：

spillLock.lock();  
while(true){  
   spillDone.sinnal();  
   while(kvstart == kvend){  
      spillReady.await();  
   }  
   spillDone.unlock();  
   //排序并将缓冲区kvbuffer中的数据spill到本地磁盘上  
   sortAndSpill();  
   spillLock.lock;  
   //重置各个指针，为下一下spill做准备  
   if(bufend < bufindex && bufindex < bufstart){  
      bufvoid = kvbuffer.length;  
   }  
   vstart = vend;  
   bufstart = bufend;  
}  
spillLock.unlock();

sortAndSpill()方法中的内部流程是这样的：
         第一步，使用用快速排序算法对kvbuffer[bufstart,bufend)中的数据排序，先对partition分区号排序，然后再按照key排序，经过这两轮排序后，数据就会以分区为单位聚集在一起，且同一分区内的数据按key有序；
         第二步，按分区大小由小到大依次将每个分区中的数据写入任务的工作目录下的临时文件中，如果用户设置了Combiner，则写入文件之前，会对每个分区中的数据做一次聚集操作，比如<key1,val1>和<key1,val2>合并成<key1,<val1,val2>>；（不确定是否正确这句话，有的说是在merge时将相同key的value合成list，待我研究下，）
         第三步，将分区数据的元信息写到内存索引数据结构SpillRecord中。分区的元数据信息包括临时文件中的偏移量、压缩前数据的大小和压缩后数据的大小。

Combine过程
当任务的所有数据都处理完后，MapTask会将该任务所有的临时文件年合并成一个大文件，同时生成相应的索引文件。在合并过程中，是以分区文单位进行合并的。
让每个Task最终生成一个文件，可以避免同时打开大量文件和对小文件产生随机读带来的开销。

回到本实验中：

有时候我们也有对值进行排序的需求。满足这种需求一是可以在reduce阶段排序收集过来的values，但是如果有数量巨大的values可能就会导致内存溢出等问题，这就是二次排序的应用场景(将对值的排序也安排到MR计算过程中，而不是单独来做)。二次排序就是首先按照第一字段排序，然后再对第一字段相同的行按照第二字段排序，注意不能破坏第一次排序的结果。

6.2实验

IntPair.java

package lab6;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.WritableComparable;

public class  IntPair implements WritableComparable<IntPair> {
	private IntWritable first;
	private IntWritable second;
	public void set(IntWritable first,IntWritable second) {
		this.first=first;
		this.second=second;
		
	}
	public IntPair() {
		set(new IntWritable(),new IntWritable());
		
	}
	public IntPair(int first,int second) {
		set(new IntWritable(first),new IntWritable(second));
	}
	public IntPair(IntWritable first,IntWritable second) {
		set(first,second);
	}
	public IntWritable getFirst() {
		return first;
	}
	public void setFirst(IntWritable first) {
		this.first=first;
	}
	public IntWritable getSecond() {
		return second;
	}
	public void setSecond(IntWritable second) {
		this.second=second;
	}
	public void write(DataOutput out)throws IOException{
		first.write(out);
		second.write(out);
	}
	public void readFields(DataInput in)throws IOException{
		first.readFields(in);
		second.readFields(in);
	}
	public int hashCode() {
		return first.hashCode()*163+second.hashCode();
	}
	public boolean equals(Object o) {
		if(o instanceof IntPair) {
			IntPair tp=(IntPair) o;
			return first.equals(tp.first)&&second.equals(tp.second);
		}
		return false;
	}
	public String toString() {
		return first+"\t"+second;
	}
	public int compareTo(IntPair tp) {
		int cmp=first.compareTo(tp.first);
		if(cmp!=0) {
			return cmp;
		}
		return second.compareTo(tp.second);
	}
}

SecondarySort.java

package lab6;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class SecondarySort {
	static class TheMapper extends Mapper<LongWritable,Text,IntPair,NullWritable>{
		@Override
		protected void map(LongWritable key,Text value,Context context) throws IOException,InterruptedException{
			String[] fields=value.toString().split("\t");
			int field1=Integer.parseInt(fields[0]);
			int field2=Integer.parseInt(fields[1]);
			context.write(new IntPair(field1,field2), NullWritable.get());
			
		}
	}
	static class TheReducer extends Reducer<IntPair,NullWritable,IntPair,NullWritable>{
		//private static final Text SEPARATOR=new Text("-------------------------------------");
		@Override
		protected void reduce(IntPair key,Iterable<NullWritable>values,Context context)throws IOException,InterruptedException{
			context.write(key, NullWritable.get());
			
		}
		
	}
	public static class FirstPartitioner extends Partitioner<IntPair,NullWritable>{
		public int getPartition(IntPair key,NullWritable value,int numPartitions) {
			return Math.abs(key.getFirst().get())%numPartitions;
		}
	}
	//如果不添加这个类，默认第一列和第二列都是升序排序的。这个类的作用是使第一列升序排序，第二列降序排序
	public static class KeyComparator extends WritableComparator{
		//无参构造器必须加上，否则报错
		protected KeyComparator() {
			super(IntPair.class,true);
		}
		public int compare(WritableComparable a,WritableComparable b) {
			IntPair ip1=(IntPair)a;
			IntPair ip2=(IntPair)b;
			//第一列升序排序
			int cmp=ip1.getFirst().compareTo(ip2.getFirst());
			if(cmp!=0) {
				return cmp;
			}
			//在第一列的情况下，第二列倒序排序
			return -ip1.getSecond().compareTo(ip2.getSecond());
		}
		
	}
	
	//entry program
	public static void main(String[] args)throws Exception{
		Configuration conf=new Configuration();
		Job job=Job.getInstance(conf);
		job.setJarByClass(SecondarySort.class); //指定主类，也可在hadoop命令中指定
		job.setMapperClass(TheMapper.class);
		//当Mapper中的输出的key和value的类型和Reduce输出的key和value的类型相同时，以下两句可以省略
		//job.setMapOutputKeyClass(IntPair.class );
		//job.setMapOutputValueClass(NullWritable.class );
		FileInputFormat.setInputPaths(job,new Path(args[0]));
		job.setPartitionerClass(FirstPartitioner.class );
		//在Map中对key进行排序
		job.setSortComparatorClass(KeyComparator.class );
		//job.setGroupingComparatorClass(GroupComparator.class );
		//设置Reduccer的相关属性
		job.setReducerClass(TheReducer.class );
		job.setOutputKeyClass(IntPair.class );
		job.setOutputValueClass(NullWritable.class );
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		int reduceNum=1;
		if(args.length>=3 && args[2]!=null) {
			reduceNum=Integer.parseInt(args[2]);
		}
		job.setNumReduceTasks(reduceNum);
		job.waitForCompletion(true);
	}

}

代码解析：

我们需要定义一个IntPair类用于数据的存储，并在IntPair内部自定义Comparator类以实现第一和第二字段的比较。这里二次排序的方法和《Pro Apache Hadoop》（简称《PAH》）中二次排序的方法稍有不同，《PAH》中有一个Group的过程，但是这里可以不用Group，因为第一和第二次排序在一个地方，都在《PAH》中对应的第一次排序的地方，从本质上将，这个实验实现的是伪二次排序。

执行程序：

参数设置：./input ./output 1

输入文件./input/data.txt:

7   444
3   9999
7   333
4   22
3   7777
7   555
3   6666
6   0
3   8888
4   11

输出结果：

3   9999
3   8888
3   7777
3   6666
4   22
4   11
6   0
7   555
7   444
7   333