Hadoop Mapreduce分区、分组、连接以及辅助排序（也叫二次排序）过程详解

最新推荐文章于 2022-09-06 08:47:39 发布

XifengHZ

最新推荐文章于 2022-09-06 08:47:39 发布

阅读量3.2k

点赞数 1

分类专栏： Hadoop & Cloud Computing

本文链接：https://blog.csdn.net/xiaocaidexuexibiji/article/details/12125699

版权

Hadoop & Cloud Computing 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

1、MapReduce中数据流动

（1）最简单的过程： map - reduce

（2）定制了partitioner以将map的结果送往指定reducer的过程：　map - partition - reduce

（3）增加了在本地先进性一次reduce（优化）过程：　map - combin(本地reduce) - partition -reduce

2、Mapreduce中Partition的概念以及使用。

（1）Partition的原理和作用

得到map给的记录后，他们该分配给哪些reducer来处理呢？hadoop采用的默认的派发方式是根据散列值来派发的，但是实际中，这并不能很高效或者按照我们要求的去执行任务。例如，经过partition处理后，一个节点的reducer分配到了20条记录，另一个却分配道了10W万条，试想，这种情况效率如何。又或者，我们想要处理后得到的文件按照一定的规律进行输出，假设有两个reducer，我们想要最终结果中part-00000中存储的是"h"开头的记录的结果,part-00001中存储其他开头的结果，这些默认的partitioner是做不到的。所以需要我们自己定制partition来根据自己的要求，选择记录的reducer。自定义partitioner很简单，只要自定义一个类，并且继承Partitioner类，重写其getPartition方法就好了，在使用的时候通过调用Job的setPartitionerClass指定一下即可

Map的结果，会通过partition分发到Reducer上。Mapper的结果，可能送到Combiner做合并，Combiner在系统中并没有自己的基类，而是用Reducer作为Combiner的基类，他们对外的功能是一样的，只是使用的位置和使用时的上下文不太一样而已。Mapper最终处理的键值对<key, value>，是需要送到Reducer去合并的，合并的时候，有相同key的键/值对会送到同一个Reducer那。哪个key到哪个Reducer的分配过程，是由Partitioner规定的。它只有一个方法，

getPartition(Text key, Text value, int numPartitions)

输入是Map的结果对<key, value>和Reducer的数目，输出则是分配的Reducer（整数编号）。就是指定Mappr输出的键值对到哪一个reducer上去。系统缺省的Partitioner是HashPartitioner，它以key的Hash值对Reducer的数目取模，得到对应的Reducer。这样保证如果有相同的key值，肯定被分配到同一个reducre上。如果有N个reducer，编号就为0,1,2,3……(N-1)。

（2）Partition的使用

分区出现的必要性，如何使用Hadoop产生一个全局排序的文件？最简单的方法就是使用一个分区，但是该方法在处理大型文件时效率极低，因为一台机器必须处理所有输出文件，从而完全丧失了MapReduce所提供的并行架构的优势。事实上我们可以这样做，首先创建一系列排好序的文件；其次，串联这些文件（类似于归并排序）；最后得到一个全局有序的文件。主要的思路是使用一个partitioner来描述全局排序的输出。比方说我们有1000个1-10000的数据，跑10个ruduce任务，如果我们运行进行partition的时候，能够将在1-1000中数据的分配到第一个reduce中，1001-2000的数据分配到第二个reduce中，以此类推。即第n个reduce所分配到的数据全部大于第n-1个reduce中的数据。这样，每个reduce出来之后都是有序的了，我们只要cat所有的输出文件，变成一个大的文件，就都是有序的了

基本思路就是这样，但是现在有一个问题，就是数据的区间如何划分，在数据量大，还有我们并不清楚数据分布的情况下。一个比较简单的方法就是采样，假如有一亿的数据，我们可以对数据进行采样，如取10000个数据采样，然后对采样数据分区间。在Hadoop中，patition我们可以用TotalOrderPartitioner替换默认的分区。然后将采样的结果传给他，就可以实现我们想要的分区。在采样时，我们可以使用hadoop的几种采样工具，RandomSampler,InputSampler,IntervalSampler。

这样，我们就可以对利用分布式文件系统进行大数据量的排序了，我们也可以重写Partitioner类中的compare函数，来定义比较的规则，从而可以实现字符串或其他非数字类型的排序，也可以实现二次排序乃至多次排序。

2、MapReduce中分组的概念和使用

分区的目的是根据Key值决定Mapper的输出记录被送到哪一个Reducer上去处理。而分组的就比较好理解了。笔者认为，分组就是与记录的Key相关。在同一个分区里面，具有相同Key值的记录是属于同一个分组的。

3、MapReduce中Combiner的使用

很多MapReduce程序受限于集群上可用的带宽，所以它会尽力最小化需要在map和reduce任务之间传输的中间数据。Hadoop允许用户声明一个combiner function来处理map的输出，同时把自己对map的处理结果作为reduce的输入。因为combiner function本身只是一种优化，hadoop并不保证对于某个map输出，这个方法会被调用多少次。换句话说，不管combiner function被调用多少次，对应的reduce输出结果都应该是一样的。

　　下面我们以《权威指南》的例子来加以说明，假设1950年的天气数据读取是由两个map完成的，其中第一个map的输出如下：　
　　(1950, 0)
　　(1950, 20)
　　(1950, 10)

第二个map的输出为：
(1950, 25)
(1950, 15)

而reduce得到的输入为：(1950, [0, 20, 10, 25, 15])，输出为：(1950, 25)

　　由于25是集合中的最大值，我们可以使用一个类似于reduce function的combiner function来找出每个map输出中的最大值，这样的话，reduce的输入就变成了：
　　(1950, [20, 25])

　　各个funciton 对温度值的处理过程可以表示如下：max(0, 20, 10, 25, 15) =max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25

　　注意：并不是所有的函数都拥有这个属性的（有这个属性的函数我们称之为commutative和associative），例如，如果我们要计算平均温度，就不能这样使用combiner function，因为mean(0, 20, 10, 25, 15) =14，而mean(mean(0, 20, 10),mean(25, 15)) = mean(10, 20) = 15

　　combiner function并不能取代reduce function（因为仍然需要reduce function处理来自不同map的带有相同key的记录）。但是他可以帮助减少需要在map和reduce之间传输的数据，就为这一点combiner function就值得考虑使用。

注意：如果MapOutputKey和MapOuputValue和outputkey和outputvalue不一致的时候，不能使用combiner。

4、Shuffle阶段排序流程详解

我们首先看一下MapReduce中的排序的总体流程。

MapReduce框架会确保每一个Reducer的输入都是按Key进行排序的。一般，将排序以及Map的输出传输到Reduce的过程称为混洗（shuffle)。每一个Map都包含一个环形的缓存，默认100M，Map首先将输出写到缓存当中。当缓存的内容达到“阈值”时（阈值默认的大小是缓存的80%），一个后台线程负责将结果写到硬盘，这个过程称为“spill”。Spill过程中，Map仍可以向缓存写入结果，如果缓存已经写满，那么Map进行等待。

Spill的具体过程如下：首先，后台线程根据Reducer的个数将输出结果进行分组，每一个分组对应一个Reducer。其次，对于每一个分组后台线程对输出结果的Key进行排序。在排序过程中，如果有Combiner函数，则对排序结果进行Combiner函数进行调用。每一次spill都会在硬盘产生一个spill文件。因此，一个Map task有可能会产生多个spill文件，当Map写出最后一个输出时，会将所有的spill文件进行合并与排序，输出最终的结果文件。在这个过程中Combiner函数仍然会被调用。从整个过程来看，Combiner函数的调用次数是不确定的。下面我们重点分析下Shuffle阶段的排序过程：

Shuffle阶段的排序可以理解成两部分，一个是对spill进行分区时，由于一个分区包含多个key值，所以要对分区内的<key,value>按照key进行排序，即key值相同的一串<key,value>存放在一起，这样一个partition内按照key值整体有序了。

第二部分并不是排序，而是进行merge，merge有两次，一次是map端将多个spill 按照分区和分区内的key进行merge，形成一个大的文件。第二次merge是在reduce端，进入同一个reduce的多个map的输出 merge在一起，该merge理解起来有点复杂，最终不是形成一个大文件，而且期间数据在内存和磁盘上都有。所以shuffle阶段的merge并不是严格的排序意义，只是将多个整体有序的文件merge成一个大的文件，由于不同的task执行map的输出会有所不同，所以merge后的结果不是每次都相同，不过还是严格要求按照分区划分，同时每个分区内的具有相同key的<key,value>对挨在一起。

Shuffle排序综述：如果只定义了map函数，没有定义reduce函数，那么输入数据经过shuffle的排序后，结果为key值相同的输出挨在一起，且key值小的一定在前面，这样整体来看key值有序（宏观意义的，不一定是按从大到小，因为如果采用默认的HashPartitioner，则key 的hash值相等的在一个分区，如果key为IntWritable的话，每个分区内的key会排序好的），而每个key对应的value不是有序的。

5、MapReduce中辅助排序的原理与实现

（1）任务

我们需要把内容如下的sample.txt文件处理为下面文件：

源文件：Sample.txt

bbb 654

ccc 534

ddd 423

aaa 754

bbb 842

ccc 120

ddd 219

aaa 344

bbb 214

ccc 547

ddd 654

aaa 122

bbb 102

ccc 479

ddd 742

aaa 146

目标：part-r-00000

aaa 122

bbb 102

ccc 120

ddd 219

（2）工作原理

过程导引：

1、定义包含记录值和自然值的组合键，本例中为MyPariWritable.

2、自定义键的比较器（comparator）来根据组合键对记录进行排序，即同时利用自然键和自然值进行排序。（aaa 122组合为一个键）。

3、针对组合键的Partitioner（本示例使用默认的hashPartitioner）和分组comparator在进行分区和分组时均只考虑自然键。

详细过程：

首先在map阶段，使用job.setInputFormatClass定义的InputFormat将输入的数据集分割成小数据块splites，同时InputFormat提供一个RecordReder的实现。本例子中使用的是TextInputFormat，他提供的RecordReder会将文本的一行的行号作为key，这一行的文本作为value。这就是自定义Map的输入是<LongWritable, Text>的原因。然后调用自定义Map的map方法，将一个个<LongWritable, Text>对输入给Map的map方法。注意输出应该符合自定义Map中定义的输出< MyPariWritable, NullWritable>。最终是生成一个List< MyPariWritable, NullWritable>。在map阶段的最后，会先调用job.setPartitionerClass对这个List进行分区，每个分区映射到一个reducer。每个分区内又调用job.setSortComparatorClass设置的key比较函数类排序。可以看到，这本身就是一个二次排序。在reduce阶段，reducer接收到所有映射到这个reducer的map输出后，也是会调用job.setSortComparatorClass设置的key比较函数类对所有数据对排序。然后开始构造一个key对应的value迭代器。这时就要用到分组，使用jobjob.setGroupingComparatorClass设置的分组函数类。只要这个比较器比较的两个key相同，他们就属于同一个组（本例中由于要求得每一个分区内的最小值，因此比较MyPariWritable类型的Key时，只需要比较自然键，这样就能保证只要两个MyPariWritable的自然键相同，则它们被送到Reduce端时候的Key就认为在相同的分组，由于该分组的Key只取分组中的第一个，而这些数据已经按照自定义MyPariWritable比较器排好序，则第一个Key正好包含了每一个自然键对应的最小值），它们的value放在一个value迭代器，而这个迭代器的key使用属于同一个组的所有key的第一个key。最后就是进入Reducer的reduce方法，reduce方法的输入是所有的key和它的value迭代器。同样注意输入与输出的类型必须与自定义的Reducer中声明的一致。

（3）实现代码

package com.hadoop;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WritableSample extends Configured implements Tool {
	@Override
	public int run(String[] args) throws Exception {
		Configuration conf=getConf();
		Job job=new Job(conf);
		job.setJarByClass(WritableSample.class);
		FileSystem fs=FileSystem.get(conf);
		fs.delete(new Path("out"),true);
		FileInputFormat.addInputPath(job, new Path("sample.txt"));
		FileOutputFormat.setOutputPath(job, new Path("out"));
        job.setMapperClass(MyWritableMap.class);
        job.setOutputKeyClass(MyPariWritable.class);
        job.setOutputValueClass(NullWritable.class);
        job.setReducerClass(MyWriableReduce.class);
        job.setSortComparatorClass(PairKeyComparator.class);
        job.setGroupingComparatorClass(GroupComparatored.class);
        job.waitForCompletion(true);
		return 0;
	}
	public static void main(String[] args) throws Exception {
		Tool tool=new WritableSample();
		ToolRunner.run(tool, args);
	}
}
class MyPariWritable implements WritableComparable<MyPariWritable>{
	Text first;
	IntWritable second;
	public void set(Text first,IntWritable second){
		this.first=first;
		this.second=second;
	}
	public Text getFirst(){
		return this.first;
	}
	public IntWritable getSecond(){
		return this.second;
	}
	@Override
	public void readFields(DataInput in) throws IOException {
		// TODO Auto-generated method stub
		first=new Text(in.readUTF());
		second=new IntWritable(in.readInt());
	}
	public void write(DataOutput out){
		try {
			out.writeUTF(first.toString());
			out.writeInt(second.get());
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}
	@Override
	public int compareTo(MyPariWritable o) {
		// TODO Auto-generated method stub
		if(this.first!=o.getFirst()){
			return this.first.toString().compareTo(o.getFirst().toString());
		}else if(this.second!=o.getSecond()){
			return this.second.get()-o.getSecond().get();
		}else return 0;
	}
	@Override
	public String toString() {
		// TODO Auto-generated method stub
		return first.toString()+" "+second.get();
	}
	@Override
	public boolean equals(Object obj) {
		MyPariWritable temp=(MyPariWritable) obj;
		return first.equals(temp.first)&&second.equals(temp.second);
	}
	@Override
	public int hashCode() {
		return first.hashCode()*163+second.hashCode();
	}
}
class MyWritableMap extends Mapper<LongWritable, Text, MyPariWritable, NullWritable>{
	MyPariWritable pair=new MyPariWritable();
	protected void map(LongWritable key, Text value, Context context) throws IOException ,InterruptedException {
		String strs[]=value.toString().split(" ");
		Text keyy=new Text(strs[0]);
		IntWritable valuee=new IntWritable(Integer.parseInt(strs[1]));
		pair.set(keyy, valuee);
		context.write(pair, NullWritable.get());
	};
}
class PairKeyComparator extends WritableComparator{
	public PairKeyComparator() {
		super(MyPariWritable.class,true);
			}
	@SuppressWarnings("rawtypes")
	@Override
	public int compare(WritableComparable a,  WritableComparable b) {
		MyPariWritable p1=(MyPariWritable) a;
		MyPariWritable p2=(MyPariWritable) b;
		if(!p1.getFirst().toString().equals(p2.getFirst().toString())){
			return p1.first.toString().compareTo(p2.first.toString());
		}else{
			return p1.getSecond().get()-p2.getSecond().get();
		}
	}
}
class MyWriableReduce extends Reducer<MyPariWritable, NullWritable, MyPariWritable, NullWritable>{
	protected void reduce(MyPariWritable key, java.lang.Iterable<NullWritable> values, Context context) throws IOException ,InterruptedException {
		context.write(key, NullWritable.get());
	};
}
class GroupComparatored extends WritableComparator{
	public GroupComparatored(){
		super(MyPariWritable.class,true);
	}
	@SuppressWarnings("rawtypes")
	@Override
	public int compare(WritableComparable a, WritableComparable b) {
		MyPariWritable p1=(MyPariWritable) a;
		MyPariWritable p2=(MyPariWritable) b;
		return p1.first.toString().compareTo(p2.first.toString());//这里只比较第一个元素，只要自然键类型为MyPariWritable的Key就认为是相同，目的是找出自然键对应的最小自然值。
	}
}

mr自带的例子中的源码SecondarySort，我重新写了一下，基本没变。
这个例子中定义的map和reduce如下，关键是它对输入输出类型的定义：（java泛型编程） 
public static class Map extends Mapper<LongWritable, Text, IntPair, IntWritable> 
public static class Reduce extends Reducer<IntPair, NullWritable, IntWritable, IntWritable> 
1 首先说一下工作原理：
在map阶段，使用job.setInputFormatClass定义的InputFormat将输入的数据集分割成小数据块splites，同时InputFormat提供一个RecordReder的实现。本例子中使用的是TextInputFormat，他提供的RecordReder会将文本的一行的行号作为key，这一行的文本作为value。这就是自定义Map的输入是<LongWritable, Text>的原因。然后调用自定义Map的map方法，将一个个<LongWritable, Text>对输入给Map的map方法。注意输出应该符合自定义Map中定义的输出<IntPair, IntWritable>。最终是生成一个List<IntPair, IntWritable>。在map阶段的最后，会先调用job.setPartitionerClass对这个List进行分区，每个分区映射到一个reducer。每个分区内又调用job.setSortComparatorClass设置的key比较函数类排序。可以看到，这本身就是一个二次排序。如果没有通过job.setSortComparatorClass设置key比较函数类，则使用key的实现的compareTo方法。在第一个例子中，使用了IntPair实现的compareTo方法，而在下一个例子中，专门定义了key比较函数类。
在reduce阶段，reducer接收到所有映射到这个reducer的map输出后，也是会调用job.setSortComparatorClass设置的key比较函数类对所有数据对排序。然后开始构造一个key对应的value迭代器。这时就要用到分组，使用jobjob.setGroupingComparatorClass设置的分组函数类。只要这个比较器比较的两个key相同，他们就属于同一个组，它们的value放在一个value迭代器，而这个迭代器的key使用属于同一个组的所有key的第一个key。最后就是进入Reducer的reduce方法，reduce方法的输入是所有的（key和它的value迭代器）。同样注意输入与输出的类型必须与自定义的Reducer中声明的一致。 
2  二次排序就是首先按照第一字段排序，然后再对第一字段相同的行按照第二字段排序，注意不能破坏第一次排序的结果 。例如
输入文件
20 21
50 51
50 52
50 53
50 54
60 51
60 53
60 52
60 56
60 57
70 58
60 61
70 54
70 55
70 56
70 57
70 58
1 2
3 4
5 6
7 82
203 21
50 512
50 522
50 53
530 54
40 511
20 53
20 522
60 56
60 57
740 58
63 61
730 54
71 55
71 56
73 57
74 58
12 211
31 42
50 62
7 8
输出：（注意需要分割线）
------------------------------------------------
1       2
------------------------------------------------
3       4
------------------------------------------------
5       6
------------------------------------------------
7       8
7       82
------------------------------------------------
12      211
------------------------------------------------
20      21
20      53
20      522
------------------------------------------------
31      42
------------------------------------------------
40      511
------------------------------------------------
50      51
50      52
50      53
50      53
50      54
50      62
50      512
50      522
------------------------------------------------
60      51
60      52
60      53
60      56
60      56
60      57
60      57
60      61
------------------------------------------------
63      61
------------------------------------------------
70      54
70      55
70      56
70      57
70      58
70      58
------------------------------------------------
71      55
71      56
------------------------------------------------
73      57
------------------------------------------------
74      58
------------------------------------------------
203     21
------------------------------------------------
530     54
------------------------------------------------
730     54
------------------------------------------------
740     58 
3  具体步骤：
（1）自定义key
在mr中，所有的key是需要被比较和排序的，并且是二次，先根据partitione，再根据大小。而本例中也是要比较两次。先按照第一字段排序，然后再对第一字段相同的按照第二字段排序。根据这一点，我们可以构造一个复合类IntPair，他有两个字段，先利用分区对第一字段排序，再利用分区内的比较对第二字段排序。
所有自定义的key应该实现接口WritableComparable，因为是可序列的并且可比较的。并重载方法：
   
   
    
    
     
     
      
      [cpp] 
      
      view plain
      
      copy
      
      
     
     
    
    
    
    //反序列化，从流中的二进制转换成IntPair  
public void readFields(DataInput in) throws IOException          
//序列化，将IntPair转化成使用流传送的二进制  
public void write(DataOutput out)  
//key的比较  
public int compareTo(IntPair o)          
//另外新定义的类应该重写的两个方法  
//The hashCode() method is used by the HashPartitioner (the default partitioner in MapReduce)  
public int hashCode()   
public boolean equals(Object right)  
   
   
（2）由于key是自定义的，所以还需要自定义一下类：
（2.1）分区函数类。这是key的第一次比较。
   
   
    
    
     
     
      
      [cpp] 
      
      view plain
      
      copy
      
      
     
     
    
    
    
    public static class FirstPartitioner extends Partitioner<IntPair,IntWritable>  
   
   
在job中使用setPartitionerClasss设置Partitioner。
（2.2）key比较函数类。这是key的第二次比较。这是一个比较器，需要继承WritableComparator。
   
   
    
    
     
     
      
      [cpp] 
      
      view plain
      
      copy
      
      
     
     
    
    
    
    public static class KeyComparator extends WritableComparator  
   
   
必须有一个构造函数，并且重载 public int compare(WritableComparable w1, WritableComparable w2)
另一种方法是 实现接口RawComparator。
在job中使用setSortComparatorClass设置key比较函数类。
（2.3）分组函数类。在reduce阶段，构造一个key对应的value迭代器的时候，只要first相同就属于同一个组，放在一个value迭代器。这是一个比较器，需要继承WritableComparator。
   
   
    
    
     
     
      
      [cpp] 
      
      view plain
      
      copy
      
      
     
     
    
    
    
    public static class GroupingComparator extends WritableComparator  
   
   
分组函数类也必须有一个构造函数，并且重载 public int compare(WritableComparable w1, WritableComparable w2)
分组函数类的另一种方法是实现接口RawComparator。
在job中使用setGroupingComparatorClass设置分组函数类。

另外注意的是，如果reduce的输入与输出不是同一种类型，则不要定义Combiner也使用reduce，因为Combiner的输出是reduce的输入。除非重新定义一个Combiner。 
3 代码。
这个例子中没有使用key比较函数类，而是使用key的实现的compareTo方法。 

   
   
    
    
     
     
      
      [java] 
      
      view plain
      
      copy
      
      
     
     
    
    
    
    package secondarySort;  
import java.io.DataInput;  
import java.io.DataOutput;  
import java.io.IOException;  
import java.util.StringTokenizer;  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.io.IntWritable;  
import org.apache.hadoop.io.LongWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.io.WritableComparable;  
import org.apache.hadoop.io.WritableComparator;  
import org.apache.hadoop.mapreduce.Job;  
import org.apache.hadoop.mapreduce.Mapper;  
import org.apache.hadoop.mapreduce.Partitioner;  
import org.apache.hadoop.mapreduce.Reducer;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;  
  
public class SecondarySort {  
    //自己定义的key类应该实现WritableComparable接口  
    public static class IntPair implements WritableComparable<IntPair> {  
        int first;  
        int second;  
        /** 
         * Set the left and right values. 
         */  
        public void set(int left, int right) {  
            first = left;  
            second = right;  
        }  
        public int getFirst() {  
            return first;  
        }  
        public int getSecond() {  
            return second;  
        }  
        @Override  
        //反序列化，从流中的二进制转换成IntPair  
        public void readFields(DataInput in) throws IOException {  
            // TODO Auto-generated method stub  
            first = in.readInt();  
            second = in.readInt();  
        }  
        @Override  
        //序列化，将IntPair转化成使用流传送的二进制  
        public void write(DataOutput out) throws IOException {  
            // TODO Auto-generated method stub  
            out.writeInt(first);  
            out.writeInt(second);  
        }  
        @Override  
        //key的比较  
        public int compareTo(IntPair o) {  
            // TODO Auto-generated method stub  
            if (first != o.first) {  
                return first < o.first ? -1 : 1;  
            } else if (second != o.second) {  
                return second < o.second ? -1 : 1;  
            } else {  
                return 0;  
            }  
        }  
          
        //新定义类应该重写的两个方法  
        @Override  
        //The hashCode() method is used by the HashPartitioner (the default partitioner in MapReduce)  
        public int hashCode() {  
            return first * 157 + second;  
        }  
        @Override  
        public boolean equals(Object right) {  
            if (right == null)  
                return false;  
            if (this == right)  
                return true;  
            if (right instanceof IntPair) {  
                IntPair r = (IntPair) right;  
                return r.first == first && r.second == second;  
            } else {  
                return false;  
            }  
        }  
    }  
     /** 
       * 分区函数类。根据first确定Partition。 
       */  
      public static class FirstPartitioner extends Partitioner<IntPair,IntWritable>{  
        @Override  
        public int getPartition(IntPair key, IntWritable value,   
                                int numPartitions) {  
          return Math.abs(key.getFirst() * 127) % numPartitions;  
        }  
      }  
        
      /** 
       * 分组函数类。只要first相同就属于同一个组。 
       */  
    /*//第一种方法，实现接口RawComparator 
    public static class GroupingComparator implements RawComparator<IntPair> { 
        @Override 
        public int compare(IntPair o1, IntPair o2) { 
            int l = o1.getFirst(); 
            int r = o2.getFirst(); 
            return l == r ? 0 : (l < r ? -1 : 1); 
        } 
        @Override 
        //一个字节一个字节的比，直到找到一个不相同的字节，然后比这个字节的大小作为两个字节流的大小比较结果。 
        public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2){ 
            // TODO Auto-generated method stub 
             return WritableComparator.compareBytes(b1, s1, Integer.SIZE/8,  
                     b2, s2, Integer.SIZE/8); 
        } 
    }*/  
    //第二种方法，继承WritableComparator  
    public static class GroupingComparator extends WritableComparator {  
          protected GroupingComparator() {  
            super(IntPair.class, true);  
          }  
          @Override  
          //Compare two WritableComparables.  
          public int compare(WritableComparable w1, WritableComparable w2) {  
            IntPair ip1 = (IntPair) w1;  
            IntPair ip2 = (IntPair) w2;  
            int l = ip1.getFirst();  
            int r = ip2.getFirst();  
            return l == r ? 0 : (l < r ? -1 : 1);  
          }  
        }  
      
          
    // 自定义map  
    public static class Map extends  
            Mapper<LongWritable, Text, IntPair, IntWritable> {  
        private final IntPair intkey = new IntPair();  
        private final IntWritable intvalue = new IntWritable();  
        public void map(LongWritable key, Text value, Context context)  
                throws IOException, InterruptedException {  
            String line = value.toString();  
            StringTokenizer tokenizer = new StringTokenizer(line);  
            int left = 0;  
            int right = 0;  
            if (tokenizer.hasMoreTokens()) {  
                left = Integer.parseInt(tokenizer.nextToken());  
                if (tokenizer.hasMoreTokens())  
                    right = Integer.parseInt(tokenizer.nextToken());  
                intkey.set(left, right);  
                intvalue.set(right);  
                context.write(intkey, intvalue);  
            }  
        }  
    }  
    // 自定义reduce  
    //  
    public static class Reduce extends  
            Reducer<IntPair, IntWritable, Text, IntWritable> {  
        private final Text left = new Text();  
        private static final Text SEPARATOR =   
              new Text("------------------------------------------------");  
        public void reduce(IntPair key, Iterable<IntWritable> values,  
                Context context) throws IOException, InterruptedException {  
            context.write(SEPARATOR, null);  
            left.set(Integer.toString(key.getFirst()));  
            for (IntWritable val : values) {  
                context.write(left, val);  
            }  
        }  
    }  
    /** 
     * @param args 
     */  
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {  
        // TODO Auto-generated method stub  
        // 读取hadoop配置  
        Configuration conf = new Configuration();  
        // 实例化一道作业  
        Job job = new Job(conf, "secondarysort");  
        job.setJarByClass(SecondarySort.class);  
        // Mapper类型  
        job.setMapperClass(Map.class);  
        // 不再需要Combiner类型，因为Combiner的输出类型<Text, IntWritable>对Reduce的输入类型<IntPair, IntWritable>不适用  
        //job.setCombinerClass(Reduce.class);  
        // Reducer类型  
        job.setReducerClass(Reduce.class);  
        // 分区函数  
        job.setPartitionerClass(FirstPartitioner.class);  
        // 分组函数  
        job.setGroupingComparatorClass(GroupingComparator.class);  
          
        // map 输出Key的类型  
        job.setMapOutputKeyClass(IntPair.class);  
        // map输出Value的类型  
        job.setMapOutputValueClass(IntWritable.class);  
        // rduce输出Key的类型，是Text，因为使用的OutputFormatClass是TextOutputFormat  
        job.setOutputKeyClass(Text.class);  
        // rduce输出Value的类型  
        job.setOutputValueClass(IntWritable.class);  
          
        // 将输入的数据集分割成小数据块splites，同时提供一个RecordReder的实现。  
        job.setInputFormatClass(TextInputFormat.class);  
        // 提供一个RecordWriter的实现，负责数据输出。  
        job.setOutputFormatClass(TextOutputFormat.class);  
          
        // 输入hdfs路径  
        FileInputFormat.setInputPaths(job, new Path(args[0]));  
        // 输出hdfs路径  
        FileOutputFormat.setOutputPath(job, new Path(args[1]));  
        // 提交job  
        System.exit(job.waitForCompletion(true) ? 0 : 1);  
    }  
}

XifengHZ

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
1
评论
Hadoop Mapreduce分区、分组、连接以及辅助排序（也叫二次排序）过程详解

1、MapReduce中数据流动（1）最简单的过程： map - reduce （2）定制了partitioner以将map的结果送往指定reducer的过程：　map - partition - reduce （3）增加了在本地先进性一次reduce（优化）过程：　map - combin(本地reduce) - partition -reduce2、Mapred
复制链接

扫一扫