Hadoop实现Secondary Sort(一)

Hadoop的MapReduce模型支持基于key的排序,即在一次MapReduce之后,结果都是按照key的大小排序的。但是在很多应用情况下,我们需要对映射在一个key下的value集合进行排序,即“secondary sort”。 

    在《hadoop the definate guide》的P227的“secondary sort”章节中,以<year,temperature>为例,在map阶段按照year来分发temperature,在reduce阶段按照同一year对应的temperature大小排序。 

本文以<String, Int>格式为例,先介绍已知类型的二次排序,再介绍泛型<key,value>集合的二次排序。 

设输入为如下的序列对: 

str1 3 

str2 2 

str1 1 

str3 9 

str2 10 

我们期望的输出结果为: 

str1 1,3 

str2 2,10 

str3 9 

1 value集合的二次排序 

(1)先定义一个TextInt类,将String及int对象封装为一个整体。由于TextInt在mapreduce中要作为key进行比较,必须实现WritableComparable接口。如下: 

public class TextInt implements WritableComparable<TextInt>{ 

private String text ; 
private int value; 
public TextInt(){} 
public TextInt(String text , int value){ 
this.text = text; 
this.value = value; 
} 
public String getFirst(){ 
return this.text; 
} 
public int getSecond(){ 
return this.value; 
} 
@Override 
public void readFields(DataInput in) throws IOException { 
text = in.readUTF(); 
value = in.readInt(); 
} 
@Override 
public void write(DataOutput out) throws IOException { 
out.writeUTF(text); 
out.writeInt(value); 
} 
@Override 
public int compareTo(TextInt that) { 
return this.text.compareTo(that.text); 
} 

} 

(2)我们用KeyValueTextInputFormat的方式来读取输入文件,以tab等分割符切分输入的<key,value>对,key,value都是Text类型。所以在mapper阶段需要把将value还原到int数据,同时封装String及int为TextInt; 

@Override 
public void map(Text key, Text value, OutputCollector<TextInt,IntWritable> output, Reporter reporter) throws IOException { 
int intValue = Integer.parseInt(value.toString()); 
TextInt ti = new TextInt(key.toString(), intValue); 
output.collect(ti, new IntWritable(intValue)); 
} 

(3)在reduce阶段,为了方便查看输出数据,我们把同一个string对应的int数据封装在一起,如下: 

@Override 
public void reduce(TextInt key, Iterator<IntWritable> values, 
OutputCollector<Text, Text> output, Reporter reporter) 
throws IOException { 
StringBuffer combineValue = new StringBuffer(); 
while( values.hasNext()){ 
int value = values.next().get(); 
combineValue.append(value + ","); 
} 
output.collect(new Text(key.getFirst()),newText(combineValue.toString())); 
} 

(4)上面的map及reduce跟普通的mapreduce没什么区别,很容易理解,但真正实现二次排序的是以下两个comparator及一个partitioner。 

1)TextIntComparator:先比较TextInt的String,再比较value; 

public static class TextIntComparator extends WritableComparator{ 
public TextIntComparator(){ 
super(TextInt.class, true); //注册comparator 
} 
@Override 
public int compare(WritableComparable a, WritableComparable b) { 
TextInt o1 = (TextInt)a; 
TextInt o2 = (TextInt)b; 
if ( !o1.getFirst().equals(o2.getFirst())){ 
return o1.getFirst().compareTo(o2.getFirst()); 
} 
else{ 
return o1.getSecond() - o2.getSecond(); 
} 
}} 

2)TextComparator: 只比较TextInt中的String。 

public static class TextComparator extends WritableComparator{ 
public TextComparator(){ 
super(TextInt.class, true); 
} 
@Override 
public int compare(WritableComparable a, WritableComparable b) { 
TextInt o1 = (TextInt)a; 
TextInt o2 = (TextInt)b; 
return o1.getFirst().compareTo(o2.getFirst()); 
} 
} 

3) PartitionByText:根据TextInt中的String来分割TextInt对象: 

public static class PartitionByText implementsPartitioner<TextInt, IntWritable>{ 
@Override 
public int getPartition(TextInt key, IntWritable value, intnumPartitions) { 
return (key.getFirst().hashCode() & Integer.MAX_VALUE) %numPartitions; 
} 
@Override 
public void configure(JobConf job) {} 
} 

(5)OK,基础工作都完成了,现在看实际的job调用: 

//……define input & output 
//定义Job 
JobConf conf = new JobConf(Join.class); 
conf.setJobName("sort by value"); 
//add inputpath: 
FileInputFormat.addInputPath(conf, new Path(input)); 
conf.setInputFormat(KeyValueTextInputFormat.class); 
conf.setMapperClass(Mapper.class);
conf.setMapOutputKeyClass(TextInt.class); 
conf.setMapOutputValueClass(IntWritable.class); 
conf.setOutputKeyComparatorClass(TextIntComparator.class); 
conf.setOutputValueGroupingComparator(TextComparator.class);
conf.setPartitionerClass(PartitionByText.class); 
conf.setOutputKeyClass(Text.class); 
conf.setOutputValueClass(Text.class); 
conf.setReducerClass(Reduce.class); 
conf.setOutputFormat(TextOutputFormat.class); 
//output 
FileOutputFormat.setOutputPath(conf, new Path(output)); 
JobClient.runJob(conf); 

通过上面的job执行,在output的文件中,就能看到类似的输出结果了: 

str1 1,3 

str2 2,10 

str3 9 

上面介绍了如何实现,在“知其然”的基础上,我们进一步“知其所以然”。首先,我们需要把相同string的int数据分发到同一个reducer,这就是PartitionByText的作用;其次,用TextComparator控制reduce阶段int数据集合的group,即把相同的string对应的int数据组装在一起;最后,用TextIntComparator实现在reduce阶段的排列方式。 

泛型value的二次排序 

在上一个章节中,将已知类型的key和value封装成TextInt对象,然后做二次排序。在很多情况下,我们需要对不同类型的key或value做二次排序,或者value是一个二元组/多元组,实现多级的排序,这时可以使用泛型的key或value。 

定义泛型的key/value组合对象,key和value均要实现WritableComparable接口。 

public static class CombinedObject implements WritableComparable{private static Configuration conf = new Configuration(); 
private Class<? extends WritableComparable> firstClass; 
private Class<? extends WritableComparable> secondClass; 
private WritableComparable first ; 
private WritableComparable second; 
public CombinedObject(){ } 
public CombinedObject(Class<? extends WritableComparable>keyClass, Class<? extends WritableComparable> valueClass){ 
if (keyClass == null || valueClass == null) { 
throw new IllegalArgumentException("nullvalueClass"); 
} 
this.firstClass = keyClass; 
this.secondClass = valueClass; 
first = ReflectionUtils.newInstance(firstClass, conf); 
second = ReflectionUtils.newInstance(secondClass, conf); 
} 
public CombinedObject(WritableComparable f, WritableComparables){ 
this(f.getClass(), s.getClass()); 
setFirst(f); 
setSecond(s); 
} 
public void setFirst(WritableComparable key){ 
this.first =key; 
} 
public void setSecond(WritableComparable value){ 
this.second = value; 
} 
public WritableComparable getFirst(){ 
return this.first; 
} 
@Override 
public void readFields(DataInput in) throws IOException { 
String firstClassName = in.readUTF(); 
String secondClassName = in.readUTF(); 
firstClass = (Class<? extends WritableComparable>)WritableName.getClass(firstClassName, conf); 
secondClass = (Class<? extends WritableComparable>)WritableName.getClass(secondClassName, conf); 
first = ReflectionUtils.newInstance(firstClass, conf); 
second = ReflectionUtils.newInstance(secondClass, conf); 
first.readFields(in); 
second.readFields(in); 
} 
@Override 
public void write(DataOutput out) throws IOException { 
out.writeUTF(firstClass.getName()); 
out.writeUTF(secondClass.getName()); 
first.write(out); 
second.write(out); 
} 
@Override 
public boolean equals(Object o) { 
if( o instanceof CombinedObject){ 
CombinedObject that = (CombinedObject)o; 
return that.first.equals(this.first); 
} 
return false; 
} 
@Override 
public int hashCode() { 
return first.hashCode(); 
} 
@Override 
public int compareTo(Object o) { 
CombinedObject that = (CombinedObject)o; 
return that.first.compareTo(this.first); 
} 
} 

注意: 

在Writable接口的write(DataOutputout)及readFields(DataInput in)函数中,需要考虑firstClass及secondClass的序列化及反序列化。 

在comparator接口的equal,hashCode,compareTo三个具体实现中,只考虑first的比较即可。 

KeyComparator:只比较first 

public static class KeyComparator extends WritableComparator{ 
public KeyComparator(){ 
super(CombinedObject.class, true); 
} 
@Override 
public int compare(WritableComparable a, WritableComparable b) { 
CombinedObject o1 = (CombinedObject)a; 
CombinedObject o2 = (CombinedObject)b; 
return o1.first.compareTo(o2.first); 
} 
} 

KeyValueComparator:比较first之后,再比较second。 

public static class KeyValueComparator extends WritableComparator{ 
public KeyValueComparator(){ 
super(CombinedObject.class, true); 
} 
@Override 
public int compare(WritableComparable a, WritableComparable b) { 
CombinedObject o1 = (CombinedObject)a; 
CombinedObject o2 = (CombinedObject)b; 
if ( !o1.first.equals(o2.first)){ 
return o1.first.compareTo(o2.first); 
} else{ 
return o2.second.compareTo(o1.second); //descend 
} 
} 
} 

KeyPartitioner:根据first来分发CombinedObject。 

public static class KeyPartitioner implementsPartitioner<CombinedObject, Writable>{ 
@Override 
public int getPartition(CombinedObject key, Writable value, intnumPartitions) { 
return (key.getFirst().hashCode() & Integer.MAX_VALUE) %numPartitions; 
} 
@Override 
public void configure(JobConf job) { } 
} 


具体的调用方式: 

job.setMapperClass(ContentMapper.class); 
job.setMapOutputKeyClass(CombinedObject.class); 
// sort value list 
job.setOutputKeyComparatorClass(KeyValueComparator.class); 
job.setOutputValueGroupingComparator(KeyComparator.class); 
job.setPartitionerClass(KeyPartitioner.class); 
job.setReducerClass(SameInfoReducer.class); 

在这个章节介绍了对泛型的key和value的二次排序,可以设计更多组合形式的<key,value>的排序,也可以自定义多元组构成value,实现更多灵活的排序方式。


本文转自《hadoop开发者》第二期。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值