基于mapreduce的 Hadoop join 实现分析(二)

最新推荐文章于 2024-10-04 18:47:42 发布

tylgoodluck

最新推荐文章于 2024-10-04 18:47:42 发布

阅读量655

点赞数

分类专栏： hadoop 文章标签： mapreduce join hadoop 数据结构 import path

hadoop 专栏收录该内容

18 篇文章 0 订阅

订阅专栏

基于mapreduce的Hadoop join实现分析(二)

标签： hadoop mapreduce join

2009-11-22 17:00

上次我们讨论了基于mapreduce的join的实现,在上次讨论的最后,我们对这个实现进行了总结,最主要的问题就是实现的可扩展性,由于在reduce端我们通过一个List数据结构保存了所有的某个外键的对应的所有人员信息,而List的最大值为Integer.MAX_VALUE,所以在数据量巨大的时候,会造成List越界的错误.所以对这个实现的优化显得很有必要.

我们再来看一下这个例子,现在有两组数据:一组为单位人员信息,如下:

人员ID 人员名称地址ID

1 张三 1

2 李四 2

3 王五 1

4 赵六 3

5 马七 3

另外一组为地址信息:

地址ID 地址名称

1 北京

2 上海

3 广州

结合第一种实现方式,我们看到第一种方式最需要改进的地方就是如果对于某个地址ID的迭代器values,如果values的第一个元素是地址信息的话,那么,我们就不需要缓存所有的人员信息了.如果第一个元素是地址信息,我们读出地址信息后,后来就全部是人员信息,那么就可以将人员的地址置为相应的地址.

现在我们回头看看mapreduce的partition和shuffle的过程,partitioner的主要功能是根据reduce的数量将map输出的结果进行分块,将数据送入到相应的reducer,所有的partitioner都必须实现Partitioner接口并实现getPartition方法,该方法的返回值为int类型,并且取值范围在0-numOfReducer-1,从而能够将map的输出输入到相应的reducer中,对于某个mapreduce过程,Hadoop框架定义了默认的partitioner为HashPartition,该Partitioner使用key的hashCode来决定将该key输送到哪个reducer;shuffle将每个partitioner输出的结果根据key进行group以及排序,将具有相同key的value构成一个valeus的迭代器,并根据key进行排序分别调用开发者定义的reduce方法进行归并.从shuffle的过程我们可以看出key之间需要进行比较,通过比较才能知道某两个key是否相等或者进行排序,因此mapduce的所有的key必须实现comparable接口的compareto()方法从而实现两个key对象之间的比较.

回到我们的问题,我们想要的是将地址信息在排序的过程中排到最前面,前面我们只通过locId进行比较的方法就不够用了,因为其无法标识出是地址表中的数据还是人员表中的数据.因此,我们需要实现自己定义的Key数据结构,完成在想共同locId的情况下地址表更小的需求.由于map的中间结果需要写到磁盘上,因此必须实现writable接口.具体实现如下:

import java.io.DataInput;

import java.io.DataOutput;

import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;

public class RecordKey implements WritableComparable<RecordKey>{

int keyId;

boolean isPrimary;

public void readFields(DataInput in) throws IOException {

// TODO Auto-generated method stub

this.keyId = in.readInt();

this.isPrimary = in.readBoolean();

}

public void write(DataOutput out) throws IOException {

// TODO Auto-generated method stub

out.writeInt(keyId);

out.writeBoolean(isPrimary);

}

public int compareTo(RecordKey k) {

// TODO Auto-generated method stub

if(this.keyId == k.keyId){

if(k.isPrimary == this.isPrimary)

return 0;

return this.isPrimary? -1:1;

}else

return this.keyId > k.keyId?1:-1;

}

@Override

public int hashCode() {

return this.keyId;

}

这个key的数据结构中需要解释的方法就是compareTo方法,该方法完成了在keyId相同的情况下,确保地址数据比人员数据小.

有了这个数据结构,我们又发现了一个新的问题------就是shuffle的group过程,shuffle的group过程默认使用的是key的compareTo()方法.刚才我们添加的自定义Key没有办法将具有相同的locId的地址和人员放到同一个group中(因为从compareTo方法中可以看出他们是不相等的).不过hadoop框架提供了OutputValueGoupingComparator可以让使用者自定义key的group信息.我们需要的就是自己定义个groupingComparator就可以啦!看看这个比较器吧!

import org.apache.commons.logging.Log;

import org.apache.commons.logging.LogFactory;

import org.apache.hadoop.io.WritableComparable;

import org.apache.hadoop.io.WritableComparator;

public class PkFkComparator extends WritableComparator {

public PkFkComparator(){

super(RecordKey.class);

}

@Override

public int compare(WritableComparable a, WritableComparable b) {

RecordKey key1 = (RecordKey)a;

RecordKey key2 = (RecordKey)b;

System.out.println("call compare");

if(key1.keyId == key2.keyId){

return 0;

}else

return key1.keyId > key2.keyId?1:-1;

}

这里我们重写了compare方法,将两个具有相同的keyId的数据设为相等.

好了,有了这两个辅助工具,剩下的就比较简单了.写mapper,reducer,以及主程序.

import java.io.IOException;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.Mapper;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reporter;

import org.apache.hadoop.io.*;

public class JoinMapper extends MapReduceBase

implements Mapper<LongWritable, Text, RecordKey, Record> {

public void map(LongWritable key, Text value,

OutputCollector<RecordKey, Record> output, Reporter reporter)

throws IOException {

String line = value.toString();

String[] values = line.split(",");

if(values.length == 2){ //这里使用记录的长度来区别地址信息与人员信息,当然可以通过其他方式(如文件名等)来实现

Record reco = new Record();

reco.locId = values[0];

reco.type = 2;

reco.locationName = values[1];

RecordKey recoKey = new RecordKey();

recoKey.keyId = Integer.parseInt(values[0]);

recoKey.isPrimary = true;

output.collect(recoKey, reco);

}else{

Record reco = new Record();

reco.locId = values[2];

reco.empId = values[0];

reco.empName = values[1];

reco.type = 1;

RecordKey recoKey = new RecordKey();

recoKey.keyId = Integer.parseInt(values[2]);

recoKey.isPrimary = false;

output.collect(recoKey, reco);

}

import java.io.IOException;

import java.util.Iterator;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reducer;

import org.apache.hadoop.mapred.Reporter;

public class JoinReducer extends MapReduceBase implements

Reducer<RecordKey, Record, LongWritable, Text> {

public void reduce(RecordKey key, Iterator<Record> values,

OutputCollector<LongWritable, Text> output,

Reporter reporter) throws IOException {

Record thisLocation= new Record();

while (values.hasNext()){

Record reco = values.next();

if(reco.type == 2){ //2 is the location

thisLocation = new Record(reco);

System.out.println("location is "+ thisLocation.locationName);

}else{ //1 is employee

reco.locationName =thisLocation.locationName;

System.out.println("emp is "+reco.toString());

output.collect(new LongWritable(0), new Text(reco.toString()));

}

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.SequenceFile;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.FileInputFormat;

import org.apache.hadoop.mapred.FileOutputFormat;

import org.apache.hadoop.mapred.JobClient;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.SequenceFileOutputFormat;

public class Join {

/**

* @param args

public static void main(String[] args) throws Exception {

// TODO Auto-generated method stub

JobConf conf = new JobConf(Join.class);

conf.setJobName("Join");

FileSystem fstm = FileSystem.get(conf);

Path outDir = new Path("/Users/outputtest");

fstm.delete(outDir, true);

conf.setOutputFormat(SequenceFileOutputFormat.class);

conf.setMapOutputKeyClass(RecordKey.class);

conf.setMapOutputValueClass(Record.class);

conf.setOutputKeyClass(LongWritable.class);

conf.setOutputValueClass(Text.class);

conf.setMapperClass(JoinMapper.class);

conf.setReducerClass(JoinReducer.class);

conf.setOutputValueGroupingComparator(PkFkComparator.class);

FileInputFormat.setInputPaths(conf, new Path(

"/user/input/join"));

FileOutputFormat.setOutputPath(conf, outDir);

JobClient.runJob(conf);

Path outPutFile = new Path(outDir, "part-00000");

SequenceFile.Reader reader = new SequenceFile.Reader(fstm, outPutFile,

conf);

org.apache.hadoop.io.Text numInside = new Text();

LongWritable numOutside = new LongWritable();

while (reader.next(numOutside, numInside)) {

System.out.println(numInside.toString() + " "

+ numOutside.toString());

}

reader.close();

}

好了,基本的程序就在这里了.这就是一个比较完整的join的实现,这里对数据中的噪声没有进行处理,如果数据中有噪声数据,可能会导致程序的运行错误,还需要进一步提高程序的健壮性.

转自： http://labs.chinamobile.com/mblog/4110_32505

tylgoodluck

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录