Hadoop 实例10 Join讲解3: 将人员的地址ID完善成为地址名称，输出格式要求：人员Id，姓名，地址 ----优化方案

最新推荐文章于 2020-03-16 20:51:25 发布

garychenqin

最新推荐文章于 2020-03-16 20:51:25 发布

阅读量738

点赞数

文章标签： hadoop 优化实例扩展数据

本文链接：https://blog.csdn.net/garychenqin/article/details/48248051

版权

1、原始数据
人员ID 人员名称地址ID

另外一组为地址信息:
地址ID 地址名称

    1 北京
    2 上海
    3 广州

2、处理说明
该处理接着上一讲，我们对这个实现进行了总结,最主要的问题就是实现的可扩展性,由于在reduce端我们通过一个List数据结构保存了所有的某个外键的对应的所有人员信息,
而List的最大值为Integer.MAX_VALUE,所以在数据量巨大的时候,会造成List越界的错误.所以对这个实现的优化显得很有必要.
3、优化说明
结合第一种实现方式,我们看到第一种方式最需要改进的地方就是如果对于某个地址ID的迭代器values,如果values的第一个元素是地址信息的话,
那么,我们就不需要缓存所有的人员信息了.如果第一个元素是地址信息,我们读出地址信息后,后来就全部是人员信息,那么就可以将人员的地址置为相应的地址.
现在我们回头看看mapreduce的partition和shuffle的过程,partitioner的主要功能是根据reduce的数量将map输出的结果进行分块,将数据送入到相应的reducer,
所有的partitioner都必须实现Partitioner接口并实现getPartition 方法,该方法的返回值为int类型,并且取值范围在0-numOfReducer-1,
从而能够将map的输出输入到相应的reducer中,对于某个 mapreduce过程,Hadoop框架定义了默认的partitioner为HashPartition,
该Partitioner使用key的 hashCode来决定将该key输送到哪个reducer;shuffle将每个partitioner输出的结果根据key进行group以及排序,
将具有相同key的value构成一个valeus的迭代器,并根据key进行排序分别调用开发者定义的reduce方法进行归并.
从shuffle的过程我们可以看出key之间需要进行比较,通过比较才能知道某两个key是否相等或者进行排序,
因此mapduce的所有的key必须实现 comparable接口的compareto()方法从而实现两个key对象之间的比较.
回到我们的问题,我们想要的是将地址信息在排序的过程中排到最前面,前面我们只通过locId进行比较的方法就不够用了,
因为其无法标识出是地址表中的数据还是人员表中的数据.因此,我们需要实现自己定义的Key数据结构,完成在想共同locId的情况下地址表更小的需求.
由于map的中间结果需要写到磁盘上,因此必须实现writable接口.具体实现如下:

4、构造用于排序的key

package cn.edu.bjut.jointwo;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;

public class UserKey implements WritableComparable<UserKey> {

    private int keyId;
    private boolean isPrimary;

    public void write(DataOutput out) throws IOException {
        out.writeInt(keyId);
        out.writeBoolean(isPrimary);

    }

    public void readFields(DataInput in) throws IOException {
        this.keyId = in.readInt();
        this.isPrimary = in.readBoolean();
    }

    public int compareTo(UserKey o) {
        if(this.keyId == o.getKeyId()) {
            if(this.isPrimary == o.isPrimary()) {
                return 0;
            } else {
                return this.isPrimary ? 1 : -1;
            }
        } else {
            return this.keyId > o.getKeyId() ? 1 : -1;
        }
    }

    @Override
    public int hashCode() { //partition 使用key的hashCode方法决定该记录发往那个一reduce numOfReduce-1
        return this.getKeyId();
    }

    public int getKeyId() {
        return keyId;
    }

    public void setKeyId(int keyId) {
        this.keyId = keyId;
    }

    public boolean isPrimary() {
        return isPrimary;
    }

    public void setPrimary(boolean isPrimary) {
        this.isPrimary = isPrimary;
    }

}

5、构造用于group的比较器
有了这个数据结构,我们又发现了一个新的问题——就是shuffle的group过程,shuffle的group过程默认使用的是key的 compareTo()方法.
刚才我们添加的自定义Key没有办法将具有相同的locId的地址和人员放到同一个group中(因为从compareTo 方法中可以看出他们是不相等的).
不过hadoop框架提供了OutputValueGoupingComparator可以让使用者自定义key的 group信息.
我们需要的就是自己定义个groupingComparator就可以啦!看看这个比较器吧!

package cn.edu.bjut.jointwo;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

public class MyComparator extends WritableComparator {

    protected MyComparator() {
        super(UserKey.class, true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        UserKey a1 = (UserKey) a;
        UserKey a2 = (UserKey) b;
        if(a1.getKeyId() == a2.getKeyId()) {
            return 0;
        } else {
            return a1.getKeyId() > a2.getKeyId() ? 1 : -1;
        }
    }
}

6Map程序：

package cn.edu.bjut.jointwo;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class JoinMapper extends Mapper<LongWritable, Text, UserKey, Member> {

    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
        String[] arr = line.split("\t");
        if(arr.length >= 3) {
            Member m = new Member();
            m.setUserNo(arr[0]);
            m.setUserName(arr[1]);
            m.setCityNo(arr[2]);

            UserKey userKey = new UserKey();
            userKey.setKeyId(Integer.parseInt(arr[2]));
            userKey.setPrimary(true);
            context.write(userKey, m);
        } else {
            Member m = new Member();
            m.setCityNo(arr[0]);
            m.setCityName(arr[1]);

            UserKey userKey = new UserKey();
            userKey.setKeyId(Integer.parseInt(arr[0]));
            userKey.setPrimary(false);
            context.write(userKey, m);
        }
    }

}

7.Reduce程序：

package cn.edu.bjut.jointwo;

import java.io.IOException;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class JoinReducer extends Reducer<UserKey, Member, Text, NullWritable> {

    @Override
    protected void reduce(UserKey key, Iterable<Member> values, Context context)
            throws IOException, InterruptedException {
        Member m = null;
        int num = 0;
        for(Member member : values) {
            if(0 == num) {
                m = new Member(member);
            } else {
                Member tmp = new Member(member);
                tmp.setCityName(m.getCityName());
                context.write(new Text(tmp.toString()), NullWritable.get());
            }
            num++;
        }
    }

}

8.主程序：

package cn.edu.bjut.jointwo;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MainJob {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "jointwo");
        job.setJarByClass(MainJob.class);

        job.setMapperClass(JoinMapper.class);
        job.setMapOutputKeyClass(UserKey.class);
        job.setMapOutputValueClass(Member.class);

        job.setReducerClass(JoinReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);
        job.setGroupingComparatorClass(MyComparator.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));

        Path outPathDir = new Path(args[1]);
        FileSystem fs = FileSystem.get(conf);
        if(fs.exists(outPathDir)) {
            fs.delete(outPathDir, true);
        }

        FileOutputFormat.setOutputPath(job, outPathDir);

        job.waitForCompletion(true);
    }

}