细节决定成败 MapReduce任务实战 Reduce Join

一、任务描述

在hdfs 上有两个文件
学生信息文件:hdfs://***.***.***:8020/user/train/joinjob/student.txt
以逗号分隔,第一列为学号,学号每个学生唯一,第二列为学生姓名。

点击(此处)折叠或打开

  1. 2016001,Join
  2. 2016002,Abigail
  3. 2016003,Abby
  4. 2016004,Alexandra
  5. 2016005,Cathy
  6. 2016006,Katherine
学生成绩信息文件:hdfs://***.***.***:8020/user/train/joinjob/student_score.txt
以逗号分隔,第一列为学号,学号每个学生唯一,第二列为课程代码,第三列为分数。

点击(此处)折叠或打开

  1. 2016001,YY,60
  2. 2016001,SX,88
  3. 2016001,YW,91
  4. 2016002,SX,77
  5. 2016002,YW,33
  6. .............
期望的结果文件形式如下

点击(此处)折叠或打开

  1. 2016001,Join,YY,60
  2. 2016001,Join,SX,88
  3. 2016001,Join,YW,91
  4. 2016002,Abigail,SX,77
  5. 2016002,Abigail,YW,33
  6. 2016002,Abigail,YY,56
  7. 2016003,Abby,YY,34
  8. 2016003,Abby,SX,84
  9. 2016003,Abby,YW,69
  10. 2016004,Alexandra,YY,89
  11. 2016004,Alexandra,SX,84
  12. .......

二、任务分析

  1. 这是一个两个数据集关联的任务。关联有map端join ,和reduce端join。 通常在有一个数据集比较小,可在全部放到内存中时,选择map端join;
  2. 当两个数据集都很大,不能把一个数据集合部放到内存中时,使用reduce端join,这时也可以使用Bloom过滤器来增加效率。
  3. 本次任务我们使用reduce端join实现。
  4. 涉及到的内容有:自定义KEY,自定义GroupingComparator,配置多个Mapper类,配置多个输入路径
  5. 三、实现思路

    1. 分别编写两个Mapper:
    2. StudentMapper 用来读取和处理student.txt中的内容。
    3. ScoreMapper 用来读取和处理student_scor.txt中的内容。
    4. 这两个mapper的输入都要进行到JoinReducer的reduce函数中处理。
    5. 两个mapper中输出的相同的学生信息进入到同一次reduce调用中,在其中进行关联,同时又要能标记每一条记录是来自学生信息,还是成绩信息。
    6. 还有有可能同一个学号的数据量有很大,不能在reduce中先存下来再处理,所以把学生信息的value放到前面,成绩信息放到后面。
    7. 就可以在一开始遍历values时就可以取到学生姓名,然后把姓名保存起来,后面遍历成绩信息时直接使用姓名值。
    8. 综上分析以如下方法完成此任务:
    9. 1、定义一个自定义KEY,StuentCustomKey, 属性有学生ID,和信息来源(标记此记录来源于学生信息表,还是成绩表),实现comparaTo方法,确保按学号排升序,按source排降序 (student要排在score前面)
    10. 2、StudentMapper 用来读取和处理student.txt中的内容,输出StuentCustomKey、Text。其中设置StuentCustomKey的source值为“student”
    11. 3、ScoreMapper 用来读取和处理student_scor.txt中的内容,输出StuentCustomKey、Text。其中设置StuentCustomKey的source值为“score”
    12. 4、为了保证相同学号的信息进入同一次reduce调用,要实现一个自定义GroupingComparator,实现中使用sid进行比较。(注意自定义GroupingComparator的构造)
    13. 三、实现代码


    14. 自定义KEY

      点击(此处)折叠或打开

      1. package join;

      2. import java.io.DataInput;
      3. import java.io.DataOutput;
      4. import java.io.IOException;

      5. import org.apache.hadoop.io.WritableComparable;

      6. public class StudentCustomKey implements WritableComparable<StudentCustomKey>{
      7.         private String sid;
      8.         private String source;

      9.         @Override
      10.         public void write(DataOutput out) throws IOException {
      11.                 out.writeUTF(sid);
      12.                 out.writeUTF(source);
      13.         }

      14.         @Override
      15.         public void readFields(DataInput in) throws IOException {
      16.                 this.sid = in.readUTF();
      17.                 this.source = in.readUTF();
      18.         }

      19.         @Override
      20.         public int compareTo(StudentCustomKey key) {
      21.                 int r= this.sid.compareTo(key.sid);
      22.                 if(r==0)
      23.                         r= -this.source.compareTo(key.source);
      24.                 return r;
      25.         }

      26.         public String getSid() {
      27.                 return sid;
      28.         }

      29.         public void setSid(String sid) {
      30.                 this.sid = sid;
      31.         }

      32.         public String getSource() {
      33.                 return source;
      34.         }

      35.         public void setSource(String source) {
      36.                 this.source = source;
      37.         }

      38.         @Override
      39.         public String toString(){
      40.                 return this.sid+","+this.source;
      41.         }

      42. }

      自定义 GroupingComparator

      点击(此处)折叠或打开

      1. package join;

      2. import org.apache.hadoop.io.WritableComparable;
      3. import org.apache.hadoop.io.WritableComparator;

      4. public class GroupComparator extends WritableComparator {

      5.         public GroupComparator(){
      6.                 super(StudentCustomKey.class,true); //mark
      7.         }

      8.         @Override
      9.         public int compare(WritableComparable a, WritableComparable b) {
      10.                 StudentCustomKey key1 = (StudentCustomKey)a;
      11.                 StudentCustomKey key2 = (StudentCustomKey)b;
      12.                 return key1.getSid().compareTo(key2.getSid());
      13.         }

      14. }

      Job类,包括两个Mapper 和一个Reducer。

      点击(此处)折叠或打开

      1. package join;

      2. import java.io.IOException;

      3. import org.apache.commons.lang.StringUtils;
      4. import org.apache.hadoop.conf.Configuration;
      5. import org.apache.hadoop.conf.Configured;
      6. import org.apache.hadoop.fs.FileSystem;
      7. import org.apache.hadoop.fs.Path;
      8. import org.apache.hadoop.io.LongWritable;
      9. import org.apache.hadoop.io.NullWritable;
      10. import org.apache.hadoop.io.Text;
      11. import org.apache.hadoop.mapreduce.Job;
      12. import org.apache.hadoop.mapreduce.Mapper;
      13. import org.apache.hadoop.mapreduce.Mapper.Context;
      14. import org.apache.hadoop.mapreduce.Reducer;
      15. import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
      16. import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
      17. import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
      18. import org.apache.hadoop.util.Tool;
      19. import org.apache.hadoop.util.ToolRunner;
      20. import org.apache.log4j.Logger;

      21. public class ReduceJoinJob extends Configured implements Tool {

      22.         public static class StudentMapper extends Mapper<LongWritable,Text,StudentCustomKey,Text>{
      23.                 private StudentCustomKey newKey = new StudentCustomKey();
      24.                 @Override
      25.                 protected void map(LongWritable key, Text value, Context context)
      26.                                 throws IOException, InterruptedException {
      27.                         String [] words = StringUtils.split(value.toString(),',');
      28.                         newKey.setSid(words[0]);
      29.                         newKey.setSource("student");
      30.                         context.write(newKey, value);
      31.                 }

      32.         }

      33.         public static class ScoreMapper extends Mapper<LongWritable,Text,StudentCustomKey,Text>{
      34.                 private StudentCustomKey newKey = new StudentCustomKey();
      35.                 @Override
      36.                 protected void map(LongWritable key, Text value, Context context)
      37.                                 throws IOException, InterruptedException {
      38.                         String [] words = StringUtils.split(value.toString(),',');
      39.                         newKey.setSid(words[0]);
      40.                         newKey.setSource("score");
      41.                         context.write(newKey, value);
      42.                 }
      43.         }

      44.         public static class JoinReducer extends Reducer<StudentCustomKey,Text,NullWritable,Text>{
      45.                 Logger log = Logger.getLogger(getClass());
      46.                 private Text newValue = new Text();
      47.                 @Override
      48.                 protected void reduce(StudentCustomKey key, Iterable<Text> values,Context context)
      49.                                 throws IOException, InterruptedException {
      50.                         String name = "";
      51.                         for(Text v : values){
      52.                                 if(key.getSource().equals("student")){
      53.                                         name = StringUtils.split(v.toString(),',')[1];
      54.                                         continue;
      55.                                 }
      56.                                 newValue.set(name +","+ v.toString());
      57.                                 context.write(NullWritable.get(), newValue);
      58.                         }

      59.                 }

      60.         }
      61.         @Override
      62.         public int run(String[] args) throws Exception {
      63.                 Job job = Job.getInstance(getConf(),"ReduceJoinJob");
      64.                 job.setJarByClass(getClass());
      65.                 Configuration conf = job.getConfiguration();

      66.                 MultipleInputs.addInputPath(job, new Path("joinjob/student.txt"), TextInputFormat.class, StudentMapper.class);
      67.                 MultipleInputs.addInputPath(job, new Path("joinjob/student_score.txt"), TextInputFormat.class, ScoreMapper.class);
      68.                 Path out = new Path("joinjob/output1");
      69.                 TextOutputFormat.setOutputPath(job, out);
      70.                 FileSystem.get(conf).delete(out, true);

      71.                 job.setOutputFormatClass(TextOutputFormat.class);
      72.                 job.setReducerClass(JoinReducer.class);

      73.                 job.setMapOutputKeyClass(StudentCustomKey.class);
      74.                 job.setMapOutputValueClass(Text.class);
      75.                 job.setOutputKeyClass(NullWritable.class);
      76.                 job.setOutputValueClass(Text.class);

      77.                 job.setGroupingComparatorClass(GroupComparator.class);
      78.                 return job.waitForCompletion(true)?0:1;
      79.         }
      80.         public static void main(String [] args){
      81.                 int r=0;
      82.                 try{
      83.                         r= ToolRunner.run(new Configuration(), new ReduceJoinJob() , args);
      84.                 }catch(Exception e){
      85.                         e.printStackTrace();
      86.                 }
      87.                 System.exit(r);
      88.         }
      89. }

      三、打包运行


    15. 结果如下。

      点击(此处)折叠或打开

      1. Join,2016001,YY,60
      2. Join,2016001,SX,88
      3. Join,2016001,YW,91
      4. Abigail,2016002,SX,77
      5. Abigail,2016002,YW,33
      6. Abigail,2016002,YY,56
      7. Abby,2016003,YW,69
      8. Abby,2016003,SX,84
      9. Abby,2016003,YY,34
      10. Alexandra,2016004,YW,100
      11. Alexandra,2016004,SX,84
      12. Alexandra,2016004,YY,89
      13. Cathy,2016005,YW,63
      14. Cathy,2016005,YY,53
      15. Cathy,2016005,SX,43
      16. Katherine,2016006,SX,90
      17. Katherine,2016006,YW,90
      18. Katherine,2016006,YY,90
      相关博文: Hadoop的GroupComparator是如何起做用的(源码分析)
    16. 本文地址:http://blog.itpub.net/30066956/viewspace-2120133/


来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/30066956/viewspace-2120133/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/30066956/viewspace-2120133/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值