细节决定成败 MapReduce任务实战 Reduce Join

最新推荐文章于 2022-12-20 20:56:26 发布

cnkxy68446

最新推荐文章于 2022-12-20 20:56:26 发布

阅读量281

点赞数

一、任务描述

在hdfs 上有两个文件

学生信息文件：hdfs://..***:8020/user/train/joinjob/student.txt

以逗号分隔，第一列为学号，学号每个学生唯一，第二列为学生姓名。

点击(此处)折叠或打开

2016001,Join
2016002,Abigail
2016003,Abby
2016004,Alexandra
2016005,Cathy
2016006,Katherine

学生成绩信息文件：hdfs://..***:8020/user/train/joinjob/student_score.txt

以逗号分隔，第一列为学号，学号每个学生唯一，第二列为课程代码，第三列为分数。

点击(此处)折叠或打开

2016001,YY,60
2016001,SX,88
2016001,YW,91
2016002,SX,77
2016002,YW,33
.............

期望的结果文件形式如下

点击(此处)折叠或打开

2016001,Join,YY,60
2016001,Join,SX,88
2016001,Join,YW,91
2016002,Abigail,SX,77
2016002,Abigail,YW,33
2016002,Abigail,YY,56
2016003,Abby,YY,34
2016003,Abby,SX,84
2016003,Abby,YW,69
2016004,Alexandra,YY,89
2016004,Alexandra,SX,84
.......

二、任务分析

这是一个两个数据集关联的任务。关联有map端join ，和reduce端join。通常在有一个数据集比较小，可在全部放到内存中时，选择map端join；
当两个数据集都很大,不能把一个数据集合部放到内存中时，使用reduce端join，这时也可以使用Bloom过滤器来增加效率。
本次任务我们使用reduce端join实现。
涉及到的内容有：自定义KEY，自定义GroupingComparator,配置多个Mapper类，配置多个输入路径
三、实现思路
1. 分别编写两个Mapper：
2. StudentMapper 用来读取和处理student.txt中的内容。
3. ScoreMapper 用来读取和处理student_scor.txt中的内容。
4. 这两个mapper的输入都要进行到JoinReducer的reduce函数中处理。
5. 两个mapper中输出的相同的学生信息进入到同一次reduce调用中，在其中进行关联，同时又要能标记每一条记录是来自学生信息，还是成绩信息。
6. 还有有可能同一个学号的数据量有很大，不能在reduce中先存下来再处理，所以把学生信息的value放到前面，成绩信息放到后面。
7. 就可以在一开始遍历values时就可以取到学生姓名，然后把姓名保存起来，后面遍历成绩信息时直接使用姓名值。
8. 综上分析以如下方法完成此任务：
9. 1、定义一个自定义KEY,StuentCustomKey, 属性有学生ID，和信息来源（标记此记录来源于学生信息表，还是成绩表），实现comparaTo方法，确保按学号排升序，按source排降序（student要排在score前面）
10. 2、StudentMapper 用来读取和处理student.txt中的内容，输出StuentCustomKey、Text。其中设置StuentCustomKey的source值为“student”
11. 3、ScoreMapper 用来读取和处理student_scor.txt中的内容,输出StuentCustomKey、Text。其中设置StuentCustomKey的source值为“score”
12. 4、为了保证相同学号的信息进入同一次reduce调用，要实现一个自定义GroupingComparator，实现中使用sid进行比较。（注意自定义GroupingComparator的构造）
13. 三、实现代码
14. 自定义KEY
  点击(此处)折叠或打开
  
  package join;
  
  import java.io.DataInput;
  import java.io.DataOutput;
  import java.io.IOException;
  
  import org.apache.hadoop.io.WritableComparable;
  
  public class StudentCustomKey implements WritableComparable<StudentCustomKey>{
          private String sid;
          private String source;
  
          @Override
          public void write(DataOutput out) throws IOException {
                  out.writeUTF(sid);
                  out.writeUTF(source);
          }
  
          @Override
          public void readFields(DataInput in) throws IOException {
                  this.sid = in.readUTF();
                  this.source = in.readUTF();
          }
  
          @Override
          public int compareTo(StudentCustomKey key) {
                  int r= this.sid.compareTo(key.sid);
                  if(r==0)
                          r= -this.source.compareTo(key.source);
                  return r;
          }
  
          public String getSid() {
                  return sid;
          }
  
          public void setSid(String sid) {
                  this.sid = sid;
          }
  
          public String getSource() {
                  return source;
          }
  
          public void setSource(String source) {
                  this.source = source;
          }
  
          @Override
          public String toString(){
                  return this.sid+","+this.source;
          }
  
  }
  自定义 GroupingComparator
  点击(此处)折叠或打开
  
  package join;
  
  import org.apache.hadoop.io.WritableComparable;
  import org.apache.hadoop.io.WritableComparator;
  
  public class GroupComparator extends WritableComparator {
  
          public GroupComparator(){
                  super(StudentCustomKey.class,true); //mark
          }
  
          @Override
          public int compare(WritableComparable a, WritableComparable b) {
                  StudentCustomKey key1 = (StudentCustomKey)a;
                  StudentCustomKey key2 = (StudentCustomKey)b;
                  return key1.getSid().compareTo(key2.getSid());
          }
  
  }
  Job类，包括两个Mapper 和一个Reducer。
  点击(此处)折叠或打开
  
  package join;
  
  import java.io.IOException;
  
  import org.apache.commons.lang.StringUtils;
  import org.apache.hadoop.conf.Configuration;
  import org.apache.hadoop.conf.Configured;
  import org.apache.hadoop.fs.FileSystem;
  import org.apache.hadoop.fs.Path;
  import org.apache.hadoop.io.LongWritable;
  import org.apache.hadoop.io.NullWritable;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapreduce.Job;
  import org.apache.hadoop.mapreduce.Mapper;
  import org.apache.hadoop.mapreduce.Mapper.Context;
  import org.apache.hadoop.mapreduce.Reducer;
  import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
  import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
  import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
  import org.apache.hadoop.util.Tool;
  import org.apache.hadoop.util.ToolRunner;
  import org.apache.log4j.Logger;
  
  public class ReduceJoinJob extends Configured implements Tool {
  
          public static class StudentMapper extends Mapper<LongWritable,Text,StudentCustomKey,Text>{
                  private StudentCustomKey newKey = new StudentCustomKey();
                  @Override
                  protected void map(LongWritable key, Text value, Context context)
                                  throws IOException, InterruptedException {
                          String [] words = StringUtils.split(value.toString(),',');
                          newKey.setSid(words[0]);
                          newKey.setSource("student");
                          context.write(newKey, value);
                  }
  
          }
  
          public static class ScoreMapper extends Mapper<LongWritable,Text,StudentCustomKey,Text>{
                  private StudentCustomKey newKey = new StudentCustomKey();
                  @Override
                  protected void map(LongWritable key, Text value, Context context)
                                  throws IOException, InterruptedException {
                          String [] words = StringUtils.split(value.toString(),',');
                          newKey.setSid(words[0]);
                          newKey.setSource("score");
                          context.write(newKey, value);
                  }
          }
  
          public static class JoinReducer extends Reducer<StudentCustomKey,Text,NullWritable,Text>{
                  Logger log = Logger.getLogger(getClass());
                  private Text newValue = new Text();
                  @Override
                  protected void reduce(StudentCustomKey key, Iterable<Text> values,Context context)
                                  throws IOException, InterruptedException {
                          String name = "";
                          for(Text v : values){
                                  if(key.getSource().equals("student")){
                                          name = StringUtils.split(v.toString(),',')[1];
                                          continue;
                                  }
                                  newValue.set(name +","+ v.toString());
                                  context.write(NullWritable.get(), newValue);
                          }
  
                  }
  
          }
          @Override
          public int run(String[] args) throws Exception {
                  Job job = Job.getInstance(getConf(),"ReduceJoinJob");
                  job.setJarByClass(getClass());
                  Configuration conf = job.getConfiguration();
  
                  MultipleInputs.addInputPath(job, new Path("joinjob/student.txt"), TextInputFormat.class, StudentMapper.class);
                  MultipleInputs.addInputPath(job, new Path("joinjob/student_score.txt"), TextInputFormat.class, ScoreMapper.class);
                  Path out = new Path("joinjob/output1");
                  TextOutputFormat.setOutputPath(job, out);
                  FileSystem.get(conf).delete(out, true);
  
                  job.setOutputFormatClass(TextOutputFormat.class);
                  job.setReducerClass(JoinReducer.class);
  
                  job.setMapOutputKeyClass(StudentCustomKey.class);
                  job.setMapOutputValueClass(Text.class);
                  job.setOutputKeyClass(NullWritable.class);
                  job.setOutputValueClass(Text.class);
  
                  job.setGroupingComparatorClass(GroupComparator.class);
                  return job.waitForCompletion(true)?0:1;
          }
          public static void main(String [] args){
                  int r=0;
                  try{
                          r= ToolRunner.run(new Configuration(), new ReduceJoinJob() , args);
                  }catch(Exception e){
                          e.printStackTrace();
                  }
                  System.exit(r);
          }
  }
  三、打包运行
15. 结果如下。
  点击(此处)折叠或打开
  Join,2016001,YY,60
  
  Join,2016001,SX,88
  Join,2016001,YW,91
  Abigail,2016002,SX,77
  Abigail,2016002,YW,33
  Abigail,2016002,YY,56
  Abby,2016003,YW,69
  Abby,2016003,SX,84
  Abby,2016003,YY,34
  Alexandra,2016004,YW,100
  Alexandra,2016004,SX,84
  Alexandra,2016004,YY,89
  Cathy,2016005,YW,63
  Cathy,2016005,YY,53
  Cathy,2016005,SX,43
  Katherine,2016006,SX,90
  Katherine,2016006,YW,90
  Katherine,2016006,YY,90
  相关博文： Hadoop的GroupComparator是如何起做用的（源码分析）
16. 本文地址：http://blog.itpub.net/30066956/viewspace-2120133/

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/30066956/viewspace-2120133/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/30066956/viewspace-2120133/

cnkxy68446

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
细节决定成败 MapReduce任务实战 Reduce Join

一、任务描述在hdfs 上有两个文件学生信息文件：hdfs://***.***.***:8020/user/train/joinjob/student.txt以逗号分隔，第一列为学号，学号每个学生...
复制链接

扫一扫