一、任务描述
在hdfs 上有两个文件学生信息文件:hdfs://***.***.***:8020/user/train/joinjob/student.txt
以逗号分隔,第一列为学号,学号每个学生唯一,第二列为学生姓名。点击(此处)折叠或打开
- 2016001,Join
- 2016002,Abigail
- 2016003,Abby
- 2016004,Alexandra
- 2016005,Cathy
- 2016006,Katherine
学生成绩信息文件:hdfs://***.***.***:8020/user/train/joinjob/student_score.txt
以逗号分隔,第一列为学号,学号每个学生唯一,第二列为课程代码,第三列为分数。点击(此处)折叠或打开
- 2016001,YY,60
- 2016001,SX,88
- 2016001,YW,91
- 2016002,SX,77
- 2016002,YW,33
- .............
期望的结果文件形式如下
点击(此处)折叠或打开
- 2016001,Join,YY,60
- 2016001,Join,SX,88
- 2016001,Join,YW,91
- 2016002,Abigail,SX,77
- 2016002,Abigail,YW,33
- 2016002,Abigail,YY,56
- 2016003,Abby,YY,34
- 2016003,Abby,SX,84
- 2016003,Abby,YW,69
- 2016004,Alexandra,YY,89
- 2016004,Alexandra,SX,84
- .......
二、任务分析
- 这是一个两个数据集关联的任务。关联有map端join ,和reduce端join。 通常在有一个数据集比较小,可在全部放到内存中时,选择map端join;
- 当两个数据集都很大,不能把一个数据集合部放到内存中时,使用reduce端join,这时也可以使用Bloom过滤器来增加效率。
- 本次任务我们使用reduce端join实现。
- 涉及到的内容有:自定义KEY,自定义GroupingComparator,配置多个Mapper类,配置多个输入路径
-
三、实现思路
- 分别编写两个Mapper:
- StudentMapper 用来读取和处理student.txt中的内容。
- ScoreMapper 用来读取和处理student_scor.txt中的内容。
- 这两个mapper的输入都要进行到JoinReducer的reduce函数中处理。
- 两个mapper中输出的相同的学生信息进入到同一次reduce调用中,在其中进行关联,同时又要能标记每一条记录是来自学生信息,还是成绩信息。
- 还有有可能同一个学号的数据量有很大,不能在reduce中先存下来再处理,所以把学生信息的value放到前面,成绩信息放到后面。
- 就可以在一开始遍历values时就可以取到学生姓名,然后把姓名保存起来,后面遍历成绩信息时直接使用姓名值。
- 综上分析以如下方法完成此任务:
- 1、定义一个自定义KEY,StuentCustomKey, 属性有学生ID,和信息来源(标记此记录来源于学生信息表,还是成绩表),实现comparaTo方法,确保按学号排升序,按source排降序 (student要排在score前面)
- 2、StudentMapper 用来读取和处理student.txt中的内容,输出StuentCustomKey、Text。其中设置StuentCustomKey的source值为“student”
- 3、ScoreMapper 用来读取和处理student_scor.txt中的内容,输出StuentCustomKey、Text。其中设置StuentCustomKey的source值为“score”
- 4、为了保证相同学号的信息进入同一次reduce调用,要实现一个自定义GroupingComparator,实现中使用sid进行比较。(注意自定义GroupingComparator的构造)
-
三、实现代码
-
自定义KEY
点击(此处)折叠或打开
- package join;
-
- import java.io.DataInput;
- import java.io.DataOutput;
- import java.io.IOException;
-
- import org.apache.hadoop.io.WritableComparable;
-
- public class StudentCustomKey implements WritableComparable<StudentCustomKey>{
- private String sid;
- private String source;
-
- @Override
- public void write(DataOutput out) throws IOException {
- out.writeUTF(sid);
- out.writeUTF(source);
- }
-
- @Override
- public void readFields(DataInput in) throws IOException {
- this.sid = in.readUTF();
- this.source = in.readUTF();
- }
-
- @Override
- public int compareTo(StudentCustomKey key) {
- int r= this.sid.compareTo(key.sid);
- if(r==0)
- r= -this.source.compareTo(key.source);
- return r;
- }
-
- public String getSid() {
- return sid;
- }
-
- public void setSid(String sid) {
- this.sid = sid;
- }
-
- public String getSource() {
- return source;
- }
-
- public void setSource(String source) {
- this.source = source;
- }
-
- @Override
- public String toString(){
- return this.sid+","+this.source;
- }
-
- }
自定义 GroupingComparator
点击(此处)折叠或打开
- package join;
-
- import org.apache.hadoop.io.WritableComparable;
- import org.apache.hadoop.io.WritableComparator;
-
- public class GroupComparator extends WritableComparator {
-
- public GroupComparator(){
- super(StudentCustomKey.class,true); //mark
- }
-
- @Override
- public int compare(WritableComparable a, WritableComparable b) {
- StudentCustomKey key1 = (StudentCustomKey)a;
- StudentCustomKey key2 = (StudentCustomKey)b;
- return key1.getSid().compareTo(key2.getSid());
- }
-
- }
Job类,包括两个Mapper 和一个Reducer。
点击(此处)折叠或打开
- package join;
-
- import java.io.IOException;
-
- import org.apache.commons.lang.StringUtils;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.conf.Configured;
- import org.apache.hadoop.fs.FileSystem;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.LongWritable;
- import org.apache.hadoop.io.NullWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.Mapper;
- import org.apache.hadoop.mapreduce.Mapper.Context;
- import org.apache.hadoop.mapreduce.Reducer;
- import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
- import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
- import org.apache.hadoop.util.Tool;
- import org.apache.hadoop.util.ToolRunner;
- import org.apache.log4j.Logger;
-
- public class ReduceJoinJob extends Configured implements Tool {
-
- public static class StudentMapper extends Mapper<LongWritable,Text,StudentCustomKey,Text>{
- private StudentCustomKey newKey = new StudentCustomKey();
- @Override
- protected void map(LongWritable key, Text value, Context context)
- throws IOException, InterruptedException {
- String [] words = StringUtils.split(value.toString(),',');
- newKey.setSid(words[0]);
- newKey.setSource("student");
- context.write(newKey, value);
- }
-
- }
-
- public static class ScoreMapper extends Mapper<LongWritable,Text,StudentCustomKey,Text>{
- private StudentCustomKey newKey = new StudentCustomKey();
- @Override
- protected void map(LongWritable key, Text value, Context context)
- throws IOException, InterruptedException {
- String [] words = StringUtils.split(value.toString(),',');
- newKey.setSid(words[0]);
- newKey.setSource("score");
- context.write(newKey, value);
- }
- }
-
- public static class JoinReducer extends Reducer<StudentCustomKey,Text,NullWritable,Text>{
- Logger log = Logger.getLogger(getClass());
- private Text newValue = new Text();
- @Override
- protected void reduce(StudentCustomKey key, Iterable<Text> values,Context context)
- throws IOException, InterruptedException {
- String name = "";
- for(Text v : values){
- if(key.getSource().equals("student")){
- name = StringUtils.split(v.toString(),',')[1];
- continue;
- }
- newValue.set(name +","+ v.toString());
- context.write(NullWritable.get(), newValue);
- }
-
- }
-
- }
- @Override
- public int run(String[] args) throws Exception {
- Job job = Job.getInstance(getConf(),"ReduceJoinJob");
- job.setJarByClass(getClass());
- Configuration conf = job.getConfiguration();
-
- MultipleInputs.addInputPath(job, new Path("joinjob/student.txt"), TextInputFormat.class, StudentMapper.class);
- MultipleInputs.addInputPath(job, new Path("joinjob/student_score.txt"), TextInputFormat.class, ScoreMapper.class);
- Path out = new Path("joinjob/output1");
- TextOutputFormat.setOutputPath(job, out);
- FileSystem.get(conf).delete(out, true);
-
- job.setOutputFormatClass(TextOutputFormat.class);
- job.setReducerClass(JoinReducer.class);
-
- job.setMapOutputKeyClass(StudentCustomKey.class);
- job.setMapOutputValueClass(Text.class);
- job.setOutputKeyClass(NullWritable.class);
- job.setOutputValueClass(Text.class);
-
- job.setGroupingComparatorClass(GroupComparator.class);
- return job.waitForCompletion(true)?0:1;
- }
- public static void main(String [] args){
- int r=0;
- try{
- r= ToolRunner.run(new Configuration(), new ReduceJoinJob() , args);
- }catch(Exception e){
- e.printStackTrace();
- }
- System.exit(r);
- }
- }
三、打包运行
- package join;
-
结果如下。
点击(此处)折叠或打开
- Join,2016001,YY,60
- Join,2016001,SX,88
- Join,2016001,YW,91
- Abigail,2016002,SX,77
- Abigail,2016002,YW,33
- Abigail,2016002,YY,56
- Abby,2016003,YW,69
- Abby,2016003,SX,84
- Abby,2016003,YY,34
- Alexandra,2016004,YW,100
- Alexandra,2016004,SX,84
- Alexandra,2016004,YY,89
- Cathy,2016005,YW,63
- Cathy,2016005,YY,53
- Cathy,2016005,SX,43
- Katherine,2016006,SX,90
- Katherine,2016006,YW,90
- Katherine,2016006,YY,90
- Join,2016001,YY,60
- 本文地址:http://blog.itpub.net/30066956/viewspace-2120133/
-
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/30066956/viewspace-2120133/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/30066956/viewspace-2120133/