hadoop处理不同输入目录文件

最新推荐文章于 2023-03-07 16:30:27 发布

dy_252

最新推荐文章于 2023-03-07 16:30:27 发布

阅读量3.2k

点赞数

分类专栏： Hadoop 文章标签： hadoop join string import exception output

Hadoop 专栏收录该内容

24 篇文章 0 订阅

订阅专栏

在写mapred任务的时候免不了要处理join。
在join中最简单的就是一对一的join。
下面通过一个小例子介绍如果在mapred中实现一对一的join。

name.txt
100 tom
101 mary
102 kate

score.txt
100 90
101 85
102 80

要得到如下的join结果
100 tom 90
101 mary 85
102 kate 80

处理思路：
mapred的输入文件为name.txt和score.txt两个，我们要通过标志区分出每条记录是来自哪个文件，所以在map的输出结果要增加文件的标志。
map的输出类似
100 name+tom
100 score+90
然后在red的过程中根据不同的前缀来区分不同记录。因为是一对一的join，所以只要将相同key的不同value连接起来后输出即可。

程序代码如下：

package org.myorg;

import java.io.IOException;
import java.util.*;
import java.lang.String;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class OneOnOneJoin {

    public static class Map extends MapReduceBase implements Mapper<Text, Text, Text ,Text> {

      public void map(Text key, Text value, OutputCollector<Text , Text> output, Reporter reporter) throws IOException {

      String path=((FileSplit)reporter.getInputSplit()).getPath().toString();
      Text kv = new Text();

      if(path.indexOf(”name.txt”)>0) {
        kv.set(”name”+”+”+value);
      } else if(path.indexOf(”score.txt”)>0) {
        kv.set(”score”+”+”+value);
      }

      output.collect(key,kv);
      }
    }

    public static class Reduce extends MapReduceBase implements Reducer<Text ,Text, Text,Text> {
      public void reduce(Text key, Iterator<Text> values, OutputCollector<Text,Text> output, Reporter reporter) throws IOException {

String name=”";
String score=”";

        while(values.hasNext()) {
        String value = values.next().toString();

        if(value.startsWith(”name+”)) {
          name=value.substring(5 , value.length());
        } else if(value.startsWith(”score+”)) {
          score=value.substring(6 , value.length());
        }
        }

if(!name.equals(”") && !score.equals(”")) {
  output.collect(key,new Text(name + “\t” + score));
}

      }
    }

    public static void main(String[] args) throws Exception {
      JobConf conf = new JobConf(OneOnOneJoin.class);
      conf.setJobName(”oneononejoin”);

      conf.setOutputKeyClass(Text.class);
      conf.setOutputValueClass(Text.class);

      conf.setMapperClass(Map.class);
      conf.setReducerClass(Reduce.class);

      conf.setInputFormat(KeyValueTextInputFormat.class);
      conf.setOutputFormat(TextOutputFormat.class);

      FileInputFormat.setInputPaths(conf, new Path(args[0]));
      FileOutputFormat.setOutputPath(conf, new Path(args[1]));

      JobClient.runJob(conf);
    }
}

编译并执行
javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d OneOnOneJoin OneOnOneJoin.java
jar -cvf OneOnOneJoin.jar -C OneOnOneJoin/ .

hadoop fs -rmr /sunwg/output
hadoop jar OneOnOneJoin.jar org.myorg.OneOnOneJoin /sunwg/input /sunwg/output

查看结果
[hadoop@hadoop00 sunwg]$ hadoop fs -cat /sunwg/output/part-00000
100     tom     90
101     mary    85
102     kate    80

进一步考虑外连接的情况，有可能在score中没有对应的记录
比如：

name.txt
100 tom
101 mary
102 kate

score.txt
100 90
102 80

要得到如下的join结果
100 tom 90
101 mary
102 kate 80

只要修改reduce中的最后输出结果的检验条件为

if(!name.equals(”")) {
output.collect(key,new Text(name + “\t” + score));
}

结果
[hadoop@hadoop00 sunwg]$ hadoop fs -cat /sunwg/output/part-00000
100     tom     90
101     mary
102     kate    80

在进一步考虑全连接的情况，有可能在name中没有对应的记录
比如：
name.txt
100 tom
101 mary

score.txt
100 90
102 80

要得到如下的join结果
100 tom 90
101 mary
102 80
只要修改reduce中的最后输出结果的检验条件为

if(!name.equals(”")) {
output.collect(key,new Text(name + “\t” + score));
}
else if(!score.equals(”")) {
output.collect(key,new Text(name + “\t” + score));
}

结果
[hadoop@hadoop00 sunwg]$ hadoop fs -cat /sunwg/output/part-00000
100     tom     90
101     mary
102             80

以上实现了两个文件的JOIN操作，采用相同的策略，可以对不同输入目录的文件添加相同标记，采用不同的map、reduce策略来实现对不同的目录实现不同的操作

转载地址：http://www.alidw.com/?p=1710