Hadoop如何处理关联计算

最新推荐文章于 2018-05-04 11:43:04 发布

hezhixue

最新推荐文章于 2018-05-04 11:43:04 发布

阅读量1k

点赞数

假设：HDFS上有2个文件，分别是客户信息和订单信息，customerID是它们之间的关联字段。如何进行关联计算，以便将客户名称添加到订单列表中？

一般方法是：输入2个源文件。根据文件名在Map中处理每条数据，如果是Order，则在foreign key上加标记”O”，形成combined key；如果是Customer则做标记”C”。Map之后的数据按照key分区，再按照combined key分组排序。最后在reduce中合并结果再输出。

实现代码：

 
public static class JMapper extends Mapper<LongWritable, Text, TextPair, Text> {
 
    //mark every row with "O" or "C" according to file name
 
    @Override
 
    protected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {
 
     String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();
 
     if (pathName.contains("order.txt")) {//identify order by file name
 
            String values[] = value.toString().split("\t");
 
            TextPair tp = new TextPair(new Text(values[1]), newText("O"));//mark with "O"
 
            context.write(tp, new Text(values[0] + "\t" + values[2]));
 
        }
 
    if (pathName.contains("customer.txt")) {//identify customer by file name
 
           String values[] = value.toString().split("\t");
 
           TextPair tp = new TextPair(new Text(values[0]), newText("C"));//mark with "C"
 
           context.write(tp, new Text(values[1]));
 
        }
 
    }
 
}

 
public static class JPartitioner extends Partitioner<TextPair, Text> {
 
    //partition by key, i.e. customerID
 
    @Override
 
    public int getPartition(TextPair key, Text value, int numParititon) {
 
        return Math.abs(key.getFirst().hashCode() * 127) % numParititon;
 
    }
 
}

 
public static class JComparator extends WritableComparator {
 
    //group by muti-key
 
    public JComparator() {
 
        super(TextPair.class, true);
 
    }
 
    @SuppressWarnings("unchecked")
 
    public int compare(WritableComparable a, WritableComparable b) {
 
        TextPair t1 = (TextPair) a;
 
        TextPair t2 = (TextPair) b;
 
        return t1.getFirst().compareTo(t2.getFirst());
 
    }
 
}

 
public static class JReduce extends Reducer<TextPair, Text, Text, Text> {
 
    //merge and output
 
    protected void reduce(TextPair key, Iterable<Text> values, Context context) throws IOException,InterruptedException {
 
     Text pid = key.getFirst();
 
     String desc = values.iterator().next().toString();
 
     while (values.iterator().hasNext()) {
 
         context.write(pid, new Text(values.iterator().next().toString() +"\t" + desc));
 
    }
 
    }
 
}

 
public class TextPair implements WritableComparable<TextPair> {
 
    //make muti-key
 
    private Text first;
 
    private Text second;
 
    public TextPair() {
 
        set(new Text(), new Text());
 
    }
 
    public TextPair(String first, String second) {
 
        set(new Text(first), new Text(second));
 
    }
 
    public TextPair(Text first, Text second) {
 
        set(first, second);
 
    }
 
    public void set(Text first, Text second) {
 
   this.first = first;
 
   this.second = second;
 
    }
 
    public Text getFirst() {
 
   return first;
 
    }
 
    public Text getSecond() {
 
   return second;
 
    }
 
    public void write(DataOutput out) throws IOException {
 
   first.write(out);
 
   second.write(out);
 
    }
 
    public void readFields(DataInput in) throws IOException {
 
   first.readFields(in);
 
   second.readFields(in);
 
    }
 
    public int compareTo(TextPair tp) {
 
   int cmp = first.compareTo(tp.first);
 
   if (cmp != 0) {
 
        return cmp;
 
   }
 
     return second.compareTo(tp.second);
 
    }
 
}

 
public static void main(String agrs[]) throws IOException, InterruptedException, ClassNotFoundException {
 
    //job entrance
 
    Configuration conf = new Configuration();
 
    GenericOptionsParser parser = new GenericOptionsParser(conf, agrs);
 
    String[] otherArgs = parser.getRemainingArgs();
 
    if (agrs.length < 3) {
 
    System.err.println("Usage: J <in_path_one> <in_path_two> <output>");
 
    System.exit(2);
 
    }
 
    Job job = new Job(conf, "J");
 
    job.setJarByClass(J.class);//Join class
 
    job.setMapperClass(JMapper.class);//Map class
 
    job.setMapOutputKeyClass(TextPair.class);//Map output key class
 
    job.setMapOutputValueClass(Text.class);//Map output value class
 
    job.setPartitionerClass(JPartitioner.class);//partition class
 
    job.setGroupingComparatorClass(JComparator.class);//condition group class after partition
 
    job.setReducerClass(Example_Join_01_Reduce.class);//reduce class
 
    job.setOutputKeyClass(Text.class);//reduce output key class
 
    job.setOutputValueClass(Text.class);//reduce ouput value class
 
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));//one of source files
 
    FileInputFormat.addInputPath(job, new Path(otherArgs[1]));//another file
 
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));//output path
 
    System.exit(job.waitForCompletion(true) ? 0 : 1);//run untill job ends
 
}

不能直接使用原始数据，而是要搞一堆代码处理标记，并绕过MapReduce原本的架构，最后从底层设计并计算数据之间的关联关系。这还是最简单的关联计算，如果用MapReduce进行多表关联或逻辑更复杂的关联计算，复杂度会呈几何级数递增。

hezhixue

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hadoop如何处理关联计算

假设：HDFS上有2个文件，分别是客户信息和订单信息，customerID是它们之间的关联字段。如何进行关联计算，以便将客户名称添加到订单列表中？一般方法是：输入2个源文件。根据文件名在Map中处理每条数据，如果是Order，则在foreign key上加标记”O”，形成combined key；如果是Customer则做标记”C”。Map之后的数据按照key分区，再按照combi
复制链接

扫一扫