Hadoop reduce多个输出

转自:Hadoop in Action

在hadoop中,想要reduce支持多个输出,有两种实现方式。

第一种就是继承MultipleTextOutputFormat类,重写generateFileNameForKey方法。

public static class PartitionByCountryMTOF
    extends MultipleTextOutputFormat<Text,Text>
{
    protected String generateFileNameForKeyValue(Text key,
            Text value, String filename)
    {
        String[] arr = value.toString().split(",", -1);
        String country = arr[4].substring(1,3);
        return country + "/" + filename;
    }
}


public int run(String[] args) throws Exception {
    Configuration conf = getConf();
    JobConf job = new JobConf(conf, MultiFile.class);
    Path in = new Path(args[0]);
    Path out = new Path(args[1]);
    FileInputFormat.setInputPaths(job, in);
    FileOutputFormat.setOutputPath(job, out);
    job.setJobName(“MultiFile”);
    job.setMapperClass(MapClass.class);
    job.setInputFormat(TextInputFormat.class);
    job.setOutputFormat(PartitionByCountryMTOF.class);
    job.setOutputKeyClass(NullWritable.class);
    job.setOutputValueClass(Text.class);
    job.setNumReduceTasks(0);
    JobClient.runJob(job);
    return 0;
}

这种方法的限制是显而易见的,它只能按照每一行数据去确定要输出的文件,而且对每一行数据,只能确定一个输出文件。假如我们对同一行数据,需要同时输出至多个文件的话,它就办不到了。这时我们可以使用MultipleOutputs类:

public class MultiFile extends Confi gured implements Tool {
    public static class MapClass extends MapReduceBase
        implements Mapper<LongWritable, Text, NullWritable, Text> {
            private MultipleOutputs mos;

            private OutputCollector<NullWritable, Text> collector;
            public void confi gure(JobConf conf) {
                mos = new MultipleOutputs(conf);
            }

            public void map(LongWritable key, Text value,
                    OutputCollector<NullWritable, Text> output,
                    Reporter reporter) throws IOException {
                String[] arr = value.toString().split(",", -1);
                String chrono = arr[0] + "," + arr[1] + "," + arr[2];
                String geo = arr[0] + "," + arr[4] + "," + arr[5];
                collector = mos.getCollector("chrono", reporter);
                collector.collect(NullWritable.get(), new Text(chrono));
                collector = mos.getCollector("geo", reporter);
                collector.collect(NullWritable.get(), new Text(geo));
            }

            public void close() throws IOException {
                mos.close();
            }
    }

    public int run(String[] args) throws Exception {
        Confi guration conf = getConf();
        JobConf job = new JobConf(conf, MultiFile.class);
        Path in = new Path(args[0]);
        Path out = new Path(args[1]);
        FileInputFormat.setInputPaths(job, in);
        FileOutputFormat.setOutputPath(job, out);
        job.setJobName("MultiFile");
        job.setMapperClass(MapClass.class);
        job.setInputFormat(TextInputFormat.class);
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);
        job.setNumReduceTasks(0);
        MultipleOutputs.addNamedOutput(job,
                "chrono",
                TextOutputFormat.class,
                NullWritable.class,
                Text.class);
        MultipleOutputs.addNamedOutput(job,
                "geo",
                TextOutputFormat.class,
                NullWritable.class,
                Text.class);
        JobClient.runJob(job);
        return 0;
    }
}

这个类维护了一个<name, OutputCollector>的map。我们可以在job配置里添加collector,然后在reduce方法中,取得对应的collector并调用collector.write即可。

最后需要注意,如果reduce的所有输出都在named collector中,那么框架最后对counter做统计的时候,会得出reduce records=0(但其实是有正常的输出的)。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值