Hadoop reduce多个输出

最新推荐文章于 2024-07-02 10:31:09 发布

inte_sleeper

最新推荐文章于 2024-07-02 10:31:09 发布

阅读量4.2k

点赞数

分类专栏： Hadoop 文章标签： hadoop string path exception class output

Hadoop 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

转自：Hadoop in Action

在hadoop中，想要reduce支持多个输出，有两种实现方式。

第一种就是继承MultipleTextOutputFormat类，重写generateFileNameForKey方法。

public static class PartitionByCountryMTOF
    extends MultipleTextOutputFormat<Text,Text>
{
    protected String generateFileNameForKeyValue(Text key,
            Text value, String filename)
    {
        String[] arr = value.toString().split(",", -1);
        String country = arr[4].substring(1,3);
        return country + "/" + filename;
    }
}


public int run(String[] args) throws Exception {
    Configuration conf = getConf();
    JobConf job = new JobConf(conf, MultiFile.class);
    Path in = new Path(args[0]);
    Path out = new Path(args[1]);
    FileInputFormat.setInputPaths(job, in);
    FileOutputFormat.setOutputPath(job, out);
    job.setJobName(“MultiFile”);
    job.setMapperClass(MapClass.class);
    job.setInputFormat(TextInputFormat.class);
    job.setOutputFormat(PartitionByCountryMTOF.class);
    job.setOutputKeyClass(NullWritable.class);
    job.setOutputValueClass(Text.class);
    job.setNumReduceTasks(0);
    JobClient.runJob(job);
    return 0;
}

这种方法的限制是显而易见的，它只能按照每一行数据去确定要输出的文件，而且对每一行数据，只能确定一个输出文件。假如我们对同一行数据，需要同时输出至多个文件的话，它就办不到了。这时我们可以使用MultipleOutputs类：

public class MultiFile extends Confi gured implements Tool {
    public static class MapClass extends MapReduceBase
        implements Mapper<LongWritable, Text, NullWritable, Text> {
            private MultipleOutputs mos;

            private OutputCollector<NullWritable, Text> collector;
            public void confi gure(JobConf conf) {
                mos = new MultipleOutputs(conf);
            }

            public void map(LongWritable key, Text value,
                    OutputCollector<NullWritable, Text> output,
                    Reporter reporter) throws IOException {
                String[] arr = value.toString().split(",", -1);
                String chrono = arr[0] + "," + arr[1] + "," + arr[2];
                String geo = arr[0] + "," + arr[4] + "," + arr[5];
                collector = mos.getCollector("chrono", reporter);
                collector.collect(NullWritable.get(), new Text(chrono));
                collector = mos.getCollector("geo", reporter);
                collector.collect(NullWritable.get(), new Text(geo));
            }

            public void close() throws IOException {
                mos.close();
            }
    }

    public int run(String[] args) throws Exception {
        Confi guration conf = getConf();
        JobConf job = new JobConf(conf, MultiFile.class);
        Path in = new Path(args[0]);
        Path out = new Path(args[1]);
        FileInputFormat.setInputPaths(job, in);
        FileOutputFormat.setOutputPath(job, out);
        job.setJobName("MultiFile");
        job.setMapperClass(MapClass.class);
        job.setInputFormat(TextInputFormat.class);
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);
        job.setNumReduceTasks(0);
        MultipleOutputs.addNamedOutput(job,
                "chrono",
                TextOutputFormat.class,
                NullWritable.class,
                Text.class);
        MultipleOutputs.addNamedOutput(job,
                "geo",
                TextOutputFormat.class,
                NullWritable.class,
                Text.class);
        JobClient.runJob(job);
        return 0;
    }
}

这个类维护了一个<name, OutputCollector>的map。我们可以在job配置里添加collector，然后在reduce方法中，取得对应的collector并调用collector.write即可。

最后需要注意，如果reduce的所有输出都在named collector中，那么框架最后对counter做统计的时候，会得出reduce records=0（但其实是有正常的输出的）。

inte_sleeper

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Hadoop reduce多个输出

转自：Hadoop in Action在hadoop中，想要reduce支持多个输出，有两种实现方式。第一种就是继承MultipleTextOutputFormat类，重写generateFileNameForKey方法。public static class PartitionByCountryMTOF extends MultipleTextOutputFormat
复制链接

扫一扫