转自:Hadoop in Action
在hadoop中,想要reduce支持多个输出,有两种实现方式。
第一种就是继承MultipleTextOutputFormat类,重写generateFileNameForKey方法。
public static class PartitionByCountryMTOF
extends MultipleTextOutputFormat<Text,Text>
{
protected String generateFileNameForKeyValue(Text key,
Text value, String filename)
{
String[] arr = value.toString().split(",", -1);
String country = arr[4].substring(1,3);
return country + "/" + filename;
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
JobConf job = new JobConf(conf, MultiFile.class);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName(“MultiFile”);
job.setMapperClass(MapClass.class);
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(PartitionByCountryMTOF.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setNumReduceTasks(0);
JobClient.runJob(job);
return 0;
}
这种方法的限制是显而易见的,它只能按照每一行数据去确定要输出的文件,而且对每一行数据,只能确定一个输出文件。假如我们对同一行数据,需要同时输出至多个文件的话,它就办不到了。这时我们可以使用MultipleOutputs类:
public class MultiFile extends Confi gured implements Tool {
public static class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, NullWritable, Text> {
private MultipleOutputs mos;
private OutputCollector<NullWritable, Text> collector;
public void confi gure(JobConf conf) {
mos = new MultipleOutputs(conf);
}
public void map(LongWritable key, Text value,
OutputCollector<NullWritable, Text> output,
Reporter reporter) throws IOException {
String[] arr = value.toString().split(",", -1);
String chrono = arr[0] + "," + arr[1] + "," + arr[2];
String geo = arr[0] + "," + arr[4] + "," + arr[5];
collector = mos.getCollector("chrono", reporter);
collector.collect(NullWritable.get(), new Text(chrono));
collector = mos.getCollector("geo", reporter);
collector.collect(NullWritable.get(), new Text(geo));
}
public void close() throws IOException {
mos.close();
}
}
public int run(String[] args) throws Exception {
Confi guration conf = getConf();
JobConf job = new JobConf(conf, MultiFile.class);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("MultiFile");
job.setMapperClass(MapClass.class);
job.setInputFormat(TextInputFormat.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setNumReduceTasks(0);
MultipleOutputs.addNamedOutput(job,
"chrono",
TextOutputFormat.class,
NullWritable.class,
Text.class);
MultipleOutputs.addNamedOutput(job,
"geo",
TextOutputFormat.class,
NullWritable.class,
Text.class);
JobClient.runJob(job);
return 0;
}
}
这个类维护了一个<name, OutputCollector>的map。我们可以在job配置里添加collector,然后在reduce方法中,取得对应的collector并调用collector.write即可。
最后需要注意,如果reduce的所有输出都在named collector中,那么框架最后对counter做统计的时候,会得出reduce records=0(但其实是有正常的输出的)。