hadoop---自定义输出文件格式以及输出到不同目录

转自: hadoop编程小技巧(7)---自定义输出文件格式以及输出到不同目录,保存在此以学习。

代码测试环境:Hadoop2.4

应用场景:当需要定制输出数据格式时可以采用此技巧,包括定制输出数据的展现形式,输出路径,输出文件名称等。

Hadoop内置的输出文件格式有:

1)FileOutputFormat<K,V>  常用的父类;

2)TextOutputFormat<K,V> 默认输出字符串输出格式;

3)SequenceFileOutputFormat<K,V> 序列化文件输出;

4)MultipleOutputs<K,V> 可以把输出数据输送到不同的目录;

5) NullOutputFormat<K,V> 把输出输出到/dev/null中,即不输出任何数据,这个应用场景是在MR中进行了逻辑处理,同时输出文件已经在MR中进行了输出,而不需要在输出的情况;

6)LazyOutputFormat<K,V> 只有在调用write方法是才会产生文件,这样的话,如果没有调用write就不会产生空文件;

步骤:

类似输入数据格式,自定义输出数据格式同样可以参考下面的步骤

1) 定义一个继承自OutputFormat的类,不过一般继承FileOutputFormat即可;

2)实现其getRecordWriter方法,返回一个RecordWriter类型;

3)自定义一个继承RecordWriter的类,定义其write方法,针对每个<key,Value>写入文件数据;

实例1(修改文件默认的输出文件名以及默认的key和value的分隔符):

输入数据:

自定义CustomFileOutputFormat(把默认文件名前缀替换掉):

package fz.outputformat;

import java.io.IOException;

import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class CustomOutputFormat extends FileOutputFormat<LongWritable, Text> {

  private String prefix = "custom_";
  @Override
  public RecordWriter<LongWritable, Text> getRecordWriter(TaskAttemptContext job)
      throws IOException, InterruptedException {
    // 新建一个可写入的文件
    Path outputDir = FileOutputFormat.getOutputPath(job);
//		System.out.println("outputDir.getName():"+outputDir.getName()+",otuputDir.toString():"+outputDir.toString());
    String subfix = job.getTaskAttemptID().getTaskID().toString();
    Path path = new Path(outputDir.toString()+"/"+prefix+subfix.substring(subfix.length()-5, subfix.length()));
    FSDataOutputStream fileOut = path.getFileSystem(job.getConfiguration()).create(path);
    return new CustomRecordWriter(fileOut);
  }

}
自定义CustomWriter(指定key,value分隔符):
package fz.outputformat;

import java.io.IOException;
import java.io.PrintWriter;

import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;

public class CustomRecordWriter extends RecordWriter<LongWritable, Text> {

  private PrintWriter out;
  private String separator =",";
  public CustomRecordWriter(FSDataOutputStream fileOut) {
    out = new PrintWriter(fileOut);
  }

  @Override
  public void write(LongWritable key, Text value) throws IOException,
      InterruptedException {
    out.println(key.get()+separator+value.toString());
  }

  @Override
  public void close(TaskAttemptContext context) throws IOException,
      InterruptedException {
    out.close();
  }

}
调用主类:
package fz.outputformat;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


public class FileOutputFormatDriver extends Configured implements Tool{

  /**
   * @param args
   * @throws Exception 
   */
  public static void main(String[] args) throws Exception {
    // TODO Auto-generated method stub
    ToolRunner.run(new Configuration(), new FileOutputFormatDriver(),args);
  }

  @Override
  public int run(String[] arg0) throws Exception {
    if(arg0.length!=3){
      System.err.println("Usage:\nfz.outputformat.FileOutputFormatDriver <in> <out> <numReducer>");
      return -1;
    }
    Configuration conf = getConf();
    
    Path in = new Path(arg0[0]);
    Path out= new Path(arg0[1]);
    boolean delete=out.getFileSystem(conf).delete(out, true);
    System.out.println("deleted "+out+"?"+delete);
    Job job = Job.getInstance(conf,"fileouttputformat test job");
    job.setJarByClass(getClass());
    
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(CustomOutputFormat.class);
    
    job.setMapperClass(Mapper.class);
    job.setMapOutputKeyClass(LongWritable.class);
    job.setMapOutputValueClass(Text.class);
    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);
    job.setNumReduceTasks(Integer.parseInt(arg0[2]));
    job.setReducerClass(Reducer.class);
    FileInputFormat.setInputPaths(job, in);
    FileOutputFormat.setOutputPath(job, out);
    
    return job.waitForCompletion(true)?0:-1;
  }

}
查看输出:

从输出结果可以看到输出格式以及文件名确实按照预想输出了。

实例2(根据key和value值输出数据到不同目录):自定义主类(主类其实就是修改了输出的方式而已):

package fz.multipleoutputformat;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


public class FileOutputFormatDriver extends Configured implements Tool{

  /**
   * @param args
   * @throws Exception 
   */
  public static void main(String[] args) throws Exception {
    // TODO Auto-generated method stub
    ToolRunner.run(new Configuration(), new FileOutputFormatDriver(),args);
  }

  @Override
  public int run(String[] arg0) throws Exception {
    if(arg0.length!=3){
      System.err.println("Usage:\nfz.multipleoutputformat.FileOutputFormatDriver <in> <out> <numReducer>");
      return -1;
    }
    Configuration conf = getConf();
    
    Path in = new Path(arg0[0]);
    Path out= new Path(arg0[1]);
    boolean delete=out.getFileSystem(conf).delete(out, true);
    System.out.println("deleted "+out+"?"+delete);
    Job job = Job.getInstance(conf,"fileouttputformat test job");
    job.setJarByClass(getClass());
    
    job.setInputFormatClass(TextInputFormat.class);
//		job.setOutputFormatClass(CustomOutputFormat.class);
    MultipleOutputs.addNamedOutput(job, "ignore", TextOutputFormat.class,
        LongWritable.class, Text.class);
    MultipleOutputs.addNamedOutput(job, "other", TextOutputFormat.class,
        LongWritable.class, Text.class);

    job.setMapperClass(Mapper.class);
    job.setMapOutputKeyClass(LongWritable.class);
    job.setMapOutputValueClass(Text.class);
    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);
    job.setNumReduceTasks(Integer.parseInt(arg0[2]));
    job.setReducerClass(MultipleReducer.class);
    FileInputFormat.setInputPaths(job, in);
    FileOutputFormat.setOutputPath(job, out);
    
    return job.waitForCompletion(true)?0:-1;
  }

}
自定义reducer(因为要根据key和value的值输出数据到不同目录,所以需要自定义逻辑)
package fz.multipleoutputformat;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;

public class MultipleReducer extends
    Reducer<LongWritable, Text, LongWritable, Text> {
  private MultipleOutputs<LongWritable,Text> out;
  @Override
  public void setup(Context cxt){
    out = new MultipleOutputs<LongWritable,Text>(cxt);
  }
  @Override
  public void reduce(LongWritable key ,Iterable<Text> value,Context cxt)throws IOException,InterruptedException{
    for(Text v:value){
      if(v.toString().startsWith("ignore")){
//				System.out.println("ignore--------------------value:"+v);
        out.write("ignore", key, v, "ign");
      }else{
//				System.out.println("other---------------------value:"+v);
        out.write("other", key, v, "oth");
      }
    }
  }
  
  @Override
  public void cleanup(Context cxt)throws IOException,InterruptedException{
    out.close();
  }
}
查看输出:

可以看到输出的数据确实根据value的不同值被写入了不同的文件目录中,但是这里同样可以看到有默认的文件生成,同时注意到这个文件的大小是0,这个暂时还没解决。

总结:自定义输出格式,可以定制一些特殊需求,不过一般使用Hadoop内置的输出格式即可,这点来说其应用意义不是很大。不过使用Hadoop内置的MultipleOutputs可以根据数据的不同特性输出到不同的目录,还是很有实际意义的。


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值