java outputformat_自定义OutputFormat代码实现

最新推荐文章于 2022-01-08 12:13:51 发布

亓卞

最新推荐文章于 2022-01-08 12:13:51 发布

阅读量359

点赞数

文章标签： java outputformat

本文链接：https://blog.csdn.net/weixin_34442215/article/details/114181631

版权

自定义OutputFormat代码实现

作者：尹正杰

一.OutputFormat接口实现类概述

OutputFormat是MapRedice输出的基类，所有实现MapReduce输出都实现了 OutputFormat接口。接下来我们介绍几种常见的OutputFormat实现类。

1>.文本输出(TextOutputFormat)

默认的输出格式是TextOutputFormat,它把每条记录写为文本行。它的键和值可以是任意类型,因为TextOutputFormat调用toString()方法把他们转换为字符串。

2>.二进制输出(SequenceFileOutputFormat)

将SequenceFileOutputFormat输出作为后续MapReduce任务的输入,这便是一种好的输出格式,因为它格式紧凑，很容易被压缩。

3>.自定义OutputFormat

根据用户需求，自定义实现输出。

使用场景:

为了实现控制最终文件的食醋胡路径和输出格式,可以自定义OutputFormat。

例如:要在一个MapReducer程序中根据数据的不同输出两类结果到不同目录,这类灵活的输出要求可以通过自定义OutputFormat来实现。

自定义OutputFormat大致步骤:

(1)自定义一个类继承FileOutputFormat;

(2)改写RecordWriter,具体改写输出数据的write()方法。

二.自定义OutputFormat案例

1>.需求说明

过滤输入website.txt日志，包含yinzhengjie的网站输出到E:\yinzhengjie\outputFormat\yinzhengjie.log,不包含yinzhengjie的网站输出到E:\yinzhengjie\outputFormat\other.log

https://www.yinzhengjie.com

https://www.jd.com/

https://www.taobao.com/

http://www.google.com/

https://www.baidu.com/

https://www.cloudera.com/

https://www.cnblogs.com/yinzhengjie/

https://www.yinzhengjie.org.cn/

website.txt

2>.MyRecordWriter.java

packagecn.org.yinzhengjie.outputformat;importorg.apache.hadoop.fs.FSDataOutputStream;importorg.apache.hadoop.fs.FileSystem;importorg.apache.hadoop.fs.Path;importorg.apache.hadoop.io.IOUtils;importorg.apache.hadoop.io.LongWritable;importorg.apache.hadoop.io.Text;importorg.apache.hadoop.mapreduce.RecordWriter;importorg.apache.hadoop.mapreduce.TaskAttemptContext;importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;importjava.io.FileNotFoundException;importjava.io.FileOutputStream;importjava.io.IOException;public class MyRecordWriter extends RecordWriter{//使用hadoop为咱们提供的API,不建议使用"FileOutputStream",因为将来咱们的代码是需要往HDFS集群上写的.

privateFSDataOutputStream yinzhengjie;privateFSDataOutputStream other;/*** 初始化方法,开启2个I/O流

*@paramjob*/

public void initialize(TaskAttemptContext job) throwsIOException {/*** 获取输出路径信息:

* 在FileOutputFormat.setOutputPath中底层定义的key是"job.getConfiguration().set(FileOutputFormat.OUTDIR, outputDir.toString());"*/String outputDir=job.getConfiguration().get(FileOutputFormat.OUTDIR);//获取文件系统

FileSystem fileSystem =FileSystem.get(job.getConfiguration());//开启2个I/O流

yinzhengjie = fileSystem.create(new Path(outputDir + "/yinzhengjie.log"));

other= fileSystem.create(new Path(outputDir + "/other.log"));

}/*** 用于将K,V写出,每对K,v调用一次*/@Overridepublic void write(LongWritable key, Text value) throwsIOException, InterruptedException {//拿到的数据并不包含换行符,因此我们需要手动加上，不然写出的数据是没有换行的哟~

String line = value.toString() + "\n";//判断每行数据是否包含"yinzhengjie",如果包含则写入到指定的I/O流中

if (line.contains("yinzhengjie")){

yinzhengjie.write(line.getBytes());

}else{

other.write(line.getBytes());

}

}/*** 用于关闭资源*/@Overridepublic void close(TaskAttemptContext context) throwsIOException, InterruptedException {

IOUtils.closeStream(yinzhengjie);

IOUtils.closeStream(other);

}

3>.MyOutputFormat.java

packagecn.org.yinzhengjie.outputformat;importorg.apache.hadoop.io.LongWritable;importorg.apache.hadoop.io.Text;importorg.apache.hadoop.mapreduce.RecordWriter;importorg.apache.hadoop.mapreduce.TaskAttemptContext;importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;importjava.io.IOException;public class MyOutputFormat extends FileOutputFormat{

@Overridepublic RecordWriter getRecordWriter(TaskAttemptContext job) throwsIOException, InterruptedException {

MyRecordWriter myRecordWriter= newMyRecordWriter();//将job信息传递给咱们自定义的RecordWriter,这样方便咱们自定义的RecordWriter获取job的配置信息

myRecordWriter.initialize(job);returnmyRecordWriter;

}

4>.OutputFormatDriver.java

packagecn.org.yinzhengjie.outputformat;importorg.apache.hadoop.conf.Configuration;importorg.apache.hadoop.fs.Path;importorg.apache.hadoop.mapreduce.Job;importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;importjava.io.IOException;public classOutputFormatDriver {public static void main(String[] args) throwsIOException, ClassNotFoundException, InterruptedException {//获取一个Job实例

Job job = Job.getInstance(newConfiguration());//设置我们的当前Driver类路径(classpath)

job.setJarByClass(OutputFormatDriver.class);//设置OutputFormat的类路径

job.setOutputFormatClass(MyOutputFormat.class);//设置输入数据

FileInputFormat.setInputPaths(job,new Path(args[0]));//设置输出数据

FileOutputFormat.setOutputPath(job,new Path(args[1]));//提交我们的Job,返回结果是一个布尔值

boolean result = job.waitForCompletion(true);//如果程序运行成功就打印"Task executed successfully!!!"

if(result){

System.out.println("Task executed successfully!!!");

}else{

System.out.println("Task execution failed...");

}//如果程序是正常运行就返回0，否则就返回1

System.exit(result ? 0 : 1);