MapReduce--OutputFormat详解以及实现自定义OutputFormat
1. OutputFormat源代码解析
- MapReuce OutputFormat 输出一般输出到:文件里面或者数据库中,今天就对常用的OutputFormat来分析一下
- OutputFormat中的源代码
/**
* <code>OutputFormat</code> describes the output-specification for a
* Map-Reduce job.
*
* <p>The Map-Reduce framework relies on the <code>OutputFormat</code> of the
* job to:<p>
* <ol>
* <li>
* Validate the output-specification of the job. For e.g. check that the
* output directory doesn't already exist.
* <li>
* Provide the {@link RecordWriter} implementation to be used to write out
* the output files of the job. Output files are stored in a
* {@link FileSystem}.
* </li>
* </ol>
*
* @see RecordWriter
*/
@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class OutputFormat<K, V> {
/**
* Get the {@link RecordWriter} for the given task.
*
* @param context the information about the current task.
* @return a {@link RecordWriter} to write the output for the job.
* @throws IOException
*/
public abstract RecordWriter<K, V>
getRecordWriter(TaskAttemptContext context
) throws IOException, InterruptedException;
/**
* Check for validity of the output-specification for the job.
*
* <p>This is to validate the output specification for the job when it is
* a job is submitted. Typically checks that it does not already exist,
* throwing an exception when it already exists, so that output is not
* overwritten.</p>
*
* @param context information about the job
* @throws IOException when output should not be attempted
*/
public abstract void checkOutputSpecs(JobContext context
) throws IOException,
InterruptedException;
/**
* Get the output committer for this output format. This is responsible
* for ensuring the output is committed correctly.
* @param context the task context
* @return an output committer
* @throws IOException
* @throws InterruptedException
*/
public abstract
OutputCommitter getOutputCommitter(TaskAttemptContext context
) throws IOException, InterruptedException;
}
- 如需要实现OutputFormat则需要实现RecordWriter
- OutputFormat 里面的checkOutputSpecs函数是检查输出路径,检查输出路径是否设置以及输出路径是否存在
- RecordWriter abstract class
/**
* <code>RecordWriter</code> writes the output <key, value> pairs
* to an output file.
* <p><code>RecordWriter</code> implementations write the job outputs to the
* {@link FileSystem}.
*
* @see OutputFormat
*/
@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class RecordWriter<K, V> {
/**
* Writes a key/value pair.
*
* @param key the key to write.
* @param value the value to write.
* @throws IOException
*/
public abstract void write(K key, V value
) throws IOException, InterruptedException;
/**
* Close this <code>RecordWriter</code> to future operations.
*
* @param context the context of the task
* @throws IOException
*/
public abstract void close(TaskAttemptContext context
) throws IOException, InterruptedException;
- RecordWriter中只要实现write、close就可以实现自定义的RecordWriter
- 下面就来操作试试,自定义实现输出为文件、数据库
2 自定义实现输出文件
2.1 需求
计算每个域名的访问量总和
1. 把www.baidu.com域名的MapReduce数据结果输出到www.baidu.com.log文件里面
2. 把www.qq.com域名的MapReduce数据结果输出到www.qq.com.log文件里面
2.2 数据
域名,访问次数
www.baidu.com,10
www.qq.com,9
www.baidu.com,7
www.qq.com,10
www.qq.com,23
www.baidu.com,6
www.qq.com,12
www.qq.com,24
www.baidu.com,9
2.3 Code
2.3.1 MyFileOutputFormatDriver Code
package com.xk.bigata.hadoop.mapreduce.outputformat;
import com.xk.bigata.hadoop.utils.FileUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class MyFileOutputFormatDriver {
public static void main(String[] args) throws Exception {
String input = "mapreduce-basic/data/domain.data";
String output = "mapreduce-basic/out";
// 1 创建 MapReduce job
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
// 删除输出路径
FileUtils.deleteFile(job.getConfiguration(), output);
// 2 设置运行主类
job.setJarByClass(MyFileOutputFormatDriver.class);
// 3 设置Map和Reduce运行的类
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
// 4 设置Map 输出的 KEY 和 VALUE 数据类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 5 设置Reduce 输出 KEY 和 VALUE 数据类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 6 设置输入和输出路径
FileInputFormat.setInputPaths(job, new Path(input));
FileOutputFormat.setOutputPath(job, new Path(output));
// 设定之定义的output
job.setOutputFormatClass(MyFileOutputFormat.class);
// 7 提交job
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] spilts = value.toString().split(",");
context.write(new Text(spilts[0]), new IntWritable(Integer.parseInt(spilts[1])));
}
}
public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
}
2.3.2 MyFileOutputFormat Code
- 由于写入的是文件,所以直接继承FileOutputFormat class就可以了,可以不用自己再写 checkOutputSpecs、getOutputCommitter这两个函数的实现了
package com.xk.bigata.hadoop.mapreduce.outputformat;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class MyFileOutputFormat extends FileOutputFormat<Text, IntWritable> {
@Override
public RecordWriter<Text, IntWritable> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
return new MyRecordWriter(job);
}
}
2.3.3 MyRecordWriter Code
package com.xk.bigata.hadoop.mapreduce.outputformat;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import java.io.IOException;
public class MyRecordWriter extends RecordWriter<Text, IntWritable> {
FileSystem fs = null;
FSDataOutputStream baiduOut = null;
FSDataOutputStream qqOut = null;
public MyRecordWriter(TaskAttemptContext job) {
try {
fs = FileSystem.get(job.getConfiguration());
baiduOut = fs.create(new Path("mapreduce-basic/out/www.baidu.com.log"));
qqOut = fs.create(new Path("mapreduce-basic/out/www.qq.com.log"));
} catch (IOException e) {
e.printStackTrace();
}
}
@Override
public void write(Text key, IntWritable value) throws IOException, InterruptedException {
String domain = key.toString();
if (domain.equals("www.baidu.com")) {
baiduOut.write((key.toString() + "\t" + value.toString()).getBytes());
} else if (domain.equals("www.qq.com")) {
qqOut.write((key.toString() + "\t" + value.toString()).getBytes());
}
}
@Override
public void close(TaskAttemptContext context) throws IOException, InterruptedException {
if (null != baiduOut) {
IOUtils.closeStream(baiduOut);
}
if (null != qqOut) {
IOUtils.closeStream(qqOut);
}
if (null != fs) {
fs.close();
}
}
}
2.4 结果
2.4.1 www.baidu.com.log
www.baidu.com 32
2.4.2 www.qq.com.log
www.qq.com 78
3 DBOutputFormat
3.1 需求
使用MapReduce进行词频统计
把MapReduce 计算结果输出到MySQL里面
3.2 数据
hadoop,spark,flink
hbase,hadoop,spark,flink
spark
hadoop
hadoop,spark,flink
hbase,hadoop,spark,flink
spark
hadoop
hbase,hadoop,spark,flink
3.3 DDL
CREATE TABLE `wc` (
`word` varchar(100) DEFAULT NULL,
`cnt` int(11) DEFAULT NULL
);
3.4 Code
3.4.1 MysqlWordCountDoamin Code
package com.xk.bigata.hadoop.mapreduce.domain;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.lib.db.DBWritable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
public class MysqlWordCountDoamin implements Writable, DBWritable {
private String word;
private int cnt;
public MysqlWordCountDoamin() {
}
public MysqlWordCountDoamin(String word, int cnt) {
this.word = word;
this.cnt = cnt;
}
@Override
public String toString() {
return "MysqlWordCountDoamin{" +
"word='" + word + '\'' +
", cnt=" + cnt +
'}';
}
public String getWord() {
return word;
}
public void setWord(String word) {
this.word = word;
}
public int getCnt() {
return cnt;
}
public void setCnt(int cnt) {
this.cnt = cnt;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(word);
out.writeInt(cnt);
}
@Override
public void readFields(DataInput in) throws IOException {
word = in.readUTF();
cnt = in.readInt();
}
@Override
public void write(PreparedStatement statement) throws SQLException {
statement.setString(1, word);
statement.setInt(2, cnt);
}
@Override
public void readFields(ResultSet resultSet) throws SQLException {
word = resultSet.getString(1);
cnt = resultSet.getInt(2);
}
}
3.4.2 MysqlDBOutputFormat Code
package com.xk.bigata.hadoop.mapreduce.outputformat;
import com.xk.bigata.hadoop.mapreduce.domain.MysqlWordCountDoamin;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.db.DBConfiguration;
import org.apache.hadoop.mapreduce.lib.db.DBOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import java.io.IOException;
public class MysqlDBOutputFormat {
public static void main(String[] args) throws Exception {
String input = "mapreduce-basic/data/wc.txt";
// 1 创建 MapReduce job
Configuration conf = new Configuration();
// 设置JDBC连接
DBConfiguration.configureDB(conf,
"com.mysql.jdbc.Driver",
"jdbc:mysql://bigdatatest01:3306/bigdata",
"root",
"Jgw@31500");
Job job = Job.getInstance(conf);
// 2 设置运行主类
job.setJarByClass(MysqlDBOutputFormat.class);
// 3 设置Map和Reduce运行的类
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
// 4 设置Map 输出的 KEY 和 VALUE 数据类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 5 设置Reduce 输出 KEY 和 VALUE 数据类型
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(MysqlWordCountDoamin.class);
// 设置输出类
job.setOutputFormatClass(DBOutputFormat.class);
// 6 设置输入和输出路径
FileInputFormat.setInputPaths(job, new Path(input));
DBOutputFormat.setOutput(job, "wc", "word", "cnt");
// 7 提交job
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
IntWritable ONE = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] spilts = value.toString().split(",");
for (String word : spilts) {
context.write(new Text(word), ONE);
}
}
}
public static class MyReducer extends Reducer<Text, IntWritable, NullWritable, MysqlWordCountDoamin> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int count = 0;
for (IntWritable value : values) {
count += value.get();
}
MysqlWordCountDoamin mysqlWordCountDoamin = new MysqlWordCountDoamin(key.toString(), count);
context.write(NullWritable.get(), mysqlWordCountDoamin);
}
}
}
3.5 报错
java.lang.Exception: java.lang.ClassCastException: org.apache.hadoop.io.NullWritable cannot be cast to org.apache.hadoop.mapreduce.lib.db.DBWritable
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:556)
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.NullWritable cannot be cast to org.apache.hadoop.mapreduce.lib.db.DBWritable
at org.apache.hadoop.mapreduce.lib.db.DBOutputFormat$DBRecordWriter.write(DBOutputFormat.java:66)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:558)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)
at com.xk.bigata.hadoop.mapreduce.outputformat.MysqlDBOutputFormat$MyReducer.reduce(MysqlDBOutputFormat.java:81)
at com.xk.bigata.hadoop.mapreduce.outputformat.MysqlDBOutputFormat$MyReducer.reduce(MysqlDBOutputFormat.java:72)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:346)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
解决方案
- 是由于Reduce输出的Key设置成NullWritable,无法write数据
- 只要把Reduce里面的key改成自定义的数据类型,value改成NullWritable
package com.xk.bigata.hadoop.mapreduce.outputformat;
import com.xk.bigata.hadoop.mapreduce.domain.MysqlWordCountDoamin;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.db.DBConfiguration;
import org.apache.hadoop.mapreduce.lib.db.DBOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import java.io.IOException;
public class MysqlDBOutputFormat {
public static void main(String[] args) throws Exception {
String input = "mapreduce-basic/data/wc.txt";
// 1 创建 MapReduce job
Configuration conf = new Configuration();
// 设置JDBC连接
DBConfiguration.configureDB(conf,
"com.mysql.jdbc.Driver",
"jdbc:mysql://bigdatatest01:3306/bigdata",
"root",
"Jgw@31500");
Job job = Job.getInstance(conf);
// 2 设置运行主类
job.setJarByClass(MysqlDBOutputFormat.class);
// 3 设置Map和Reduce运行的类
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
// 4 设置Map 输出的 KEY 和 VALUE 数据类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 5 设置Reduce 输出 KEY 和 VALUE 数据类型
job.setOutputKeyClass(MysqlWordCountDoamin.class);
job.setOutputValueClass(NullWritable.class);
// 设置输出类
job.setOutputFormatClass(DBOutputFormat.class);
// 6 设置输入和输出路径
FileInputFormat.setInputPaths(job, new Path(input));
DBOutputFormat.setOutput(job, "wc", "word", "cnt");
// 7 提交job
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
IntWritable ONE = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] spilts = value.toString().split(",");
for (String word : spilts) {
context.write(new Text(word), ONE);
}
}
}
public static class MyReducer extends Reducer<Text, IntWritable, MysqlWordCountDoamin, NullWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int count = 0;
for (IntWritable value : values) {
count += value.get();
}
MysqlWordCountDoamin mysqlWordCountDoamin = new MysqlWordCountDoamin(key.toString(), count);
context.write(mysqlWordCountDoamin, NullWritable.get());
}
}
}
3.6 结果
word,cnt
flink,5
hadoop,7
hbase,3
spark,7