大数据源码hadoop------redceu阶段的输入和输出

最新推荐文章于 2023-06-20 21:46:36 发布

爱码-947967114

最新推荐文章于 2023-06-20 21:46:36 发布

阅读量245

点赞数

本文链接：https://blog.csdn.net/qq_39923024/article/details/84064229

版权

大数据专栏收录该内容

13 篇文章 0 订阅

订阅专栏

first Codec

public class Friend {
	public static void main(String[] args){
		System.out.println("BigData加QQ群：947967114");
	}
}

在前面的merge阶段，最后的merge.close也就是MergeManagerImpl.close返回的是RawKeyValueIterator，我们再次回顾Shuffle.run的几行。
public RawKeyValueIterator run() throws IOException, InterruptedException
RawKeyValueIterator kvIter = null;
try {
kvIter = merger.close();
} catch (Throwable e) {
throw new ShuffleError("Error while doing final merge " , e);
}
return kvIter;
所谓的iterator就是一个序列，一个供for训话或者while循环逐个处理的序列，来自Mapper的输出的KV对经过一系列的Sort、combine，以及Merge以后就形成了一个线性的序列，也就是run返回的RawKeyValueIterator ，Hadoop的代码中有SequenceFile.MergeQueue、MapTask.MRResultIterator、MapTask.MRRresultIterator已经SequenceFile.MergeQueue实现了RawKeyValueIterator 类。那么在这里呢就是Merger.MergeQueue。就是这个序列在reduce阶段被用作Redecer的输入，但是这里不再是KV对，而是一个K对应一串的V值。
所以Reducer的InputFormat不会像Mapper的InputFormat那样多样，它的InputFormat形式上非常整齐，因为它的输入本来就是Hadoop产生的，注意Reducer的输入有可能和Mapper的输入类型不同，但是和Mapper的输入必须相同，
Reducer的输出最常见的是文件，所以设置OutputFormat是必要的，另一个可能就是输出到数据库，可以在OutputFoemat中看一下这段。
public abstract class OutputFormat<K, V>
public abstract RecordWriter<K, V>
getRecordWriter(TaskAttemptContext context ) throws IOException, InterruptedException;
public abstract void checkOutputSpecs(JobContext context
) throws IOException,
InterruptedException;
public abstract OutputCommitter getOutputCommitter(TaskAttemptContext context
) throws IOException, InterruptedException;
}
这是一个抽象类，三个函数完全都是抽象的，Hadoop在这个类的基础上提供了其他的输出格式。
public abstract class FileOutputFormat<K, V> extends OutputFormat<K, V> {
public class SequenceFileOutputFormat <K,V> extends FileOutputFormat<K, V>
public class TextOutputFormat<K, V> extends FileOutputFormat<K, V> {
public class FilterOutputFormat <K,V> extends OutputFormat<K, V> {
public class DBOutputFormat<K extends DBWritable, V>
extends OutputFormat<K,V> {
public class NullOutputFormat<K, V> extends OutputFormat<K, V> {

接下来是ReduceTask中的runNewReducer创建DBOutputFormat的RecordWriter。也就是DBRecordWriter
private <INKEY,INVALUE,OUTKEY,OUTVALUE>
void runNewReducer(JobConf job,
final TaskUmbilicalProtocol umbilical,
final TaskReporter reporter,
RawKeyValueIterator rIter,
RawComparator comparator,
Class keyClass,
Class valueClass
) throws IOException,InterruptedException,
ClassNotFoundException {
// wrap value iterator to report progress.
final RawKeyValueIterator rawIter = rIter;
rIter = new RawKeyValueIterator() {
public void close() throws IOException {
rawIter.close();
}
public DataInputBuffer getKey() throws IOException {
return rawIter.getKey();
}
public Progress getProgress() {
return rawIter.getProgress();
}
public DataInputBuffer getValue() throws IOException {
return rawIter.getValue();
}
public boolean next() throws IOException {
boolean ret = rawIter.next();
reporter.setProgress(rawIter.getProgress().getProgress());
return ret;
}
};
// make a task context so we can get the classes
org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job,
getTaskID(), reporter);
// make a reducer
org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE> reducer =
(org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>)
ReflectionUtils.newInstance(taskContext.getReducerClass(), job);
org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE> trackedRW =
new NewTrackingRecordWriter<OUTKEY, OUTVALUE>(this, taskContext);
这里我们要看一下NewTrackingRecordWriter这是构建RecordWriter的过程。构造方法源代码如下：
NewTrackingRecordWriter(ReduceTask reduce,
org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
throws InterruptedException, IOException {
this.outputRecordCounter = reduce.reduceOutputCounter;
this.fileOutputByteCounter = reduce.fileOutputByteCounter;

  List<Statistics> matchedStats = null;
  if (reduce.outputFormat instanceof org.apache.hadoop.mapreduce.lib.output.FileOutputFormat) {
    matchedStats = getFsStatistics(org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
        .getOutputPath(taskContext), taskContext.getConfiguration());
  }

  fsStats = matchedStats;

  long bytesOutPrev = getOutputBytes(fsStats);
  this.real = (org.apache.hadoop.mapreduce.RecordWriter<K, V>) reduce.outputFormat
      .getRecordWriter(taskContext);

这里的getRecordWriter是获取outputFormat，但是outputFormat没有具体的实现，只能找对应的扩展类这里我们还是以DBOutputFormat为例来说明，我们查看源码：
public RecordWriter<K, V> getRecordWriter(TaskAttemptContext context)
throws IOException {
DBConfiguration dbConf = new DBConfiguration(context.getConfiguration());
//定义DBConfiguration对象
String tableName = dbConf.getOutputTableName();
//对象里面加入数据库的名字
String[] fieldNames = dbConf.getOutputFieldNames();
//对象里面加入表的名字
if(fieldNames == null) {
fieldNames = new String[dbConf.getOutputFieldCount()];
//对象里面加入字段的名字
}

try {
  Connection connection = dbConf.getConnection();

//建立连接
这里呢我们要看一下getConnection源码：
public Connection getConnection()
throws ClassNotFoundException, SQLException {

Class.forName(conf.get(DBConfiguration.DRIVER_CLASS_PROPERTY));

if(conf.get(DBConfiguration.USERNAME_PROPERTY) == null) {
  return DriverManager.getConnection(
           conf.get(DBConfiguration.URL_PROPERTY));
} else {
  return DriverManager.getConnection(
      conf.get(DBConfiguration.URL_PROPERTY), 
      conf.get(DBConfiguration.USERNAME_PROPERTY), 
      conf.get(DBConfiguration.PASSWORD_PROPERTY));

//获取URL，用户名，密码，建立和数据库Server的连接
}
}
PreparedStatement statement = null;

  statement = connection.prepareStatement(
                constructQuery(tableName, fieldNames));

//构建用于SQL语句的字符串，进一步构建SQL语句
return new DBRecordWriter(connection, statement);
//穿件DBRecordWriter，返回后会被进一步包装成NewTrackingRecordWriter，也就是前面的trackedRW对象。
} catch (Exception ex) {
throw new IOException(ex.getMessage());
}
}
long bytesOutCurr = getOutputBytes(fsStats);
fileOutputByteCounter.increment(bytesOutCurr - bytesOutPrev);
}
回到源代码：
job.setBoolean(“mapred.skip.on”, isSkipping());
job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
org.apache.hadoop.mapreduce.Reducer.Context
reducerContext = createReduceContext(reducer, job, getTaskID(),rIter,reduceInputKeyCounter,reduceInputValueCounter,trackedRW,committer,reporter,comparator,keyClass,valueClass);
//这里创建了Reduce的Context，源码如下：
protected static <INKEY,INVALUE,OUTKEY,OUTVALUE> org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context
createReduceContext(org.apache.hadoop.mapreduce.Reducer
<INKEY,INVALUE,OUTKEY,OUTVALUE> reducer,
Configuration job,
org.apache.hadoop.mapreduce.TaskAttemptID taskId,
RawKeyValueIterator rIter,
org.apache.hadoop.mapreduce.Counter inputKeyCounter,
org.apache.hadoop.mapreduce.Counter inputValueCounter,
org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE> output,
org.apache.hadoop.mapreduce.OutputCommitter committer,
org.apache.hadoop.mapreduce.StatusReporter reporter,
RawComparator comparator,
Class keyClass, Class valueClass
) throws IOException, InterruptedException {
org.apache.hadoop.mapreduce.ReduceContext<INKEY, INVALUE, OUTKEY, OUTVALUE>
reduceContext =
new ReduceContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, taskId,
rIter,inputKeyCounter, inputValueCounter, output, committer, reporter, comparator,keyClass,valueClass);
//这里的output参数就是trackedRW
org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context
reducerContext =
new WrappedReducer<INKEY, INVALUE, OUTKEY, OUTVALUE>().getReducerContext(
reduceContext);

return reducerContext;
//创建ReduceContextImpl对象，这就是reducer.run中的那个context
}
try {
reducer.run(reducerContext);
//这里运行的是Reducer的run方法
源码如下：
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKey()) {
reduce(context.getCurrentKey(), context.getValues(), context);
// If a back up store is used, reset it
我们看reduce里面写了什么呢：
protected void reduce(KEYIN key, Iterable values, Context context
) throws IOException, InterruptedException {
for(VALUEIN value: values) {
context.write((KEYOUT) key, (VALUEOUT) value);
//也就是说这里调用了write来完成，要想知道使用了那个write就需要找context到底是什么。我们上面提过这个context其实是ReduceContextImpl对象，ReduceContextImpl又是继承自TaskInputOutputContextImpl，所以最终知道使用的是TaskInputOutputContextImpl的write方法，对应的源代码是：
public void write(KEYOUT key, VALUEOUT value
) throws IOException, InterruptedException {
output.write(key, value);
}
回到源码
}
}
Iterator iter = context.getValues().iterator();
//迭代值
if(iter instanceof ReduceContext.ValueIterator) {
((ReduceContext.ValueIterator)iter).resetBackupStore();
}
}
} finally {
cleanup(context);
}
}
} finally {
trackedRW.close(reducerContext);
}
}
我们特别关注的是Reducer运行是其context中的output究竟是什么，因为RecordWriter，Reducer的输出就是他写出去的。这里因为我们用OutputFormat的子类DBOutputFormat，这个RecordWriter就是DBRecordWriter的NewTrackingRecordWriter，这就是程序中的trackedRW变量，这个trackedRW被用作调用createReduceContext时的参数之一，最后被设置成TnaskInputOutputContextImpl，也就是ReduceContextImpl的output，这位运行Reducer的运行做好了最根本准备，搭建好了最后一个数据流的环节。这里把DBRecordWriter包装成为NewTrackingRecordWriter的原因就是可以Tracking，也就是跟踪程序的执行。
在Reducer之前，先要建立好跟数据库服务器的链接，这就象先打开文件一样，但是建立数据库的链接肯定要比打开文件要复杂，处理建立连接外，还要为后面用到的INSERT语句做准备，其中constructQuery构建了一个SQL查询语句，虽然对数据库的操作是写入，是INSERT但是我们还是习惯使用查询这个词。我们看一下constructQuery的源码。
代码结构：
runNewReducer->NewTrackingRecordWriter->DBOutputFormat.getRecordWriter->DBOutputFormat.constructQuery
源码如下
public String constructQuery(String table, String[] fieldNames) {
if(fieldNames == null) {//没有字段
throw new IllegalArgumentException(“Field names may not be null”);
}

StringBuilder query = new StringBuilder();
query.append("INSERT INTO ").append(table);

//做成一条语句INSERT INTO tableName
if (fieldNames.length > 0 && fieldNames[0] != null) {
//字段有内容
query.append(" (");
//做成一条语句INSERT INTO tableName(name,age,address
for (int i = 0; i < fieldNames.length; i++) {
query.append(fieldNames[i]);
if (i != fieldNames.length - 1) {
query.append(",");
}
}
query.append(")");
//做成一条语句INSERT INTO tableName(name,age,address
}
query.append(" VALUES (");
//做成一条语句INSERT INTO tableName(name,age,address) VALUES(“”,””,””

for (int i = 0; i < fieldNames.length; i++) {
  query.append("?");
  if(i != fieldNames.length - 1) {
    query.append(",");
  }
}
query.append(");");

//做成一条语句INSERT INTO tableName(name,age,address) VALUES(“”,””,””);
return query.toString();
}
整个过程就是在制作成一条语句：INSERT INTO tableName() VALUES();以此字符串为基础构成一个针对具体数据库的SQL语句。在建立数据库连接和SQL语句的基础上创建DBRecordWriter对象。回到runNewReducer，做好了上述准备后，把这些不见都手机在一起构建一个reducerContext，实际上是个ReduceContextImpl对象，就可以调用Reducer.run了。

爱码-947967114

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大数据源码hadoop------redceu阶段的输入和输出

first Codecpublic class Friend { public static void main(String[] args){ System.out.println(&amp;amp;amp;quot;BigData：&amp;amp;amp;quot;+&amp;amp;amp;quot;--&amp;amp;amp;quot;+&amp;amp;amp;quot;947967114&amp;amp;amp;quot;);
复制链接

扫一扫

专栏目录