大数据源码hadoop------redceu阶段的输入和输出

first Codec

public class Friend {
	public static void main(String[] args){
		System.out.println("BigData加QQ群:947967114");
	}
}

在前面的merge阶段,最后的merge.close也就是MergeManagerImpl.close返回的是RawKeyValueIterator,我们再次回顾Shuffle.run的几行。
public RawKeyValueIterator run() throws IOException, InterruptedException
RawKeyValueIterator kvIter = null;
try {
kvIter = merger.close();
} catch (Throwable e) {
throw new ShuffleError("Error while doing final merge " , e);
}
return kvIter;
所谓的iterator就是一个序列,一个供for训话或者while循环逐个处理的序列,来自Mapper的输出的KV对经过一系列的Sort、combine,以及Merge以后就形成了一个线性的序列,也就是run返回的RawKeyValueIterator ,Hadoop的代码中有SequenceFile.MergeQueue、MapTask.MRResultIterator、MapTask.MRRresultIterator已经SequenceFile.MergeQueue实现了RawKeyValueIterator 类。那么在这里呢就是Merger.MergeQueue。就是这个序列在reduce阶段被用作Redecer的输入,但是这里不再是KV对,而是一个K对应一串的V值。
所以Reducer的InputFormat不会像Mapper的InputFormat那样多样,它的InputFormat形式上非常整齐,因为它的输入本来就是Hadoop产生的,注意Reducer的输入有可能和Mapper的输入类型不同,但是和Mapper的输入必须相同,
Reducer的输出最常见的是文件,所以设置OutputFormat是必要的,另一个可能就是输出到数据库,可以在OutputFoemat中看一下这段。
public abstract class OutputFormat<K, V>
public abstract RecordWriter<K, V>
getRecordWriter(TaskAttemptContext context ) throws IOException, InterruptedException;
public abstract void checkOutputSpecs(JobContext context
) throws IOException,
InterruptedException;
public abstract OutputCommitter getOutputCommitter(TaskAttemptContext context
) throws IOException, InterruptedException;
}
这是一个抽象类,三个函数完全都是抽象的,Hadoop在这个类的基础上提供了其他的输出格式。
public abstract class FileOutputFormat<K, V> extends OutputFormat<K, V> {
public class SequenceFileOutputFormat <K,V> extends FileOutputFormat<K, V>
public class TextOutputFormat<K, V> extends FileOutputFormat<K, V> {
public class FilterOutputFormat <K,V> extends OutputFormat<K, V> {
public class DBOutputFormat<K extends DBWritable, V>
extends OutputFormat<K,V> {
public class NullOutputFormat<K, V> extends OutputFormat<K, V> {

接下来是ReduceTask中的runNewReducer创建DBOutputFormat的RecordWriter。也就是DBRecordWriter
private <INKEY,INVALUE,OUTKEY,OUTVALUE>
void runNewReducer(JobConf job,
final TaskUmbilicalProtocol umbilical,
final TaskReporter reporter,
RawKeyValueIterator rIter,
RawComparator comparator,
Class keyClass,
Class valueClass
) throws IOException,InterruptedException,
ClassNotFoundException {
// wrap value iterator to report progress.
final RawKeyValueIterator rawIter = rIter;
rIter = new RawKeyValueIterator() {
public void close() throws IOException {
rawIter.close();
}
public DataInputBuffer getKey() throws IOException {
return rawIter.getKey();
}
public Progress getProgress() {
return rawIter.getProgress();
}
public DataInputBuffer getValue() throws IOException {
return rawIter.getValue();
}
public boolean next() throws IOException {
boolean ret = rawIter.next();
reporter.setProgress(rawIter.getProgress().getProgress());
return ret;
}
};
// make a task context so we can get the classes
org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job,
getTaskID(), reporter);
// make a reducer
org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE> reducer =
(org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>)
ReflectionUtils.newInstance(taskContext.getReducerClass(), job);
org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE> trackedRW =
new NewTrackingRecordWriter<OUTKEY, OUTVALUE>(this, taskContext);
这里我们要看一下NewTrackingRecordWriter这是构建RecordWriter的过程。构造方法源代码如下:
NewTrackingRecordWriter(ReduceTask reduce,
org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
throws InterruptedException, IOException {
this.outputRecordCounter = reduce.reduceOutputCounter;
this.fileOutputByteCounter = reduce.fileOutputByteCounter;

  List<Statistics> matchedStats = null;
  if (reduce.outputFormat instanceof org.apache.hadoop.mapreduce.lib.output.FileOutputFormat) {
    matchedStats = getFsStatistics(org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
        .getOutputPath(taskContext), taskContext.getConfiguration());
  }

  fsStats = matchedStats;

  long bytesOutPrev = getOutputBytes(fsStats);
  this.real = (org.apache.hadoop.mapreduce.RecordWriter<K, V>) reduce.outputFormat
      .getRecordWriter(taskContext);

这里的getRecordWriter是获取outputFormat,但是outputFormat没有具体的实现,只能找对应的扩展类这里我们还是以DBOutputFormat为例来说明,我们查看源码:
public RecordWriter<K, V> getRecordWriter(TaskAttemptContext context)
throws IOException {
DBConfiguration dbConf = new DBConfiguration(context.getConfiguration());
//定义DBConfiguration对象
String tableName = dbConf.getOutputTableName();
//对象里面加入数据库的名字
String[] fieldNames = dbConf.getOutputFieldNames();
//对象里面加入表的名字
if(fieldNames == null) {
fieldNames = new String[dbConf.getOutputFieldCount()];
//对象里面加入字段的名字
}

try {
  Connection connection = dbConf.getConnection();

//建立连接
这里呢我们要看一下getConnection源码:
public Connection getConnection()
throws ClassNotFoundException, SQLException {

Class.forName(conf.get(DBConfiguration.DRIVER_CLASS_PROPERTY));

if(conf.get(DBConfiguration.USERNAME_PROPERTY) == null) {
  return DriverManager.getConnection(
           conf.get(DBConfiguration.URL_PROPERTY));
} else {
  return DriverManager.getConnection(
      conf.get(DBConfiguration.URL_PROPERTY), 
      conf.get(DBConfiguration.USERNAME_PROPERTY), 
      conf.get(DBConfiguration.PASSWORD_PROPERTY));

//获取URL,用户名,密码,建立和数据库Server的连接
}
}
PreparedStatement statement = null;

  statement = connection.prepareStatement(
                constructQuery(tableName, fieldNames));

//构建用于SQL语句的字符串,进一步构建SQL语句
return new DBRecordWriter(connection, statement);
//穿件DBRecordWriter,返回后会被进一步包装成NewTrackingRecordWriter,也就是前面的trackedRW对象。
} catch (Exception ex) {
throw new IOException(ex.getMessage());
}
}
long bytesOutCurr = getOutputBytes(fsStats);
fileOutputByteCounter.increment(bytesOutCurr - bytesOutPrev);
}
回到源代码:
job.setBoolean(“mapred.skip.on”, isSkipping());
job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
org.apache.hadoop.mapreduce.Reducer.Context
reducerContext = createReduceContext(reducer, job, getTaskID(),rIter,reduceInputKeyCounter,reduceInputValueCounter,trackedRW,committer,reporter,comparator,keyClass,valueClass);
//这里创建了Reduce的Context,源码如下:
protected static <INKEY,INVALUE,OUTKEY,OUTVALUE> org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context
createReduceContext(org.apache.hadoop.mapreduce.Reducer
<INKEY,INVALUE,OUTKEY,OUTVALUE> reducer,
Configuration job,
org.apache.hadoop.mapreduce.TaskAttemptID taskId,
RawKeyValueIterator rIter,
org.apache.hadoop.mapreduce.Counter inputKeyCounter,
org.apache.hadoop.mapreduce.Counter inputValueCounter,
org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE> output,
org.apache.hadoop.mapreduce.OutputCommitter committer,
org.apache.hadoop.mapreduce.StatusReporter reporter,
RawComparator comparator,
Class keyClass, Class valueClass
) throws IOException, InterruptedException {
org.apache.hadoop.mapreduce.ReduceContext<INKEY, INVALUE, OUTKEY, OUTVALUE>
reduceContext =
new ReduceContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, taskId,
rIter,inputKeyCounter, inputValueCounter, output, committer, reporter, comparator,keyClass,valueClass);
//这里的output参数就是trackedRW
org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context
reducerContext =
new WrappedReducer<INKEY, INVALUE, OUTKEY, OUTVALUE>().getReducerContext(
reduceContext);

return reducerContext;
//创建ReduceContextImpl对象,这就是reducer.run中的那个context
}
try {
reducer.run(reducerContext);
//这里运行的是Reducer的run方法
源码如下:
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKey()) {
reduce(context.getCurrentKey(), context.getValues(), context);
// If a back up store is used, reset it
我们看reduce里面写了什么呢:
protected void reduce(KEYIN key, Iterable values, Context context
) throws IOException, InterruptedException {
for(VALUEIN value: values) {
context.write((KEYOUT) key, (VALUEOUT) value);
//也就是说这里调用了write来完成,要想知道使用了那个write就需要找context到底是什么。我们上面提过这个context其实是ReduceContextImpl对象,ReduceContextImpl又是继承自TaskInputOutputContextImpl,所以最终知道使用的是TaskInputOutputContextImpl的write方法,对应的源代码是:
public void write(KEYOUT key, VALUEOUT value
) throws IOException, InterruptedException {
output.write(key, value);
}
回到源码
}
}
Iterator iter = context.getValues().iterator();
//迭代值
if(iter instanceof ReduceContext.ValueIterator) {
((ReduceContext.ValueIterator)iter).resetBackupStore();
}
}
} finally {
cleanup(context);
}
}
} finally {
trackedRW.close(reducerContext);
}
}
我们特别关注的是Reducer运行是其context中的output究竟是什么,因为RecordWriter,Reducer的输出就是他写出去的。这里因为我们用OutputFormat的子类DBOutputFormat,这个RecordWriter就是DBRecordWriter的NewTrackingRecordWriter,这就是程序中的trackedRW变量,这个trackedRW被用作调用createReduceContext时的参数之一,最后被设置成TnaskInputOutputContextImpl,也就是ReduceContextImpl的output,这位运行Reducer的运行做好了最根本准备,搭建好了最后一个数据流的环节。这里把DBRecordWriter包装成为NewTrackingRecordWriter的原因就是可以Tracking,也就是跟踪程序的执行。
在Reducer之前,先要建立好跟数据库服务器的链接,这就象先打开文件一样,但是建立数据库的链接肯定要比打开文件要复杂,处理建立连接外,还要为后面用到的INSERT语句做准备,其中constructQuery构建了一个SQL查询语句,虽然对数据库的操作是写入,是INSERT但是我们还是习惯使用查询这个词。我们看一下constructQuery的源码。
代码结构:
runNewReducer->NewTrackingRecordWriter->DBOutputFormat.getRecordWriter->DBOutputFormat.constructQuery
源码如下
public String constructQuery(String table, String[] fieldNames) {
if(fieldNames == null) {//没有字段
throw new IllegalArgumentException(“Field names may not be null”);
}

StringBuilder query = new StringBuilder();
query.append("INSERT INTO ").append(table);

//做成一条语句INSERT INTO tableName
if (fieldNames.length > 0 && fieldNames[0] != null) {
//字段有内容
query.append(" (");
//做成一条语句INSERT INTO tableName(name,age,address
for (int i = 0; i < fieldNames.length; i++) {
query.append(fieldNames[i]);
if (i != fieldNames.length - 1) {
query.append(",");
}
}
query.append(")");
//做成一条语句INSERT INTO tableName(name,age,address
}
query.append(" VALUES (");
//做成一条语句INSERT INTO tableName(name,age,address) VALUES(“”,””,””

for (int i = 0; i < fieldNames.length; i++) {
  query.append("?");
  if(i != fieldNames.length - 1) {
    query.append(",");
  }
}
query.append(");");

//做成一条语句INSERT INTO tableName(name,age,address) VALUES(“”,””,””);
return query.toString();
}
整个过程就是在制作成一条语句:INSERT INTO tableName() VALUES();以此字符串为基础构成一个针对具体数据库的SQL语句。在建立数据库连接和SQL语句的基础上创建DBRecordWriter对象。回到runNewReducer,做好了上述准备后,把这些不见都手机在一起构建一个reducerContext,实际上是个ReduceContextImpl对象,就可以调用Reducer.run了。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值