Hadoop_Mapper&Context&InputSplit&FileSplit源码浅析

Hadoop之Mapper类

源码:
public class Mapper< KEYIN , VALUEIN , KEYOUT , VALUEOUT > {

/**
* The <code> Context </code> passed on to the { @link Mapper} implementations.
*/
public abstract class Context
implements MapContext< KEYIN , VALUEIN , KEYOUT , VALUEOUT > {
}
/**
* Called once at the beginning of the task.
*/
protected void setup (Context context
) throws IOException , InterruptedException {
// NOTHING
}

/**
* Called once for each key/value pair in the input split. Most applications
* should override this, but the default is the identity function.
*/
@SuppressWarnings ( "unchecked" )
protected void map ( KEYIN key , VALUEIN value ,
Context context) throws IOException , InterruptedException {
context.write(( KEYOUT ) key , ( VALUEOUT ) value) ;
}

/**
* Called once at the end of the task.
*/
protected void cleanup (Context context
) throws IOException , InterruptedException {
// NOTHING
}
/**
* Expert users can override this method for more complete control over the
* execution of the Mapper.
* @param context
* @throws IOException
*/
public void run (Context context) throws IOException , InterruptedException {
setup(context) ;
try {
while (context.nextKeyValue()) {
map(context.getCurrentKey() , context.getCurrentValue() , context) ;
}
} finally {
cleanup(context) ;
}
}
}
主要方法有 setup(一个task调用一次), map(每一个k/v对调用一次), cleanup(每一个task调用一次), run( Expert users can override this method for more complete control over the)
还有一个内部类Context (implements MapContext),

而在MapContext中只有一个方法:
/**
* Get the input split for this map.
*/
public InputSplit getInputSplit () ;


再看看InputSplit类,该类是一个抽象类,先看类的注释:
/**
* <code> InputSplit </code> represents the data to be processed by an
* individual { @link Mapper}.
*
* <p> Typically, it presents a byte-oriented[面向] view on the input and is the
* responsibility[责任] of { @link RecordReader} of the job to process this and present
* a record-oriented view.
源码:

public abstract class InputSplit {
/**
* Get the size of the split, so that the input splits can be sorted by size.
* @return the number of bytes in the split
* @throws IOException
* @throws InterruptedException
*/
public abstract long getLength () throws IOException , InterruptedException ;

/**
* Get the list of nodes by name where the data for the split would be local.
* The locations do not need to be serialized.
*
* @return a new array of the node nodes.
* @throws IOException
* @throws InterruptedException
*/
public abstract
String[] getLocations () throws IOException , InterruptedException ;
/**
* Gets info about which nodes the input split is stored on and how it is
* stored at each location.
*
* @return list of <code> SplitLocationInfo </code> s describing how the split
* data is stored at each location. A null value indicates that all the
* locations have the data stored on disk.
* @throws IOException
*/
@Evolving
public SplitLocationInfo[] getLocationInfo () throws IOException {
return null;
}
}
主要的方法有:
                              public abstract long getLength【获取切片的大小,并按大小排序】 ,
                              public abstract  String[] getLocations ,                               
                               public SplitLocationInfo[] getLocationInfo【获取信息,包含有哪些节点存储了切片,是如何存储的】
Ctrl+H查看类的层级结构:


看到了InputSplit类的实现类,其中包括FileSplit类(看lib包下的)
老规矩,先看类注释:
/** A section of an input file. Returned by { @link
* InputFormat# getSplits(JobContext) } and passed to
* { @link InputFormat# createRecordReader(InputSplit,TaskAttemptContext) }. */

说明了类的来源【 InputFormat# getSplits(JobContext) 】和去处【 InputFormat# createRecordReader(InputSplit,TaskAttemptContext)
这里先说FileSplit类本身,有关InputFormat下节说。
public class FileSplit extends InputSplit implements Writable
类中,有几个属性,一个无参构造,两个有参构造,暂时不用管/
继续,还有几个实现方法不管,其中有以下几个方法:
/** The file containing this split's data. */
public Path getPath() { return file ; }

/** The position of the first byte in the file to process. */
public long getStart() { return start ; }

/** The number of bytes in the file to process. */
@Override
public long getLength() { return length ; }


可以看出,从FileSplit中也能获取到文件的Path,进而通过Path可以获取FileSystem,以及输入输出流FSDataInputStream,FSDataOutputSream

至此,Mapper中的内容大致看完了。






评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值