Hadoop_Mapper&Context&InputSplit&FileSplit源码浅析

最新推荐文章于 2023-02-08 21:00:31 发布

zhengbiubiu

最新推荐文章于 2023-02-08 21:00:31 发布

阅读量499

点赞数 1

分类专栏： Hadoop 文章标签： Hadoop Mapper Context InputSplit FileSplit

本文链接：https://blog.csdn.net/weixin_41967486/article/details/80259786

版权

Hadoop 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Hadoop之Mapper类

源码:

 
  public class 
  Mapper< 
  KEYIN 
  , 
  VALUEIN 
  , 
  KEYOUT 
  , 
  VALUEOUT 
  > { 
 
  /** 
 
  * The 
  <code> 
  Context 
  </code> 
  passed on to the { 
  @link 
  Mapper} implementations. 
 
  */ 
 
  public abstract class 
  Context 
 
  implements 
  MapContext< 
  KEYIN 
  , 
  VALUEIN 
  , 
  KEYOUT 
  , 
  VALUEOUT 
  > { 
 
  } 
 
  /** 
 
  * Called once at the beginning of the task. 
 
  */ 
 
  protected void 
  setup 
  (Context context 
 
  ) 
  throws 
  IOException 
  , 
  InterruptedException { 
 
  // NOTHING 
 
  } 
 
  /** 
 
  * Called once for each key/value pair in the input split. Most applications 
 
  * should override this, but the default is the identity function. 
 
  */ 
 
  @SuppressWarnings 
  ( 
  "unchecked" 
  ) 
 
  protected void 
  map 
  ( 
  KEYIN 
  key 
  , 
  VALUEIN 
  value 
  , 
 
  Context context) 
  throws 
  IOException 
  , 
  InterruptedException { 
 
  context.write(( 
  KEYOUT 
  ) key 
  , 
  ( 
  VALUEOUT 
  ) value) 
  ; 
 
  } 
 
  /** 
 
  * Called once at the end of the task. 
 
  */ 
 
  protected void 
  cleanup 
  (Context context 
 
  ) 
  throws 
  IOException 
  , 
  InterruptedException { 
 
  // NOTHING 
 
  } 
 
  /** 
 
  * Expert users can override this method for more complete control over the 
 
  * execution of the Mapper. 
 
  * 
  @param 
  context 
 
  * 
  @throws 
  IOException 
 
  */ 
 
  public void 
  run 
  (Context context) 
  throws 
  IOException 
  , 
  InterruptedException { 
 
  setup(context) 
  ; 
 
  try 
  { 
 
  while 
  (context.nextKeyValue()) { 
 
  map(context.getCurrentKey() 
  , 
  context.getCurrentValue() 
  , 
  context) 
  ; 
 
  } 
 
  } 
  finally 
  { 
 
  cleanup(context) 
  ; 
 
  } 
 
  } 
 
  }

主要方法有 setup(一个task调用一次), map(每一个k/v对调用一次), cleanup(每一个task调用一次), run( Expert users can override this method for more complete control over the)

还有一个内部类Context (implements MapContext)，

而在MapContext中只有一个方法：

 
   /** 
  
   * Get the input split for this map. 
  
   */ 
  
   public 
   InputSplit 
   getInputSplit 
   () 
   ;

再看看InputSplit类，该类是一个抽象类，先看类的注释:

 
  /** 
 
  * 
  <code> 
  InputSplit 
  </code> 
  represents the data to be processed by an 
 
  * individual { 
  @link 
  Mapper}. 
 
  * 
 
  * 
  <p> 
  Typically, it presents a byte-oriented[面向] view on the input and is the 
 
  * responsibility[责任] of { 
  @link 
  RecordReader} of the job to process this and present 
 
  * a record-oriented view.

源码:

 
  public 
  abstract 
  class 
  InputSplit { 
 
  /** 
 
  * Get the size of the split, so that the input splits can be sorted by size. 
 
  * 
  @return 
  the number of bytes in the split 
 
  * 
  @throws 
  IOException 
 
  * 
  @throws 
  InterruptedException 
 
  */ 
 
  public abstract long 
  getLength 
  () 
  throws 
  IOException 
  , 
  InterruptedException 
  ; 
 
  /** 
 
  * Get the list of nodes by name where the data for the split would be local. 
 
  * The locations do not need to be serialized. 
 
  * 
 
  * 
  @return 
  a new array of the node nodes. 
 
  * 
  @throws 
  IOException 
 
  * 
  @throws 
  InterruptedException 
 
  */ 
 
  public abstract 
 
  String[] 
  getLocations 
  () 
  throws 
  IOException 
  , 
  InterruptedException 
  ; 
 
  /** 
 
  * Gets info about which nodes the input split is stored on and how it is 
 
  * stored at each location. 
 
  * 
 
  * 
  @return 
  list of 
  <code> 
  SplitLocationInfo 
  </code> 
  s describing how the split 
 
  * data is stored at each location. A null value indicates that all the 
 
  * locations have the data stored on disk. 
 
  * 
  @throws 
  IOException 
 
  */ 
 
  @Evolving 
 
  public 
  SplitLocationInfo[] 
  getLocationInfo 
  () 
  throws 
  IOException { 
 
  return null; 
 
  } 
 
  }

主要的方法有:

public abstract long getLength【获取切片的大小，并按大小排序】 ,

public abstract String[] getLocations ,

public SplitLocationInfo[] getLocationInfo【获取信息，包含有哪些节点存储了切片，是如何存储的】

Ctrl+H查看类的层级结构:

看到了InputSplit类的实现类，其中包括FileSplit类(看lib包下的)

老规矩，先看类注释:

 
  /** A section of an input file. Returned by { 
  @link 
 
  * InputFormat# 
  getSplits(JobContext) 
  } and passed to 
 
  * { 
  @link 
  InputFormat# 
  createRecordReader(InputSplit,TaskAttemptContext) 
  }. */

说明了类的来源【 InputFormat# getSplits(JobContext) 】和去处【 InputFormat# createRecordReader(InputSplit,TaskAttemptContext) 】

这里先说FileSplit类本身，有关InputFormat下节说。

 
  public class 
  FileSplit 
  extends 
  InputSplit 
  implements 
  Writable 
 

类中，有几个属性，一个无参构造，两个有参构造，暂时不用管/

继续，还有几个实现方法不管，其中有以下几个方法:

 
   /** The file containing this split's data. */ 
  
   public 
   Path getPath() { 
   return 
   file 
   ; } 
  
   /** The position of the first byte in the file to process. */ 
  
   public long 
   getStart() { 
   return 
   start 
   ; } 
  
   /** The number of bytes in the file to process. */ 
  
   @Override 
  
   public long 
   getLength() { 
   return 
   length 
   ; }