MapReduce类型和格式(二) 输入格式

本文深入探讨了MapReduce的输入格式,包括InputSplit、RecordReader、FileInputFormat等概念,分析了如何处理小文件、避免切分、设置分片大小以及不同类型的输入格式如TextInputFormat、KeyValueTextInputFormat、SequenceFileInputFormat等。此外,还讨论了如何从数据库读取数据以及MultiInputs的使用。
摘要由CSDN通过智能技术生成

输入分片和记录

一个map操作处理一个split,一个split被分为若干记录,每个记录是一键值对,map一个个处理键值对。

public abstract class InputSplit {
  /**
   * Get the size of the split, so that the input splits can be sorted by size.
   * @return the number of bytes in the split
   * @throws IOException
   * @throws InterruptedException
   */
  public abstract long getLength() throws IOException, InterruptedException;

  /**
   * Get the list of nodes by name where the data for the split would be local.
   * The locations do not need to be serialized.
   * 
   * @return a new array of the node nodes.
   * @throws IOException
   * @throws InterruptedException
   */
  public abstract 
    String[] getLocations() throws IOException, InterruptedException;
  
  /**
   * Gets info about which nodes the input split is stored on and how it is
   * stored at each location.
   * 
   * @return list of <code>SplitLocationInfo</code>s describing how the split
   *    data is stored at each location. A null value indicates that all the
   *    locations have the data stored on disk.
   * @throws IOException
   */
  @Evolving
  public SplitLocationInfo[] getLocationInfo() throws IOException {
    return null;
  }
}

1) InputSplit包含以字节为单位的长度和一组存储位置

2) InputSplit不包含数据,它是数据的引用

3)存储位置将map任务尽量放在分片数据附近,分片大小用来排序分片,优先处理最大的分片

4) InputSplit由InputFormat创建,不需直接与InputSplit交互

public abstract class InputFormat<K, V> {

  /** 
   * Logically split the set of input files for the job.  
   * 
   * <p>Each {@link InputSplit} is then assigned to an individual {@link Mapper}
   * for processing.</p>
   *
   * <p><i>Note</i>: The split is a <i>logical</i> split of the inputs and the
   * input files are not physic
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值