2.2 MapReduce之Split原理

最新推荐文章于 2022-07-31 10:13:45 发布

oifengo

最新推荐文章于 2022-07-31 10:13:45 发布

阅读量604

点赞数

分类专栏： # 爬梯

本文链接：https://blog.csdn.net/weixin_39381833/article/details/107443064

版权

爬梯专栏收录该内容

47 篇文章 0 订阅

订阅专栏

InputFormat.class

getSplits

注意是package org.apache.hadoop.mapreduce;包

Logically split the set of input files for the job.  逻辑上的分片

List<InputSplit> getSplits

在这里插入图片描述

RecordReader

  /** 
   * Logically split the set of input files for the job.  
   * 
   * <p>Each {@link InputSplit} is then assigned to an individual {@link Mapper}
   * for processing.</p>
   *
   * <p><i>Note</i>: The split is a <i>logical</i> split of the inputs and the
   * input files are not physically split into chunks. For e.g. a split could
   * be <i>&lt;input-file-path, start, offset&gt;</i> tuple. The InputFormat
   * also creates the {@link RecordReader} to read the {@link InputSplit}.
   * 
   * @param context job configuration.
   * @return an array of {@link InputSplit}s for the job.
   */

InputFormat
	List<InputSplit> getSplits
	---->InputSplit 交给一个Map来处理
	Logically split the set of input files for the job
		逻辑：InputSplit不是真正物理上对应的东西 
		物理：Block 真实存在
	RecordReader

针对一个InputSplit创建一个RecordReader

  /**
   * Called once at initialization.
   * @param split the split that defines the range of records to read
   * @param context the information about the task
   */
  public abstract void initialize(InputSplit split,
                                  TaskAttemptContext context
                                  )

TextInputFormat.class

getSplits

  /** 
   * Generate the list of files and make them into FileSplits.
   * @param job the job context
   * @throws IOException
   */
  public List<InputSplit> getSplits(JobContext job)

输入数据 ===》 inputFormat
			---> 有几个InputSplit
				--->MapTask
			---> RecordReader
				--->LineRecordReader

当文件很小时
1个.txt 就只有一个Splits。。。两个文件会产生两个splits
在这里插入图片描述

在Driver中不设置Inputformat 的话，默认走的的是TextInputFormat

Job ------> JobSubmit -----> InputFormat ------>

本地模式的blockSize为32M
在这里插入图片描述
TestInputFormat中没有getSplits方法，所以走的还是父类（abstract class FileInputFormat）的方法

SplitSize 默认 blockSzie一样大
在这里插入图片描述

bytesRemaing

          long bytesRemaining = length;
          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                        blkLocations[blkIndex].getHosts(),
                        blkLocations[blkIndex].getCachedHosts()));
            bytesRemaining -= splitSize;
          }

SPLIT_SLOP

  private static final double SPLIT_SLOP = 1.1;   // 10% slop

有1.1的富裕
也就是说即使 129M的文件(默认blockSize 128M)，SplitSize也是1

fileSize & blockSize

NLineInputFormat.class

default


  public List<InputSplit> getSplits(JobContext job)
  throws IOException {
    List<InputSplit> splits = new ArrayList<InputSplit>();
    int numLinesPerSplit = getNumLinesPerSplit(job);
    for (FileStatus status : listStatus(job)) {
      splits.addAll(getSplitsForFile(status,
        job.getConfiguration(), numLinesPerSplit));
    }
    return splits;
  }

如果不设置Splits 个数的话，默认numLinesPerSplit = 1; 在这里插入图片描述

一端设置进去，另一端取出

setNumLinesPerSplit

        //设置N line
        NLineInputFormat.setNumLinesPerSplit(job,5);

        job.setInputFormatClass(NLineInputFormat.class);
        //job.setInputFormatClass(TextInputFormat.class);

在这里插入图片描述
对于Splits的数量
可以看到是一个一个增加的对数量没有定义

oifengo

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
2.2 MapReduce之Split原理

InputFormat.classgetSplits注意是package org.apache.hadoop.mapreduce;包Logically split the set of input files for the job. 逻辑上的分片List<InputSplit> getSplitsRecordReader /** * Logically split the set of input files for the job. * * &
复制链接

扫一扫