大数据—Hadoop(十二)_ MapReduce_05、核心框架原理_源码(2)_切片机制

1、切片源码

类名:JobSubmitter

1.1 切片准备工作

1、

JobStatus submitJobInternal(Job job, Cluster cluster) 
throws ClassNotFoundException, InterruptedException, IOException {

  ……
 
  int maps = writeSplits(job, submitJobDir);

  ……
}

2、控制切片信息

int maps = writeSplits(job, submitJobDir);

3、读取新Api的切片方式

private int writeSplits(org.apache.hadoop.mapreduce.JobContext job,
  Path jobSubmitDir) throws IOException,
  InterruptedException, ClassNotFoundException {
	JobConf jConf = (JobConf)job.getConfiguration();
	int maps;
	if (jConf.getUseNewMapper()) {
	  maps = writeNewSplits(job, jobSubmitDir);
	} else {
	  maps = writeOldSplits(jConf, jobSubmitDir);
	}
	return maps;
}

4、进入切片:maps = writeNewSplits(job, jobSubmitDir);

private <T extends InputSplit>
  int writeNewSplits(JobContext job, Path jobSubmitDir) throws IOException,
  InterruptedException, ClassNotFoundException {
	Configuration conf = job.getConfiguration();
	InputFormat<?, ?> input =
	  ReflectionUtils.newInstance(job.getInputFormatClass(), conf);
	
	List<InputSplit> splits = input.getSplits(job);
	T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]);
	
	// sort the splits into order based on size, so that the biggest
	// go first
	Arrays.sort(array, new SplitComparator());
	JobSplitWriter.createSplitFiles(jobSubmitDir, conf, 
	    jobSubmitDir.getFileSystem(conf), array);
	return array.length;
}

5、List splits = input.getSplits(job);
进入类名:FileInputFormat

InputFormat的实现类介绍:
DBInputFormat(连接数据库读取数据)
本地模式下使用的FileInputFormat(输入文件),是InputFormat的实现类,默认走的是FileInputFormat下的TextInputFormat,按行切割

后续会学习FileInputFormat下的CombineFileInputFormat下的CombineTextInputFormat,对小文件合并后切割
在这里插入图片描述

1.2 切片核心代码

6、

public List<InputSplit> getSplits(JobContext job) throws IOException {
    StopWatch sw = new StopWatch().start();
    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
    long maxSize = getMaxSplitSize(job);

    // generate splits
    List<InputSplit> splits = new ArrayList<InputSplit>();
    List<FileStatus> files = listStatus(job);

    boolean ignoreDirs = !getInputDirRecursive(job)
      && job.getConfiguration().getBoolean(INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS, false);
    for (FileStatus file: files) {
      if (ignoreDirs && file.isDirectory()) {
        continue;
      }
      Path path = file.getPath();
      long length = file.getLen();
      if (length != 0) {
        BlockLocation[] blkLocations;
        if (file instanceof LocatedFileStatus) {
          blkLocations = ((LocatedFileStatus) file).getBlockLocations();
        } else {
          FileSystem fs = path.getFileSystem(job.getConfiguration());
          blkLocations = fs.getFileBlockLocations(file, 0, length);
        }
        if (isSplitable(job, path)) {
          long blockSize = file.getBlockSize();
          long splitSize = computeSplitSize(blockSize, minSize, maxSize);

          long bytesRemaining = length;
          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                        blkLocations[blkIndex].getHosts(),
                        blkLocations[blkIndex].getCachedHosts()));
            bytesRemaining -= splitSize;
          }

          if (bytesRemaining != 0) {
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
                       blkLocations[blkIndex].getHosts(),
                       blkLocations[blkIndex].getCachedHosts()));
          }
        } else { // not splitable
          if (LOG.isDebugEnabled()) {
            // Log only if the file is big enough to be splitted
            if (length > Math.min(file.getBlockSize(), minSize)) {
              LOG.debug("File is not splittable so no parallelization "
                  + "is possible: " + file.getPath());
            }
          }
          splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
                      blkLocations[0].getCachedHosts()));
        }
      } else { 
        //Create empty hosts array for zero length files
        splits.add(makeSplit(path, 0, length, new String[0]));
      }
    }
    // Save the number of input files for metrics/loadgen
    job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
    sw.stop();
    if (LOG.isDebugEnabled()) {
      LOG.debug("Total # of splits generated by getSplits: " + splits.size()
          + ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
    }
    return splits;
}

1.2.1 获取minSize,默认是1

7、获取minSize,默认是1,取值逻辑:在8和9中取最大值

long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));

8、默认值1

getFormatMinSplitSize(),

9、默认值0,在0和1中取最小值
第9步的取值逻辑为10-12步

getMinSplitSize(job)

10、先取SPLIT_MINSIZE,取不到取1

/**
 * Get the minimum split size
* @param job the job
* @return the minimum number of bytes that can be in a split
*/
public static long getMinSplitSize(JobContext job) {
	return job.getConfiguration().getLong(SPLIT_MINSIZE, 1L);
}

11、

public static final String SPLIT_MINSIZE = 
"mapreduce.input.fileinputformat.split.minsize";

12、mapred-default.xml
默认值0

<property>
 	<name>mapreduce.input.fileinputformat.split.minsize</name>
	<value>0</value>
	<description>The minimum size chunk that map input should be split
	into.  Note that some file formats may have minimum split sizes that
	take priority over this setting.</description>
</property>

1.2.2 获取maxSize ,默认是Long的最大值

13、

long maxSize = getMaxSplitSize(job);

14、

/**
 * Get the maximum split size.
 * @param context the job to look at.
 * @return the maximum number of bytes a split can include
 */
public static long getMaxSplitSize(JobContext context) {
  return context.getConfiguration().getLong(SPLIT_MAXSIZE, 
                                            Long.MAX_VALUE);
}

SPLIT_MAXSIZE找不到,取第二个 Long.MAX_VALUE,为Long的最大值

15、

public static final String SPLIT_MAXSIZE = 
"mapreduce.input.fileinputformat.split.maxsize";

16、
在mapred-default.xml中找不到

1.2.3 遍历文件

17、

for (FileStatus file: files) {
  if (ignoreDirs && file.isDirectory()) {
    continue;
  }
  Path path = file.getPath();
  long length = file.getLen();
  if (length != 0) {
    BlockLocation[] blkLocations;
    if (file instanceof LocatedFileStatus) {
      blkLocations = ((LocatedFileStatus) file).getBlockLocations();
    } else {
      FileSystem fs = path.getFileSystem(job.getConfiguration());
      blkLocations = fs.getFileBlockLocations(file, 0, length);
    }
    if (isSplitable(job, path)) {
      long blockSize = file.getBlockSize();
      long splitSize = computeSplitSize(blockSize, minSize, maxSize);

      long bytesRemaining = length;
      while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
        int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
        splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                    blkLocations[blkIndex].getHosts(),
                    blkLocations[blkIndex].getCachedHosts()));
        bytesRemaining -= splitSize;
      }

      if (bytesRemaining != 0) {
        int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
        splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
                   blkLocations[blkIndex].getHosts(),
                   blkLocations[blkIndex].getCachedHosts()));
      }
    } else { // not splitable
      if (LOG.isDebugEnabled()) {
        // Log only if the file is big enough to be splitted
        if (length > Math.min(file.getBlockSize(), minSize)) {
          LOG.debug("File is not splittable so no parallelization "
              + "is possible: " + file.getPath());
        }
      }
      splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
                  blkLocations[0].getCachedHosts()));
    }
  } else { 
    //Create empty hosts array for zero length files
    splits.add(makeSplit(path, 0, length, new String[0]));
  }
}

循环遍历当前目录下的所有文件,所有文件都会遍历一次
也可以反应切分是对每个文件单独处理

18、得到文件路径

Path path = file.getPath();

19、得到文件长度

long length = file.getLen();

20、

if (isSplitable(job, path)) {

long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(blockSize, minSize, maxSize);

 long bytesRemaining = length;
      while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
        int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
        splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                    blkLocations[blkIndex].getHosts(),
                    blkLocations[blkIndex].getCachedHosts()));
        bytesRemaining -= splitSize;
}

如果文件可以被切分,继续执行逻辑,如果是不支持切分的压缩文件(后续讲压缩会详细讲),则只会有一片,没有必要切割成两块

21、获取块大小blockSize :33554432 B,本地模式默认是32M
(33554432 / 1024 / 1024 = 32)

long blockSize = file.getBlockSize();

22、逻辑:blockSize(32554332), minSize(1), maxSize(Long的最大值)三个数据的中间值

long splitSize = computeSplitSize(blockSize, minSize, maxSize);

获取切片大小splitSize 为32M
如果想调整块大小,一般改minSize

23、当前的文件大小/切片大小 > 1.1 走切片逻辑,但如果小于1.1,则算做一个文件块
如:33M / 32M < 1.1 , 不切分,当作一个文件块
如果切分,则会有两个文件32M和1M,产生小文件问题,不利于后期计算

private static final double SPLIT_SLOP = 1.1;

2、总结

  1. 获取本地数据的存储目录
  2. 依次遍历目录下的文件
    对每个文件单独切片
  3. 遍历第一个文件
    获取文件大小
    计算切片大小
  4. long splitSize = computeSplitSize(blockSize, minSize, maxSize);
    逻辑:blockSize(本地32M,集群128M), minSize(1), maxSize(Long的最大值)三个数据的中间值
  5. 默认情况下,块大小 = 切片大小
    如果块大小想改大,调整minSize
    如果块大小想改小,调整maxSize
    生产环境一般只会调大,不调小
  6. 开始切片数据,每切一片都要判断剩余部分是否大于块大小*1.1,不大于按照一块计算
文件大小文件切几块切几片
128.121
15022
  1. 将切片信息写入文件
  2. 整个核心过程在getSplit()中完成
  3. 最终InputSplit只记录了切片的元数据信息,比如起始位置、长度,不会真正对数据做改变
  4. 将切片规划提交到yarn上
  5. MrAppMaster会根据切片数,开启对应的MapTask个数
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

大数据之负

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值