切片机制
核心代码:
maps = writeNewSplits(job, jobSubmitDir); //向资源提交路径提交job的切片信息
List splits = input.getSplits(job); //FileInputFormat类中的getSplits()方法
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job)); //minSize==1
long maxSize = getMaxSplitSize(job); //long的最大值
List<InputSplit> splits = new ArrayList<InputSplit>(); //存放split的集合
List<FileStatus> files = listStatus(job); //获得job输入路径的所有文件
long blockSize = file.getBlockSize(); //获得块尺寸
long splitSize = computeSplitSize(blockSize, minSize, maxSize); //计算切片大小
return Math.max(minSize, Math.min(maxSize, blockSize)); //通过minSize和maxSize调整splitSize相对于blockSize的大小
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) { //SPLIT_SLOP为1.1,当剩余字节/切片大小>1.1开始切片,每次记录切片的偏移量,若最后小于1.1,则剩余部分为一块切片
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
}``
InPutFormat是一个抽象类定义了
getSplits()方法和recordRedear方法,分别由FileInputFormat和TextInputFormat实现,后者是前者的子类,return linerecordRedear 即逐行读取数据。