1、切片源码
类名:JobSubmitter
1.1 切片准备工作
1、
JobStatus submitJobInternal(Job job, Cluster cluster)
throws ClassNotFoundException, InterruptedException, IOException {
……
int maps = writeSplits(job, submitJobDir);
……
}
2、控制切片信息
int maps = writeSplits(job, submitJobDir);
3、读取新Api的切片方式
private int writeSplits(org.apache.hadoop.mapreduce.JobContext job,
Path jobSubmitDir) throws IOException,
InterruptedException, ClassNotFoundException {
JobConf jConf = (JobConf)job.getConfiguration();
int maps;
if (jConf.getUseNewMapper()) {
maps = writeNewSplits(job, jobSubmitDir);
} else {
maps = writeOldSplits(jConf, jobSubmitDir);
}
return maps;
}
4、进入切片:maps = writeNewSplits(job, jobSubmitDir);
private <T extends InputSplit>
int writeNewSplits(JobContext job, Path jobSubmitDir) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = job.getConfiguration();
InputFormat<?, ?> input =
ReflectionUtils.newInstance(job.getInputFormatClass(), conf);
List<InputSplit> splits = input.getSplits(job);
T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]);
// sort the splits into order based on size, so that the biggest
// go first
Arrays.sort(array, new SplitComparator());
JobSplitWriter.createSplitFiles(jobSubmitDir, conf,
jobSubmitDir.getFileSystem(conf), array);
return array.length;
}
5、List splits = input.getSplits(job);
进入类名:FileInputFormat
InputFormat的实现类介绍:
DBInputFormat(连接数据库读取数据)
本地模式下使用的FileInputFormat(输入文件),是InputFormat的实现类,默认走的是FileInputFormat下的TextInputFormat,按行切割
后续会学习FileInputFormat下的CombineFileInputFormat下的CombineTextInputFormat,对小文件合并后切割
1.2 切片核心代码
6、
public List<InputSplit> getSplits(JobContext job) throws IOException {
StopWatch sw = new StopWatch().start();
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
long maxSize = getMaxSplitSize(job);
// generate splits
List<InputSplit> splits = new ArrayList<InputSplit>();
List<FileStatus> files = listStatus(job);
boolean ignoreDirs = !getInputDirRecursive(job)
&& job.getConfiguration().getBoolean(INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS, false);
for (FileStatus file: files) {
if (ignoreDirs && file.isDirectory()) {
continue;
}
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
FileSystem fs = path.getFileSystem(job.getConfiguration());
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (isSplitable(job, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
}
} else { // not splitable
if (LOG.isDebugEnabled()) {
// Log only if the file is big enough to be splitted
if (length > Math.min(file.getBlockSize(), minSize)) {
LOG.debug("File is not splittable so no parallelization "
+ "is possible: " + file.getPath());
}
}
splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
blkLocations[0].getCachedHosts()));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
// Save the number of input files for metrics/loadgen
job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
sw.stop();
if (LOG.isDebugEnabled()) {
LOG.debug("Total # of splits generated by getSplits: " + splits.size()
+ ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
}
return splits;
}
1.2.1 获取minSize,默认是1
7、获取minSize,默认是1,取值逻辑:在8和9中取最大值
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
8、默认值1
getFormatMinSplitSize(),
9、默认值0,在0和1中取最小值
第9步的取值逻辑为10-12步
getMinSplitSize(job)
10、先取SPLIT_MINSIZE,取不到取1
/**
* Get the minimum split size
* @param job the job
* @return the minimum number of bytes that can be in a split
*/
public static long getMinSplitSize(JobContext job) {
return job.getConfiguration().getLong(SPLIT_MINSIZE, 1L);
}
11、
public static final String SPLIT_MINSIZE =
"mapreduce.input.fileinputformat.split.minsize";
12、mapred-default.xml
默认值0
<property>
<name>mapreduce.input.fileinputformat.split.minsize</name>
<value>0</value>
<description>The minimum size chunk that map input should be split
into. Note that some file formats may have minimum split sizes that
take priority over this setting.</description>
</property>
1.2.2 获取maxSize ,默认是Long的最大值
13、
long maxSize = getMaxSplitSize(job);
14、
/**
* Get the maximum split size.
* @param context the job to look at.
* @return the maximum number of bytes a split can include
*/
public static long getMaxSplitSize(JobContext context) {
return context.getConfiguration().getLong(SPLIT_MAXSIZE,
Long.MAX_VALUE);
}
SPLIT_MAXSIZE找不到,取第二个 Long.MAX_VALUE,为Long的最大值
15、
public static final String SPLIT_MAXSIZE =
"mapreduce.input.fileinputformat.split.maxsize";
16、
在mapred-default.xml中找不到
1.2.3 遍历文件
17、
for (FileStatus file: files) {
if (ignoreDirs && file.isDirectory()) {
continue;
}
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
FileSystem fs = path.getFileSystem(job.getConfiguration());
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (isSplitable(job, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
}
} else { // not splitable
if (LOG.isDebugEnabled()) {
// Log only if the file is big enough to be splitted
if (length > Math.min(file.getBlockSize(), minSize)) {
LOG.debug("File is not splittable so no parallelization "
+ "is possible: " + file.getPath());
}
}
splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
blkLocations[0].getCachedHosts()));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
循环遍历当前目录下的所有文件,所有文件都会遍历一次
也可以反应切分是对每个文件单独处理
18、得到文件路径
Path path = file.getPath();
19、得到文件长度
long length = file.getLen();
20、
if (isSplitable(job, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
bytesRemaining -= splitSize;
}
如果文件可以被切分,继续执行逻辑,如果是不支持切分的压缩文件(后续讲压缩会详细讲),则只会有一片,没有必要切割成两块
21、获取块大小blockSize :33554432 B,本地模式默认是32M
(33554432 / 1024 / 1024 = 32)
long blockSize = file.getBlockSize();
22、逻辑:blockSize(32554332), minSize(1), maxSize(Long的最大值)三个数据的中间值
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
获取切片大小splitSize 为32M
如果想调整块大小,一般改minSize
23、当前的文件大小/切片大小 > 1.1 走切片逻辑,但如果小于1.1,则算做一个文件块
如:33M / 32M < 1.1 , 不切分,当作一个文件块
如果切分,则会有两个文件32M和1M,产生小文件问题,不利于后期计算
private static final double SPLIT_SLOP = 1.1;
2、总结
- 获取本地数据的存储目录
- 依次遍历目录下的文件
对每个文件单独切片 - 遍历第一个文件
获取文件大小
计算切片大小 - long splitSize = computeSplitSize(blockSize, minSize, maxSize);
逻辑:blockSize(本地32M,集群128M), minSize(1), maxSize(Long的最大值)三个数据的中间值 - 默认情况下,块大小 = 切片大小
如果块大小想改大,调整minSize
如果块大小想改小,调整maxSize
生产环境一般只会调大,不调小 - 开始切片数据,每切一片都要判断剩余部分是否大于块大小*1.1,不大于按照一块计算
文件大小 | 文件切几块 | 切几片 |
---|---|---|
128.1 | 2 | 1 |
150 | 2 | 2 |
- 将切片信息写入文件
- 整个核心过程在getSplit()中完成
- 最终InputSplit只记录了切片的元数据信息,比如起始位置、长度,不会真正对数据做改变
- 将切片规划提交到yarn上
- MrAppMaster会根据切片数,开启对应的MapTask个数