InputFormat.class
getSplits
注意是package org.apache.hadoop.mapreduce;包
Logically split the set of input files for the job. 逻辑上的分片
List<InputSplit> getSplits
RecordReader
/**
* Logically split the set of input files for the job.
*
* <p>Each {@link InputSplit} is then assigned to an individual {@link Mapper}
* for processing.</p>
*
* <p><i>Note</i>: The split is a <i>logical</i> split of the inputs and the
* input files are not physically split into chunks. For e.g. a split could
* be <i><input-file-path, start, offset></i> tuple. The InputFormat
* also creates the {@link RecordReader} to read the {@link InputSplit}.
*
* @param context job configuration.
* @return an array of {@link InputSplit}s for the job.
*/
InputFormat
List<InputSplit> getSplits
---->InputSplit 交给一个Map来处理
Logically split the set of input files for the job
逻辑:InputSplit不是真正物理上对应的东西
物理:Block 真实存在
RecordReader
针对一个InputSplit创建一个RecordReader
/**
* Called once at initialization.
* @param split the split that defines the range of records to read
* @param context the information about the task
*/
public abstract void initialize(InputSplit split,
TaskAttemptContext context
)
TextInputFormat.class
getSplits
/**
* Generate the list of files and make them into FileSplits.
* @param job the job context
* @throws IOException
*/
public List<InputSplit> getSplits(JobContext job)
输入数据 ===》 inputFormat
---> 有几个InputSplit
--->MapTask
---> RecordReader
--->LineRecordReader
当文件很小时
1个.txt 就只有一个Splits。。。两个文件会产生两个splits
在Driver中不设置Inputformat 的话 , 默认走的的是TextInputFormat
Job ------> JobSubmit -----> InputFormat ------>
本地模式的blockSize为32M
TestInputFormat中没有getSplits方法,所以走的还是父类(abstract class FileInputFormat)的方法
SplitSize 默认 blockSzie一样大
bytesRemaing
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
bytesRemaining -= splitSize;
}
SPLIT_SLOP
private static final double SPLIT_SLOP = 1.1; // 10% slop
有1.1的富裕
也就是说即使 129M的文件(默认blockSize 128M),SplitSize也是1
NLineInputFormat.class
default
public List<InputSplit> getSplits(JobContext job)
throws IOException {
List<InputSplit> splits = new ArrayList<InputSplit>();
int numLinesPerSplit = getNumLinesPerSplit(job);
for (FileStatus status : listStatus(job)) {
splits.addAll(getSplitsForFile(status,
job.getConfiguration(), numLinesPerSplit));
}
return splits;
}
如果不设置Splits 个数的话,默认numLinesPerSplit = 1;
一端设置进去,另一端取出
setNumLinesPerSplit
//设置N line
NLineInputFormat.setNumLinesPerSplit(job,5);
job.setInputFormatClass(NLineInputFormat.class);
//job.setInputFormatClass(TextInputFormat.class);
对于Splits的数量
可以看到是一个一个增加的 对数量没有定义