属性
一些常量,java是如何通过常量去找配置文件的呢?
public static final String INPUT_DIR =
"mapreduce.input.fileinputformat.inputdir";
public static final String SPLIT_MAXSIZE =
"mapreduce.input.fileinputformat.split.maxsize";
public static final String SPLIT_MINSIZE =
"mapreduce.input.fileinputformat.split.minsize";
public static final String PATHFILTER_CLASS =
"mapreduce.input.pathFilter.class";
public static final String NUM_INPUT_FILES =
"mapreduce.input.fileinputformat.numinputfiles";
public static final String INPUT_DIR_RECURSIVE =
"mapreduce.input.fileinputformat.input.dir.recursive";
public static final String LIST_STATUS_NUM_THREADS =
"mapreduce.input.fileinputformat.list-status.num-threads";
public static final int DEFAULT_LIST_STATUS_NUM_THREADS = 1;
private static final Log LOG = LogFactory.getLog(FileInputFormat.class);
private static final double SPLIT_SLOP = 1.
getSplits方法
看一下是如何获取分片信息的.getSplits方法返回值是一个List.然后List里面装的是InputSplit对象,InputSplit对象如何定义的呢?后面详说.
InputSplit表示要由单个Mapper处理的数据。
通常,它在输入中提供面向字节的视图,并且是作业的RecordReader负责处理此问题并提供面向记录的视图。
/**
* Generate the list of files and make them into FileSplits.
* @param job the job context
* @throws IOException
*/
public List<InputSplit> getSplits(JobContext job) throws IOException {
StopWatch sw = new StopWatch().start();
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
long maxSize = getMaxSplitSize(job);
// generate splits
List<InputSplit> splits = new ArrayList<InputSplit>();
List<FileStatus> files = listStatus(job);
for (FileStatus file: files) {
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
FileSystem fs = path.getFileSystem(job.getConfiguration());
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (isSplitable(job, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
}
} else { // not splitable
splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
blkLocations[0].getCachedHosts()));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
// Save the number of input files for metrics/loadgen
job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
sw.stop();
if (LOG.isDebugEnabled()) {
LOG.debug("Total # of splits generated by getSplits: " + splits.size()
+ ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
}
return splits;
}
切分谁呢?需要看listStatus这个方法.
列出输入目录。 子类可以覆盖例如仅选择与正则表达式匹配的文件。
就是获取文件目录的.从而可以获取到待拆分的文件.
然后判断是否可拆分
isSplitable
给定的文件名是否可拆分? 通常为true,但是如果文件是流压缩的,则不会。
FileInputFormat实现可以重写此方法并返回false,以确保永远不会拆分单个输入文件,以便Mappers处理整个文件。
拆分的关键是什么,确定拆分的大小,这个怎么确定的呢?就是通过下面这个方法确定拆分的大小,一般是128M
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
接下来就可以拆分了,拆分的结果是啥呢?返回一更FileSplit对象.拆分需要哪些信息呢
路径,文件长度,block位置,内存中的block位置.
file – the file name start – the position of the first byte in the file to process
length – the number of bytes in the file to process
hosts –the list of hosts containing the block
inMemoryHosts – the list of hosts containing the block in memory
/**
* A factory that makes the split for this class. It can be overridden
* by sub-classes to make sub-types
*/
protected FileSplit makeSplit(Path file, long start, long length,
String[] hosts, String[] inMemoryHosts) {
return new FileSplit(file, start, length, hosts, inMemoryHosts);
}
FileSplit是啥呢?见下文
TextInputForrmat
fileinputformat有个子类叫做TextInputForrmat.实现了一个重要的方法.重写了issplitable方法.
createRecordReader 方法返回一个RecordReader对象,这个对象时如何构建的呢?
必然通过上面拆分的InputSplit构建出来… 还需要一个分隔符deleimiter.
deleimiter从哪里来呢?context.getConfiguration()可以获取到分割符.有了这个,就可以new 一个LineRecordReader对象.
public class TextInputFormat extends FileInputFormat<LongWritable, Text> {
@Override
public RecordReader<LongWritable, Text>
createRecordReader(InputSplit split,
TaskAttemptContext context) {
String delimiter = context.getConfiguration().get(
"textinputformat.record.delimiter");
byte[] recordDelimiterBytes = null;
if (null != delimiter)
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
return new LineRecordReader(recordDelimiterBytes);
}
@Override
protected boolean isSplitable(JobContext context, Path file) {
final CompressionCodec codec =
new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
if (null == codec) {
return true;
}
return codec instanceof SplittableCompressionCodec;
}