Hadoop被设计用来处理海量数据,这种数据可以是结构化的,半结构化的,甚至是一些无结构化的文本数据(这些数据可能存储在HDFS文件中,也可能存放在DB中)。它处理数据的核心就是map-reduce模型,但是,无论是map还是reduce,它们的输入输出数据都是key-value对的形式,这种key-value对的形式我们可以看做是结构化的数据。同时,对于reduce的输入,当然就是map的输出,而reduce、map的输出又直接可以在map和reduce处理函数中定义,那么这就只剩下map的输出了,也就是说,Hadoop如何把输入文件包装成key-value对的形式交给map来处理,同时hadoop又是如何切割作业的输入文件来结果不同的TaskTracker同时来处理的呢?这两个问题就是本文将要重点讲述的内容——作业的输入文件格式化器(InputFormat)。
在Hadoop对Map-Reduce实现设计中,作业的输入文件格式化器包括两个组件:文件读取器(RecordReader)和文件切割器(Spliter)。其中,文件切割器用来对作业的所有输入数据进行分片切割,最后有多少个切片就有多少个map任务,文件读取器用来读取切片中的数据,并按照一定的格式把读取的数据包装成一个个key-value对。而在具体的对应实现中这个输入文件格式化器被定义了一个抽先类,这样它把如何切割输入数据以及如何读取数据并把数据包装成key-value对交给了用户来实现,因为只有用户才知道输入的数据是如何组织的,map函数需要什么样的key-value值作为输入值。这个输入文件格式化器对应的是org.apache.hadoop.mapreduce.InputFormat类:
- public abstract class InputFormat<K, V> {
- /**
- * Logically split the set of input files for the job.
- */
- public abstract List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException;
- /**
- * Create a record reader for a given split. The framework will call
- */
- public abstract RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException;
- }
从上面的类图可以看出,Hadoop在抽象类FileInputFormat中实现了一个基于文件的数据分片切割器,所以在这里我先主要谈谈它是如何实现的。先来看源码:
- protected long getFormatMinSplitSize() {
- return 1;
- }
- public static long getMinSplitSize(JobContext job) {
- return job.getConfiguration().getLong("mapred.min.split.size", 1L);
- }
- public static long getMaxSplitSize(JobContext context) {
- return context.getConfiguration().getLong("mapred.max.split.size", Long.MAX_VALUE);
- }
- protected long computeSplitSize(long blockSize, long minSize,long maxSize) {
- return Math.max(minSize, Math.min(maxSize, blockSize));
- }
- public List<InputSplit> getSplits(JobContext job) throws IOException {
- long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));//计算允许的最小切片大小
- long maxSize = getMaxSplitSize(job);//计算允许的最大切片大小
- // generate splits
- LOG.debug("start to split all input files for Job["+job.getJobName()+"]");
- List<InputSplit> splits = new ArrayList<InputSplit>();
- for (FileStatus file: listStatus(job)) {
- Path path = file.getPath();
- FileSystem fs = path.getFileSystem(job.getConfiguration());
- long length = file.getLen();
- BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
- if ((length != 0) && isSplitable(job, path)) {
- long blockSize = file.getBlockSize();
- long splitSize = computeSplitSize(blockSize, minSize, maxSize);//计算该输入文件一个切片最终大小
- long bytesRemaining = length;
- while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
- int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
- splits.add(new FileSplit(path, length-bytesRemaining, splitSize, blkLocations[blkIndex].getHosts()));
- bytesRemaining -= splitSize;
- }
- if (bytesRemaining != 0) {
- splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,blkLocations[blkLocations.length-1].getHosts()));
- }
- } else if (length != 0) {
- splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
- } else {
- //Create empty hosts array for zero length files
- splits.add(new FileSplit(path, 0, length, new String[0]));
- }
- }
- LOG.debug("Total # of splits in Job["+job.getJobName()+"]'s input files: " + splits.size());
- return splits;
- }
- /*是否允许对一个文件进行切片*/
- protected boolean isSplitable(JobContext context, Path filename) {
- return true;
- }
最后,以LineRecordReader为例来简单的讲解一下记录读取器的实现,这个记录读取器是按文本文件中的行来读取数据的,它的key-value中为:行号一行文本。
- public class LineRecordReader extends RecordReader<LongWritable, Text> {
- private static final Log LOG = LogFactory.getLog(LineRecordReader.class);
- private CompressionCodecFactory compressionCodecs = null;
- private long start;
- private long pos;
- private long end;
- private LineReader in;
- private int maxLineLength;
- private LongWritable key = null;
- private Text value = null;
- public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
- FileSplit split = (FileSplit) genericSplit;
- Configuration job = context.getConfiguration();
- this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE);
- start = split.getStart();
- end = start + split.getLength();
- final Path file = split.getPath();
- compressionCodecs = new CompressionCodecFactory(job);
- final CompressionCodec codec = compressionCodecs.getCodec(file);
- // open the file and seek to the start of the split
- FileSystem fs = file.getFileSystem(job);
- Path _inputFile = split.getPath();
- FSDataInputStream fileIn = fs.open(_inputFile);
- boolean skipFirstLine = false;
- if (codec != null) {
- in = new LineReader(codec.createInputStream(fileIn), job);
- end = Long.MAX_VALUE;
- } else {
- if (start != 0) {
- skipFirstLine = true;
- --start;
- fileIn.seek(start);
- }
- in = new LineReader(fileIn, job);
- }
- if (skipFirstLine) { // skip first line and re-establish "start".
- start += in.readLine(new Text(), 0,(int)Math.min((long)Integer.MAX_VALUE, end - start));
- }
- this.pos = start;
- }
- public boolean nextKeyValue() throws IOException {
- if (key == null) {
- key = new LongWritable();
- }
- key.set(pos);
- if (value == null) {
- value = new Text();
- }
- int newSize = 0;
- while (pos < end) {
- newSize = in.readLine(value, maxLineLength,Math.max((int)Math.min(Integer.MAX_VALUE, end-pos), maxLineLength));
- if (newSize == 0) {
- break;
- }
- pos += newSize;
- if (newSize < maxLineLength) {
- break;
- }
- // line too long. try again
- LOG.debug("Skipped this line because the line is too long: lineLength["+newSize+"]>maxLineLength["+maxLineLength+"] at position[" + (pos - newSize)+"].");
- }
- if (newSize == 0) {
- key = null;
- value = null;
- return false;
- }
- else {
- return true;
- }
- }
- @Override
- public LongWritable getCurrentKey() {
- return key;
- }
- @Override
- public Text getCurrentValue() {
- return value;
- }
- /**
- * Get the progress within the split
- */
- public float getProgress() {
- if (start == end) {
- return 0.0f;
- } else {
- return Math.min(1.0f, (pos - start) / (float)(end - start));
- }
- }
- public synchronized void close() throws IOException {
- if (in != null) {
- in.close();
- }
- }
- }