下面是关于LineRecoedReader的NextKeyValue代码:
public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;
while (pos < end) {
newSize = in.readLine(value, maxLineLength,
Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),
maxLineLength));
if (newSize == 0) {
break;
}
pos += newSize;
if (newSize < maxLineLength) {
break;
}
// line too long. try again
LOG.info("Skipped line of size " + newSize + " at pos " +
(pos - newSize));
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
在key.set(pos)中,pos是该Line的位置,value是该Line的内容,有一个例子说明,是权威指南中的:
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
该记录被LIneRecordReader处理为4条K/V对:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
结合wordcount的例子,每一次mapper处理的K/V对,对value进行处理,StringTokenizer itr = new StringTokenizer(value.toString()),将value分割成一个一个的标记,经过mapper的处理。生成如下格式的中间体:
(On,1),(the,1),(top,1),(of,1).....再由job将这个中间体传给reducer进行排序和汇总;
所以,一个job的input从输入到mapper的输出大概是这样:
从FIleInputFormat.addInputPath(args),将input提交给FileInputFormat的getSplit()进行分块,在本例中,TextIputFormat获取每一行数据的LineRecordReader,用LineRecord进行从K/V对的读取,LineRecord其实是就像是读取器,具体的从输入流中读取数据的任务是它完成的,最后读取的K/V对交由mapper进行处理。