在之前的Blog [http://flyfoxs.iteye.com/blog/2110463] 中讨论了, hadoop在文件切割时,可能会把一个行数据切割成无意义的2块. 如果不做特别处理,这会造成数据的失真及处理错误. 经人指点,发现这个BUG不存在.
Hadoop在分割文件后,后期读取中会通过一些规则来保证不会出现把一行数据分割成2行. 下面对这个后期处理机制(LineRecordReader)做一个分析:
1)数据分割是由JobClient完成,不是在hadoop集群完成.(并且这个是一个粗分,具体精确的还是依赖Mapper依赖如下规则)
2)数据的分割是由JobClient完成,但是Mapper在处理的时候,不是严格按照这个来处理,
除了第一个Split,其他的Split都是从第一个换行符开始读取
Split的结束是下一个Split的换行符,(太霸道了,除了最后一个,几乎每一都要跨越Split)
3)针对超长行,有一个理论上的Bug,就是如果有行超过了你限制的长度,那么这一行会有部分数据会被抛弃. 但是这个Bug是理论上的,因为默认值为 Integer.MAX_VALUE .
this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE);
下面的代码可以看出LineRecordReader读取最后一行的时候,并不是严格按照Split的结束而结束. 而是必须要读取到下一个Split的换行符.
代码比较复杂已经添加了注释,如果有不明白的欢迎提问.
public int readLine(Text str, int maxLineLength,
int maxBytesToConsume) throws IOException {
/* We're reading data from in, but the head of the stream may be
* already buffered in buffer, so we have several cases:
* 1. No newline characters are in the buffer, so we need to copy
* everything and read another buffer from the stream.
* 2. An unambiguously terminated line is in buffer, so we just
* copy to str.
* 3. Ambiguously terminated line is in buffer, i.e. buffer ends
* in CR. In this case we copy everything up to CR to str, but
* we also need to see what follows CR: if it's LF, then we
* need consume LF as well, so next call to readLine will read
* from after that.
* We use a flag prevCharCR to signal if previous character was CR
* and, if it happens to be at the end of the buffer, delay
* consuming it until we have a chance to look at the char that
* follows.
*/
str.clear();
int txtLength = 0; //tracks str.getLength(), as an optimization
int newlineLength = 0; //length of terminating newline
boolean prevCharCR = false; //true of prev char was CR
long bytesConsumed = 0;
do {
//bufferPosn记录了当前Buffer读取到哪个位置,这样当下一次循环时
int startPosn = bufferPosn; //starting from where we left off the last time
//如果Buffer里面的数据已经处理完毕,则对Buffer清空,重新再从IO流读取数据
if (bufferPosn >= bufferLength) {
startPosn = bufferPosn = 0;
if (prevCharCR)
++bytesConsumed; //account for CR from previous read
//从IO中读取数据处理,只有处理完毕(bufferPosn >= bufferLength)才会再次读取
//bufferLength记录了从IO中读取了多少个字节的数据
bufferLength = in.read(buffer);
if (bufferLength <= 0)
break; // EOF
}
//在For循环总寻找断行符, 兼容MAC, Windows, Linux 多种平台的换行符
for (; bufferPosn < bufferLength; ++bufferPosn) { //search for newline
//判断当前字符是否是'\n'
if (buffer[bufferPosn] == LF) {
//如果是'\r\n'来区分一行, newlineLength=2, 如果是'\n'则newlineLength=1
newlineLength = (prevCharCR) ? 2 : 1;
++bufferPosn; // at next invocation proceed from following byte
break;
}
//如果是\r来区分一行,则newlineLength=1
if (prevCharCR) { //CR + notLF, we are at notLF
newlineLength = 1;
break;
}
//判断当前字符是否是'\r', 等待下一个循环来组合判断真正的换行符
prevCharCR = (buffer[bufferPosn] == CR);
}
int readLength = bufferPosn - startPosn;
//Buffer最后一个字节就是'\r'
if (prevCharCR && newlineLength == 0)
--readLength; //CR at the end of the buffer
bytesConsumed += readLength;
//appendLength:在本轮循环中从Buffer中读取的负载长度,去除了换行符
int appendLength = readLength - newlineLength;
//txtLength:记录了最终返回的str的长度
if (appendLength > maxLineLength - txtLength) {
//如果添加后,字符串长度超过了一行长度的上限,那么超过的将不会被添加到str
appendLength = maxLineLength - txtLength;
}
//将当前Buffer中,指定区间的字符添加到返回值(str)
if (appendLength > 0) {
str.append(buffer, startPosn, appendLength);
txtLength += appendLength;
}
//如果在buffer里面没有读取到换行符,并且已经读取的字节数没有超过预定大小,则继续从IO流读取下一批数据
} while (newlineLength == 0 && bytesConsumed < maxBytesToConsume);
if (bytesConsumed > (long)Integer.MAX_VALUE)
throw new IOException("Too many bytes before newline: " + bytesConsumed);
return (int)bytesConsumed;
}
下面的代码可以看出LineRecordReader是如何来判读是否需要忽略第一行
public void initialize(InputSplit genericSplit,
TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength",
Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
compressionCodecs = new CompressionCodecFactory(job);
final CompressionCodec codec = compressionCodecs.getCodec(file);
// open the file and seek to the start of the split
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());
boolean skipFirstLine = false;
if (codec != null) {
in = new LineReader(codec.createInputStream(fileIn), job);
end = Long.MAX_VALUE;
} else {
if (start != 0) {
//只有文件的第一行不能忽略第一行
skipFirstLine = true;
--start;
fileIn.seek(start);
}
in = new LineReader(fileIn, job);
}
if (skipFirstLine) { // skip first line and re-establish "start".
start += in.readLine(new Text(), 0,
(int)Math.min((long)Integer.MAX_VALUE, end - start));
}
this.pos = start;
}
参考文献:
http://blog.csdn.net/bluishglc/article/details/9380087
http://blog.csdn.net/wanghai__/article/details/6583364