MapReduce中跨InputSplit数据的处理

最新推荐文章于 2021-08-24 22:03:18 发布

woshiliufeng

最新推荐文章于 2021-08-24 22:03:18 发布

阅读量957

点赞数 1

分类专栏： Hadoop 2.x源码分析文章标签： mapreduce InputSplit LineRecordReader

本文链接：https://blog.csdn.net/woshiliufeng/article/details/39956609

版权

Hadoop 2.x源码分析专栏收录该内容

6 篇文章 2 订阅

订阅专栏

LineRecordReader类的分析

InputSplit只是一个逻辑概念，每个InputSplit并没有对文件进行实际的切割，只是记录了要处理的数据的位置，首先说明InputSplit的生成：

一、bytesRemaining = length，length为文件的大小，bytesRemaining初始化的值为文件的大小

二、生成blockIndex，blkIndex =getBlockIndex(blkLocations, length-bytesRemaining)，blkLocations是一个数组，length-bytesRemaining是偏移量

三、makeSplit(path, length-bytesRemaining,splitSize, blkLocations[blkIndex].getHosts())，生成FileSplit，length-bytesRemaining是FileSplit的start值。

四、每次都要进行bytesRemaining-= splitSize

五、如果bytesRemaining不是0的话，把剩余的bytesRemaining文件生成FileSplit

makeSplit(path, length-bytesRemaining, bytesRemaining,blkLocations[blkIndex].getHosts())

总结：生成的split是有path路径、start值、split的具体大小以及在构成文件所有块中的index位置

public void initialize(InputSplitgenericSplit,

TaskAttemptContextcontext) throws IOException {

FileSplit split = (FileSplit) genericSplit;

Configuration job = context.getConfiguration();

this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);

start = split.getStart();

end = start + split.getLength();

final Path file = split.getPath();

// open the file and seek to the start of the split

final FileSystem fs = file.getFileSystem(job);

fileIn = fs.open(file);

CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);

if (null!=codec) {

isCompressedInput = true;

decompressor = CodecPool.getDecompressor(codec);

if (codec instanceof SplittableCompressionCodec) {

final SplitCompressionInputStream cIn =

((SplittableCompressionCodec)codec).createInputStream(

fileIn, decompressor, start, end,

SplittableCompressionCodec.READ_MODE.BYBLOCK);

in = new CompressedSplitLineReader(cIn, job,

this.recordDelimiterBytes);

start = cIn.getAdjustedStart();

end = cIn.getAdjustedEnd();

filePosition = cIn;

} else {

in = new SplitLineReader(codec.createInputStream(fileIn,

decompressor), job, this.recordDelimiterBytes);

filePosition = fileIn;

}

}else {

fileIn.seek(start);

in = new SplitLineReader(fileIn, job, this.recordDelimiterBytes);

filePosition = fileIn;

}

// If this is not the first split, we always throw away first record

// because we always (except the last split) read one extra line in

// next() method.

if (start != 0){

// maxBytesToConsume方法的实现是

// returnisCompressedInput

// ? Integer.MAX_VALUE

// : (int) Math.min(Integer.MAX_VALUE, end - start)

//读一行以CR, LF, or CRLF结尾的长度，CR是\r，LF是\n，CRLF是\r\n

start += in.readLine(new Text(), 0, maxBytesToConsume(start));

}

this.pos = start;

}

示例：

文本的数据格式及内容如下

0,7-24-15:59:59,report,F9.Stock.WEST.ForecastDataStat,62.4984,122.228.176.194,13,9,117

0,7-24-15:59:59,report,F9.Stock.WEST.ExRightAndDivdend,0,122.228.176.194,1,3,3

0,7-24-15:59:59,report,F9.Stock.WEST.ForecastDataStat,78.123,122.228.176.194,13,9,117

0,7-24-15:59:58,report,F9.Stock.WEST.HistoryForecast,15.6246,122.228.176.194,61,16,976

0,7-8-08:59:59,report,F9.Stock.WEST.GetOrganListForProfitForecast,312.492,10.23.193.168,5,1,5

0,7-8-08:59:59,report,all,343.7412,10.23.193.176,14,45,630

0,7-8-08:59:59,report,F9.Stock.WEST.CompareColligationWithRealValue,0,124.207.185.223,8,13,104

0,7-8-08:59:59,report,F9.Stock.WEST.ExRightAndDivdend,0,10.23.193.148,1,3,3

0,7-8-08:59:59,report,F9.Stock.WEST.ForecastDataStat,62.4984,10.23.193.148,13,9,117

0,7-8-08:59:59,report,F9.Stock.WEST.ColligationInformationPicDate,0,10.23.193.168,218,4,872

Mapreduce出来的结果

0 1

87 1

252 1

339 1

492 1

663 1

747 1

933 1

1212 1

1299 1

1524 1

2064 1

LineRecordReader的initialize过程

一、通过FileSplit对象得到start值和end值

二、判断文件是否压缩，是的话进行第三步，否则第六步

三、根据压缩算法得到对应的解压算法对象，如果压缩算法是支持切分的，则进行第四步，否则进行第五步

四、生成支持分割的压缩输入流，createInputStream( fileIn, decompressor, start, end,

SplittableCompressionCodec.READ_MODE.BYBLOCK)，初始化SplitLineReader对象，并对start值和end值进行调整

五、直接初始化SplitLineReader对象

六、找到start值的位置，初始化SplitLineReader对象

七、如果start的值不为0，start的值跳到他所在行的CR, LF, or CRLF结尾的长度，CR是\r，LF是\n，CRLF是\r\n任意一个的后面位置

八、将start的值赋予pos

readLine(new Text(), 0, maxBytesToConsume(start))的作用是从输入流中读取一行数据到第一个参数中，第一个参数用来存储读取的数据，第二个参数是可以存储的最大字节数，这行中剩余的部分被丢弃，第三个参数只是一个暗示，具体实现如下：

public int readLine(Text str, int maxLineLength,

int maxBytesToConsume) throws IOException {

if (this.recordDelimiterBytes != null) {

return readCustomLine(str, maxLineLength, maxBytesToConsume);

} else {

return readDefaultLine(str, maxLineLength, maxBytesToConsume);

}

两个返回值的区别是，第一个是读取自定义分隔符为一行结束标识，第二个是读取以CR, LF, or CRLF为一行结束标识

以readDefaultLine方法为例：

所有的操作都是在一个do while循环中进行

do {

……

}while (newlineLength == 0 && bytesConsumed < maxBytesToConsume)

newlineLength=0说明没有遇到换行符，maxBytesToConsume的默认值为Integer.MAX_VALUE，bytesConsumed为读取的内容，如果bytesConsumed的大小超过Integer.MAX_VALUE的话，则会抛出IO异常：Too many bytes before newline

nextKeyValue方法是对Mapper的key和value进行赋值，并且确保跨InputSplit的数据能够被正确读取到，getFilePosition() <= end，=end的条件是为了处理一条数据正好在一个InputSplit的尾部结束，然后继续读取下一个InputSplit的第一条数据，因此所有跨InputSplit的数据都是前一个InputSplit进行控制的

key的值是字符的偏移量以及文件分割时的偏移量，因此是LongWritable类型：key.set(pos)

对value的赋值语句：

while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {

newSize = in.readLine(value, maxLineLength,

Math.max(maxBytesToConsume(pos), maxLineLength));

pos += newSize;

if (newSize < maxLineLength) {

break;

}

// line too long. try again

LOG.info("Skipped line of size " + newSize + " at pos " +

(pos - newSize));