前言:
我们在线上跑的MapRed程序和Hive程序,Map数到底怎么控制?这个问题一直比较模糊,这次有机会深入代码里面去看,才了解到其实MapReduce针对不同输入格式有不同的判断文件大小的规则以及文件切片和合并的方式。下面就来具体了解一下FileInputFormat和CompositeInputFormat两种主流输入格式的处理细节。
首先在这里提一下MapReduce框架是默认一个文件Block对应一个Map的输入的,所以在这个背景下才有不同的输入格式自己的有针对性的文件切分或合并方式。
CompositeInputFormat格式:
这个是Hive读取所有Textfile格式的文件是用的默认输入格式:
void
createSplits(Map<String, Set<OneBlockInfo>> nodeToBlocks,
Map<OneBlockInfo, String[]> blockToNodes,
Map<String, List<OneBlockInfo>> rackToBlocks,
long
totLength,
long
maxSize,
long
minSizeNode,
long
minSizeRack,
List<InputSplit> splits
) {
...
// if the accumulated split size exceeds the maximum, then
// create this split.
//这里的判断条件表示文件分块是按maxSize来合并的
if
(maxSize !=
0
&& curSplitSize >= maxSize) {
// create an input split and add it to the splits array
addCreatedSplit(splits, Collections.singleton(node), validBlocks);
totalLength -= curSplitSize;
curSplitSize =
0
;
splitsPerNode.add(node);
// Remove entries from blocksInNode so that we don't walk these
// again.
blocksInCurrentNode.removeAll(validBlocks);
validBlocks.clear();
// Done creating a single split for this node. Move on to the next
// node so that splits are distributed across nodes.
break
;
...
}
...
if
(validBlocks.size() !=
0
) {
// This implies that the last few blocks (or all in case maxSize=0)
// were not part of a split. The node is complete.
// if there were any blocks left over and their combined size is
// larger than minSplitNode, then combine them into one split.
// Otherwise add them back to the unprocessed pool. It is likely
// that they will be combined with other blocks from the
// same rack later on.
// This condition also kicks in when max split size is not set. All
// blocks on a node will be grouped together into a single split.
//当maxSize没有设置的时候,程序会直接进去这块代码,采用minSizeNode来作为下线合并文件块
if
(minSizeNode !=
0
&& curSplitSize >= minSizeNode
&& splitsPerNode.count(node) ==
0
) {
// haven't created any split on this machine. so its ok to add a
// smaller one for parallelism. Otherwise group it in the rack for
// balanced size create an input split and add it to the splits
// array
addCreatedSplit(splits, Collections.singleton(node), validBlocks);
totalLength -= curSplitSize;
splitsPerNode.add(node);
// Remove entries from blocksInNode so that we don't walk this again.
blocksInCurrentNode.removeAll(validBlocks);
// The node is done. This was the last set of blocks for this node.
...
//通过上面的分析就可以解释为什么,在hive语句中当要调高map的文件输入文件大小时,设置mapred.min.split.size没有用了,
//而应该通过设置mapred.max.split.size参数来调高输入大小。(在mapreduce框架中设置一个v1或者v2的mapred.min.split.size参数就可以了,两版本的参数会相互同步,择一个设置即 可。)
//当没有设置mapred.max.split.size参数时,如set mapred.min.split.size=0;一个节点上的所有文件会合并成一个文件。
|
有几种情况如:
min:128m,max:256m,split按256m划分。
min:128m,max:0,split按128m划分。
FileInputFormat输入格式:
TextInputFormat,KeyValueTextInputFormat等文本格式的文件都使用这种方式切分Map的输入文件:
public
List<InputSplit> getSplits(JobContext job)
throws
IOException {
Stopwatch sw =
new
Stopwatch().start();
long
minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
long
maxSize = getMaxSplitSize(job);
// generate splits
List<InputSplit> splits =
new
ArrayList<InputSplit>();
List<FileStatus> files = listStatus(job);
for
(FileStatus file: files) {
Path path = file.getPath();
long
length = file.getLen();
if
(length !=
0
) {
BlockLocation[] blkLocations;
if
(file
instanceof
LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
}
else
{
FileSystem fs = path.getFileSystem(job.getConfiguration());
blkLocations = fs.getFileBlockLocations(file,
0
, length);
}
if
(isSplitable(job, path)) {
long
blockSize = file.getBlockSize();
//程序在这里获取splitsize即split的文件大小
long
splitSize = computeSplitSize(blockSize, minSize, maxSize);
...
protected
long
computeSplitSize(
long
blockSize,
long
minSize,
long
maxSize) {
return
Math.max(minSize, Math.min(maxSize, blockSize));
}
//可以看出,如果需要减小Map的输入,需要同时调小minsize和maxsize。
//如果需要调大map的输入,调大minsize即可。
|