MapRed程序map个数控制分析

最新推荐文章于 2023-03-02 21:53:33 发布

tomson8975

最新推荐文章于 2023-03-02 21:53:33 发布

阅读量3.2k

点赞数

分类专栏： yarn mapred

本文链接：https://blog.csdn.net/tomson8975/article/details/49465469

版权

yarn 同时被 2 个专栏收录

6 篇文章 1 订阅

订阅专栏

mapred

4 篇文章 0 订阅

订阅专栏

前言：

我们在线上跑的MapRed程序和Hive程序，Map数到底怎么控制？这个问题一直比较模糊，这次有机会深入代码里面去看，才了解到其实MapReduce针对不同输入格式有不同的判断文件大小的规则以及文件切片和合并的方式。下面就来具体了解一下FileInputFormat和CompositeInputFormat两种主流输入格式的处理细节。

首先在这里提一下MapReduce框架是默认一个文件Block对应一个Map的输入的，所以在这个背景下才有不同的输入格式自己的有针对性的文件切分或合并方式。

CompositeInputFormat格式：

这个是Hive读取所有Textfile格式的文件是用的默认输入格式：

 
          void  
          createSplits(Map<String, Set<OneBlockInfo>> nodeToBlocks, 
         
          Map<OneBlockInfo, String[]> blockToNodes, 
         
          Map<String, List<OneBlockInfo>> rackToBlocks, 
         
          long  
          totLength, 
         
          long  
          maxSize, 
         
          long  
          minSizeNode, 
         
          long  
          minSizeRack, 
         
          List<InputSplit> splits                     
         
          ) { 
         
          ... 
         
          // if the accumulated split size exceeds the maximum, then 
         
          // create this split. 
         
          //这里的判断条件表示文件分块是按maxSize来合并的 
         
          if  
          (maxSize !=  
          0  
          && curSplitSize >= maxSize) { 
         
          // create an input split and add it to the splits array 
         
          addCreatedSplit(splits, Collections.singleton(node), validBlocks); 
         
          totalLength -= curSplitSize; 
         
          curSplitSize =  
          0 
          ; 
         
          splitsPerNode.add(node); 
         
          // Remove entries from blocksInNode so that we don't walk these 
         
          // again. 
         
          blocksInCurrentNode.removeAll(validBlocks); 
         
          validBlocks.clear(); 
         
          // Done creating a single split for this node. Move on to the next 
         
          // node so that splits are distributed across nodes. 
         
          break 
          ; 
         
          ... 
         
          } 
         
          ... 
         
          if  
          (validBlocks.size() !=  
          0 
          ) { 
         
          // This implies that the last few blocks (or all in case maxSize=0) 
         
          // were not part of a split. The node is complete. 
         
          // if there were any blocks left over and their combined size is 
         
          // larger than minSplitNode, then combine them into one split. 
         
          // Otherwise add them back to the unprocessed pool. It is likely 
         
          // that they will be combined with other blocks from the 
         
          // same rack later on. 
         
          // This condition also kicks in when max split size is not set. All 
         
          // blocks on a node will be grouped together into a single split. 
         
          //当maxSize没有设置的时候，程序会直接进去这块代码，采用minSizeNode来作为下线合并文件块 
         
          if  
          (minSizeNode !=  
          0  
          && curSplitSize >= minSizeNode 
         
          && splitsPerNode.count(node) ==  
          0 
          ) { 
         
          // haven't created any split on this machine. so its ok to add a 
         
          // smaller one for parallelism. Otherwise group it in the rack for 
         
          // balanced size create an input split and add it to the splits 
         
          // array 
         
          addCreatedSplit(splits, Collections.singleton(node), validBlocks); 
         
          totalLength -= curSplitSize; 
         
          splitsPerNode.add(node); 
         
          // Remove entries from blocksInNode so that we don't walk this again. 
         
          blocksInCurrentNode.removeAll(validBlocks); 
         
          // The node is done. This was the last set of blocks for this node. 
         
          ... 
         
          //通过上面的分析就可以解释为什么，在hive语句中当要调高map的文件输入文件大小时，设置mapred.min.split.size没有用了， 
         
          //而应该通过设置mapred.max.split.size参数来调高输入大小。（在mapreduce框架中设置一个v1或者v2的mapred.min.split.size参数就可以了，两版本的参数会相互同步，择一个设置即                 可。） 
         
          //当没有设置mapred.max.split.size参数时，如set mapred.min.split.size=0;一个节点上的所有文件会合并成一个文件。

有几种情况如：

min:128m,max:256m,split按256m划分。

min:128m,max:0，split按128m划分。

FileInputFormat输入格式：

TextInputFormat,KeyValueTextInputFormat等文本格式的文件都使用这种方式切分Map的输入文件：

 
          public  
          List<InputSplit> getSplits(JobContext job)  
          throws  
          IOException { 
         
          Stopwatch sw =  
          new  
          Stopwatch().start(); 
         
          long  
          minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job)); 
         
          long  
          maxSize = getMaxSplitSize(job); 
         
          // generate splits 
         
          List<InputSplit> splits =  
          new  
          ArrayList<InputSplit>(); 
         
          List<FileStatus> files = listStatus(job); 
         
          for  
          (FileStatus file: files) { 
         
          Path path = file.getPath(); 
         
          long  
          length = file.getLen(); 
         
          if  
          (length !=  
          0 
          ) { 
         
          BlockLocation[] blkLocations; 
         
          if  
          (file  
          instanceof  
          LocatedFileStatus) { 
         
          blkLocations = ((LocatedFileStatus) file).getBlockLocations(); 
         
          }  
          else  
          { 
         
          FileSystem fs = path.getFileSystem(job.getConfiguration()); 
         
          blkLocations = fs.getFileBlockLocations(file,  
          0 
          , length); 
         
          } 
         
          if  
          (isSplitable(job, path)) { 
         
          long  
          blockSize = file.getBlockSize(); 
         
          //程序在这里获取splitsize即split的文件大小 
         
          long  
          splitSize = computeSplitSize(blockSize, minSize, maxSize); 
         
          ... 
         
          protected  
          long  
          computeSplitSize( 
          long  
          blockSize,  
          long  
          minSize, 
         
          long  
          maxSize) { 
         
          return  
          Math.max(minSize, Math.min(maxSize, blockSize)); 
         
          } 
         
          //可以看出，如果需要减小Map的输入，需要同时调小minsize和maxsize。 
         
          //如果需要调大map的输入，调大minsize即可。

tomson8975

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MapRed程序map个数控制分析

前言：我们在线上跑的MapRed程序和Hive程序，Map数到底怎么控制？这个问题一直比较模糊，这次有机会深入代码里面去看，才了解到其实MapReduce针对不同输入格式有不同的判断文件大小的规则以及文件切片和合并的方式。下面就来具体了解一下FileInputFormat和CompositeInputFormat两种主流输入格式的处理细节。首先在这里提一下MapRe
复制链接

扫一扫