mapreduce程序中读取文件过程详解

最新推荐文章于 2024-03-20 08:58:07 发布

仲夏夜有蚊子

最新推荐文章于 2024-03-20 08:58:07 发布

阅读量2.3k

点赞数

分类专栏： hadoop

hadoop 专栏收录该内容

22 篇文章 0 订阅

订阅专栏

hadoop的inputformat包括他的子类reader是maptask读取数据的重要步骤

一、获得splits-mapper数

1. jobclinet的submitJobInternal，生成split，获取mapper数量

    Java代码   
    
  
 public   
   RunningJob submitJobInternal {  
     return ugi.doAs(new PrivilegedExceptionAction<RunningJob>() {  
 ....  
 int maps = writeSplits(context, submitJobDir);//<span style="font-family: Helvetica, Tahoma, Arial, sans-serif; white-space: normal; background-color: #ffffff;">生成split，获取mapper数量</span>  
 ....  
 }}  

jobclinet的writesplit方法

    Java代码   
    
  
 private int writeSplits(org.apache.hadoop.mapreduce.JobContext job,  
       Path jobSubmitDir) throws IOException,  
       InterruptedException, ClassNotFoundException {  
     JobConf jConf = (JobConf)job.getConfiguration();  
     int maps;  
     if (jConf.getUseNewMapper()) {  
       maps = writeNewSplits(job, jobSubmitDir);//新api调用此方法  
     } else {  
       maps = writeOldSplits(jConf, jobSubmitDir);  
     }  
     return maps;  
   }  

2.writeNewSplits新api方法，反射inputformat类，调用getsplit方法，获取split数据，并排序，并返回mapper数量

    Java代码   
    
  
 private <T extends InputSplit>  
   int writeNewSplits(JobContext job, Path jobSubmitDir) throws IOException,  
       InterruptedException, ClassNotFoundException {  
     Configuration conf = job.getConfiguration();  
     InputFormat<?, ?> input =  
       ReflectionUtils.newInstance(job.getInputFormatClass(), conf);//反射到inputsplit  
   
     List<InputSplit> splits = input.getSplits(job);//调用inputformat子类实现的getsplits方法  
     T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]);//生成数组，这么简单的方法写的这么复杂，真够扯的，不懂这样为了什么  
   
     // sort the splits into order based on size, so that the biggest  
     // go first  
     Arrays.sort(array, new SplitComparator());//splits排序  
     JobSplitWriter.createSplitFiles(jobSubmitDir, conf,  
         jobSubmitDir.getFileSystem(conf), array);  
     return array.length;//mapper数量  
   }  

3.贴上最常用的FileInputSplit的getSplits方法

    Java代码   
    
  
 public List<InputSplit> getSplits(JobContext job  
                                     ) throws IOException {  
     long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));  
     long maxSize = getMaxSplitSize(job);  
   
     // generate splits  
     List<InputSplit> splits = new ArrayList<InputSplit>();  
     List<FileStatus>files = listStatus(job);  
     for (FileStatus file: files) {  
       Path path = file.getPath();  
       FileSystem fs = path.getFileSystem(job.getConfiguration());  
       long length = file.getLen();  
       BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);  
       if ((length != 0) && isSplitable(job, path)) {   
         long blockSize = file.getBlockSize();  
         long splitSize = computeSplitSize(blockSize, minSize, maxSize);//获得split文件的最大文件大小  
   
         long bytesRemaining = length;  
         while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {//分解大文件  
           int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);  
           splits.add(new FileSplit(path, length-bytesRemaining, splitSize,   
                                    blkLocations[blkIndex].getHosts()));  
           bytesRemaining -= splitSize;  
         }  
           
         if (bytesRemaining != 0) {  
           splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,   
                      blkLocations[blkLocations.length-1].getHosts()));  
         }  
       } else if (length != 0) {  
         splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));  
       } else {   
         //Create empty hosts array for zero length files  
         splits.add(new FileSplit(path, 0, length, new String[0]));  
       }  
     }  
       
     // Save the number of input files in the job-conf  
     job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());  
   
     LOG.debug("Total # of splits: " + splits.size());  
     return splits;  
   }  

二、读取keyvalue的过程

1.实例化inputformat，初始化reader

在MapTask类的runNewMapper方法中，生成inputformat和recordreader，并进行初始化，运行mapper

MapTask$NewTrackingRecordReader 由 RecordReader组成，是它的一个代理类

    Java代码   
    
  
  private <INKEY,INVALUE,OUTKEY,OUTVALUE>  
   void runNewMapper {  
  // 生成自定义inputformat  
     org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =  
       (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)  
         ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);  
 .....  
 //生成自定义recordreader  
 org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =  
       new NewTrackingRecordReader<INKEY,INVALUE>  
           (split, inputFormat, reporter, job, taskContext);  
 .....  
 //初始化recordreader  
 input.initialize(split, mapperContext);  
 .....  
 //运行mapper  
 mapper.run(mapperContext);  
    }  

2.在运行mapper中，调用context让reader读取key和value，其中使用代理类MapTask$NewTrackingRecordReader，添加并推送读取记录

mapper代码：

    Java代码   
    
  
 public void run(Context context) throws IOException, InterruptedException {  
    setup(context);  
       
    while (context.nextKeyValue()) {  
      map(context.getCurrentKey(), context.getCurrentValue(), context);  
    }  
    cleanup(context);  
  }  

MapContext代码：

    Java代码   
    
  
 @Override  
   public boolean nextKeyValue() throws IOException, InterruptedException {  
     return reader.nextKeyValue();  
   }  
 @Override  
   public KEYIN getCurrentKey() throws IOException, InterruptedException {  
     return reader.getCurrentKey();  
   }  
   
   @Override  
   public VALUEIN getCurrentValue() throws IOException, InterruptedException {  
     return reader.getCurrentValue();  
   }  

MapTask$NewTrackingRecordReader的代码：

    Java代码   
    
  
 @Override  
    public boolean nextKeyValue() throws IOException, InterruptedException {  
      boolean result = false;  
      try {  
        long bytesInPrev = getInputBytes(fsStats);  
        result = real.nextKeyValue();//recordreader实际读取数据  
        long bytesInCurr = getInputBytes(fsStats);  
   
        if (result) {  
          inputRecordCounter.increment(1);//添加读取记录  
          fileInputByteCounter.increment(bytesInCurr - bytesInPrev);//记录读取数据  
        }  
        reporter.setProgress(getProgress());//将reporter的flag置为true，推送记录信息  
      } catch (IOException ioe) {  
        if (inputSplit instanceof FileSplit) {  
          FileSplit fileSplit = (FileSplit) inputSplit;  
          LOG.error("IO error in map input file "  
              + fileSplit.getPath().toString());  
          throw new IOException("IO error in map input file "  
              + fileSplit.getPath().toString(), ioe);  
        }  
        throw ioe;  
      }  
      return result;  
    }  

3.执行完mapper方法，返回到maptask，关闭reader

    Java代码   
    
  
 mapper.run(mapperContext);  
 input.close();//关闭inputformat  
 output.close(mapperContext);  

两个步骤不在同一个线程中完成，生成splits后进入monitor阶段

以上也调用了所有的inputformat虚类的所有方法

仲夏夜有蚊子

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录