MapReduce 客户端提交Job源码跟踪

最新推荐文章于 2022-01-12 16:29:29 发布

lemon lime

最新推荐文章于 2022-01-12 16:29:29 发布

阅读量266

点赞数 1

分类专栏： mapreduce 文章标签： mapreduce 客户端提交

本文链接：https://blog.csdn.net/weixin_43270493/article/details/86425164

版权

mapreduce 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

MapReduce 客户端提交Job源码跟踪

以WordCount为例：
在这里插入图片描述

首先是设置了连接Hadoop集群相关的配置文件，设置了Job相关的类的信息等等…

一，点击Job进入:

在这里插入图片描述

可以看到job类是继承了一个类并实现了 JobContext接口，点击JobContext
在这里插入图片描述

JobContext又是继承了 MRJobConfig 这个类，不难猜出，这个类是MapReduce程序运行时的配置相关参数类，点击进入查看：
在这里插入图片描述

很多默认的参数配置,

 public static final String MAP_MEMORY_MB = "mapreduce.map.memory.mb";
 public static final int DEFAULT_MAP_MEMORY_MB = 1024;

 public static final String REDUCE_MEMORY_MB = "mapreduce.reduce.memory.mb";
 public static final int DEFAULT_REDUCE_MEMORY_MB = 1024;

map和reduce的默认的内存配置大小。

在这里插入图片描述

二，点击进入到waitForCompletion()方法中。

在判断状态state可以提交job后，执行submit提交方法。monitorAndPrintJob()会不断的刷新获取Job运行的进度信息，并打印。boolean参数verbose为true表明要打印运行速度，为false，就是表示只等待运行结果，不打印运行日志。

public boolean waitForCompletion(boolean verbose
                                   ) throws IOException, InterruptedException,
    
ClassNotFoundException {
    //当任务的状态为 define时，提交任务
    if (state == JobState.DEFINE) {
      submit();
    }
    //如果传入的参数是 true  则监听打印job运行日志
    if (verbose) {
      monitorAndPrintJob();
    } else {
      // get the completion poll interval from the client.
      //从客户端获得完成轮询时间间隔
      int completionPollIntervalMillis = 
        Job.getCompletionPollInterval(cluster.getConf());
      while (!isComplete()) {
        try {
          Thread.sleep(completionPollIntervalMillis);
        } catch (InterruptedException ie) {
        }
      }
    }
    //返回一个boolean值，表示作业是否成功提交
    return isSuccessful();
  }

三，点击进入到submit中

submit方法首先是确保当前的Job状态是处于 Define状态，否则不提交Job任务。然后启动新的API,connect()方法会产生一个Client实例，用来和ResourceManager进行通信。submit方法中关键的两个步骤就是，调用connect方法，另一个就是获取到 JobSubmitrer类的实例，调用该对象的submitJobInternal方法来提交任务。

  public void submit() 
         throws IOException, InterruptedException, ClassNotFoundException {
    //再次检查任务的状态
    ensureState(JobState.DEFINE);
    //使用新的API  里面就是一些配置的更改
    setUseNewAPI();
    connect();
    //为cluster赋值，Client即是提交器,分为本体提交器和Yarn提交器，由配置文件决定
    final JobSubmitter submitter = 
        getJobSubmitter(cluster.getFileSystem(), cluster.getClient());
    status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {
      public JobStatus run() throws IOException, InterruptedException, 
      ClassNotFoundException {
        return submitter.submitJobInternal(Job.this, cluster);
      }
    });
    state = JobState.RUNNING;
    LOG.info("The url to track the job: " + getTrackingURL());
   }

四，点 connect方法

MapReduce作业提交时连接集群是通过job类的connect实现的，它实际上就是构造了一个集群的实例对象cluster

private synchronized void connect()
          throws IOException, InterruptedException, ClassNotFoundException {
    if (cluster == null) {
      cluster = 
        ugi.doAs(new PrivilegedExceptionAction<Cluster>() {
                   public Cluster run()
                          throws IOException, InterruptedException, 
                                 ClassNotFoundException {
                     //返回一个集群实例对象
                     return new Cluster(getConfiguration());
                   }
                 });
    }
  }

点击进入查看，怎么构造对象的：

  
  public Cluster(Configuration conf) throws IOException {
    //调用多参数的构造方法
    this(null, conf);
  }

  public Cluster(InetSocketAddress jobTrackAddr, Configuration conf) 
      throws IOException {
    this.conf = conf;
    this.ugi = UserGroupInformation.getCurrentUser();
    //初始化集群对象
    initialize(jobTrackAddr, conf);
  }

点击 initialize方法查看：

private void initialize(InetSocketAddress jobTrackAddr, Configuration conf)
      throws IOException {

    synchronized (frameworkLoader) {
      //依次取出每个ClientProtocolProvider，通过其create()方法构造ClientProtocol实例
      for (ClientProtocolProvider provider : frameworkLoader) {
        LOG.debug("Trying ClientProtocolProvider : "
            + provider.getClass().getName());
        ClientProtocol clientProtocol = null; 
        try {
            
           //如果配置文件没有配置YARN信息，则构建LocalRunner,MR任务本地运行
          //如果配置文件有配置YARN信息，则构建YarnRunner,MR任务在YARN集群上运行
          if (jobTrackAddr == null) {
            clientProtocol = provider.create(conf);
          } else {
            clientProtocol = provider.create(jobTrackAddr, conf);
          }

          if (clientProtocol != null) {
            clientProtocolProvider = provider;
            client = clientProtocol;
            LOG.debug("Picked " + provider.getClass().getName()
                + " as the ClientProtocolProvider");
            break;
          }
          else {
            LOG.debug("Cannot pick " + provider.getClass().getName()
                + " as the ClientProtocolProvider - returned null protocol");
          }
        } 
        catch (Exception e) {
          LOG.info("Failed to use " + provider.getClass().getName()
              + " due to error: " + e.getMessage());
        }
      }
    }

    if (null == clientProtocolProvider || null == client) {
      throw new IOException(
          "Cannot initialize Cluster. Please check your configuration for "
              + MRConfig.FRAMEWORK_NAME
              + " and the correspond server addresses.");
    }
  }

点击create方法，你可以看到两个实现，一个是本地local，一个是yarn

到了这里，我们就能够知道一个很重要的信息，Cluster中客户端通信协议ClientProtocol实例，要么是Yarn模式下的YARNRunner，要么就是Local模式下的LocalJobRunner。

五，YarnRunner

以Yarn模式来分析MapReduce集群连接，看下YARNRunner的实现。

最重要的一个变量就是ResourceManager的代理ResourceMgrDelegate类型的resMgrDelegate实例，Yarn模式下整个MapReduce客户端就是由它负责与Yarn集群进行通信，完成诸如作业提交、作业状态查询等过程，通过它获取集群的信息，其内部有一个YarnClient实例YarnClient，负责与Yarn进行通信，还有ApplicationId、ApplicationSubmissionContext等与特定应用程序相关的成员变量。另外一个比较重要的变量就是客户端缓存ClientCache实例clientCache。

@SuppressWarnings("unchecked")
public class YARNRunner implements ClientProtocol {

  private static final Log LOG = LogFactory.getLog(YARNRunner.class);
  //记录工厂RecordFactory实例
  private final RecordFactory recordFactory = RecordFactoryProvider.getRecordFactory(null);
  //ResourceManager对象代理实例对象
  private ResourceMgrDelegate resMgrDelegate;
  //客户端的缓存实例
  private ClientCache clientCache;
  //配置信息
  private Configuration conf;
  //文件上下文实例
  private final FileContext defaultFileContext;
  
  /**
   * Yarn runner incapsulates the client interface of
   * yarn
   * @param conf the configuration object for the client
   */
  public YARNRunner(Configuration conf) {
   this(conf, new ResourceMgrDelegate(new YarnConfiguration(conf)));
  }

  /**
   * Similar to {@link #YARNRunner(Configuration)} but allowing injecting
   * {@link ResourceMgrDelegate}. Enables mocking and testing.
   * @param conf the configuration object for the client
   * @param resMgrDelegate the resourcemanager client handle.
   */
  public YARNRunner(Configuration conf, ResourceMgrDelegate resMgrDelegate) {
   this(conf, resMgrDelegate, new ClientCache(conf, resMgrDelegate));
  }

  /**
   * Similar to {@link YARNRunner#YARNRunner(Configuration, ResourceMgrDelegate)}
   * but allowing injecting {@link ClientCache}. Enable mocking and testing.
   * @param conf the configuration object
   * @param resMgrDelegate the resource manager delegate
   * @param clientCache the client cache object.
   */
  public YARNRunner(Configuration conf, ResourceMgrDelegate resMgrDelegate,
      ClientCache clientCache) {
    this.conf = conf;
    try {
      this.resMgrDelegate = resMgrDelegate;
      this.clientCache = clientCache;
      this.defaultFileContext = FileContext.getFileContext(this.conf);
    } catch (UnsupportedFileSystemException ufe) {
      throw new RuntimeException("Error in instantiating YarnClient", ufe);
    }
  }

六，connect方法总结

MapReduce任务提交时连接集群是通过job的connect方法实现的，它实际上是构造了集群实例对象cluster。Cluster是连接MapReduce集群的一个工具，提供了具体获取MapReduce集群信息的方法。在Cluster内部，有一个和集群进行通信的客户端通信协议ClientProtocol实例Client,Hadoop2.0中提供了两种模式的ClientProtocol，分别是Yarn模式，另一种Local模式。Yarn模式下，ClientProtocol的实例YarnRunner对象内部有一个ResourceManager代理对象的实例，Yarn模式下整个MapReduce客户端就是由它负责与Yarn集群进行通信，完成作业提交，作业状态查询等操作。

七，submitJobInternal()方法

回到三，上面已经介绍了connect()方法，下面开始介绍另一个重要的的方法submitJobInternal()。

该方法隶属于JobSubmitter类，顾名思义，该类是MapReduce中作业提交者，而实际上JobSubmitter除了构造方法外，对外提供的唯一一个非private成员变量或方法就是submitJobInternal()方法，它是提交Job的内部方法，实现了提交Job的所有业务逻辑。

@InterfaceAudience.Private
@InterfaceStability.Unstable
class JobSubmitter {
  protected static final Log LOG = LogFactory.getLog(JobSubmitter.class);
  private static final String SHUFFLE_KEYGEN_ALGORITHM = "HmacSHA1";
  private static final int SHUFFLE_KEY_LENGTH = 64;
  private FileSystem jtFs;   //文件系统FileSystem对象实例
  private ClientProtocol submitClient;   //客户端通信协议实例对象
  private String submitHostName;    //提交作业的主机名
  private String submitHostAddress;  //提交作业的主机
  .............

JobSubmitter唯一的对外核心功能方法submitJobInternal()，它被用于提交作业至集群

 JobStatus submitJobInternal(Job job, Cluster cluster) 
  throws ClassNotFoundException, InterruptedException, IOException {

    //validate the jobs output specs 
     //检查任务输出规格
     //检查作业输出路径是否配置并且是否存在。正确情况是已经配置且不存在
    //输出路径的配置参数为mapreduce.output.fileoutputformat.outputdir
    checkSpecs(job);
	
    Configuration conf = job.getConfiguration();
     
     //添加应用框架路径到分布式缓存中
    addMRFrameworkToDistributedCache(conf);
     
	//通过静态方法getStagingDir()获取作业执行时相关资源的存放路径
    //参数未配置时默认是/tmp/hadoop-yarn/staging/提交作业用户名/.staging
    Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);
    //configure the command line options correctly on the submitting dfs
     //在提交dfs上正确配置命令行选项
    InetAddress ip = InetAddress.getLocalHost();   //获取当前主机IP
    if (ip != null) {
        //记录提交作业的主机IP、主机名，并且设置配置信息conf
      submitHostAddress = ip.getHostAddress();
      submitHostName = ip.getHostName();
      conf.set(MRJobConfig.JOB_SUBMITHOST,submitHostName);
      conf.set(MRJobConfig.JOB_SUBMITHOSTADDR,submitHostAddress);
    }
    JobID jobId = submitClient.getNewJobID(); //生成作业ID,即是jobID
    job.setJobID(jobId);	//将jobID设置入job
     //构造提交作业路径,jobStagingArea后接/jobID
    Path submitJobDir = new Path(jobStagingArea, jobId.toString());
    JobStatus status = null;
    try {
        //设置一些作业参数
      conf.set(MRJobConfig.USER_NAME,
          UserGroupInformation.getCurrentUser().getShortUserName());
      conf.set("hadoop.http.filter.initializers", 
          "org.apache.hadoop.yarn.server.webproxy.amfilter.AmFilterInitializer");
      conf.set(MRJobConfig.MAPREDUCE_JOB_DIR, submitJobDir.toString());
      LOG.debug("Configuring job " + jobId + " with " + submitJobDir 
          + " as the submit dir");
        // get delegation token for the dir  获得路径的授权令牌
      TokenCache.obtainTokensForNamenodes(job.getCredentials(),
          new Path[] { submitJobDir }, conf);
       //获取秘钥和令牌，并将它们存储到令牌缓存TokenCache中
      populateTokenCache(conf, job.getCredentials());

      // generate a secret to authenticate shuffle transfers  生成一个秘密来验证洗牌转移
      if (TokenCache.getShuffleSecretKey(job.getCredentials()) == null) {
        KeyGenerator keyGen;
        try {
         
          int keyLen = CryptoUtils.isShuffleEncrypted(conf) 
              ? conf.getInt(MRJobConfig.MR_ENCRYPTED_INTERMEDIATE_DATA_KEY_SIZE_BITS, 
                  MRJobConfig.DEFAULT_MR_ENCRYPTED_INTERMEDIATE_DATA_KEY_SIZE_BITS)
              : SHUFFLE_KEY_LENGTH;
          keyGen = KeyGenerator.getInstance(SHUFFLE_KEYGEN_ALGORITHM);
          keyGen.init(keyLen);
        } catch (NoSuchAlgorithmException e) {
          throw new IOException("Error generating shuffle secret key", e);
        }
        SecretKey shuffleKey = keyGen.generateKey();
        TokenCache.setShuffleSecretKey(shuffleKey.getEncoded(),
            job.getCredentials());
      }
	  //复制并配置相关文件		
      copyAndConfigureFiles(job, submitJobDir);
		//获取配置文件路径
      Path submitJobFile = JobSubmissionFiles.getJobConfPath(submitJobDir);
      
      // Create the splits for the job  创建  split
      LOG.debug("Creating splits at " + jtFs.makeQualified(submitJobDir));
        
      //调用writeSplits()方法，写分片数据文件job.splits和分片元数据文件job.splitmetainfo,计算map任务数
      int maps = writeSplits(job, submitJobDir);
      conf.setInt(MRJobConfig.NUM_MAPS, maps);
      LOG.info("number of splits:" + maps);

      // write "queue admins of the queue to which job is being submitted"
      // to job file.
       // 获取作业队列名queue，取参数mapreduce.job.queuename,默认值为default
      String queue = conf.get(MRJobConfig.QUEUE_NAME,
          JobConf.DEFAULT_QUEUE_NAME);
      AccessControlList acl = submitClient.getQueueAdmins(queue);
      conf.set(toFullPropertyName(queue,
          QueueACL.ADMINISTER_JOBS.getAclName()), acl.getAclString());

      // removing jobtoken referrals before copying the jobconf to HDFS
      // as the tasks don't need this setting, actually they may break
      // because of it if present as the referral will point to a
      // different job.
      TokenCache.cleanUpTokenReferral(conf);//清除缓存的令牌
	//根据参数确定是否需要追踪令牌ID
      if (conf.getBoolean(
          MRJobConfig.JOB_TOKEN_TRACKING_IDS_ENABLED,
          MRJobConfig.DEFAULT_JOB_TOKEN_TRACKING_IDS_ENABLED)) {
        // Add HDFS tracking ids
        ArrayList<String> trackingIds = new ArrayList<String>();
        for (Token<? extends TokenIdentifier> t :
            job.getCredentials().getAllTokens()) {
          trackingIds.add(t.decodeIdentifier().getTrackingId());
        }
        conf.setStrings(MRJobConfig.JOB_TOKEN_TRACKING_IDS,
            trackingIds.toArray(new String[trackingIds.size()]));
      }

      // Set reservation info if it exists  设置保留信息，如果它存在
      ReservationId reservationId = job.getReservationId();
      if (reservationId != null) {
        conf.set(MRJobConfig.RESERVATION_ID, reservationId.toString());
      }

      // Write job file to submit dir   写作业文件提交目录
      writeConf(conf, submitJobFile);
      
      //
      // Now, actually submit the job (using the submit name)
      //  现在，实际提交作业（使用提交名称）
      printTokens(jobId, job.getCredentials());
        
      //通过客户端通信协议ClientProtocol实例submitClient的submitJob()方法提交作业
     //并获取作业状态实例status。由上下文可知，此处的submitClient是YARNRunner或LocalJobRunner
      status = submitClient.submitJob(
          jobId, submitJobDir.toString(), job.getCredentials());
        
      if (status != null) {
        return status;
      } else {
        throw new IOException("Could not launch job");
      }
    } finally {
      if (status == null) {
        LOG.info("Cleaning up the staging area " + submitJobDir);
        if (jtFs != null && submitJobDir != null)
          jtFs.delete(submitJobDir, true);

      }
    }
  }

八，点击 writeSplits()方法

 private int writeSplits(org.apache.hadoop.mapreduce.JobContext job,
      Path jobSubmitDir) throws IOException,
      InterruptedException, ClassNotFoundException {
    JobConf jConf = (JobConf)job.getConfiguration();
    int maps;
    //使用的是新的 API
    if (jConf.getUseNewMapper()) {
      maps = writeNewSplits(job, jobSubmitDir);
    } else {
      maps = writeOldSplits(jConf, jobSubmitDir);
    }
    return maps;
  }

点击 writeUseNewMapper()

 @SuppressWarnings("unchecked")
  private <T extends InputSplit>
  int writeNewSplits(JobContext job, Path jobSubmitDir) throws IOException,
      InterruptedException, ClassNotFoundException {
    Configuration conf = job.getConfiguration();
    InputFormat<?, ?> input =
      ReflectionUtils.newInstance(job.getInputFormatClass(), conf);

    List<InputSplit> splits = input.getSplits(job);
    T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]);

    // sort the splits into order based on size, so that the biggest
    // go first
    Arrays.sort(array, new SplitComparator());
    JobSplitWriter.createSplitFiles(jobSubmitDir, conf, 
        jobSubmitDir.getFileSystem(conf), array);
    return array.length;
  }

点击 Job.getInputFormatClass():

@SuppressWarnings("unchecked")
  public Class<? extends InputFormat<?,?>> getInputFormatClass() 
     throws ClassNotFoundException {
    return (Class<? extends InputFormat<?,?>>) 
      //输入的格式，未指定，默认是 TextInputFormat类型
      conf.getClass(INPUT_FORMAT_CLASS_ATTR, TextInputFormat.class);
  }

点击 input.getSplits()方法

public List<InputSplit> getSplits(JobContext job) throws IOException {
    Stopwatch sw = new Stopwatch().start();
    //从配置和默认的最小  取最大值，默认的是1MB
    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
    //从配置和默认的最大 取最小值，默认的是Long的最大值
    long maxSize = getMaxSplitSize(job);

    // generate splits
    List<InputSplit> splits = new ArrayList<InputSplit>();
    //获取文件的列表信息
    List<FileStatus> files = listStatus(job);
    //遍历获取数据的位置信息
    for (FileStatus file: files) {
      Path path = file.getPath();
      long length = file.getLen();
      if (length != 0) {
        BlockLocation[] blkLocations;
        if (file instanceof LocatedFileStatus) {
          blkLocations = ((LocatedFileStatus) file).getBlockLocations();
        } else {
          FileSystem fs = path.getFileSystem(job.getConfiguration());
          blkLocations = fs.getFileBlockLocations(file, 0, length);
        }
        if (isSplitable(job, path)) {
          long blockSize = file.getBlockSize();
          //Blocksize和max取小  得出的值 和 minsize取大值
          long splitSize = computeSplitSize(blockSize, minSize, maxSize);

          long bytesRemaining = length;
          //只要剩下的文件长度大小 是splitsize的 1.1以上，就继续切分 split
          //也就是说最后的一个split取值可能是 0 到  1.1*splitsize 之间
          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                        blkLocations[blkIndex].getHosts(),
                        blkLocations[blkIndex].getCachedHosts()));
            bytesRemaining -= splitSize;
          }

          if (bytesRemaining != 0) {
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
                       blkLocations[blkIndex].getHosts(),
                       blkLocations[blkIndex].getCachedHosts()));
          }
        } else { // not splitable
          splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
                      blkLocations[0].getCachedHosts()));
        }
      } else { 
        //Create empty hosts array for zero length files
        splits.add(makeSplit(path, 0, length, new String[0]));
      }
    }
    // Save the number of input files for metrics/loadgen
    job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
    sw.stop();
    if (LOG.isDebugEnabled()) {
      LOG.debug("Total # of splits generated by getSplits: " + splits.size()
          + ", TimeTaken: " + sw.elapsedMillis());
    }
    return splits;
  }

computeSplitSize()方法源码

 protected long computeSplitSize(long blockSize, long minSize,
                                  long maxSize) {
    return Math.max(minSize, Math.min(maxSize, blockSize));
  }

提交任务到底是谁？

点击 submitClient.submitJob()方法

这里看的是Yarn模式下提交的源码

@Override
  public JobStatus submitJob(JobID jobId, String jobSubmitDir, Credentials ts)
  throws IOException, InterruptedException {
    
    addHistoryToken(ts);
    
    // Construct necessary information to start the MR AM
    ApplicationSubmissionContext appContext =
      createApplicationSubmissionContext(conf, jobSubmitDir, ts);

    // Submit to ResourceManager
    //提交给 ResourceManager  到这MapReduce  client提交Job 也就差不多了
    try {
      ApplicationId applicationId =
          resMgrDelegate.submitApplication(appContext);

      ApplicationReport appMaster = resMgrDelegate
          .getApplicationReport(applicationId);
      String diagnostics =
          (appMaster == null ?
              "application report is null" : appMaster.getDiagnostics());
      if (appMaster == null
          || appMaster.getYarnApplicationState() == YarnApplicationState.FAILED
          || appMaster.getYarnApplicationState() == YarnApplicationState.KILLED) {
        throw new IOException("Failed to run job : " +
            diagnostics);
      }
      return clientCache.getClient(jobId).getJobStatus(jobId);
    } catch (YarnException e) {
      throw new IOException(e);
    }
  }

至此，MapReduce的Job提交的大体过程就分析完毕！

lemon lime

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
MapReduce 客户端提交Job源码跟踪

MapReduce 客户端提交Job源码跟踪以WordCount为例：首先是设置了连接Hadoop集群相关的配置文件，设置了Job相关的类的信息等等…一，点击Job进入:可以看到job类是继承了一个类并实现了 JobContext接口，点击JobContextJobContext又是继承了 MRJobConfig 这个类，不难猜出，这个类是MapReduce程序运行时的配置...
复制链接

扫一扫