hadoop2.xWordCount执行过程-客户端部分

如下展示wordCount程序的部分:

WordCount


    public static void main(String[] args) throws Exception{

        Configuration conf = new Configuration();

        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

        if(otherArgs.length < 2){

            System.err.println("Usage: wordcount <in> [<in> ...] <out>");

            System.exit(2);

        }

        Job job = Job.getInstance(conf, "word count");

        job.setJarByClass(WordCount.class);

        job.setMapperClass(TokenizerMapper.class);

        job.setCombinerClass(IntSumReducer.class);

        job.setReducerClass(IntSumReducer.class);



        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(IntWritable.class);



        for(int i = 0; i < otherArgs.length - 1; ++i){

            FileInputFormat.addInputPath(job, new Path(otherArgs[i]));

        }

        FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

程序的正式启动是从job.waitForCompletion(true)这句话开始的

Job#waitForCompletion


  /**

   * Submit the job to the cluster and wait for it to finish.

   * @param verbose print the progress to the user

   * @return true if the job succeeded

   * @throws IOException thrown if the communication with the 

   *         <code>JobTracker</code> is lost

   */

  public boolean waitForCompletion(boolean verbose

                                   ) throws IOException, InterruptedException,

                                            ClassNotFoundException {

    if (state == JobState.DEFINE) {

      submit();

    }

    if (verbose) {

      monitorAndPrintJob();

    } else {

      // get the completion poll interval from the client.

      int completionPollIntervalMillis = 

        Job.getCompletionPollInterval(cluster.getConf());

      while (!isComplete()) {

        try {

          Thread.sleep(completionPollIntervalMillis);

        } catch (InterruptedException ie) {

        }

      }

    }

    return isSuccessful();

  }

在job.waitForCompletion(true)中执行了submit()方法

Job#submit


  /**

   * Submit the job to the cluster and return immediately.

   * @throws IOException

   */

  public void submit() 

         throws IOException, InterruptedException, ClassNotFoundException {

    ensureState(JobState.DEFINE);

    setUseNewAPI();

    connect();

    final JobSubmitter submitter = 

        getJobSubmitter(cluster.getFileSystem(), cluster.getClient());

    status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {

      public JobStatus run() throws IOException, InterruptedException, 

      ClassNotFoundException {

        return submitter.submitJobInternal(Job.this, cluster);

      }

    });

    state = JobState.RUNNING;

    LOG.info("The url to track the job: " + getTrackingURL());

   }

在Job.submit()方法中调用了connect(),这个方法主要是初始化Cluster这个类

Job#connect


  private synchronized void connect()

          throws IOException, InterruptedException, ClassNotFoundException {

    if (cluster == null) {

      cluster = 

        ugi.doAs(new PrivilegedExceptionAction<Cluster>() {

                   public Cluster run()

                          throws IOException, InterruptedException, 

                                 ClassNotFoundException {

                     return new Cluster(getConfiguration());

                   }

                 });

    }

  }

Cluster主要是提供了一个连接mapreduce集群的方式,它维护了一个ClientProtocol的句柄

Cluster


public class Cluster {

  private ClientProtocolProvider clientProtocolProvider;

  private ClientProtocol client;



  private static ServiceLoader<ClientProtocolProvider> frameworkLoader =

      ServiceLoader.load(ClientProtocolProvider.class);



  static {

    ConfigUtil.loadResources();

  }



  public Cluster(InetSocketAddress jobTrackAddr, Configuration conf) 

      throws IOException {

    this.conf = conf;

    this.ugi = UserGroupInformation.getCurrentUser();

    initialize(jobTrackAddr, conf);

  }



  private void initialize(InetSocketAddress jobTrackAddr, Configuration conf)

      throws IOException {



    synchronized (frameworkLoader) {

      for (ClientProtocolProvider provider : frameworkLoader) {

        LOG.debug("Trying ClientProtocolProvider : "

            + provider.getClass().getName());

        ClientProtocol clientProtocol = null; 

        try {

          if (jobTrackAddr == null) {

            clientProtocol = provider.create(conf);

          } else {

            clientProtocol = provider.create(jobTrackAddr, conf);

          }



          if (clientProtocol != null) {

            clientProtocolProvider = provider;

            client = clientProtocol;

            LOG.debug("Picked " + provider.getClass().getName()

                + " as the ClientProtocolProvider");

            break;

          }

          else {

            LOG.debug("Cannot pick " + provider.getClass().getName()

                + " as the ClientProtocolProvider - returned null protocol");

          }

        } 

        catch (Exception e) {

          LOG.info("Failed to use " + provider.getClass().getName()

              + " due to error: " + e.getMessage());

        }

      }

    }



    if (null == clientProtocolProvider || null == client) {

      throw new IOException(

          "Cannot initialize Cluster. Please check your configuration for "

              + MRConfig.FRAMEWORK_NAME

              + " and the correspond server addresses.");

    }

  }

}

在Cluster会通过ServiceLoader动态地加载所有的ClientProtocolProvider实现,默认情况下,Yarn提供了两种实现,LocalClientProtocolProvider和YarnClientProtocolProvider。如果用户在配置文件中将选项mapreduce.framework.name配置为yarn,则客户端会采用YarnClientProtocolProvider,该类会创建一个真正的YARNRunner对象作为真正的客户端。后文中都假设配置了yarn。

Cluster初始化的时候会调用Cluster#initialize方法,该方法会给Cluster的client属性赋值,ClientProtocol是一个接口,它有两个实现类YARNRunner和LocalJobRunner,如果配置了yarn,则这里的client的真实类为YARNRunner。此时Job.connect()方法调用完成。我们继续回到Job.submit()方法。

Job#submit()


  /**

   * Submit the job to the cluster and return immediately.

   * @throws IOException

   */

  public void submit() 

         throws IOException, InterruptedException, ClassNotFoundException {

    ensureState(JobState.DEFINE);

    setUseNewAPI();

    connect();

    final JobSubmitter submitter = 

        getJobSubmitter(cluster.getFileSystem(), cluster.getClient());

    status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {

      public JobStatus run() throws IOException, InterruptedException, 

      ClassNotFoundException {

        return submitter.submitJobInternal(Job.this, cluster);

      }

    });

    state = JobState.RUNNING;

    LOG.info("The url to track the job: " + getTrackingURL());

   }

可以看到它初始化了一个JobSubmitter类,JobSubmitter中维护的ClientProtocol其实就是Cluster中的YARNRunner。然后Job.submit()中调用了submitter.submitJobInternal(Job.this, cluster),我们来看看JobSubmitter.submitJobInternal

JobSubmitter.submitJobInternal


  JobStatus submitJobInternal(Job job, Cluster cluster) 

  throws ClassNotFoundException, InterruptedException, IOException {

    .......

      //

      // Now, actually submit the job (using the submit name)

      //

      printTokens(jobId, job.getCredentials());

      status = submitClient.submitJob(

          jobId, submitJobDir.toString(), job.getCredentials());

      if (status != null) {

        return status;

      } else {

        throw new IOException("Could not launch job");

      }

    } finally {

      if (status == null) {

        LOG.info("Cleaning up the staging area " + submitJobDir);

        if (jtFs != null && submitJobDir != null)

          jtFs.delete(submitJobDir, true);



      }

    }

  }

上面的submitClient.submitJob(JobId, submitJobDir.toString(), job.getCredentials())才是真正提交作业的地方,因为submitClient是YARNRunner的对象,所以我们应该看下YARNRunner.submitJob(JobID jobId, String jobSubmitDir, Credentials ts);

YARNRunner#submitJob


  public JobStatus submitJob(JobID jobId, String jobSubmitDir, Credentials ts)

  throws IOException, InterruptedException {



    addHistoryToken(ts);



    // Construct necessary information to start the MR AM

    ApplicationSubmissionContext appContext =

      createApplicationSubmissionContext(conf, jobSubmitDir, ts);



    // Submit to ResourceManager

    try {

      ApplicationId applicationId =

          resMgrDelegate.submitApplication(appContext);



      ApplicationReport appMaster = resMgrDelegate

          .getApplicationReport(applicationId);

      String diagnostics =

          (appMaster == null ?

              "application report is null" : appMaster.getDiagnostics());

      if (appMaster == null

          || appMaster.getYarnApplicationState() == YarnApplicationState.FAILED

          || appMaster.getYarnApplicationState() == YarnApplicationState.KILLED) {

        throw new IOException("Failed to run job : " +

            diagnostics);

      }

      return clientCache.getClient(jobId).getJobStatus(jobId);

    } catch (YarnException e) {

      throw new IOException(e);

    }

  }

我们读到


ApplicationId applicationId = resMgrDelegate.submitApplication(appContext);

可知其实是这句话提交了作业请求,但是resMgrDelegate又是什么呢,它其实是ResourceMgrDelegate类的一个对象,ResourceMgrDelegate继承自抽象类YarnClient,而另一个类YarnClientImpl也继承子YarnClient,有意思的是ResourceMgrDelegate中的client其实是YarnClientImpl的实例化,也就是说ResourceMgrDelegate如同它的名字一样是一个代理类。在YarnClientImpl我们可以看到客户端的Rpc连接方法

YarnClientImpl#serviceStart


  @Override

  protected void serviceStart() throws Exception {

    try {

      rmClient = ClientRMProxy.createRMProxy(getConfig(),

          ApplicationClientProtocol.class);

      if (historyServiceEnabled) {

        historyClient.start();

      }

      if (timelineServiceEnabled) {

        timelineClient.start();

      }

    } catch (IOException e) {

      throw new YarnRuntimeException(e);

    }

    super.serviceStart();

  }

这就WordCount的客户端部分。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值