Mapreduce本地提交debug源码剖析

最新推荐文章于 2020-08-10 20:52:17 发布

ldd儆儆

最新推荐文章于 2020-08-10 20:52:17 发布

阅读量309

点赞数 1

分类专栏：大数据

本文链接：https://blog.csdn.net/qq_38304320/article/details/88690288

版权

大数据专栏收录该内容

10 篇文章 0 订阅

订阅专栏

首先进入job.waitForCompletion

在这里插入图片描述

由于job状态已经被定义，所以直接进入submit方法

在这里插入图片描述

1.进入submit方法：

   public void submit() throws IOException, InterruptedException, ClassNotFoundException {
       //再次确认job状态
        this.ensureState(Job.JobState.DEFINE);
        
        /*
        //设置使用新的API,新旧API在命名上是有差别的
         旧：mapred.mapper/reducer.class
         
          新：mapreduce.mapper/reducer.class
        */
        this.setUseNewAPI();
              
        this.connect();  *********
        final JobSubmitter submitter = this.getJobSubmitter(this.cluster.getFileSystem(), this.cluster.getClient());
        this.status = (JobStatus)this.ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {
            public JobStatus run() throws IOException, InterruptedException, ClassNotFoundException {
                return submitter.submitJobInternal(Job.this, Job.this.cluster);
            }
        });
        this.state = Job.JobState.RUNNING;
        LOG.info("The url to track the job: " + this.getTrackingURL());
    }

1.1进入submit中的connect方法：

作用：创建cluster对象

  private synchronized void connect() throws IOException, InterruptedException, ClassNotFoundException {
  /*
  由于开始并没有创建cluster对象 cluster为null

  */
        if (this.cluster == null) {
            this.cluster = (Cluster)this.ugi.doAs(new PrivilegedExceptionAction<Cluster>() {
                public Cluster run() throws IOException, InterruptedException, ClassNotFoundException {
                //接下来进入此方法
                    return new Cluster(Job.this.getConfiguration());*********
                }
            });
        }

    }

 public Cluster(InetSocketAddress jobTrackAddr, Configuration conf) throws IOException {
      .......
        this.initialize(jobTrackAddr//null, conf);
    }

1.1.1 进入initialize方法

initialize方法作用：创建客户端协议提供者，之后利用其创建clientProtocol 对象

private void initialize(InetSocketAddress jobTrackAddr, Configuration conf) throws IOException {
        synchronized(frameworkLoader) {
       
            Iterator var4 = frameworkLoader.iterator();
            while(var4.hasNext()) {
            /*
             private static ServiceLoader<ClientProtocolProvider> frameworkLoader = 				        ServiceLoader.load(ClientProtocolProvider.class);
                1.首先根据 服务加载器创建framework，之后通过其创建provider
            */
                ClientProtocolProvider provider = (ClientProtocolProvider)var4.next();
                LOG.debug("Trying ClientProtocolProvider : " + provider.getClass().getName());
                ClientProtocol clientProtocol = null;  
      
                try {
                    if (jobTrackAddr == null) {
                    进入此
                        clientProtocol = provider.create(conf);
                        
                        /*
                        
				    public ClientProtocol create(Configuration conf) throws IOException {
				        String framework = conf.get("mapreduce.framework.name", "local");
				        if (!"local".equals(framework//==local)) {
				            return null;
				        } else {
				            conf.setInt("mapreduce.job.maps", 1);
				            return new LocalJobRunner(conf);
				        }
				    }
                        */
                        
                    } else {
                        clientProtocol = provider.create(jobTrackAddr, conf);
                    }
                }

1.2接下来重点研究connect下面的这个方法

1.通过之前conenct方法中创建的cluster对象创建submitter对象

/*
[root@l4 tmp]# ll
total 12
-rw-r--r--. 1 root root 9442 Mar 20 14:04 debugger-agent-storage6349796582743942141jar
drwxr-xr-x. 3 root root   20 Mar 20 11:21 hadoop-root
drwxr-xr-x. 2 root root   32 Mar 20 14:04 hsperfdata_root
*/
 final JobSubmitter submitter = this.getJobSubmitter(this.cluster.getFileSystem(), this.cluster.getClient());//创建的目录文件
        this.status = (JobStatus)this.ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {
            public JobStatus run() throws IOException, InterruptedException, ClassNotFoundException {
            //进入submitJobInternal方法
                return submitter.submitJobInternal(Job.this, Job.this.cluster);
            }
        });

进入submitJobInternal()

 JobStatus submitJobInternal(Job job, Cluster cluster) throws ClassNotFoundException, InterruptedException, IOException {
        this.checkSpecs(job);//检查输出文件路径是否存在 ,进入此方法   output.checkOutputSpecs(job);
        /*
         output.checkOutputSpecs(job); F7进入此方法
        ====================== == = = = == = = 
        Path outDir = getOutputPath(job);
        if (outDir == null) {
            throw new InvalidJobConfException("Output directory not set.");
        } else {
            TokenCache.obtainTokensForNamenodes(job.getCredentials(), new Path[]{outDir}, job.getConfiguration());
            if (outDir.getFileSystem(job.getConfiguration()).exists(outDir)) {
                throw new FileAlreadyExistsException("Output directory " + outDir + " already exists");
            }
        }
    }       
        */
	===========================================================
	设置新的jobID
			 JobID jobId = this.submitClient.getNewJobID();
			 jobId :job_localranid_0000jobid
			 /*
			 public synchronized org.apache.hadoop.mapreduce.JobID getNewJobID() {
			//  this.randid = this.rand.nextInt(2147483647);
			private static int jobid = 0;
        return new org.apache.hadoop.mapreduce.JobID("local" + this.randid, ++jobid);
        /*
         public JobID(String jtIdentifier, int id) {
        super(id);
        this.jtIdentifier = new Text(jtIdentifier);  
    }
        */
    }
			 */
        job.setJobID(jobId);
        Path submitJobDir = new Path(jobStagingArea, jobId.toString());
        JobStatus status = null;
    }
    =======================================================
    进入分片机制源码：
     int maps = this.writeSplits(job, submitJobDir);
     private int writeSplits(Job  . .. . {
     JobConf jConf = (JobConf)job.getConfiguration();
        int maps;
        if (jConf.getUseNewMapper()) {
            maps = this.writeNewSplits(job, jobSubmitDir); // *******        
        } else {
            maps = this.writeOldSplits(jConf, jobSubmitDir);
        }
        return maps;
    }
     }

maps = this.writeNewSplits(job, jobSubmitDir)方法进入：

在这里插入图片描述


  Configuration conf = job.getConfiguration(); InputFormat<?, ?> input = 	(InputFormat)ReflectionUtils.newInstance(job.getInputFormatClass(), conf);
        List<InputSplit> splits = input.getSplits(job);//进入此方法
        /*
         StopWatch sw = (new StopWatch()).start();
         //minSize 1L(defaultvalue)
        long minSize = Math.max(this.getFormatMinSplitSize(), getMinSplitSize(job)); 
        //maxSize 9223372036854775807L(defaultvalue)
        long maxSize = getMaxSplitSize(job);
        List<InputSplit> splits = new ArrayList();
        //得到所有输入文件
        List<FileStatus> files = this.listStatus(job);//      /*
  开始循环的得到每一个文件
        while(true) {
            while(true) {
                while(var9.hasNext()) {
                 FileStatus file = (FileStatus)var9.next();
                 //得到文件路径以及大小
                    Path path = file.getPath();
                    long length = file.getLen();
                  . . . . . .
  if (this.isSplitable(job, path)) {
                            long blockSize = file.getBlockSize();//128M
                            //计算出切片大小，默认=blockSize
                            long splitSize = this.computeSplitSize(blockSize, minSize, maxSize);
                            /*
                            protected long computeSplitSize(long blockSize, long minSize, long 			maxSize) {
            //return blockSize                
        return Math.max(minSize, Math.min(maxSize, blockSize));
    }
                            */
                            切片规划：
              /* for 开始 形成split s1=0-128 s2=129-256  .. .
               每次切片都要判断切完剩余部分是否大于快的大小的1.1倍，否则，就划分为一个splits*/
               
		  for(bytesRemaining = length; (double)bytesRemaining / (double)splitSize > 1.1D; bytesRemaining -= splitSize) {
                                blkIndex = this.getBlockIndex(blkLocations, length - bytesRemaining);
                                splits.add(this.makeSplit(path, length - bytesRemaining, splitSize, blkLocations[blkIndex].getHosts(), blkLocations[blkIndex].getCachedHosts()));
                            }

步骤总结：

1.首先提交job任务，进入submit方法。
2.submit方法中一个最为重要的方法就是connect方法，他的作用是创建一个Cluster对象。

3.在创建Cluster对象之前必须进行初始化，而初始化的作用就是通过遍历Cluster静态成员变量framework 创建 ClientProtocolProvider 的provider对象而provider可以创建与集群进行通信的客户端通信协议实例clientProtocol。

4.在hadoop通过使用动态加载ServiceLoader方式创建了静态成员变量framework

private static ServiceLoader<ClientProtocolProvider> frameworkLoader =ServiceLoader.load(ClientProtocolProvider.class);

5.ServiceLoader通过加载jar包下的META-INF的service文件夹下面的相应的实现类进行实现。其实内部也是施工反射机制实现的。
在这里插入图片描述

6.ClientProtocolProvider 以及ClientProtocol都是CLuster类中的私有成员变量。
7 .在4步骤中，ServiceLoader创建对象需要ClientProtocolProvider 。而framework的创建也是根据ClientProtocolProvider 子类的类型进行创建，要不然怎么区分local和yarn模式呢？
ClientProtocolProvider 有俩个子类，一个是YarnClientProtocolProvider 一个是LocalClientProtocolProvider，一个是本地local提交调用的，一个是yarn提交调用。
在这里插入图片描述
8.以下是LocalClientProtocolProvider源码：
-------首先会通过配置文件得到mapreduce.framework.name值，默认是local
-------进行配置job作业的map数量，配置为1
-------创建LocalJobRunner对象

在这里插入图片描述
9.以下是YarnClientProtocolProvider源码
-------首先会通过配置文件得到mapreduce.framework.name值，默认是local
-------进行配置job作业的map数量，配置为1
-------创建YarnRunner对象

在这里插入图片描述
10.接下来看一下YarnRunenr源码：
— 看一下他的成员变量
最重要的一个变量就是ResourceManager代理ResourceMgrDelegate实例resMgrDelegate，Yarn模式下整个MapReduce客户端就是由它负责与Yarn集群进行通信，完成诸如作业提交、作业状态查询等过程，通过它获取集群的信息，其内部有一个YarnClient实例YarnClient，负责与Yarn进行通信，还有ApplicationId、ApplicationSubmissionContext等与特定应用程序相关的成员变量

   private static final Log LOG = LogFactory.getLog(YARNRunner.class);
    private static final RecordFactory recordFactory = RecordFactoryProvider.getRecordFactory((Configuration)null);
    public static final Priority AM_CONTAINER_PRIORITY;
    private ResourceMgrDelegate resMgrDelegate;
    private ClientCache clientCache;
    private Configuration conf;
    private final FileContext defaultFileContext;

-----总共有三个构造函数：
YARNRunner一共提供了三个构造函数，而我们之前说的WordCount作业提交时，其内部调用的是YARNRunner带有一个参数的构造函数，它会先构造ResourceManager代理ResourceMgrDelegate实例，然后再调用两个参数的构造函数，继而构造客户端缓存ClientCache实例，然后再调用三个参数的构造函数，而最终的构造函数只是进行简单的类成员变量赋值，然后通过FileContext的静态getFileContext()方法获取文件山下文FileContext实例defaultFileContext。
在这里插入图片描述

ldd儆儆

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Mapreduce本地提交debug源码剖析

首先进入job.waitForCompletion由于job状态已经被定义，所以直接进入submit方法1.进入submit方法： public void submit() throws IOException, InterruptedException, ClassNotFoundException { //再次确认job状态 this.ensureS...
复制链接

扫一扫