前面我们所分析的部分其实只是Hadoop作业提交的前奏曲,真正的作业提交代码是在MR程序的main里,RunJar在最后会动态调用这个main,在之前有说明。我们下面要做的就是要比RunJar更进一步,让作业提交能在编码时就可实现,就像HadoopEclipse Plugin那样可以对包含Mapper和Reducer的MR类直接Run on Hadoop。
一般来说,每个MR程序都会有这么一段类似的作业提交代码,这里拿WordCount的举例:
Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1);
首先要做的是构建一个Configuration对象,并进行参数解析。接着构建提交作业用的Job对象,并设置作业Jar包、对应Mapper和Reducer类、输入输出的Key和Value的类及作业的输入和输出路径,最后就是提交作业并等待作业结束。这些只是比较基本的设置参数,实际还支持更多的设置参数,这里就不一一介绍,详细的可参考API文档。
一般分析代码都从开始一步步分析,但我们的重点是分析提交过程中发生的事,这里我们先不理前面的设置对后面作业的影响,我们直接跳到作业提交那一步进行分析,当碰到问题需要分析前面的代码时我会再分析。
public booleanwaitForCompletion(boolean verbose
) throwsIOException, InterruptedException,
ClassNotFoundException {
if (state ==JobState.DEFINE) {
submit();
}
if(verbose) {
monitorAndPrintJob();
} else {
// get the completion pollinterval from the client.
intcompletionPollIntervalMillis =
Job.getCompletionPollInterval(cluster.getConf());
while(!isComplete()) {
try {
Thread.sleep(completionPollIntervalMillis);
} catch(InterruptedException ie) {
}
}
}
return isSuccessful();
}
当调用job.waitForCompletion时,Job如果已经初始化好,立即调用submit()函数,之后调用monitorAndPrintJob()检查Job和Task的运行状况,或者自身进入循环,以一定的时间间隔轮询检查所提交的Job是是否执行完成。如果执行完成,跳出循环,调用isSuccessful()函数返回执行后的状态。
其内部调用的是submit方法来提交,如果传入参数为ture则及时打印作业运作信息,否则只是等待作业结束。
/**
* Submit the job to the cluster and returnimmediately.
* @throwsIOException
*/
public voidsubmit()
throwsIOException, InterruptedException, ClassNotFoundException {
ensureState(JobState.DEFINE);
setUseNewAPI();
connect();
finalJobSubmitter submitter =
getJobSubmitter(cluster.getFileSystem(),cluster.getClient());
status = ugi.doAs(newPrivilegedExceptionAction<JobStatus>() {
publicJobStatus run() throws IOException,InterruptedException,
ClassNotFoundException {
returnsubmitter.submitJobInternal(Job.this, cluster);
}
});
state = JobState.RUNNING;
LOG.info("Theurl to track the job: " + getTrackingURL());
}
submit方法进去后,还有一层,里面用到了job对象内部的jobSubmitter对象的submitJobInternal来提交作业,从这个方法才开始做正事。
/**
* Internal method for submitting jobs to thesystem.
*
* <p>Thejob submission process involves:
* <ol>
* <li>
* Checking the input and output specifications of the job.
* </li>
* <li>
* Computing the {@link InputSplit}sfor the job.
* </li>
* <li>
* Setup the requisite accounting information for the
* {@linkDistributedCache} of the job, if necessary.
* </li>
* <li>
* Copying the job's jar and configuration to the map-reducesystem
* directory on the distributed file-system.
* </li>
* <li>
* Submitting the job to the <code>JobTracker</code>and optionally
* monitoring it's status.
* </li>
* </ol></p>
* @paramjob the configuration to submit
* @paramcluster the handle to the Cluster
* @throwsClassNotFoundException
* @throwsInterruptedException
* @throwsIOException
*/
JobStatus submitJobInternal(Job job, Clustercluster)
throwsClassNotFoundException, InterruptedException, IOException {
//validate the jobs output specs
checkSpecs(job);
Configuration conf =job.getConfiguration();
addMRFrameworkToDistributedCache(conf);
Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster,conf);
//configure the command lineoptions correctly on the submitting dfs
InetAddress ip = InetAddress.getLocalHost();
if (ip!= null) {
submitHostAddress =ip.getHostAddress();
submitHostName =ip.getHostName();
conf.set(MRJobConfig.JOB_SUBMITHOST,submitHostName);
conf.set(MRJobConfig.JOB_SUBMITHOSTADDR,submitHostAddress);
}
JobID jobId = submitClient.getNewJobID();
job.setJobID(jobId);
Path submitJobDir = newPath(jobStagingArea, jobId.toString());
JobStatus status = null;
try {
conf.set(MRJobConfig.USER_NAME,
UserGroupInformation.getCurrentUser().getShortUserName());
conf.set("hadoop.http.filter.initializers",
"org.apache.hadoop.yarn.server.webproxy.amfilter.AmFilterInitializer");
conf.set(MRJobConfig.MAPREDUCE_JOB_DIR,submitJobDir.toString());
LOG.debug("Configuringjob " + jobId + " with " +submitJobDir
+ " as thesubmit dir");
// get delegation token for thedir
TokenCache.obtainTokensForNamenodes(job.getCredentials(),
newPath[] { submitJobDir }, conf);
populateTokenCache(conf,job.getCredentials());
// generate a secret toauthenticate shuffle transfers
if(TokenCache.getShuffleSecretKey(job.getCredentials()) == null) {
KeyGenerator keyGen;
try {
keyGen = KeyGenerator.getInstance(SHUFFLE_KEYGEN_ALGORITHM);
keyGen.init(SHUFFLE_KEY_LENGTH);
} catch(NoSuchAlgorithmException e) {
throw newIOException("Error generating shuffle secret key", e);
}
SecretKey shuffleKey =keyGen.generateKey();
TokenCache.setShuffleSecretKey(shuffleKey.getEncoded(),
job.getCredentials());
}
copyAndConfigureFiles(job, submitJobDir);
Path submitJobFile = JobSubmissionFiles.getJobConfPath(submitJobDir);
// Create the splits for thejob
LOG.debug("Creatingsplits at " + jtFs.makeQualified(submitJobDir));
int maps= writeSplits(job, submitJobDir);
conf.setInt(MRJobConfig.NUM_MAPS,maps);
LOG.info("numberof splits:" + maps);
// write "queue adminsof the queue to which job is being submitted"
// to job file.
String queue = conf.get(MRJobConfig.QUEUE_NAME,
JobConf.DEFAULT_QUEUE_NAME);
AccessControlList acl = submitClient.getQueueAdmins(queue);
conf.set(toFullPropertyName(queue,
QueueACL.ADMINISTER_JOBS.getAclName()),acl.getAclString());
// removing jobtokenreferrals before copying the jobconf to HDFS
// as the tasks don't need thissetting, actually they may break
// because of it if present asthe referral will point to a
// different job.
TokenCache.cleanUpTokenReferral(conf);
if(conf.getBoolean(
MRJobConfig.JOB_TOKEN_TRACKING_IDS_ENABLED,
MRJobConfig.DEFAULT_JOB_TOKEN_TRACKING_IDS_ENABLED)) {
// Add HDFStracking ids
ArrayList<String> trackingIds = newArrayList<String>();
for(Token<? extendsTokenIdentifier> t :
job.getCredentials().getAllTokens()) {
trackingIds.add(t.decodeIdentifier().getTrackingId());
}
conf.setStrings(MRJobConfig.JOB_TOKEN_TRACKING_IDS,
trackingIds.toArray(newString[trackingIds.size()]));
}
// Write job file to submit dir
writeConf(conf, submitJobFile);
//
// Now, actually submit the job(using the submit name)
//
printTokens(jobId, job.getCredentials());
status = submitClient.submitJob(
jobId, submitJobDir.toString(),job.getCredentials());
if(status != null) {
returnstatus;
} else {
throw newIOException("Could not launch job");
}
} finally {
if(status == null) {
LOG.info("Cleaningup the staging area " + submitJobDir);
if (jtFs != null&& submitJobDir != null)
jtFs.delete(submitJobDir,true);
}
}
}
submitJobInternal()函数主要进行如下操作
· 检查Job的输入输出是各项参数,获取配置信息和远程主机的地址,生成JobID,确定所需工作目录(也是MRAppMaster.java所在目录),执行期间设置必要的信息
· 拷贝所需要的Jar文件和配置文件信息到HDFS系统上的指定工作目录,以便各个节点调用使用
· 计算并获数去输入分片(Input Split)的数目,以确定map的个数
· 调用YARNRunner类下的submitJob()函数,提交Job,传出相应的所需参数(例如 JobID等)。
· 等待submit()执行返回Job执行状态,最后删除相应的工作目录。
接下来我们一步一步分析这个过程,首先,进去第一件事就是获取存储空间,建立submitJobDir文件夹,用来存放job提交所需信息,接下来获取jobId,然后设置一些依赖库,配置文件等的路径,并进行安全性检查;然后进行job输入文件的切割,也就是split过程,主要通过
int maps= writeSplits(job, submitJobDir);
</pre><pre name="code" class="java">进行文件的split,这里的split只是一种逻辑的split,并没有进行物理存储的切割,切割完成后,每一个split对应一个map,并将job.split也就是split元数据写入到submitJobDir文件夹中,之后随job一块提交给resoursemanager,在YARN上运行;
</pre><pre name="code" class="java">下一篇,我会重点分析<pre name="code" class="java" style="color: rgb(51, 51, 51);">writeSplits(job, submitJobDir)
的执行过程,也就是job是怎样进行split并生成split文件的。。。。。。。