Hadoop收集作业执行状态信息

最新推荐文章于 2022-08-02 14:50:38 发布

小桥

最新推荐文章于 2022-08-02 14:50:38 发布

阅读量940

点赞数

分类专栏： hadoop 文章标签： hadoop

hadoop 专栏收录该内容

57 篇文章 0 订阅

订阅专栏

最近一个项目需要收集hadoop作业的执行状态的信息，我给出了以下的解决策略：

1、从Hadoop提供的jobtracker.jsp获取需要的信息，这里遇到的一个问题是里面使用了application作用域
JobTracker tracker = (JobTracker) application.getAttribute("job.tracker");
而Jetty服务器是嵌入到Hadoop的内部的，
org.apache.mapred.Jobtracker.java
InetSocketAddress infoSocAddr = NetUtils.createSocketAddr(
conf.get(JT_HTTP_ADDRESS, "0.0.0.0:50030"));
infoServer = new HttpServer("job", infoBindAddress, tmpInfoPort,
tmpInfoPort == 0, conf);
infoServer.setAttribute("job.tracker", this);
于是，如果想通过jsp页面获取统计信息的话，必须绕开Jetty服务器，或者在修改Jobtracker的中返回infoServer的一个引用，在代码中实现，不过显然这个需要修改Hadoop的核心代码，灵活性不高。
2、脚本解析jsp.
通过wget http://localhost:50030/jobtracker.jsp可以看到：

-----------------------------------------------------------------------------------------------
State: RUNNING 
Started: Tue Dec 28 09:43:40 CST 2010 
Version: 0.21.0,
985326 
Compiled: Tue Aug 17 01:02:28 EDT 2010 by
tomwhite from
branches/branch-0.21 
Identifier: 201012280943 
...............................................................................................
这些信息完全都可以使用python beautiful soup（http://www.crummy.com/software/BeautifulSoup/）来解析得到。

3、把你的hadooop版本升级到Hadoop-0.21.0，Cluster类

提供了丰富的API接口

	`cancelDelegationToken(org.apache.hadoop.security.token.Token<org.apache.hadoop.mapreduce.security.token.delegation.DelegationTokenIdentifier> token)` Cancel a delegation token from the JobTracker
`void`	`close()` Close the `Cluster`.
`TaskTrackerInfo[]`	`getActiveTaskTrackers()` Get all active trackers in the cluster.
`Job[]`	`getAllJobs()` Get all the jobs in cluster.
`TaskTrackerInfo[]`	`getBlackListedTaskTrackers()` Get blacklisted trackers.
`QueueInfo[]`	`getChildQueues(String queueName)` Returns immediate children of queueName.
`ClusterMetrics`	`getClusterStatus()` Get current cluster status.
`org.apache.hadoop.security.token.Token <org.apache.hadoop.mapreduce. security.token.delegation.DelegationTokenIdentifier>`	`getDelegationToken(org.apache.hadoop.io.Text renewer)` Get a delegation token for the user from the JobTracker.
`org.apache.hadoop.fs.FileSystem`	`getFileSystem()` Get the file system where job-specific files are stored
`Job`	`getJob(JobID jobId)` Get job corresponding to jobid.
`String`	`getJobHistoryUrl(JobID jobId)` Get the job history file path for a given job id.
`State`	`getJobTrackerState()` Get JobTracker's state
`QueueInfo`	`getQueue(String name)` Get queue information for the specified name.
`QueueAclsInfo[]`	`getQueueAclsForCurrentUser()` Gets the Queue ACLs for current user
`QueueInfo[]`	`getQueues()` Get all the queues in cluster.
`QueueInfo[]`	`getRootQueues()` Gets the root level queues.
`org.apache.hadoop.fs.Path`	`getStagingAreaDir()` Grab the jobtracker's view of the staging directory path where job-specific files will be placed.
`org.apache.hadoop.fs.Path`	`getSystemDir()` Grab the jobtracker system directory path where job-specific files will be placed.

例如我们需要打印作业的信息的时候，只需要：
Configuration conf = new Configuration();
Cluster cluster = new Cluster(conf);
Job[] job = cluster.getAllJobs();
if(job != null) {
for (Job tmp : job) {
System.out.println(tmp.getJobID());
System.out.println(tmp.getJobName());
System.out.println(tmp.getStartTime());
System.out.println(tmp.getFinishTime());
}
}
:-)...

Klose 我们应该和Hadoop一起进步。