以CDH6.0.1版本为例,其资源管理框架是YARN,对于所有跑在YARN上的job(或app),都可以通过YARN的ResourceManager(简称RM)提供的restful API请求查询其运行状态。其GET请求命令格式如下:
GET http(or https)://rm-http(or https)-address:port/ws/v1/cluster/apps
返回结果为json格式的所有YARN记录的job信息。也可在上述命令的URL后面添加多个过滤参数,支持的过滤参数有如下几个:
states: RM记录的job运行状态,多个时以逗号分割,有效值包括(NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED)
finalStatus: 由job自己报告的最终状态,有效值包括(UNDEFINED, SUCCEEDED, FAILED, KILLED)
user: 启动job的用户
queue: job运行时所在的YARN资源池队列
limit: 限制返回job的个数
startedTimeBegin: job开始执行时所在时间段的开始时间点
startedTimeEnd: job开始执行时所在时间段的结束时间点
finishedTimeBegin: job执行完成时所在时间段的开始时间点
finishedTimeEnd: job执行完成时所在时间段的结束时间点
applicationTypes: job的类型,多个时以逗号分割
applicationTags: job的标签,多个时以逗号分割
deSelects: 返回结果中需要跳过的字段
一、失败作业监控
import sys
import json
import time
import subprocess
reload(sys)
sys.setdefaultencoding('utf-8')
if len(sys.argv) < 2:
print "FAILED Error! Need a time interval param(Unit s) to run!"
exit(0)
# 获取在now - interval到now时间段内执行完成的作业
interval = int(sys.argv[1]) * 1000
now = int(time.time() * 1000)
before = now - interval
cmd = 'curl -k --compressed -H "Accept: application/json" -X GET "https://192.168.0.39:8090/ws/v1/cluster/apps?&finalStatus=FAILED&finishedTimeBegin={0}&finishedTimeEnd={1}"'.format(
before, now)
getAppsProc = subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
outPut, err = getAppsProc.communicate()
if not getAppsProc.returncode:
out = json.loads(outPut)
if out['apps']:
for app in out['apps']['app']:
print 'appId: {0}'.format(app['id'])
print 'name: {0}'.format(app['name'])
print 'finalStatus: {0}'.format(app['finalStatus'])
print 'appType: {0}'.format(app['applicationType'])
print 'startedTime: {0}'.format(time.ctime(app['startedTime'] / 1000))
print 'finishedTime: {0}'.format(time.ctime(app['finishedTime'] / 1000))
print 'elapsedTime: {0}s'.format(app['elapsedTime'] / 1000)
print 'user: {0}'.format(app['user'])
print 'queue: {0}'.format(app['queue'])
print '-------------------------------------'
print '{0} jobs failed'.format(len(out['apps']['app']))
二、执行超时作业监控
import re
import sys
import json
import time
import subprocess
reload(sys)
sys.setdefaultencoding('utf-8')
JOB_TYPE_PATTERN = ".*stream.*"
pattern = re.compile(JOB_TYPE_PATTERN, re.I)
if len(sys.argv) < 3:
print "FAILED Error! Need a duration time of job execution param(Unit min) to run!"
exit(0)
# 要监控的作业名
job_name = str(sys.argv[1])
# 该作业预期执行时长
duration = int(sys.argv[2])
cmd = 'curl -k --compressed -H "Accept: application/json" -X GET "https://192.168.0.39:8090/ws/v1/cluster/apps?&states=running&applicationTypes=SPARK"'
getAppsProc = subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
outPut, err = getAppsProc.communicate()
if not getAppsProc.returncode:
out = json.loads(outPut)
if out['apps']:
for app in out['apps']['app']:
app_name = app['name']
elapsed_time = app['elapsedTime'] / 60000.0
if (not pattern.match(app_name)) and elapsed_time > duration and job_name == app_name:
print 'startedTime: {0}'.format(time.ctime(app['startedTime'] / 1000))
print 'elapsedTime: %.1fmin' % (elapsed_time)
print 'The job [{0}] not finished yet'.format(app_name)
print '-------------------------------------'
注:192.168.0.39:8090是ResourceManager所在主机的WEB监听端口;当通过curl命令访问其https地址时如果出现curl: (35) Cannot communicate securely with peer: no common encryption algorithm(s)错误时,需通过yum -y install curl重新更新一下curl命令的依赖,或者加上-k参数。
最后可以利用ZABBIX对执行失败和超时的作业进行告警,只需要配置好相应的监控项、触发器和告警发送媒介即可。