架构图:
注释:
所有的EMRFlink任务监控开发及上线(包括WAFStreamDW) DONE(2023年02月28日)
1. 开发
2. 日志格式定义
2. 告警配置
问题:若获取session及job状态失败,格式与正常的不同
1. session running && job failed时,job为errors-json串,无法获取job-name及job-status信息
解决:代码中处理+SQL中处理status为null的情况,以便告警使用
2. session failed && job failed时,session状态可获取,但获取其下job状态时会以xml格式返回Content has moved
解决:代码中人为设定job的name及status
Shell jq使用:
添加key-value:
cat json1.json | jq '.key="value"'
更新key-value:
cat json1.json | jq '.key="value2"'
添加key-动态value:
xargs=auto-value
cat json1.json | jq --arg foo ${xargs} '.key=$foo'
将格式化的json扁平化输出到文件:
cat json1.json | jq -c > target.json
获取某个YarnSession的运行状态的命令:
echo $(curl -s http://192.168.0.1:8088/ws/v1/cluster/apps/application_123456_0246) | jq -r '.app.state'
获取某个FlinkJob的运行状态的命令:
curl -s http://192.168.0.1:8088/proxy/application_123456_0246/jobs/4b3bcb1babcdef4ebd577efd48cbfa07
汇总脚本:
# log json to target dir (/mnt/disk1/log/flink-jobs)
# to well logtailed by SLS
# LISTS=($(cat /home/hadoop/flink_job_list.txt))
cp ossref://oss-file-tools/EMR_Project/EMRAlert/flink_job_list.txt /home/hadoop/
LISTS=($(cat /home/hadoop/flink_job_list.txt))
for i in $( seq 0 $((${#LISTS[*]} - 1)) );
do
string=${LISTS[i]};
array=(${string//,/ });
# if yarn session is FAILED, get flink job FAILED too.
echo $(curl -s http://192.168.0.1:8088/ws/v1/cluster/apps/${array[0]}) | jq -r '.app.state' > /mnt/disk1/log/flink-jobs/$year/tmp_session.state
# yarn session tmp file
echo $(curl -s http://192.168.0.1:8088/ws/v1/cluster/apps/${array[0]}) | jq '.emr_cluster_id="C-95927ABC297E8648"' | jq '.emr_cluster_name="oss-prod-emr-flink"' | jq -c > /mnt/disk1/log/flink-jobs/$year/tmp_session.json;
# flink job tmp file
if [ "FAILED" = $(cat /mnt/disk1/log/flink-jobs/$year/tmp_session.state) ]; then
echo '{"state":"FAILED"}' | jq '.emr_cluster_id="C-95927ABC297E8648"' | jq '.emr_cluster_name="oss-prod-emr-flink"' | jq --arg foo ${array[0]} '.yarn_session=$foo' | jq --arg foo ${array[1]} '.jid=$foo' | jq --arg foo ${array[2]} '.jname=$foo' | jq -c > /mnt/disk1/log/flink-jobs/$year/tmp_job.json;
else
echo $(curl -s http://192.168.0.1:8088/proxy/${array[0]}/jobs/${array[1]}) | jq '.emr_cluster_id="C-95927ABC297E8648"' | jq '.emr_cluster_name="oss-prod-emr-flink"' | jq --arg foo ${array[0]} '.yarn_session=$foo' | jq --arg foo ${array[2]} '.jname=$foo' | jq -c > /mnt/disk1/log/flink-jobs/$year/tmp_job.json;
fi
# flink job detail got
jq -s '.[0] * .[1]' /mnt/disk1/log/flink-jobs/$year/tmp_job.json /mnt/disk1/log/flink-jobs/$year/tmp_session.json | jq -c >> /mnt/disk1/log/flink-jobs/$year/flink_job_$day.log;
done
任务上传至EMR并对自身告警(注:添加了emr_cluster_id及emr_cluster_name)
WorkFlow:
FlinkClusterLogCollection
monitor running status of yarn session & flink job per 5 minutesJob:
FlinkClusterLogCollection
LOG示例:
SessionRunning&JobRunning:
略SessionRunning&JobFailed:
略SessionFailed&JobFailed:
略
基于SLS的Flink任务监控告警设置:
EMRFlinkJobMonitor_YarnSessionAlert
* | select json_extract_scalar(app, '$.name') as session_name, json_extract_scalar(app, '$.state') as session_state, jname, if(state is null,'JobNotFound',state) as jstate
分组:session_name
有数据匹配:session_state != 'RUNNING'
Title:【Reminder]】FlinkYarnSession: ${session_name}目前状态是${session_state}!!!请处理!!!
Content:FlinkYarnSession: ${session_name}目前状态是${session_state}!!!请处理!!!
行动策略:EMRFlinkJobMonitorEMRFlinkJobMonitor_FlinkJobAlert
* | select json_extract_scalar(app, '$.name') as session_name, json_extract_scalar(app, '$.state') as session_state, jname, if(state is null,'JobNotFound',state) as jstate
分组:jname
有数据匹配:jstate != 'RUNNING'
Title:【Reminder]】FlinkYarnSession: ${session_name}中的FlinkJob: ${jname}目前状态是${jstate}!!!请处理!!!
Content:FlinkYarnSession: ${session_name}中的FlinkJob: ${jname}目前状态是${jstate}!!!请处理!!!
行动策略:EMRFlinkJobMonitor
YarnSessionAlertConfig:
FlinkJobAlertConfig:
更新时间:2023年02月28日16:41:29