使用Spark内置的回调函数来实现离线与实时任务监控,实现细粒度任务的监控和异常报警
- 需要实现抽象类:SparkListener
通过以上函数,可以实现spark作业各个流程的指标监控
onApplicationStart 当整个应⽤开始执⾏时
onApplicationEnd 当整个Application结束时调⽤的回调函数
onJobStart 当job开始执⾏时触发的回调函数
onJobEnd 当job结束时触发的回调函数
onStageSubmitted 当提交stage时触发的回调函数
onStageCompleted 当stage完成时触发的回调函数
onTaskStart 当task开始时触发的回调函数
onTaskGettingResult 获取task执⾏的结果
onTaskEnd 当task执⾏完成时执⾏的回调函数
onBlockManagerAdded 新增bockManger时触发的回调函数
onBlockUpdated 当blockManage中管理的内存或者磁盘发⽣变化时触发的回调函数
onBlockManagerRemoved 当blockManager回收时触发的回调函数
onEnvironmentUpdate 上下⽂环境发⽣变化是触发的回调函数
onUnpersistRDD 当RDD发⽣unpersist时发⽣的回调函数
onExecutorAdded 当新增⼀个executor时触发的回调函数
onExecutorMetricsUpdate 当executor发⽣变化时触发的回调函数
onExecutorRemoved 当移除⼀个executor时触发的回调函数
我们要通过集成SparkListener实现其中的onTaskEnd方法,通过此回调函数,获取
- 1.taskMetriecs:就是用来监控我们当前框架的性能,在task回调中是可以获取taskMetrices的信息
- 2.shuffle信息
- 3.task运行input output信息
- 4.taskInfo:task的状态信息
- 5.邮件告警:监控各种失败原因
- 最后我们要把获取的结果要插入到redis里面,进行存储
我们右键点击进入TaskMetrics,会惊奇发现他有一堆的供你调用获取当前框架运行状态的方法
executor反序列化消耗时间
executor反序列化cpu消耗时间
executor运行所消耗时间
executor执行cpu消耗时间
结果集大小
jvm在执行垃圾回收消耗的时间
…等等一大堆的获取运行状态数据的方法
我们要把这些方法中获得的信息存储到redis中
注意:要把一个HashMap使用set放入redis中,直接放置会报错,需要使用Json 4s处理一下
使用Json(DefaultFormats).write(map集合)把一个集合变为string字符串
工具类实现:
- 1.taskMetriecs:就是用来监控我们当前框架的性能,在task回调中是可以获取taskMetrices的信息
- 2.shuffle信息
- 3.task运行input output信息
- 4.taskInfo:task的状态信息
- 5.邮件告警:监控各种失败原因
- 最后我们要把获取的结果要插入到redis里面,进行存储
package com.cartravel.spark
import com.cartravel.loggings.Logging
import com.cartravel.mailAlarm.MailUtil
import com.cartravel.readApplicationconfUtil.readApplicatuinFileUtil
import com.cartravel.redis.JedisUtil
import com.cartravel.tools.DataStruct
import org.apache.spark.{ExceptionFailure, Resubmitted, SparkConf, TaskEndReason, TaskFailedReason, TaskKilledException}
import org.apache.spark.executor.{ShuffleReadMetrics, TaskMetrics}
import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd}
import org.json4s.DefaultFormats
import org.json4s.jackson.Json
import redis.clients.jedis.Jedis
import scala.collection.mutable
class offlineMonitoring(conf:SparkConf) extends SparkListener with Logging{
private val jedisUtil: JedisUtil = JedisUtil.getInstance()
private val jedis: Jedis = jedisUtil.getJedis
override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
/**静态的sql,job监控
* 1.taskMetriecs:就是用来监控我们当前框架的性能
*在task回调中是可以获取taskMetrices的信息
* 2.shuffle信息
* 3.task运行input output信息
* 4.taskInfo,task的状态信息
* 最后结果要插入到redis里面
*/
jedis.select(2)//redis默认是16个库,我们要把信息插入到第二个库
//我们要获取一下当前实时的时间,因为我们这个回调是不断的在运行,我们为了防止这个key重复,对key进行加时机戳进行存储
val currentTime = System.currentTimeMillis()
//#########################1.metrics监控###########################
val metrics: TaskMetrics = taskEnd.taskMetrics
// private val _executorDeserializeTime = new LongAccumulator
// private val _executorDeserializeCpuTime = new LongAccumulator
// private val _executorRunTime = new LongAccumulator
// private val _executorCpuTime = new LongAccumulator
// private val _resultSize = new LongAccumulator
// private val _jvmGCTime = new LongAccumulator
// private val _resultSerializationTime = new LongAccumulator
// private val _memoryBytesSpilled = new LongAccumulator
// private val _diskBytesSpilled = new LongAccumulator
// private val _peakExecutionMemory = new LongAccumulator
// private val _updatedBlockStatuses = new CollectionAccumulator[(BlockId, BlockStatus)]
val taskMetricsMap: mutable.HashMap[String, String] = scala.collection.mutable.HashMap(
"executorDeserializeTime"->metrics.executorDeserializeTime,
"executorDeserializeCpuTime" -> metrics.executorDeserializeCpuTime,
"executorRunTime"->metrics.executorRunTime,
"executorCpuTime"->metrics.executorCpuTime,
"resultSize"->metrics.resultSize,
"jvmGCTime"->metrics.jvmGCTime,
"resultSerializationTime"->metrics.resultSerializationTime,
"memoryBytesSpilled"->metrics.memoryBytesSpilled,
"diskBytesSpilled"->metrics.diskBytesSpilled,
"peakExecutionMemory"->metrics.peakExecutionMemory, //executor的最大内存
"updatedBlockStatuses"->metrics.updatedBlockStatuses
)
val taskMetricsKey = s"taskMetrics_${currentTime}}"
jedis.set(taskMetricsKey,Json(DefaultFormats).write(taskMetricsKey)) //使用Json(DefaultFormats).write(map集合)把一个集合变为string字符串
//因为我们redis是走内存的,空间比较小,所以最好指定以下这个key的生命周期是多少
jedis.expire(taskMetricsKey,3600)//3600s=1小时,一小时后清除
//#########################2.shuffle监控###########################
// private[executor] val _remoteBlocksFetched = new LongAccumulator
// private[executor] val _localBlocksFetched = new LongAccumulator
// private[executor] val _remoteBytesRead = new LongAccumulator
// private[executor] val _localBytesRead = new LongAccumulator
// private[executor] val _fetchWaitTime = new LongAccumulator
// private[executor] val _recordsRead = new LongAccumulator
val shuffleReadMetrics: ShuffleReadMetrics = metrics.shuffleReadMetrics
val shuffleReadMetricsMap = collection.mutable.HashMap(
"remoteBlocksFetched"->shuffleReadMetrics.remoteBlocksFetched, //shuffle远程拉取数据块
"localBlocksFetched"->shuffleReadMetrics.localBlocksFetched,
"remoteBytesRead"->shuffleReadMetrics.recordsRead,
"localBytesRead"->shuffleReadMetrics.localBytesRead,
"fetchWaitTime"->shuffleReadMetrics.fetchWaitTime,
"recordsRead"->shuffleReadMetrics.recordsRead
)
val shuffleReadMetricsKey = s"shuffleRedMetrics_$currentTime}"
jedis.set(shuffleReadMetricsKey,Json(DefaultFormats).write(shuffleReadMetricsMap))
// private[executor] val _bytesWritten = new LongAccumulator
// private[executor] val _recordsWritten = new LongAccumulator
// private[executor] val _writeTime = new LongAccumulator
val shuffleWriteMetrics = metrics.shuffleWriteMetrics
val shuffleWriteMetricsMap = collection.mutable.HashMap(
"bytesWritten"->shuffleWriteMetrics.bytesWritten, //shuffle写的总大小
"recordsWritten"->shuffleWriteMetrics.recordsWritten,
"writeTime"->shuffleWriteMetrics.writeTime
)
val shuffleWriteMetricsKey = s"shuffleWriteMetrics_${currentTime}"
jedis.set(shuffleReadMetricsKey,Json(DefaultFormats).write(shuffleReadMetricsMap))
//#########################3.task运行input output信息###########################
// private[executor] val _bytesRead = new LongAccumulator
// private[executor] val _recordsRead = new LongAccumulator
val inputMetrics = metrics.inputMetrics
val inputMetricsMap = collection.mutable.HashMap(
"bytesRead"->inputMetrics.bytesRead,
"recordsRead"->inputMetrics.recordsRead
)
val inputMetricsKey = s"inputMetrics_${currentTime}"
jedis.set(inputMetricsKey,Json(DefaultFormats).write(inputMetricsMap))
// private[executor] val _bytesWritten = new LongAccumulator
// private[executor] val _recordsWritten = new LongAccumulator
val outputMetrics = metrics.outputMetrics
val outputMetricsMap = collection.mutable.HashMap(
"bytesWritten"->outputMetrics.bytesWritten,
"recordsWritten"->outputMetrics.recordsWritten
)
val outputMetricsKey = s"outputMetricsMap_${currentTime}"
jedis.set(outputMetricsKey,Json(DefaultFormats).write(outputMetricsMap))
//#########################4.taskInfo:task的状态信息###########################
// val taskId: Long,
// val index: Int,
// val attemptNumber: Int,
// val launchTime: Long,
// val executorId: String,
// val host: String,
// val taskLocality: TaskLocality.TaskLocality,
// val speculative: Boolean
val taskInfo = taskEnd.taskInfo
val taskInfoMap = collection.mutable.HashMap(
"taskId"->taskInfo.taskId,
"index"->taskInfo.index,
"attemptNumber"->taskInfo.attemptNumber,
"launchTime"->taskInfo.launchTime,
"executorId"->taskInfo.executorId,
"host"->taskInfo.host,
"taskLocality"->taskInfo.taskLocality,
"speculative"->taskInfo.speculative //task是否开启推测执行
)
val taskInfoKey = s"taskInfo_$currentTime"
jedis.set(taskInfoKey,Json(DefaultFormats).write(taskInfoMap))
//#########################5.邮件告警:监控各种失败原因###########################
//先判断下task是否是运行的,
if (taskInfo != null && taskEnd.stageAttemptId != -1){ //stageAttemptId只要运行这个一定不等于-1
val reason: TaskEndReason = taskEnd.reason
val errmsg = reason match { //对这个reason做模式匹配,看看是成功还是失败,如果失败是哪种失败
case kill: TaskKilledException => Some(kill.getMessage)
case e:TaskFailedReason=>Some(e.toErrorString)
case exceptionFailure:ExceptionFailure=>Some(exceptionFailure.toErrorString)
case Resubmitted => Some(Resubmitted.toErrorString)
}
if (errmsg.nonEmpty){
if (conf.getBoolean("enableSendEmailOnTaskFail",defaultValue = false)){
val args = Array(readApplicatuinFileUtil.getConf("main.host"),s"spark监控任务$reason",errmsg.get)
val properties = DataStruct.convertProp(
("mail.host", readApplicatuinFileUtil.getConf("mail.host")),
("mail.transport.protocol", readApplicatuinFileUtil.getConf("mail.transport.protocol")),
("mail.smtp.auth", readApplicatuinFileUtil.getConf("mail.smtp.auth"))
)
MailUtil.sendMail(properties,args) //里头需要一个properties,使用它来封装,是否发送邮件啊,邮件地址,域名,端口....
}
}
}
}
}
最后我们怎么把这个离线监控的类让他运作起来?
我们可以通过配置Session的时候,指定配置参数spark.extraListeners,后面是你编写监听的类路径名