日志分析
场景:
- 日志数据 访问IP,URL,耗时
- 统计每个URL在最近一分钟的访问次数,平均耗时
解决方案
将日志数据导入kafka, 通过spark streaming 从kafka中将数据抽取出来,实时统计一分钟内每个接口的访问次数,及平均耗时,将结果打印出来
实现代码
val conf = new SparkConf().setAppName("log").setMaster("local[4]")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(1))
// 模拟日志数据
val seqData = Seq("10.1.96.221 GET /mobile/mobileStat?Imei=123&Para=0&Type=15 0.003",
"10.2.81.231 GET /mobile/monitoringStat?Imei=223&Para=0&Type=1018 0.005",
"20.1.61.211 GET /mobile/mobileStat?Imei=333&Para=0&Type=12 0.012")
val queue = new SynchronizedQueue[RDD[String]]()
for(i <- (0 to 100)) {
queue += sc.parallelize(seqData)
}
// 模拟流式日志数据
val logStream = ssc.queueStream(queue, oneAtATime = true)
// 获取 url,访问耗时
val logs = logStream.map(log => {
val arr = log.split("\\s{1,}")
(arr(2), arr(3))
}).map(t2 => {
(t2._1.split("\\?")(0),t2._2)
})
// 取最新60秒的window数据
val win60 = logs.window(Seconds(60))
// 按照url分组,获取每个url的总耗时及数量
win60.groupByKey().map(t2 => {
val api = t2._1
var total = BigDecimal(0)
var count = BigInt(0)
for(t <- t2._2) {
total += BigDecimal(t)
count += 1
}
(api, total, count, total / BigDecimal(count.toLong))
}).foreachRDD(rdd => {
rdd.collect().foreach(t4 => {
println("\t---60秒---" + t4)
})
})
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate