1. 从Kafak 接收的数据如下
124.29.167.30 2019-03-23 21:58:01 “GET www/1 HTTP/1.0” https://search.yahoo.com/search?p=猎场 404
- 124.29.167.30 访问源ip
- 2019-03-23 21:58:01 访问时间
- “GET www/1 HTTP/1.0” https://search.yahoo.com/search?p=猎场 请求的详细信息,www/1中的1为数据类型
- 404 http 请求状态
2. 定义要存入的日志格式
case class ClickLog (ip:String, time:String, categoryId:Int, statusCode:Int, referer:String)
- ip : 访问者ip
- time: 访问时间
- categoryId: 访问的内容的类别
- statusCode: 访问的状态
- referer:访问的来源URL
3. 清洗访问日志
//124.29.167.30 2019-03-23 21:58:01 "GET www/1 HTTP/1.0" https://search.yahoo.com/search?p=猎场 404
val cleanData = logs.map(line => {
// 对日志数据进行分割
val infos = line.split("\t")
// 获取访问的URL
val url = infos(2).split(" ")(2)
// 获取访问的内容的类别
var categoryId = 0
if (url.startsWith("www/")) {
categoryId = url.split("/")(1).toInt
}
// 返回值为ClickLog, 并按照categoryId不为0进行筛选
ClickLog(infos(0), DateUtils.parseToMinute(infos(1)), categoryId, infos(4).trim().toInt, infos(3))
}).filter(clickLog => clickLog.categoryId!=0)