1、数据清洗
- 读入日志文件并转化为RDD[Row]类型
- 对数据进行清洗
- 按照第一列和第二列对数据进行去重
- 过滤掉状态码非200
- 过滤掉event_time为空的数据
- 将url按照”&”以及”=”切割
- 保存数据
- 将数据写入mysql表中
文件目录:D:\test\t\test.log,一条数据的结构如下:
2018-09-04T20:27:31+08:00 http://datacenter.bdqn.cn/logs/user?actionBegin=1536150451540&actionClient=Mozilla%2F5.0+%28Windows+NT+10.0%3B+WOW64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F58.0.3029.110+Safari%2F537.36+SE+2.X+MetaSr+1.0&actionEnd=1536150451668&actionName=startEval&actionTest=0&actionType=3&actionValue=272090&clientType=001_kgc&examType=001&ifEquipment=web&isFromContinue=false&skillIdCount=0&skillLevel=0&testType=jineng&userSID=B842B843AE317425D53D0C567A903EF7.exam-tomcat-node3.exam-tomcat-node3&userUID=272090&userUIP=1.180.18.157 GET 200 192.168.168.64 - - Apache-HttpClient/4.1.2 (java 1.5)
import java.util.Properties
import org.apache.commons.lang.StringUtils
import org.apache.spark.sql.{
Row, SparkSession}
import org.apache.spark.sql.types.{
StringType, StructField, StructType}
object DataClear {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[1]")
.appName("clearDemo")
.getOrCreate()
import spark.implicits._
val sc = spark.sparkContext
val linesRdd = sc.textFile("D:\\test\\t\\test.log")
val rdd = linesRdd.map(x=>x.split("\t"))
.filter(x=>x.length==8)
.map(x=>Row(x(0).trim,x(1).trim,x(2).trim,x(3).trim,x(4).trim,x(5).trim,x(6).trim,x(7).trim))
val schema = StructType(
Array(
StructField("event_time", StringType),
StructField("url", StringType),
StructField("method", StringType),
StructField("status", StringType),
StructField("sip", StringType),
StructField("user_uip", StringType),
StructField("action_prepend", StringType),
StructField("action_client", StringType)
)
)
val orgDF = spark.createDataFrame(rdd,schema)