更改__DATA__管道“|”后 ,下面的代码片段产生所需的输出。请注意,我使用的是Windows平台,因此我将替换“ r n”。请检查
val spark = SparkSession.builder().appName("Spark_test").master("local[*]").getOrCreate()
import spark.implicits._
val file1 = spark.sparkContext.textFile("./in/machine_logs.txt")
spark.sparkContext.hadoopConfiguration.set("textinputformat.record.delimiter","|")
val file2 = file1.filter( line => { val x = line.split("""n"""); x.length > 5 } )
.map( line => { val x = line.split("""\n""")
val p = x(2).replaceAll("\\r","") // not needed if Unix platform
val q = x(3).split(" ")(1)
val r = x(4).split(",")(2)
(p + "," + q + "," + r)
} )
file2.collect.foreach(println)//file2.saveAsTextFile("./in/machine_logs.out") --> comment above line and uncomment this line to save in file输出:
2018-11-16T06:3937,hortonworks, 2 users2018-11-16T06:4037,cloudera, 28 usersUPDATE1:
使用正则表达式匹配:
val date_pattern="[0-9]+-+-+T+:+".rval uname_pattern="(Linux) (.*?) [0-9a-zA-z-#() . : _ /]+(GNU/Linux)".rval cpu_regex="""(.+),(.*?),s+(load average):+""".rval file2 = file1.filter( line => { val x = line.split("""n"""); x.length > 5 } ) .map( line => {
var q = ""; var r = "";
val p = date_pattern.findFirstIn(line).mkString
uname_pattern.findAllIn(line).matchData.foreach(m=> {q = m.group(2).mkString} )
cpu_regex.findAllIn(line).matchData.foreach(m=> {r = m.group(2).mkString} )
(p + "," + q + "," + r)
} )file2.collect.foreach(println)