How to parse the below log using DataFrame/Spark SQL table that can be queried later?
66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
解决方案
You can basically split the string by looking at the valid delimiter and then keep on creating the new columns like this.(Assuming you are looking for something like this)
scala> val DF = spark.sparkContext.textFile("/Users/goldie/code/files/sampleStackOverflow.txt").toDF
DF: org.apache.spark.sql.DataFrame = [value: string]
scala> DF.show
+--------------------+
| value|
+--------------------+
|66.249.69.97 - - ...|
|71.19.157.174 - -...|
|71.19.157.174 - -...|
|71.19.157.174 - -...|
|71.19.157.174 - -...|
+--------------------+
scala> DF.withColumn("IP Address",split(col("value")," - - ")(0)).show
+--------------------+-------------+
| value| IP Address|
+--------------------+-------------+
|66.249.69.97 - - ...| 66.249.69.97|
|71.19.157.174 - -...|71.19.157.174|
|71.19.157.174 - -...|71.19.157.174|
|71.19.157.174 - -...|71.19.157.174|
|71.19.157.174 - -...|71.19.157.174|
+--------------------+-------------+
Edit:
Added four columns (as per your previous file)
scala> :paste
// Entering paste mode (ctrl-D to finish)
DF.withColumn("IP Address",split(col("value")," - - ")(0)).
withColumn("temp1",split(col("value")," - - ")(1)).
withColumn("Time",concat(split(col("temp1")," ")(0),split(col("temp1")," ")(1))).
withColumn("col3",substring(split(col("temp1")," ")(2),2,3)).
withColumn("Col4",split(col("temp1")," ")(3)).
select(col("IP Address"),col("Time"),col("col3"),col("col4")).show
// Exiting paste mode, now interpreting.
+-------------+--------------------+----+-------------------+
| IP Address| Time|col3| col4|
+-------------+--------------------+----+-------------------+
| 66.249.69.97|[24/Sep/2014:22:2...| GET| /071300/242153|
|71.19.157.174|[24/Sep/2014:22:2...| GET| /error|
|71.19.157.174|[24/Sep/2014:22:2...| GET| /favicon.ico|
|71.19.157.174|[24/Sep/2014:22:2...| GET| /|
|71.19.157.174|[24/Sep/2014:22:2...| GET|/jobmineimg.php?q=m|
+-------------+--------------------+----+-------------------+