I am trying to use from_json()
to convert the JSON to a DataFrame.
import org.apache.spark.sql.functions._
val schemaExample2 = new StructType()
.add("", ArrayType(new StructType()
.add("FirstName", StringType)
.add("Surname", StringType)
)
)
val dfExample2= spark.sql("""select "[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }" as theJson""")
val dfICanWorkWith = dfExample2.select(from_json($"theJson", schemaExample2))
dfICanWorkWith.collect()
// Result \\
res22: Array[org.apache.spark.sql.Row] = Array([null])
The problem is that you don’t have a fully qualified json. Your json is missing a couple of things:
- First you are missing the surrounding
{}
in which the json is done - Second you are missing the variable value (you set it as
""
but did not add it) - Lastly you are missing the closing
]
Try replacing it with:
val dfExample2= spark.sql("""
select "{\"\":[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }]}" as theJson
""")
and you will get:
scala> dfICanWorkWith.collect()
res12: Array[org.apache.spark.sql.Row] = Array([[WrappedArray([Johnny,Boy], [Franky,Man])]])
Reference: Spark from_json - StructType and ArrayType