我们知道spark读取文件并转换成DataFrame可以通过sparkSession.read.format直接读取,但是当我们读取的是普通文本,并且内容包含普通文本及json格式的文档,如下图abc.log文档,如何读取并转换?
12334 hehehe {"name":"zhangsan","age":"32"} 1995-6-7
123423 xixi {"name":"lisi","age":"32"} 2000-9-8
234435 cici {"name":"wangwu","age":"34"} 2020-9-7
23432 cici {"name":"zhaoliu","age":"23"} 1997-4-3
文档存在json格式,首先需要导入解析json格式依赖
import org.apache.spark.sql.functions._
步骤如下:
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[1]").appName("mytest").getOrCreate()
import spark.implicits._ //记得要加,否则下面不能取值
//使用sparkContext读取文件
val txt = spark.sparkContext.textFile("file:///f:/abc.log")
//对文件用空格进行分割,拿到每个字段的值后,转换成dataframe格式
val dd = txt.map(_.split(" ")).map(x=>(x(0),x(1),x(2),x(3))).toDF("no","action","info","times")
//此时,表格已经形成如下,剩下的我们只需要对其中json格式的一列进行解析
// +------+------+--------------------+--------+
// | no|action| info| times|
// +------+------+--------------------+--------+
// | 12334|hehehe|{"name":"zhangsan...|1995-6-7|
// |123423| xixi|{"name":"lisi","a...|2000-9-8|
// |234435| cici|{"name":"wangwu",...|2020-9-7|
// | 23432| cici|{"name":"zhaoliu"...|1997-4-3|
//选取字段并解析
dd.select($"no",$"action",
get_json_object($"info","$.name").as("name"),
get_json_object($"info","$.age").as("age"),
$"times").show()
}
//最终读表控制台输出如下
+------+------+--------+---+--------+
| no|action| name|age| times|
+------+------+--------+---+--------+
| 12334|hehehe|zhangsan| 32|1995-6-7|
|123423| xixi| lisi| 32|2000-9-8|
|234435| cici| wangwu| 34|2020-9-7|
| 23432| cici| zhaoliu| 23|1997-4-3|
+------+------+--------+---+--------+