Databricks读取json数据（2）

地老鼠PN_1

已于 2024-07-27 19:46:52 修改

阅读量134

点赞数 2

文章标签： azure

于 2024-07-25 22:25:36 首次发布

本文链接：https://blog.csdn.net/dilaoshuPN/article/details/140700888

版权

像以下这种多行的json数据。

[

{

"raceId":841,

"driverId":153,

"stop":1,

"lap":1,

"time":"17:05:23",

"duration":26.898,

"milliseconds":26898

{

"raceId":841,

"driverId":30,

"stop":1,

"lap":1,

"time":"17:05:52",

"duration":25.021,

"milliseconds":25021

}

]

可以使用选择spark读取json数据的可选项multiLine=True，来读取数据。

1、首先导入需要的包

from pyspark.sql.types import StructType,StructField,IntegerType

2、使用struct定义表的结构。这里定义了raceId不能为空，如果源数据里raceId有为空的，读到dataframe里的数据就为空。

pit_stops_schema=StructType([StructField("raceId", IntegerType(), False),
                             StructField("driverId", IntegerType(), True),
                             StructField("stop", IntegerType(), True),
                             StructField("lap", IntegerType(), True),
                             StructField("time", StringType(), True),
                             StructField("duration", StringType(), True),
                             StructField("milliseconds", IntegerType(), True)
                             ])

3、设置multiLine为True，读取json数据。

pit_stops_df=spark.read.option('multiLine',True).schema(pit_stops_schema).json("/mnt/formula189dl/raw/pit_stops.json")

4、查看结果。

地老鼠PN_1

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Databricks读取json数据（2）

可以使用选择spark读取json数据的可选项multiLine=True，来读取数据。3、设置multiLine为True，读取json数据。2、使用struct定义表的结构。像以下这种多行的json数据。1、首先导入需要的包。
复制链接

扫一扫