像以下这种每行json是一行数据的时候,可以使用如下方式读取数据:
{"constructorId":1,"constructorRef":"mclaren","name":"McLaren","nationality":"British","url":"http://en.wikipedia.org/wiki/McLaren"}
{"constructorId":2,"constructorRef":"bmw_sauber","name":"BMW Sauber","nationality":"German","url":"http://en.wikipedia.org/wiki/BMW_Sauber"}
{"constructorId":3,"constructorRef":"williams","name":"Williams","nationality":"British","url":"http://en.wikipedia.org/wiki/Williams_Grand_Prix_Engineering"}
{"constructorId":4,"constructorRef":"renault","name":"Renault","nationality":"French","url":"http://en.wikipedia.org/wiki/Renault_in_Formula_One"}
1、首先定义表的结构,定义的表结构必须是pyspark.sql.types.StructType或者string类型。
如果不定义表结构,系统会遍历源数据,自动推断表结构。
constructors_schema="constructorId int,constructorRef string,name string,nationality string,url string"
2、然后将schema和数据的路径输入进去,使用spark.read
constructors_df=spark.read.schema(constructors_schema).json("/mnt/formula189dl/raw/constructors.json")
3、使用display查看结果数据。
4、使用printSchema()查看表结构,其中nullable = true表示这个字段可以为空。