【Coding】SparkSQL读写JSON文件

zxfBdd

于 2024-03-06 23:03:21 发布

阅读量140

点赞数

分类专栏：大数据文章标签： json

原文链接：https://zhuanlan.zhihu.com/p/485465242

版权

大数据专栏收录该内容

590 篇文章 29 订阅

订阅专栏

Spark SQL提供了spark.read.json("path")方法读取JSON文件到DataFrame中，也提供了dataframe.write.json("path")方法来将DataFrame数据保存为JSON 文件。在这篇文章中，你可以学习到如何使用Scala读取JSON文件到DataFrame和将DataFrame保存到JSON文件中。

创建SparkSession

val spark = SparkSession
    .builder()
    .master("local[*]")
    .appName("读取Json文件数据")
    .getOrCreate()

从spark2.0开始，SparkSession成为DataFrame编程的入口，在读取之前我们先创建一个SparkSession。

使用Spark的spark.read.json()函数可以读取JSON文件并将其转换为DataFrame。如果要处理包含JSON数组的情况，需要确保每行都有相同的结构。

读取单个JSON文件

// read json file into dataframe
  val singleDF: DataFrame =
    spark.read
      .option("multiline", "true")
      .json("src/main/resources/json_file_1.json")
  singleDF.printSchema()
  singleDF.show(false)

SparkSQL默认JSON文件中的每一行都是一个完整的JSON，而我们实际开发中遇到的JSON文件可能是跨行的，所以这里用option("multiline", "true") 来处理这种跨行的JSON文件。

SparkSQL期望的JSON文件：

[{"name": "suwenjin","age": 12},{"name": "fumingming","age": 25}]

实际生产中的JSON文件：

[
    {
        "name": "suwenjin",
        "age": 12
    },
    {
        "name": "fumingming",
        "age": 25
    }
]

stackoverflow相关问题解答：

https://stackoverflow.com/questions/57451719/since-spark-2-3-the-queries-from-raw-json-csv-files-are-disallowed-when-the-refstackoverflow.com/questions/57451719/since-spark-2-3-the-queries-from-raw-json-csv-files-are-disallowed-when-the-ref

实际上SparkSQL在从某个数据源读取数据时有许多option，详情参考官网的介绍：

JSON Files - Spark 3.2.1 Documentationspark.apache.org/docs/latest/sql-data-sources-json.html

读取多个JSON文件

// read mutiple files into dataframe
val multipleDF: DataFrame = spark.read
    .option("multiline", "true")
    .json(
      "src/main/resources/json_file_1.json",
      "src/main/resources/json_file_2.json"
    )
multipleDF.show(false)

读取路径下所有JSON文件

// read all file from a folder
val allDF: DataFrame = spark.read
    .option("multiline", "true")
    .json("src/main/resources/*")
allDF.show(false)

用自定义的Schema读取JSON文件

读取JSON文件时，我们可以自定义Schema到DataFrame。

// Define custom schema
  val schema = new StructType()
    .add("FriendAge", LongType, true)
    .add("FriendName", StringType, true)
  val singleDFwithSchema: DataFrame =
    spark.read
      .schema(schema)
      .option("multiline", "true")
      .json("src/main/resources/json_file_1.json")
  singleDFwithSchema.show(false)

读取JSON文件为临时表

如果你比较习惯用SQL解决实际问题，可以将JSON文件读取为一个临时表。

spark.sqlContext.sql(
    "CREATE TEMPORARY VIEW people USING json OPTIONS (path 'src/main/resources/json_file_1.json', multiline true)"
  )
spark.sqlContext
    .sql("select * from people")
    .show()

保存DataFrame到JOSN文件

SparkSQL可以通过.mode()指定 SaveMode 。mode()的入参为SaveMode类的常量。

SaveMode.Overwrite:写入时覆盖原来的文件。
SaveMode.Append :写入时在原来的文件上追加。
SaveMode.Ignore: 如果文件已存在，就忽略这次保存的操作。
SaveMode.ErrorIfExists:如果文件已存在，就报错。

import org.apache.spark.sql.SaveMode

// write df to json
allDF.write.mode(SaveMode.Overwrite).json("src/main/other_resources/all_json_file.json")

完整Code

gitee:

苏文进/SparkNotesgitee.com/jwsmai/spark-notes编辑

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.types._

object FromJsonFile extends App {
  val spark = SparkSession
    .builder()
    .master("local[1]")
    .appName("读取Json文件数据")
    .getOrCreate()

  spark.sparkContext.setLogLevel("WARN")

  // read json file into dataframe
  val singleDF: DataFrame =
    spark.read
      .option("multiline", "true")
      .json("src/main/resources/json_file_1.json")
  singleDF.printSchema()
  singleDF.show(false)

  // read mutiple files into dataframe
  val multipleDF: DataFrame = spark.read
    .option("multiline", "true")
    .json(
      "src/main/resources/json_file_1.json",
      "src/main/resources/json_file_2.json"
    )
  multipleDF.show(false)

    // read all file from a folder
    val allDF: DataFrame =
        spark.read.option("multiline", "true").json("src/main/resources/*")
    allDF.show(false)

  // Define custom schema
  val schema = new StructType()
    .add("FriendAge", LongType, true)
    .add("FriendName", StringType, true)
  val singleDFwithSchema: DataFrame =
    spark.read
      .schema(schema)
      .option("multiline", "true")
      .json("src/main/resources/json_file_1.json")
  singleDFwithSchema.show(false)

  spark.sqlContext.sql(
    "CREATE TEMPORARY VIEW people USING json OPTIONS (path 'src/main/resources/json_file_1.json', multiline true)"
  )
  spark.sqlContext
    .sql(
      "select * from people"
    )
    .show()
  
  // write df to json
  allDF.write.mode(SaveMode.Overwrite).json("src/main/other_resources/all_json_file.json")

}