关于sparkSQL读json和csv文件的一些问题点

spark 专栏收录该内容
23 篇文章 0 订阅

读json格式的数据和文件

		import spark.implicits._  // spark的一个隐式转换
 		val spark = SparkSession
      .builder()
      .master("local")
      .appName("JsonFileTest")
      .getOrCreate()
     / /读取json文件数据
    	val jsonDF = spark.read.json(s"${BASE_PATH}/json")
    //自定义json数据
    val jsonDataset = spark.createDataset(
      """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
    val otherJsonDF = spark.read.json(jsonDataset)

primitivesAsString和prefersDecimal

primitivesAsString(默认为false) 表示将基本类型转化为string类型,这里的基本类型包括:boolean、int、long、float、double
prefersDecimal(默认是false)表示在primitivesAsString为false的时候,将float,double转成DecimalType

	val jsonDataset_1 = spark.createDataset(
      """{"name":"Yin","address":{"is_old":true,"area":23000.34}}""" :: Nil)
    var otherJsonDF_1 = spark.read.json(jsonDataset_1)
    otherJsonDF_1.printSchema()  // 打印Schema信息,可以发现,全是字符串类型
    
    // 这样写依旧是字符串类型
     var optsMap = Map("primitivesAsString" -> "true", "prefersDecimal" -> "true")
    otherJsonDF_1 = spark.read.options(optsMap).json(jsonDataset_1)
    otherJsonDF_1.printSchema()
	
	// 需要这样写prefersDecimal才会生效
	optsMap = Map("primitivesAsString" -> "false", "prefersDecimal" -> "true")
    otherJsonDF_1 = spark.read.options(optsMap).json(jsonDataset_1)
    otherJsonDF_1.printSchema()

allowComments

allowComments(默认是false),表示是否支持json中含有java/c格式的注释

spark.read.option("allowComments", "true").json(Seq("""{"name":/* hello */"Yin","address":{"is_old":true,"area":23000.34}}""").toDS()).show()

allowUnquotedFieldNames

allowUnquotedFieldNames(默认是false),表示是否支持json中含有没有引号的域名

spark.read.option("allowUnquotedFieldNames", "true").json(Seq("""{name:"Yin","address":{"is_old":true,"area":23000.34}}""").toDS()).show()

allowSingleQuotes

allowSingleQuotes(默认是true),表示是否支持json中含有单引号的域名或者值

spark.read.option("allowSingleQuotes", "true").json(Seq("""{'name':'Yin',"address":{"is_old":true,"area":23000.34}}""").toDS()).show()

allowNumericLeadingZeros

allowNumericLeadingZeros(默认是false),表示是否支持json中含有以0开头的数值

spark.read.option("allowNumericLeadingZeros", "true").json(Seq("""{'name':'Yin',"address":{"is_old":true,"area":0023000.34}}""").toDS()).show()

allowNonNumericNumbers

allowNonNumericNumbers(默认是false),表示是否支持json中含有NaN(not a number)

spark.read.option("allowNonNumericNumbers", "true").json(Seq("""{'name':'Yin',"address":{"is_old":true,"area":NaN}}""").toDS()).show()

allowBackslashEscapingAnyCharacter

allowBackslashEscapingAnyCharacter(默认是false),表示是否支持json中含有反斜杠,且将反斜杠忽略掉

spark.read.option("allowBackslashEscapingAnyCharacter", "true").json(Seq("""{'name':'Yin',"address":{"is_old":true,"area":"\$23000"}}""").toDS()).show()

mode

mode(默认是PERMISSIVE),表是碰到格式解析错误的json的处理行为是:
PERMISSIVE 表示比较宽容的。如果某条格式错误,则新增一个字段,字段名为columnNameOfCorruptRecord的值,字段的值是错误格式的json字符串,其他的是null

spark.read.option("mode", "PERMISSIVE").json(Seq("""{'name':'Yin',"address":{"is_old":true,"area":3000}}""",
      """{'name':'Yin',"address":{"is_old":true,"area":\3000}}""").toDS()).show()

spark.read.option("mode", "PERMISSIVE").option("columnNameOfCorruptRecord", "customer_column").json(
      Seq("""{'name':'Yin',"address":{"is_old":true,"area":3000}}""",
      """{'name':'Yin',"address":{"is_old":true,"area":\3000}}""").toDS()).show()

DROPMALFORMED 表示丢掉错误格式的那条记录

spark.read.option("mode", "DROPMALFORMED").json(Seq("""{'name':'Yin',"address":{"is_old":true,"area":3000}}""",
      """{'name':'Yin',"address":{"is_old":true,"area":\3000}}""").toDS()).show()

FAILFAST 碰到解析错误的记录直接报错

spark.read.option("mode", "FAILFAST").json(Seq("""{'name':'Yin',"address":{"is_old":true,"area":3000}}""",
      """{'name':'Yin',"address":{"is_old":true,"area":\3000}}""").toDS()).show()

dateFormat

dateFormat(默认值为yyyy-MM-dd) 表示json中时间的字符串格式(对应着DataType)

val customSchema = new StructType(Array(StructField("name", StringType, true),
      StructField("date", DateType, true)))
    val dataFormatDF =
      spark.read.schema(customSchema).option("dateFormat", "dd/MM/yyyy HH:mm").json(Seq(
        """{'name':'Yin',"date":"26/08/2015 18:00"}""").toDS())
    dataFormatDF.write.mode(SaveMode.Overwrite).option("dateFormat", "yyyy/MM/dd").json("testjson")
    spark.read.json("testjson").show()

timestampFormat

timestampFormat(默认值为yyyy-MM-dd’T’HH:mm:ss.SSSZZ) 表示json中时间的字符串格式(对应着TimestampType)

val customSchema_1 = new StructType(Array(StructField("name", StringType, true),
      StructField("date", TimestampType, true)))
    val timestampFormatDf =
      spark.read.schema(customSchema_1).option("timestampFormat", "dd/MM/yyyy HH:mm").json(Seq(
        """{'name':'Yin',"date":"26/08/2015 18:00"}""").toDS())

    val optMap = Map("timestampFormat" -> "yyyy/MM/dd HH:mm", DateTimeUtils.TIMEZONE_OPTION -> "GMT")
    timestampFormatDf.write.mode(SaveMode.Overwrite).format("json").options(optMap).save("test.json")
    spark.read.json("test.json").show()

compression

compression 压缩格式,支持的压缩格式有:
none 和 uncompressed表示不压缩
bzip2、deflate、gzip、lz4、snappy

timestampFormatDf.write.mode(SaveMode.Overwrite).option("compression", "gzip").json("test.json")

multiLine 表示是否支持一条json记录拆分成多行

val primitiveFieldAndType: Dataset[String] = spark.createDataset(spark.sparkContext.parallelize(
      """{"string":"this is a simple string.",
          "integer":10,
          "long":21474836470,
          "bigInteger":92233720368547758070,
          "double":1.7976931348623157E308,
          "boolean":true,
          "null":null
      }""" ::
        """{"string":"this is a simple string.",
 |          "integer":10,
 |          "long":21474836470,
 |          "bigInteger":92233720368547758070,
 |          "double":1.7976931348623157E308,
 |          "boolean":true,
 |          "null":null
 |      }""" :: Nil))(Encoders.STRING)
    primitiveFieldAndType.toDF("value").write.mode(SaveMode.Overwrite).option("compression", "GzIp").text(s"${BASE_PATH}/primitiveFieldAndType")
    
val multiLineDF = spark.read.option("multiLine", false).json(s"${BASE_PATH}/primitiveFieldAndType")

读csv文件

val csvDF = spark.read.csv(s"${BASE_PATH}/csv").toDF("age", "name")
// 从String类型中的Dataset来创建DataFrame
val csvDS = spark.createDataset(Seq("23,jeffy", "34,katy"))
val ds = spark.read.csv(csvDS)

sep和delimiter

sep 和 delimiter 的功能都是一样,都是表示csv的切割符,(默认是,)(读写参数)

	spark.read.csv(Seq("23,jeffy", "34,katy").toDS()).show()
    spark.read.option("sep", " ").csv(Seq("23 jeffy", "34 katy").toDS()).show()
    spark.read.option("delimiter", " ").csv(Seq("23 jeffy", "34 katy").toDS()).show()
    ds.write.mode(SaveMode.Overwrite).option("sep", "|").csv(s"${BASE_PATH}/delimiter")

header

header(默认是false) 表示是否将csv文件中的第一行作为schema(读写参数)

	val headerDF = spark.read.option("header", true).csv(s"${BASE_PATH}/cars.csv")
    headerDF.printSchema()
    headerDF.write.mode(SaveMode.Overwrite).option("header", true).csv(s"${BASE_PATH}/headerDF")

inferSchema

inferSchema 表示是否支持从数据中推导出schema(只读参数)

val inferSchemaDF =
      spark.read.option("header", true).option("inferSchema", true).csv(s"${BASE_PATH}/cars.csv")

charset和encoding

charset和encoding(默认是UTF-8),根据指定的编码器对csv文件进行解码(只读参数)

spark.read.option("header", "true").option("encoding", "iso-8859-1").option("sep", "þ").csv(s"${BASE_PATH}/cars_iso-8859-1.csv").show()

quote

quote(默认值是" ) 表示将不需要切割的字段值用quote标记起来(读写参数)

var optMap = Map("quote" -> "\'", "delimiter" -> " ")
    spark.read.options(optMap).csv(Seq("23 'jeffy tang'", "34 katy").toDS()).show()

escape

escape(默认值是\) 如果在quote标记的字段值中还含有quote,则用escape来避免(读写参数)

optMap = Map("quote" -> "\'", "delimiter" -> " ", "escape" -> "\"")
    spark.read.options(optMap).csv(Seq("23 'jeffy \"'tang'", "34 katy").toDS()).show()

comment

comment(默认是空字符串,表示关闭这个功能) 表示csv中的注释的标记符(读写参数)

optMap = Map("comment" -> "~", "header" -> "false")
    spark.read.options(optMap).csv(s"${BASE_PATH}/comments.csv").show()

读写参数

ignoreLeadingWhiteSpace(默认是false) 表示是否忽略字段值前面的空格
ignoreTrailingWhiteSpace(默认是false) 表示是否忽略字段值后面的空格

	optMap = Map("ignoreLeadingWhiteSpace" -> "true", "ignoreTrailingWhiteSpace" -> "true")
    spark.read.options(optMap).csv(Seq(" a,b  , c ").toDS()).show()

multiLine

multiLine(默认是false) 是否支持一条记录被拆分成了多行的csv的读取解析(只读参数)

val primitiveFieldAndType = Seq(
      """"
        |string","integer
        |
        |
        |","long
        |
        |","bigInteger",double,boolean,null""".stripMargin,
      """"this is a
        |simple
        |string.","
        |
        |10","
        |21474836470","92233720368547758070","
        |
        |1.7976931348623157E308",true,""".stripMargin)

      primitiveFieldAndType.toDF("value").coalesce(1).write.mode(SaveMode.Overwrite).text(s"csv_multiLine_test")

      spark.read.option("header", true).option("multiLine", true).csv("csv_multiLine_test").show()

mode

和json的一样
mode(默认是PERMISSIVE) (只读参数)
PERMISSIVE 表示碰到解析错误的时候,将字段都置为null
DROPMALFORMED 表示忽略掉解析错误的记录
FAILFAST 当有解析错误的时候,立马抛出异常

nullValue

nullValue(默认是空字符串), 表示需要将nullValue指定的字符串解析成null(读写参数)

spark.read.option("nullValue", "--").csv(Seq("0,2013-11-11,--", "1,1983-08-04,3").toDS()).show()

nanValue

nanValue(默认值为NaN) (只读参数)
positiveInf
negativeInf

val numbers = spark.read.format("csv").schema(StructType(List(
        StructField("int", IntegerType, true),
        StructField("long", LongType, true),
        StructField("float", FloatType, true),
        StructField("double", DoubleType, true)
      ))).options(Map(
        "header" -> "true",
        "mode" -> "DROPMALFORMED",
        "nullValue" -> "--",
        "nanValue" -> "NAN",
        "negativeInf" -> "-INF",
        "positiveInf" -> "INF")).load(s"${BASE_PATH}/numbers.csv")
    numbers.show()

codec和compression dateFormat timestampFormat

同上

maxColumns

maxColumns(默认是20480) 规定一个csv的一条记录最大的列数 (只读参数)

spark.read.option("maxColumns", "2").csv(Seq("test,as,g", "h,bm,s").toDS()).show() //会报错
  • 1
    点赞
  • 0
    评论
  • 1
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

相关推荐
©️2020 CSDN 皮肤主题: 技术工厂 设计师:CSDN官方博客 返回首页
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值