Spark 读写CSV的常用配置项

最新推荐文章于 2024-09-05 17:10:06 发布

三丰

最新推荐文章于 2024-09-05 17:10:06 发布

阅读量2.4w

点赞数

分类专栏： Spark 文章标签： spark csv

Spark 专栏收录该内容

23 篇文章

订阅专栏

Spark 2.0 之前，Spark SQL 读写 CSV 格式文件，需要 Databricks 官方提供的 spark-csv 库。在 Spark 2.0 之后，Spark SQL 原生支持读写 CSV 格式文件。
测试带标题的文件如下：

id|name|age
1| darren |18
2|anne|18
3|"test"|18
4|'test2'|18

package com.darren.spark.sql.csv

import org.apache.spark.sql.{SaveMode, SparkSession}

/**
  * @Author Darren Zhang
  * @Date 2019-05-30
  * @Description TODO
  **/
object CSVReader {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("CSV Reader")
      .master("local")
      .getOrCreate()

    val result = spark.read.format("csv")
      .option("delimiter", "|")
      .option("header", "true")
      .option("quote", "'")
      .option("nullValue", "\\N")
      .option("inferSchema", "true")
      .load("test-in/csv/csv_with_header.csv")

    result.show()
    result.printSchema()

  }
}

输出结果如下：

+---+--------+----+
| id|    name| age|
+---+--------+----+
|  1| darren |  18|
|  2|    anne|  18|
|  3|  "test"|  18|
|  4|   test2|null|
+---+--------+----+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

参数说明：

delimiter 分隔符，默认为逗号,
nullValue 指定一个字符串代表 null 值
quote 引号字符，默认为双引号"
header 第一行不作为数据内容，作为标题
inferSchema 自动推测字段类型

自动推测字段类型只是折中方案，更好的方案是指定字段类型：

package com.darren.spark.sql.csv

/**
  * @Author Darren Zhang
  * @Date 2019-05-30
  * @Description TODO
  **/
case class User(id: Int, name: String, age: Int)

测试不带标题的文件：

1| darren |18
2|anne|18
3|"test"|18
4|'test2'|\N

package com.darren.spark.sql.csv

import org.apache.spark.sql.{SaveMode, SparkSession}

/**
  * @Author Darren Zhang
  * @Date 2019-05-30
  * @Description TODO
  **/
object CSVReader {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("CSV Reader")
      .master("local")
      .getOrCreate()

    val result = spark.read.format("csv")
      .option("delimiter", "|")
      //.option("header", "true")
      .option("quote", "'")
      .option("nullValue", "\\N")
      .option("inferSchema", "true")
      .load("test-in/csv/csv_without_header.csv")
      .toDF("id", "name", "age")
    
    result.show()
    result.printSchema()

  }
}

+---+--------+----+
| id|    name| age|
+---+--------+----+
|  1| darren |  18|
|  2|    anne|  18|
|  3|  "test"|  18|
|  4|   test2|null|
+---+--------+----+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

指定类型：

package com.darren.spark.sql.csv

import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{SaveMode, SparkSession}

/**
  * @Author Darren Zhang
  * @Date 2019-05-30
  * @Description TODO
  **/
object CSVReader {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("CSV Reader")
      .master("local")
      .getOrCreate()

    val result = spark.read.format("csv")
      .option("delimiter", "|")
      //.option("header", "true")
      .option("quote", "'")
      .option("nullValue", "\\N")
      .option("inferSchema", "true")
      .schema(ScalaReflection.schemaFor[User].dataType.asInstanceOf[StructType])
      .load("test-in/csv/csv_without_header.csv")
      //.toDF("id", "name", "age")

    result.show()
    result.printSchema()


  }
}

结果和上边的结果一样

写CSV文件：

    result.write
      .mode(SaveMode.Overwrite)
      .option("delimiter", "|")
     // .option("quote", "")
      .format("csv")
      .save("test-out/csv/")

1|darren|18
2|anne|18
3|"\"test\""|18
4|test2|

文件的内容和读的输出有一些变化

第一个变化：写出的文件会增加双引号\",会在有引号的地方再增加引号，因为双引号是默认值，如果不想增加，就把注释打开，设置引号为空即可

第二个变化：darren前后的空格没有了。在spark 2.1.1 使用 Spark SQL 保存 CSV 格式文件，默认情况下，会自动裁剪字符串前后空格。

这样的默认行为有时候并不是我们所期望的，在 Spark 2.2.0 之后，可以通过配置关闭改功能：

 result.write
      .mode(SaveMode.Overwrite)
      .option("delimiter", "|")
     // .option("quote", "")
      .option("ignoreLeadingWhiteSpace", false)
      .option("ignoreTrailingWhiteSpace", false)
      .option("nullValue", null)
      .format("csv")
      .save("test-out/csv/")

参数说明：

ignoreLeadingWhiteSpace 裁剪前面的空格
ignoreTrailingWhiteSpace 裁剪后面的空格
nullValue 空值设置，如果不想用任何符号作为空值，可以赋值null即可

问题：那么spark读写CSV到底有多少个属性可以设置呢？

答案：没有找到有资料显示有多少个，但是找到了源码，可以判断有多少个。

源码如下：

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.spark.sql.execution.datasources.csv

import java.nio.charset.StandardCharsets
import java.util.{Locale, TimeZone}

import com.univocity.parsers.csv.{CsvParserSettings, CsvWriterSettings, UnescapedQuoteHandling}
import org.apache.commons.lang3.time.FastDateFormat

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.util._

class CSVOptions(
    @transient private val parameters: CaseInsensitiveMap[String],
    defaultTimeZoneId: String,
    defaultColumnNameOfCorruptRecord: String)
  extends Logging with Serializable {

  def this(
    parameters: Map[String, String],
    defaultTimeZoneId: String,
    defaultColumnNameOfCorruptRecord: String = "") = {
      this(
        CaseInsensitiveMap(parameters),
        defaultTimeZoneId,
        defaultColumnNameOfCorruptRecord)
  }

  private def getChar(paramName: String, default: Char): Char = {
    val paramValue = parameters.get(paramName)
    paramValue match {
      case None => default
      case Some(null) => default
      case Some(value) if value.length == 0 => '\u0000'
      case Some(value) if value.length == 1 => value.charAt(0)
      case _ => throw new RuntimeException(s"$paramName cannot be more than one character")
    }
  }

  private def getInt(paramName: String, default: Int): Int = {
    val paramValue = parameters.get(paramName)
    paramValue match {
      case None => default
      case Some(null) => default
      case Some(value) => try {
        value.toInt
      } catch {
        case e: NumberFormatException =>
          throw new RuntimeException(s"$paramName should be an integer. Found $value")
      }
    }
  }

  private def getBool(paramName: String, default: Boolean = false): Boolean = {
    val param = parameters.getOrElse(paramName, default.toString)
    if (param == null) {
      default
    } else if (param.toLowerCase(Locale.ROOT) == "true") {
      true
    } else if (param.toLowerCase(Locale.ROOT) == "false") {
      false
    } else {
      throw new Exception(s"$paramName flag can be true or false")
    }
  }

  val delimiter = CSVUtils.toChar(
    parameters.getOrElse("sep", parameters.getOrElse("delimiter", ",")))
  val parseMode: ParseMode =
    parameters.get("mode").map(ParseMode.fromString).getOrElse(PermissiveMode)
  val charset = parameters.getOrElse("encoding",
    parameters.getOrElse("charset", StandardCharsets.UTF_8.name()))

  val quote = getChar("quote", '\"')
  val escape = getChar("escape", '\\')
  val comment = getChar("comment", '\u0000')

  val headerFlag = getBool("header")
  val inferSchemaFlag = getBool("inferSchema")
  val ignoreLeadingWhiteSpaceInRead = getBool("ignoreLeadingWhiteSpace", default = false)
  val ignoreTrailingWhiteSpaceInRead = getBool("ignoreTrailingWhiteSpace", default = false)

  // For write, both options were `true` by default. We leave it as `true` for
  // backwards compatibility.
  val ignoreLeadingWhiteSpaceFlagInWrite = getBool("ignoreLeadingWhiteSpace", default = true)
  val ignoreTrailingWhiteSpaceFlagInWrite = getBool("ignoreTrailingWhiteSpace", default = true)

  val columnNameOfCorruptRecord =
    parameters.getOrElse("columnNameOfCorruptRecord", defaultColumnNameOfCorruptRecord)

  val nullValue = parameters.getOrElse("nullValue", "")

  val nanValue = parameters.getOrElse("nanValue", "NaN")

  val positiveInf = parameters.getOrElse("positiveInf", "Inf")
  val negativeInf = parameters.getOrElse("negativeInf", "-Inf")


  val compressionCodec: Option[String] = {
    val name = parameters.get("compression").orElse(parameters.get("codec"))
    name.map(CompressionCodecs.getCodecClassName)
  }

  val timeZone: TimeZone = DateTimeUtils.getTimeZone(
    parameters.getOrElse(DateTimeUtils.TIMEZONE_OPTION, defaultTimeZoneId))

  // Uses `FastDateFormat` which can be direct replacement for `SimpleDateFormat` and thread-safe.
  val dateFormat: FastDateFormat =
    FastDateFormat.getInstance(parameters.getOrElse("dateFormat", "yyyy-MM-dd"), Locale.US)

  val timestampFormat: FastDateFormat =
    FastDateFormat.getInstance(
      parameters.getOrElse("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"), timeZone, Locale.US)

  val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false)

  val maxColumns = getInt("maxColumns", 20480)

  val maxCharsPerColumn = getInt("maxCharsPerColumn", -1)

  val escapeQuotes = getBool("escapeQuotes", true)

  val quoteAll = getBool("quoteAll", false)

  val inputBufferSize = 128

  val isCommentSet = this.comment != '\u0000'

  def asWriterSettings: CsvWriterSettings = {
    val writerSettings = new CsvWriterSettings()
    val format = writerSettings.getFormat
    format.setDelimiter(delimiter)
    format.setQuote(quote)
    format.setQuoteEscape(escape)
    format.setComment(comment)
    writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)
    writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
    writerSettings.setNullValue(nullValue)
    writerSettings.setEmptyValue(nullValue)
    writerSettings.setSkipEmptyLines(true)
    writerSettings.setQuoteAllFields(quoteAll)
    writerSettings.setQuoteEscapingEnabled(escapeQuotes)
    writerSettings
  }

  def asParserSettings: CsvParserSettings = {
    val settings = new CsvParserSettings()
    val format = settings.getFormat
    format.setDelimiter(delimiter)
    format.setQuote(quote)
    format.setQuoteEscape(escape)
    format.setComment(comment)
    settings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceInRead)
    settings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceInRead)
    settings.setReadInputOnSeparateThread(false)
    settings.setInputBufferSize(inputBufferSize)
    settings.setMaxColumns(maxColumns)
    settings.setNullValue(nullValue)
    settings.setMaxCharsPerColumn(maxCharsPerColumn)
    settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
    settings
  }
}

原文链接：http://dyingbleed.com/spark-ii-5/