Spark 读写CSV的常用配置项

最新推荐文章于 2024-06-01 00:12:16 发布

荣誉帝王

最新推荐文章于 2024-06-01 00:12:16 发布

阅读量1.7k

点赞数

原文链接：http://dyingbleed.com/spark-ii-5/

版权

Spark 2.0 之前，Spark SQL 读写 CSV 格式文件，需要 Databricks 官方提供的 spark-csv 库。在 Spark 2.0 之后，Spark SQL 原生支持读写 CSV 格式文件。
测试带标题的文件如下：

id|name|age
1| darren |18
2|anne|18
3|"test"|18
4|'test2'|18

package com.darren.spark.sql.csv
import org.apache.spark.sql.{SaveMode, SparkSession}
/**
* @Author Darren Zhang
* @Date 2019-05-30
* @Description TODO
**/
object CSVReader {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("CSV Reader")
.master("local")
.getOrCreate()
val result = spark.read.format("csv")
.option("delimiter", "|")
.option("header", "true")
.option("quote", "'")
.option("nullValue", "\\N")
.option("inferSchema", "true")
.load("test-in/csv/csv_with_header.csv")
result.show()
result.printSchema()
}
}

输出结果如下：

+---+--------+----+
| id| name| age|
+---+--------+----+
| 1| darren | 18|
| 2| anne| 18|
| 3| "test"| 18|
| 4| test2|null|
+---+--------+----+
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- age: integer (nullable = true)

参数说明：

delimiter 分隔符，默认为逗号,
nullValue 指定一个字符串代表 null 值
quote 引号字符，默认为双引号"
header 第一行不作为数据内容，作为标题
inferSchema 自动推测字段类型

自动推测字段类型只是折中方案，更好的方案是指定字段类型：

package com.darren.spark.sql.csv
/**
* @Author Darren Zhang
* @Date 2019-05-30
* @Description TODO
**/
case class User(id: Int, name: String, age: Int)

测试不带标题的文件：

1| darren |18
2|anne|18
3|"test"|18
4|'test2'|\N

package com.darren.spark.sql.csv
import org.apache.spark.sql.{SaveMode, SparkSession}
/**
* @Author Darren Zhang
* @Date 2019-05-30
* @Description TODO
**/
object CSVReader {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("CSV Reader")
.master("local")
.getOrCreate()
val result = spark.read.format("csv")
.option("delimiter", "|")
//.option("header", "true")
.option("quote", "'")
.option("nullValue", "\\N")
.option("inferSchema", "true")
.load("test-in/csv/csv_without_header.csv")
.toDF("id", "name", "age")
result.show()
result.printSchema()
}
}

+---+--------+----+
| id| name| age|
+---+--------+----+
| 1| darren | 18|
| 2| anne| 18|
| 3| "test"| 18|
| 4| test2|null|
+---+--------+----+
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- age: integer (nullable = true)

指定类型：

package com.darren.spark.sql.csv
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{SaveMode, SparkSession}
/**
* @Author Darren Zhang
* @Date 2019-05-30
* @Description TODO
**/
object CSVReader {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("CSV Reader")
.master("local")
.getOrCreate()
val result = spark.read.format("csv")
.option("delimiter", "|")
//.option("header", "true")
.option("quote", "'")
.option("nullValue", "\\N")
.option("inferSchema", "true")
.schema(ScalaReflection.schemaFor[User].dataType.asInstanceOf[StructType])
.load("test-in/csv/csv_without_header.csv")
//.toDF("id", "name", "age")
result.show()
result.printSchema()
}
}

结果和上边的结果一样

写CSV文件：

result.write
.mode(SaveMode.Overwrite)
.option("delimiter", "|")
// .option("quote", "")
.format("csv")
.save("test-out/csv/")

1|darren|18
2|anne|18
3|"\"test\""|18
4|test2|

文件的内容和读的输出有一些变化

第一个变化：写出的文件会增加双引号\",会在有引号的地方再增加引号，因为双引号是默认值，如果不想增加，就把注释打开，设置引号为空即可

第二个变化：darren前后的空格没有了。在spark 2.1.1 使用 Spark SQL 保存 CSV 格式文件，默认情况下，会自动裁剪字符串前后空格。

这样的默认行为有时候并不是我们所期望的，在 Spark 2.2.0 之后，可以通过配置关闭改功能：

result.write
.mode(SaveMode.Overwrite)
.option("delimiter", "|")
// .option("quote", "")
.option("ignoreLeadingWhiteSpace", false)
.option("ignoreTrailingWhiteSpace", false)
.option("nullValue", null)
.format("csv")
.save("test-out/csv/")

参数说明：

ignoreLeadingWhiteSpace 裁剪前面的空格
ignoreTrailingWhiteSpace 裁剪后面的空格
nullValue 空值设置，如果不想用任何符号作为空值，可以赋值null即可

问题：那么spark读写CSV到底有多少个属性可以设置呢？

答案：没有找到有资料显示有多少个，但是找到了源码，可以判断有多少个。

源码如下：

/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.spark.sql.execution.datasources.csv
import java.nio.charset.StandardCharsets
import java.util.{Locale, TimeZone}
import com.univocity.parsers.csv.{CsvParserSettings, CsvWriterSettings, UnescapedQuoteHandling}
import org.apache.commons.lang3.time.FastDateFormat
import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.util._
class CSVOptions(
@transient private val parameters: CaseInsensitiveMap[String],
defaultTimeZoneId: String,
defaultColumnNameOfCorruptRecord: String)
extends Logging with Serializable {
def this(
parameters: Map[String, String],
defaultTimeZoneId: String,
defaultColumnNameOfCorruptRecord: String = "") = {
this(
CaseInsensitiveMap(parameters),
defaultTimeZoneId,
defaultColumnNameOfCorruptRecord)
}
private def getChar(paramName: String, default: Char): Char = {
val paramValue = parameters.get(paramName)
paramValue match {
case None => default
case Some(null) => default
case Some(value) if value.length == 0 => '\u0000'
case Some(value) if value.length == 1 => value.charAt(0)
case _ => throw new RuntimeException(s"$paramName cannot be more than one character")
}
}
private def getInt(paramName: String, default: Int): Int = {
val paramValue = parameters.get(paramName)
paramValue match {
case None => default
case Some(null) => default
case Some(value) => try {
value.toInt
} catch {
case e: NumberFormatException =>
throw new RuntimeException(s"$paramName should be an integer. Found $value")
}
}
}
private def getBool(paramName: String, default: Boolean = false): Boolean = {
val param = parameters.getOrElse(paramName, default.toString)
if (param == null) {
default
} else if (param.toLowerCase(Locale.ROOT) == "true") {
true
} else if (param.toLowerCase(Locale.ROOT) == "false") {
false
} else {
throw new Exception(s"$paramName flag can be true or false")
}
}
val delimiter = CSVUtils.toChar(
parameters.getOrElse("sep", parameters.getOrElse("delimiter", ",")))
val parseMode: ParseMode =
parameters.get("mode").map(ParseMode.fromString).getOrElse(PermissiveMode)
val charset = parameters.getOrElse("encoding",
parameters.getOrElse("charset", StandardCharsets.UTF_8.name()))
val quote = getChar("quote", '\"')
val escape = getChar("escape", '\\')
val comment = getChar("comment", '\u0000')
val headerFlag = getBool("header")
val inferSchemaFlag = getBool("inferSchema")
val ignoreLeadingWhiteSpaceInRead = getBool("ignoreLeadingWhiteSpace", default = false)
val ignoreTrailingWhiteSpaceInRead = getBool("ignoreTrailingWhiteSpace", default = false)
// For write, both options were `true` by default. We leave it as `true` for
// backwards compatibility.
val ignoreLeadingWhiteSpaceFlagInWrite = getBool("ignoreLeadingWhiteSpace", default = true)
val ignoreTrailingWhiteSpaceFlagInWrite = getBool("ignoreTrailingWhiteSpace", default = true)
val columnNameOfCorruptRecord =
parameters.getOrElse("columnNameOfCorruptRecord", defaultColumnNameOfCorruptRecord)
val nullValue = parameters.getOrElse("nullValue", "")
val nanValue = parameters.getOrElse("nanValue", "NaN")
val positiveInf = parameters.getOrElse("positiveInf", "Inf")
val negativeInf = parameters.getOrElse("negativeInf", "-Inf")
val compressionCodec: Option[String] = {
val name = parameters.get("compression").orElse(parameters.get("codec"))
name.map(CompressionCodecs.getCodecClassName)
}
val timeZone: TimeZone = DateTimeUtils.getTimeZone(
parameters.getOrElse(DateTimeUtils.TIMEZONE_OPTION, defaultTimeZoneId))
// Uses `FastDateFormat` which can be direct replacement for `SimpleDateFormat` and thread-safe.
val dateFormat: FastDateFormat =
FastDateFormat.getInstance(parameters.getOrElse("dateFormat", "yyyy-MM-dd"), Locale.US)
val timestampFormat: FastDateFormat =
FastDateFormat.getInstance(
parameters.getOrElse("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"), timeZone, Locale.US)
val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false)
val maxColumns = getInt("maxColumns", 20480)
val maxCharsPerColumn = getInt("maxCharsPerColumn", -1)
val escapeQuotes = getBool("escapeQuotes", true)
val quoteAll = getBool("quoteAll", false)
val inputBufferSize = 128
val isCommentSet = this.comment != '\u0000'
def asWriterSettings: CsvWriterSettings = {
val writerSettings = new CsvWriterSettings()
val format = writerSettings.getFormat
format.setDelimiter(delimiter)
format.setQuote(quote)
format.setQuoteEscape(escape)
format.setComment(comment)
writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)
writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
writerSettings.setNullValue(nullValue)
writerSettings.setEmptyValue(nullValue)
writerSettings.setSkipEmptyLines(true)
writerSettings.setQuoteAllFields(quoteAll)
writerSettings.setQuoteEscapingEnabled(escapeQuotes)
writerSettings
}
def asParserSettings: CsvParserSettings = {
val settings = new CsvParserSettings()
val format = settings.getFormat
format.setDelimiter(delimiter)
format.setQuote(quote)
format.setQuoteEscape(escape)
format.setComment(comment)
settings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceInRead)
settings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceInRead)
settings.setReadInputOnSeparateThread(false)
settings.setInputBufferSize(inputBufferSize)
settings.setMaxColumns(maxColumns)
settings.setNullValue(nullValue)
settings.setMaxCharsPerColumn(maxCharsPerColumn)
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
settings
}
}

荣誉帝王

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Spark 读写CSV的常用配置项

Spark 2.0 之前，Spark SQL 读写 CSV 格式文件，需要 Databricks 官方提供的 spark-csv 库。在 Spark 2.0 之后，Spark SQL 原生支持读写 CSV 格式文件。测试带标题的文件如下： id|name|age 1| darren |18 2|anne|18 3|"test"|18 4|'...
复制链接

扫一扫