Spark 2.0 之前,Spark SQL 读写 CSV 格式文件,需要 Databricks 官方提供的 spark-csv 库。在 Spark 2.0 之后,Spark SQL 原生支持读写 CSV 格式文件。
测试带标题的文件如下:
-
id|name|age
-
1| darren |18
-
2|anne|18
-
3|"test"|18
-
4|'test2'|18
-
package com.darren.spark.sql.csv
-
import org.apache.spark.sql.{SaveMode, SparkSession}
-
/**
-
* @Author Darren Zhang
-
* @Date 2019-05-30
-
* @Description TODO
-
**/
-
object CSVReader {
-
def main(args: Array[String]): Unit = {
-
val spark = SparkSession.builder()
-
.appName("CSV Reader")
-
.master("local")
-
.getOrCreate()
-
val result = spark.read.format("csv")
-
.option("delimiter", "|")
-
.option("header", "true")
-
.option("quote", "'")
-
.option("nullValue", "\\N")
-
.option("inferSchema", "true")
-
.load("test-in/csv/csv_with_header.csv")
-
result.show()
-
result.printSchema()
-
}
-
}
输出结果如下:
-
+---+--------+----+
-
| id| name| age|
-
+---+--------+----+
-
| 1| darren | 18|
-
| 2| anne| 18|
-
| 3| "test"| 18|
-
| 4| test2|null|
-
+---+--------+----+
-
root
-
|-- id: integer (nullable = true)
-
|-- name: string (nullable = true)
-
|-- age: integer (nullable = true)
参数说明:
- delimiter 分隔符,默认为逗号,
- nullValue 指定一个字符串代表 null 值
- quote 引号字符,默认为双引号"
- header 第一行不作为数据内容,作为标题
- inferSchema 自动推测字段类型
自动推测字段类型只是折中方案,更好的方案是指定字段类型:
-
package com.darren.spark.sql.csv
-
/**
-
* @Author Darren Zhang
-
* @Date 2019-05-30
-
* @Description TODO
-
**/
-
case class User(id: Int, name: String, age: Int)
测试不带标题的文件:
-
1| darren |18
-
2|anne|18
-
3|"test"|18
-
4|'test2'|\N
-
package com.darren.spark.sql.csv
-
import org.apache.spark.sql.{SaveMode, SparkSession}
-
/**
-
* @Author Darren Zhang
-
* @Date 2019-05-30
-
* @Description TODO
-
**/
-
object CSVReader {
-
def main(args: Array[String]): Unit = {
-
val spark = SparkSession.builder()
-
.appName("CSV Reader")
-
.master("local")
-
.getOrCreate()
-
val result = spark.read.format("csv")
-
.option("delimiter", "|")
-
//.option("header", "true")
-
.option("quote", "'")
-
.option("nullValue", "\\N")
-
.option("inferSchema", "true")
-
.load("test-in/csv/csv_without_header.csv")
-
.toDF("id", "name", "age")
-
result.show()
-
result.printSchema()
-
}
-
}
-
+---+--------+----+
-
| id| name| age|
-
+---+--------+----+
-
| 1| darren | 18|
-
| 2| anne| 18|
-
| 3| "test"| 18|
-
| 4| test2|null|
-
+---+--------+----+
-
root
-
|-- id: integer (nullable = true)
-
|-- name: string (nullable = true)
-
|-- age: integer (nullable = true)
指定类型:
-
package com.darren.spark.sql.csv
-
import org.apache.spark.sql.catalyst.ScalaReflection
-
import org.apache.spark.sql.types.StructType
-
import org.apache.spark.sql.{SaveMode, SparkSession}
-
/**
-
* @Author Darren Zhang
-
* @Date 2019-05-30
-
* @Description TODO
-
**/
-
object CSVReader {
-
def main(args: Array[String]): Unit = {
-
val spark = SparkSession.builder()
-
.appName("CSV Reader")
-
.master("local")
-
.getOrCreate()
-
val result = spark.read.format("csv")
-
.option("delimiter", "|")
-
//.option("header", "true")
-
.option("quote", "'")
-
.option("nullValue", "\\N")
-
.option("inferSchema", "true")
-
.schema(ScalaReflection.schemaFor[User].dataType.asInstanceOf[StructType])
-
.load("test-in/csv/csv_without_header.csv")
-
//.toDF("id", "name", "age")
-
result.show()
-
result.printSchema()
-
}
-
}
结果和上边的结果一样
写CSV文件:
-
result.write
-
.mode(SaveMode.Overwrite)
-
.option("delimiter", "|")
-
// .option("quote", "")
-
.format("csv")
-
.save("test-out/csv/")
-
1|darren|18
-
2|anne|18
-
3|"\"test\""|18
-
4|test2|
文件的内容和读的输出有一些变化
第一个变化:写出的文件会增加双引号\",会在有引号的地方再增加引号,因为双引号是默认值,如果不想增加,就把注释打开,设置引号为空即可
第二个变化:darren前后的空格没有了。在spark 2.1.1 使用 Spark SQL 保存 CSV 格式文件,默认情况下,会自动裁剪字符串前后空格。
这样的默认行为有时候并不是我们所期望的,在 Spark 2.2.0 之后,可以通过配置关闭改功能:
-
result.write
-
.mode(SaveMode.Overwrite)
-
.option("delimiter", "|")
-
// .option("quote", "")
-
.option("ignoreLeadingWhiteSpace", false)
-
.option("ignoreTrailingWhiteSpace", false)
-
.option("nullValue", null)
-
.format("csv")
-
.save("test-out/csv/")
参数说明:
- ignoreLeadingWhiteSpace 裁剪前面的空格
- ignoreTrailingWhiteSpace 裁剪后面的空格
- nullValue 空值设置,如果不想用任何符号作为空值,可以赋值null即可
问题:那么spark读写CSV到底有多少个属性可以设置呢?
答案:没有找到有资料显示有多少个,但是找到了源码,可以判断有多少个。
源码如下:
-
/*
-
* Licensed to the Apache Software Foundation (ASF) under one or more
-
* contributor license agreements. See the NOTICE file distributed with
-
* this work for additional information regarding copyright ownership.
-
* The ASF licenses this file to You under the Apache License, Version 2.0
-
* (the "License"); you may not use this file except in compliance with
-
* the License. You may obtain a copy of the License at
-
*
-
* http://www.apache.org/licenses/LICENSE-2.0
-
*
-
* Unless required by applicable law or agreed to in writing, software
-
* distributed under the License is distributed on an "AS IS" BASIS,
-
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-
* See the License for the specific language governing permissions and
-
* limitations under the License.
-
*/
-
package org.apache.spark.sql.execution.datasources.csv
-
import java.nio.charset.StandardCharsets
-
import java.util.{Locale, TimeZone}
-
import com.univocity.parsers.csv.{CsvParserSettings, CsvWriterSettings, UnescapedQuoteHandling}
-
import org.apache.commons.lang3.time.FastDateFormat
-
import org.apache.spark.internal.Logging
-
import org.apache.spark.sql.catalyst.util._
-
class CSVOptions(
-
@transient private val parameters: CaseInsensitiveMap[String],
-
defaultTimeZoneId: String,
-
defaultColumnNameOfCorruptRecord: String)
-
extends Logging with Serializable {
-
def this(
-
parameters: Map[String, String],
-
defaultTimeZoneId: String,
-
defaultColumnNameOfCorruptRecord: String = "") = {
-
this(
-
CaseInsensitiveMap(parameters),
-
defaultTimeZoneId,
-
defaultColumnNameOfCorruptRecord)
-
}
-
private def getChar(paramName: String, default: Char): Char = {
-
val paramValue = parameters.get(paramName)
-
paramValue match {
-
case None => default
-
case Some(null) => default
-
case Some(value) if value.length == 0 => '\u0000'
-
case Some(value) if value.length == 1 => value.charAt(0)
-
case _ => throw new RuntimeException(s"$paramName cannot be more than one character")
-
}
-
}
-
private def getInt(paramName: String, default: Int): Int = {
-
val paramValue = parameters.get(paramName)
-
paramValue match {
-
case None => default
-
case Some(null) => default
-
case Some(value) => try {
-
value.toInt
-
} catch {
-
case e: NumberFormatException =>
-
throw new RuntimeException(s"$paramName should be an integer. Found $value")
-
}
-
}
-
}
-
private def getBool(paramName: String, default: Boolean = false): Boolean = {
-
val param = parameters.getOrElse(paramName, default.toString)
-
if (param == null) {
-
default
-
} else if (param.toLowerCase(Locale.ROOT) == "true") {
-
true
-
} else if (param.toLowerCase(Locale.ROOT) == "false") {
-
false
-
} else {
-
throw new Exception(s"$paramName flag can be true or false")
-
}
-
}
-
val delimiter = CSVUtils.toChar(
-
parameters.getOrElse("sep", parameters.getOrElse("delimiter", ",")))
-
val parseMode: ParseMode =
-
parameters.get("mode").map(ParseMode.fromString).getOrElse(PermissiveMode)
-
val charset = parameters.getOrElse("encoding",
-
parameters.getOrElse("charset", StandardCharsets.UTF_8.name()))
-
val quote = getChar("quote", '\"')
-
val escape = getChar("escape", '\\')
-
val comment = getChar("comment", '\u0000')
-
val headerFlag = getBool("header")
-
val inferSchemaFlag = getBool("inferSchema")
-
val ignoreLeadingWhiteSpaceInRead = getBool("ignoreLeadingWhiteSpace", default = false)
-
val ignoreTrailingWhiteSpaceInRead = getBool("ignoreTrailingWhiteSpace", default = false)
-
// For write, both options were `true` by default. We leave it as `true` for
-
// backwards compatibility.
-
val ignoreLeadingWhiteSpaceFlagInWrite = getBool("ignoreLeadingWhiteSpace", default = true)
-
val ignoreTrailingWhiteSpaceFlagInWrite = getBool("ignoreTrailingWhiteSpace", default = true)
-
val columnNameOfCorruptRecord =
-
parameters.getOrElse("columnNameOfCorruptRecord", defaultColumnNameOfCorruptRecord)
-
val nullValue = parameters.getOrElse("nullValue", "")
-
val nanValue = parameters.getOrElse("nanValue", "NaN")
-
val positiveInf = parameters.getOrElse("positiveInf", "Inf")
-
val negativeInf = parameters.getOrElse("negativeInf", "-Inf")
-
val compressionCodec: Option[String] = {
-
val name = parameters.get("compression").orElse(parameters.get("codec"))
-
name.map(CompressionCodecs.getCodecClassName)
-
}
-
val timeZone: TimeZone = DateTimeUtils.getTimeZone(
-
parameters.getOrElse(DateTimeUtils.TIMEZONE_OPTION, defaultTimeZoneId))
-
// Uses `FastDateFormat` which can be direct replacement for `SimpleDateFormat` and thread-safe.
-
val dateFormat: FastDateFormat =
-
FastDateFormat.getInstance(parameters.getOrElse("dateFormat", "yyyy-MM-dd"), Locale.US)
-
val timestampFormat: FastDateFormat =
-
FastDateFormat.getInstance(
-
parameters.getOrElse("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"), timeZone, Locale.US)
-
val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false)
-
val maxColumns = getInt("maxColumns", 20480)
-
val maxCharsPerColumn = getInt("maxCharsPerColumn", -1)
-
val escapeQuotes = getBool("escapeQuotes", true)
-
val quoteAll = getBool("quoteAll", false)
-
val inputBufferSize = 128
-
val isCommentSet = this.comment != '\u0000'
-
def asWriterSettings: CsvWriterSettings = {
-
val writerSettings = new CsvWriterSettings()
-
val format = writerSettings.getFormat
-
format.setDelimiter(delimiter)
-
format.setQuote(quote)
-
format.setQuoteEscape(escape)
-
format.setComment(comment)
-
writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)
-
writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
-
writerSettings.setNullValue(nullValue)
-
writerSettings.setEmptyValue(nullValue)
-
writerSettings.setSkipEmptyLines(true)
-
writerSettings.setQuoteAllFields(quoteAll)
-
writerSettings.setQuoteEscapingEnabled(escapeQuotes)
-
writerSettings
-
}
-
def asParserSettings: CsvParserSettings = {
-
val settings = new CsvParserSettings()
-
val format = settings.getFormat
-
format.setDelimiter(delimiter)
-
format.setQuote(quote)
-
format.setQuoteEscape(escape)
-
format.setComment(comment)
-
settings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceInRead)
-
settings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceInRead)
-
settings.setReadInputOnSeparateThread(false)
-
settings.setInputBufferSize(inputBufferSize)
-
settings.setMaxColumns(maxColumns)
-
settings.setNullValue(nullValue)
-
settings.setMaxCharsPerColumn(maxCharsPerColumn)
-
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
-
settings
-
}
-
}