Spark 读写CSV常用配置-实例

最新推荐文章于 2024-06-01 00:12:16 发布

*Snowgrass*

最新推荐文章于 2024-06-01 00:12:16 发布

阅读量1.1k

点赞数 1

分类专栏： Scala Spark

原文链接：http://dyingbleed.com/spark-ii-5/

版权

Spark 同时被 2 个专栏收录

18 篇文章 0 订阅

订阅专栏

Scala

8 篇文章 0 订阅

订阅专栏

Spark 2.0 之前，Spark SQL 读写 CSV 格式文件，需要 Databricks 官方提供的 spark-csv 库。在 Spark 2.0 之后，Spark SQL 原生支持读写 CSV 格式文件。
测试带标题的文件如下：

id|name|age

1| darren |18

2|anne|18

3|"test"|18

4|'test2'|18

package com.darren.spark.sql.csv


import org.apache.spark.sql.{SaveMode, SparkSession}


/**

* @Author Darren Zhang

* @Date 2019-05-30

* @Description TODO

**/

object CSVReader {


def main(args: Array[String]): Unit = {

val spark = SparkSession.builder()

.appName("CSV Reader")

.master("local")

.getOrCreate()


val result = spark.read.format("csv")

.option("delimiter", "|")

.option("header", "true")

.option("quote", "'")

.option("nullValue", "\\N")

.option("inferSchema", "true")

.load("test-in/csv/csv_with_header.csv")


result.show()

result.printSchema()


}

}

输出结果如下：

+---+--------+----+

| id| name| age|

+---+--------+----+

| 1| darren | 18|

| 2| anne| 18|

| 3| "test"| 18|

| 4| test2|null|

+---+--------+----+

root

|-- id: integer (nullable = true)

|-- name: string (nullable = true)

|-- age: integer (nullable = true)

参数说明：

delimiter 分隔符，默认为逗号,
nullValue 指定一个字符串代表 null 值
quote 引号字符，默认为双引号"
header 第一行不作为数据内容，作为标题
inferSchema 自动推测字段类型

自动推测字段类型只是折中方案，更好的方案是指定字段类型：

package com.darren.spark.sql.csv

/**

* @Author Darren Zhang

* @Date 2019-05-30

* @Description TODO

**/

case class User(id: Int, name: String, age: Int)

测试不带标题的文件：

1| darren |18

2|anne|18

3|"test"|18

4|'test2'|\N

package com.darren.spark.sql.csv


import org.apache.spark.sql.{SaveMode, SparkSession}


/**

* @Author Darren Zhang

* @Date 2019-05-30

* @Description TODO

**/

object CSVReader {


def main(args: Array[String]): Unit = {

val spark = SparkSession.builder()

.appName("CSV Reader")

.master("local")

.getOrCreate()


val result = spark.read.format("csv")

.option("delimiter", "|")

//.option("header", "true")

.option("quote", "'")

.option("nullValue", "\\N")

.option("inferSchema", "true")

.load("test-in/csv/csv_without_header.csv")

.toDF("id", "name", "age")


result.show()

result.printSchema()


}

}

+---+--------+----+

| id| name| age|

+---+--------+----+

| 1| darren | 18|

| 2| anne| 18|

| 3| "test"| 18|

| 4| test2|null|

+---+--------+----+

root

|-- id: integer (nullable = true)

|-- name: string (nullable = true)

|-- age: integer (nullable = true)

指定类型：

package com.darren.spark.sql.csv


import org.apache.spark.sql.catalyst.ScalaReflection

import org.apache.spark.sql.types.StructType

import org.apache.spark.sql.{SaveMode, SparkSession}


/**

* @Author Darren Zhang

* @Date 2019-05-30

* @Description TODO

**/

object CSVReader {


def main(args: Array[String]): Unit = {

val spark = SparkSession.builder()

.appName("CSV Reader")

.master("local")

.getOrCreate()


val result = spark.read.format("csv")

.option("delimiter", "|")

//.option("header", "true")

.option("quote", "'")

.option("nullValue", "\\N")

.option("inferSchema", "true")

.schema(ScalaReflection.schemaFor[User].dataType.asInstanceOf[StructType])

.load("test-in/csv/csv_without_header.csv")

//.toDF("id", "name", "age")


result.show()

result.printSchema()


}

}

结果和上边的结果一样

写CSV文件：

result.write

.mode(SaveMode.Overwrite)

.option("delimiter", "|")

// .option("quote", "")

.format("csv")

.save("test-out/csv/")

1|darren|18

2|anne|18

3|"\"test\""|18

4|test2|

文件的内容和读的输出有一些变化

第一个变化：写出的文件会增加双引号\",会在有引号的地方再增加引号，因为双引号是默认值，如果不想增加，就把注释打开，设置引号为空即可

第二个变化：darren前后的空格没有了。在spark 2.1.1 使用 Spark SQL 保存 CSV 格式文件，默认情况下，会自动裁剪字符串前后空格。

这样的默认行为有时候并不是我们所期望的，在 Spark 2.2.0 之后，可以通过配置关闭改功能：

result.write

.mode(SaveMode.Overwrite)

.option("delimiter", "|")

// .option("quote", "")

.option("ignoreLeadingWhiteSpace", false)

.option("ignoreTrailingWhiteSpace", false)

.option("nullValue", null)

.format("csv")

.save("test-out/csv/")

参数说明：

ignoreLeadingWhiteSpace 裁剪前面的空格
ignoreTrailingWhiteSpace 裁剪后面的空格
nullValue 空值设置，如果不想用任何符号作为空值，可以赋值null即可

问题：那么spark读写CSV到底有多少个属性可以设置呢？

答案：没有找到有资料显示有多少个，但是找到了源码，可以判断有多少个。

源码如下：

/*

* Licensed to the Apache Software Foundation (ASF) under one or more

* contributor license agreements. See the NOTICE file distributed with

* this work for additional information regarding copyright ownership.

* The ASF licenses this file to You under the Apache License, Version 2.0

* (the "License"); you may not use this file except in compliance with

* the License. You may obtain a copy of the License at

*

* http://www.apache.org/licenses/LICENSE-2.0

*

* Unless required by applicable law or agreed to in writing, software

* distributed under the License is distributed on an "AS IS" BASIS,

* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

* See the License for the specific language governing permissions and

* limitations under the License.

*/


package org.apache.spark.sql.execution.datasources.csv


import java.nio.charset.StandardCharsets

import java.util.{Locale, TimeZone}


import com.univocity.parsers.csv.{CsvParserSettings, CsvWriterSettings, UnescapedQuoteHandling}

import org.apache.commons.lang3.time.FastDateFormat


import org.apache.spark.internal.Logging

import org.apache.spark.sql.catalyst.util._


class CSVOptions(

@transient private val parameters: CaseInsensitiveMap[String],

defaultTimeZoneId: String,

defaultColumnNameOfCorruptRecord: String)

extends Logging with Serializable {


def this(

parameters: Map[String, String],

defaultTimeZoneId: String,

defaultColumnNameOfCorruptRecord: String = "") = {

this(

CaseInsensitiveMap(parameters),

defaultTimeZoneId,

defaultColumnNameOfCorruptRecord)

}


private def getChar(paramName: String, default: Char): Char = {

val paramValue = parameters.get(paramName)

paramValue match {

case None => default

case Some(null) => default

case Some(value) if value.length == 0 => '\u0000'

case Some(value) if value.length == 1 => value.charAt(0)

case _ => throw new RuntimeException(s"$paramName cannot be more than one character")

}

}


private def getInt(paramName: String, default: Int): Int = {

val paramValue = parameters.get(paramName)

paramValue match {

case None => default

case Some(null) => default

case Some(value) => try {

value.toInt

} catch {

case e: NumberFormatException =>

throw new RuntimeException(s"$paramName should be an integer. Found $value")

}

}

}


private def getBool(paramName: String, default: Boolean = false): Boolean = {

val param = parameters.getOrElse(paramName, default.toString)

if (param == null) {

default

} else if (param.toLowerCase(Locale.ROOT) == "true") {

true

} else if (param.toLowerCase(Locale.ROOT) == "false") {

false

} else {

throw new Exception(s"$paramName flag can be true or false")

}

}


val delimiter = CSVUtils.toChar(

parameters.getOrElse("sep", parameters.getOrElse("delimiter", ",")))

val parseMode: ParseMode =

parameters.get("mode").map(ParseMode.fromString).getOrElse(PermissiveMode)

val charset = parameters.getOrElse("encoding",

parameters.getOrElse("charset", StandardCharsets.UTF_8.name()))


val quote = getChar("quote", '\"')

val escape = getChar("escape", '\\')

val comment = getChar("comment", '\u0000')


val headerFlag = getBool("header")

val inferSchemaFlag = getBool("inferSchema")

val ignoreLeadingWhiteSpaceInRead = getBool("ignoreLeadingWhiteSpace", default = false)

val ignoreTrailingWhiteSpaceInRead = getBool("ignoreTrailingWhiteSpace", default = false)


// For write, both options were `true` by default. We leave it as `true` for

// backwards compatibility.

val ignoreLeadingWhiteSpaceFlagInWrite = getBool("ignoreLeadingWhiteSpace", default = true)

val ignoreTrailingWhiteSpaceFlagInWrite = getBool("ignoreTrailingWhiteSpace", default = true)


val columnNameOfCorruptRecord =

parameters.getOrElse("columnNameOfCorruptRecord", defaultColumnNameOfCorruptRecord)


val nullValue = parameters.getOrElse("nullValue", "")


val nanValue = parameters.getOrElse("nanValue", "NaN")


val positiveInf = parameters.getOrElse("positiveInf", "Inf")

val negativeInf = parameters.getOrElse("negativeInf", "-Inf")



val compressionCodec: Option[String] = {

val name = parameters.get("compression").orElse(parameters.get("codec"))

name.map(CompressionCodecs.getCodecClassName)

}


val timeZone: TimeZone = DateTimeUtils.getTimeZone(

parameters.getOrElse(DateTimeUtils.TIMEZONE_OPTION, defaultTimeZoneId))


// Uses `FastDateFormat` which can be direct replacement for `SimpleDateFormat` and thread-safe.

val dateFormat: FastDateFormat =

FastDateFormat.getInstance(parameters.getOrElse("dateFormat", "yyyy-MM-dd"), Locale.US)


val timestampFormat: FastDateFormat =

FastDateFormat.getInstance(

parameters.getOrElse("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"), timeZone, Locale.US)


val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false)


val maxColumns = getInt("maxColumns", 20480)


val maxCharsPerColumn = getInt("maxCharsPerColumn", -1)


val escapeQuotes = getBool("escapeQuotes", true)


val quoteAll = getBool("quoteAll", false)


val inputBufferSize = 128


val isCommentSet = this.comment != '\u0000'


def asWriterSettings: CsvWriterSettings = {

val writerSettings = new CsvWriterSettings()

val format = writerSettings.getFormat

format.setDelimiter(delimiter)

format.setQuote(quote)

format.setQuoteEscape(escape)

format.setComment(comment)

writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)

writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)

writerSettings.setNullValue(nullValue)

writerSettings.setEmptyValue(nullValue)

writerSettings.setSkipEmptyLines(true)

writerSettings.setQuoteAllFields(quoteAll)

writerSettings.setQuoteEscapingEnabled(escapeQuotes)

writerSettings

}


def asParserSettings: CsvParserSettings = {

val settings = new CsvParserSettings()

val format = settings.getFormat

format.setDelimiter(delimiter)

format.setQuote(quote)

format.setQuoteEscape(escape)

format.setComment(comment)

settings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceInRead)

settings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceInRead)

settings.setReadInputOnSeparateThread(false)

settings.setInputBufferSize(inputBufferSize)

settings.setMaxColumns(maxColumns)

settings.setNullValue(nullValue)

settings.setMaxCharsPerColumn(maxCharsPerColumn)

settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)

settings

}

}

*Snowgrass*

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Spark 读写CSV常用配置-实例

Spark 2.0 之前，Spark SQL 读写 CSV 格式文件，需要 Databricks 官方提供的 spark-csv 库。在 Spark 2.0 之后，Spark SQL 原生支持读写 CSV 格式文件。测试带标题的文件如下： id|name|age 1| darren |18 2|anne|18 3|"test"|18 4|'test2'|18 package com.darren.spark.sql.csv impor.
复制链接

扫一扫