Spark 读写CSV常用配置-实例

8 篇文章 0 订阅

Spark 2.0 之前,Spark SQL 读写 CSV 格式文件,需要 Databricks 官方提供的 spark-csv 库。在 Spark 2.0 之后,Spark SQL 原生支持读写 CSV 格式文件。
测试带标题的文件如下:

id|name|age

1| darren |18

2|anne|18

3|"test"|18

4|'test2'|18

package com.darren.spark.sql.csv


import org.apache.spark.sql.{SaveMode, SparkSession}


/**

* @Author Darren Zhang

* @Date 2019-05-30

* @Description TODO

**/

object CSVReader {


def main(args: Array[String]): Unit = {

val spark = SparkSession.builder()

.appName("CSV Reader")

.master("local")

.getOrCreate()


val result = spark.read.format("csv")

.option("delimiter", "|")

.option("header", "true")

.option("quote", "'")

.option("nullValue", "\\N")

.option("inferSchema", "true")

.load("test-in/csv/csv_with_header.csv")


result.show()

result.printSchema()


}

}

 

输出结果如下:

+---+--------+----+

| id| name| age|

+---+--------+----+

| 1| darren | 18|

| 2| anne| 18|

| 3| "test"| 18|

| 4| test2|null|

+---+--------+----+

root

|-- id: integer (nullable = true)

|-- name: string (nullable = true)

|-- age: integer (nullable = true)

参数说明:

  • delimiter 分隔符,默认为逗号,
  • nullValue 指定一个字符串代表 null 值
  • quote 引号字符,默认为双引号"
  • header 第一行不作为数据内容,作为标题
  • inferSchema 自动推测字段类型

自动推测字段类型只是折中方案,更好的方案是指定字段类型:

package com.darren.spark.sql.csv

/**

* @Author Darren Zhang

* @Date 2019-05-30

* @Description TODO

**/

case class User(id: Int, name: String, age: Int)

测试不带标题的文件:

1| darren |18

2|anne|18

3|"test"|18

4|'test2'|\N

package com.darren.spark.sql.csv


import org.apache.spark.sql.{SaveMode, SparkSession}


/**

* @Author Darren Zhang

* @Date 2019-05-30

* @Description TODO

**/

object CSVReader {


def main(args: Array[String]): Unit = {

val spark = SparkSession.builder()

.appName("CSV Reader")

.master("local")

.getOrCreate()


val result = spark.read.format("csv")

.option("delimiter", "|")

//.option("header", "true")

.option("quote", "'")

.option("nullValue", "\\N")

.option("inferSchema", "true")

.load("test-in/csv/csv_without_header.csv")

.toDF("id", "name", "age")


result.show()

result.printSchema()


}

}

+---+--------+----+

| id| name| age|

+---+--------+----+

| 1| darren | 18|

| 2| anne| 18|

| 3| "test"| 18|

| 4| test2|null|

+---+--------+----+

 

root

|-- id: integer (nullable = true)

|-- name: string (nullable = true)

|-- age: integer (nullable = true)

指定类型:

package com.darren.spark.sql.csv


import org.apache.spark.sql.catalyst.ScalaReflection

import org.apache.spark.sql.types.StructType

import org.apache.spark.sql.{SaveMode, SparkSession}


/**

* @Author Darren Zhang

* @Date 2019-05-30

* @Description TODO

**/

object CSVReader {


def main(args: Array[String]): Unit = {

val spark = SparkSession.builder()

.appName("CSV Reader")

.master("local")

.getOrCreate()


val result = spark.read.format("csv")

.option("delimiter", "|")

//.option("header", "true")

.option("quote", "'")

.option("nullValue", "\\N")

.option("inferSchema", "true")

.schema(ScalaReflection.schemaFor[User].dataType.asInstanceOf[StructType])

.load("test-in/csv/csv_without_header.csv")

//.toDF("id", "name", "age")


result.show()

result.printSchema()


}

}

结果和上边的结果一样

写CSV文件:

result.write

.mode(SaveMode.Overwrite)

.option("delimiter", "|")

// .option("quote", "")

.format("csv")

.save("test-out/csv/")

1|darren|18

2|anne|18

3|"\"test\""|18

4|test2|

文件的内容和读的输出有一些变化

第一个变化:写出的文件会增加双引号\",会在有引号的地方再增加引号,因为双引号是默认值,如果不想增加,就把注释打开,设置引号为空即可

第二个变化:darren前后的空格没有了。在spark 2.1.1 使用 Spark SQL 保存 CSV 格式文件,默认情况下,会自动裁剪字符串前后空格。

这样的默认行为有时候并不是我们所期望的,在 Spark 2.2.0 之后,可以通过配置关闭改功能:

result.write

.mode(SaveMode.Overwrite)

.option("delimiter", "|")

// .option("quote", "")

.option("ignoreLeadingWhiteSpace", false)

.option("ignoreTrailingWhiteSpace", false)

.option("nullValue", null)

.format("csv")

.save("test-out/csv/")

 

参数说明:

  • ignoreLeadingWhiteSpace 裁剪前面的空格
  • ignoreTrailingWhiteSpace 裁剪后面的空格
  • nullValue 空值设置,如果不想用任何符号作为空值,可以赋值null即可

问题:那么spark读写CSV到底有多少个属性可以设置呢?

答案:没有找到有资料显示有多少个,但是找到了源码,可以判断有多少个。

源码如下:

/*

* Licensed to the Apache Software Foundation (ASF) under one or more

* contributor license agreements. See the NOTICE file distributed with

* this work for additional information regarding copyright ownership.

* The ASF licenses this file to You under the Apache License, Version 2.0

* (the "License"); you may not use this file except in compliance with

* the License. You may obtain a copy of the License at

*

* http://www.apache.org/licenses/LICENSE-2.0

*

* Unless required by applicable law or agreed to in writing, software

* distributed under the License is distributed on an "AS IS" BASIS,

* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

* See the License for the specific language governing permissions and

* limitations under the License.

*/


package org.apache.spark.sql.execution.datasources.csv


import java.nio.charset.StandardCharsets

import java.util.{Locale, TimeZone}


import com.univocity.parsers.csv.{CsvParserSettings, CsvWriterSettings, UnescapedQuoteHandling}

import org.apache.commons.lang3.time.FastDateFormat


import org.apache.spark.internal.Logging

import org.apache.spark.sql.catalyst.util._


class CSVOptions(

@transient private val parameters: CaseInsensitiveMap[String],

defaultTimeZoneId: String,

defaultColumnNameOfCorruptRecord: String)

extends Logging with Serializable {


def this(

parameters: Map[String, String],

defaultTimeZoneId: String,

defaultColumnNameOfCorruptRecord: String = "") = {

this(

CaseInsensitiveMap(parameters),

defaultTimeZoneId,

defaultColumnNameOfCorruptRecord)

}


private def getChar(paramName: String, default: Char): Char = {

val paramValue = parameters.get(paramName)

paramValue match {

case None => default

case Some(null) => default

case Some(value) if value.length == 0 => '\u0000'

case Some(value) if value.length == 1 => value.charAt(0)

case _ => throw new RuntimeException(s"$paramName cannot be more than one character")

}

}


private def getInt(paramName: String, default: Int): Int = {

val paramValue = parameters.get(paramName)

paramValue match {

case None => default

case Some(null) => default

case Some(value) => try {

value.toInt

} catch {

case e: NumberFormatException =>

throw new RuntimeException(s"$paramName should be an integer. Found $value")

}

}

}


private def getBool(paramName: String, default: Boolean = false): Boolean = {

val param = parameters.getOrElse(paramName, default.toString)

if (param == null) {

default

} else if (param.toLowerCase(Locale.ROOT) == "true") {

true

} else if (param.toLowerCase(Locale.ROOT) == "false") {

false

} else {

throw new Exception(s"$paramName flag can be true or false")

}

}


val delimiter = CSVUtils.toChar(

parameters.getOrElse("sep", parameters.getOrElse("delimiter", ",")))

val parseMode: ParseMode =

parameters.get("mode").map(ParseMode.fromString).getOrElse(PermissiveMode)

val charset = parameters.getOrElse("encoding",

parameters.getOrElse("charset", StandardCharsets.UTF_8.name()))


val quote = getChar("quote", '\"')

val escape = getChar("escape", '\\')

val comment = getChar("comment", '\u0000')


val headerFlag = getBool("header")

val inferSchemaFlag = getBool("inferSchema")

val ignoreLeadingWhiteSpaceInRead = getBool("ignoreLeadingWhiteSpace", default = false)

val ignoreTrailingWhiteSpaceInRead = getBool("ignoreTrailingWhiteSpace", default = false)


// For write, both options were `true` by default. We leave it as `true` for

// backwards compatibility.

val ignoreLeadingWhiteSpaceFlagInWrite = getBool("ignoreLeadingWhiteSpace", default = true)

val ignoreTrailingWhiteSpaceFlagInWrite = getBool("ignoreTrailingWhiteSpace", default = true)


val columnNameOfCorruptRecord =

parameters.getOrElse("columnNameOfCorruptRecord", defaultColumnNameOfCorruptRecord)


val nullValue = parameters.getOrElse("nullValue", "")


val nanValue = parameters.getOrElse("nanValue", "NaN")


val positiveInf = parameters.getOrElse("positiveInf", "Inf")

val negativeInf = parameters.getOrElse("negativeInf", "-Inf")



val compressionCodec: Option[String] = {

val name = parameters.get("compression").orElse(parameters.get("codec"))

name.map(CompressionCodecs.getCodecClassName)

}


val timeZone: TimeZone = DateTimeUtils.getTimeZone(

parameters.getOrElse(DateTimeUtils.TIMEZONE_OPTION, defaultTimeZoneId))


// Uses `FastDateFormat` which can be direct replacement for `SimpleDateFormat` and thread-safe.

val dateFormat: FastDateFormat =

FastDateFormat.getInstance(parameters.getOrElse("dateFormat", "yyyy-MM-dd"), Locale.US)


val timestampFormat: FastDateFormat =

FastDateFormat.getInstance(

parameters.getOrElse("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"), timeZone, Locale.US)


val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false)


val maxColumns = getInt("maxColumns", 20480)


val maxCharsPerColumn = getInt("maxCharsPerColumn", -1)


val escapeQuotes = getBool("escapeQuotes", true)


val quoteAll = getBool("quoteAll", false)


val inputBufferSize = 128


val isCommentSet = this.comment != '\u0000'


def asWriterSettings: CsvWriterSettings = {

val writerSettings = new CsvWriterSettings()

val format = writerSettings.getFormat

format.setDelimiter(delimiter)

format.setQuote(quote)

format.setQuoteEscape(escape)

format.setComment(comment)

writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)

writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)

writerSettings.setNullValue(nullValue)

writerSettings.setEmptyValue(nullValue)

writerSettings.setSkipEmptyLines(true)

writerSettings.setQuoteAllFields(quoteAll)

writerSettings.setQuoteEscapingEnabled(escapeQuotes)

writerSettings

}


def asParserSettings: CsvParserSettings = {

val settings = new CsvParserSettings()

val format = settings.getFormat

format.setDelimiter(delimiter)

format.setQuote(quote)

format.setQuoteEscape(escape)

format.setComment(comment)

settings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceInRead)

settings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceInRead)

settings.setReadInputOnSeparateThread(false)

settings.setInputBufferSize(inputBufferSize)

settings.setMaxColumns(maxColumns)

settings.setNullValue(nullValue)

settings.setMaxCharsPerColumn(maxCharsPerColumn)

settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)

settings

}

}

 

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值