Spark 读写CSV的常用配置项

Spark 2.0 之前,Spark SQL 读写 CSV 格式文件,需要 Databricks 官方提供的 spark-csv 库。在 Spark 2.0 之后,Spark SQL 原生支持读写 CSV 格式文件。
测试带标题的文件如下:

 
  1. id|name|age

  2. 1| darren |18

  3. 2|anne|18

  4. 3|"test"|18

  5. 4|'test2'|18

 
  1. package com.darren.spark.sql.csv

  2.  
  3. import org.apache.spark.sql.{SaveMode, SparkSession}

  4.  
  5. /**

  6. * @Author Darren Zhang

  7. * @Date 2019-05-30

  8. * @Description TODO

  9. **/

  10. object CSVReader {

  11.  
  12. def main(args: Array[String]): Unit = {

  13. val spark = SparkSession.builder()

  14. .appName("CSV Reader")

  15. .master("local")

  16. .getOrCreate()

  17.  
  18. val result = spark.read.format("csv")

  19. .option("delimiter", "|")

  20. .option("header", "true")

  21. .option("quote", "'")

  22. .option("nullValue", "\\N")

  23. .option("inferSchema", "true")

  24. .load("test-in/csv/csv_with_header.csv")

  25.  
  26. result.show()

  27. result.printSchema()

  28.  
  29. }

  30. }

输出结果如下:

 
  1. +---+--------+----+

  2. | id| name| age|

  3. +---+--------+----+

  4. | 1| darren | 18|

  5. | 2| anne| 18|

  6. | 3| "test"| 18|

  7. | 4| test2|null|

  8. +---+--------+----+

  9.  
  10. root

  11. |-- id: integer (nullable = true)

  12. |-- name: string (nullable = true)

  13. |-- age: integer (nullable = true)

  14.  

参数说明:

  • delimiter 分隔符,默认为逗号,
  • nullValue 指定一个字符串代表 null 值
  • quote 引号字符,默认为双引号"
  • header 第一行不作为数据内容,作为标题
  • inferSchema 自动推测字段类型

自动推测字段类型只是折中方案,更好的方案是指定字段类型:
 

 
  1. package com.darren.spark.sql.csv

  2.  
  3. /**

  4. * @Author Darren Zhang

  5. * @Date 2019-05-30

  6. * @Description TODO

  7. **/

  8. case class User(id: Int, name: String, age: Int)

  9.  

测试不带标题的文件:

 
  1. 1| darren |18

  2. 2|anne|18

  3. 3|"test"|18

  4. 4|'test2'|\N

 
  1. package com.darren.spark.sql.csv

  2.  
  3. import org.apache.spark.sql.{SaveMode, SparkSession}

  4.  
  5. /**

  6. * @Author Darren Zhang

  7. * @Date 2019-05-30

  8. * @Description TODO

  9. **/

  10. object CSVReader {

  11.  
  12. def main(args: Array[String]): Unit = {

  13. val spark = SparkSession.builder()

  14. .appName("CSV Reader")

  15. .master("local")

  16. .getOrCreate()

  17.  
  18. val result = spark.read.format("csv")

  19. .option("delimiter", "|")

  20. //.option("header", "true")

  21. .option("quote", "'")

  22. .option("nullValue", "\\N")

  23. .option("inferSchema", "true")

  24. .load("test-in/csv/csv_without_header.csv")

  25. .toDF("id", "name", "age")

  26.  
  27. result.show()

  28. result.printSchema()

  29.  
  30. }

  31. }

 
  1. +---+--------+----+

  2. | id| name| age|

  3. +---+--------+----+

  4. | 1| darren | 18|

  5. | 2| anne| 18|

  6. | 3| "test"| 18|

  7. | 4| test2|null|

  8. +---+--------+----+

  9.  
  10. root

  11. |-- id: integer (nullable = true)

  12. |-- name: string (nullable = true)

  13. |-- age: integer (nullable = true)

  14.  

指定类型:

 
  1. package com.darren.spark.sql.csv

  2.  
  3. import org.apache.spark.sql.catalyst.ScalaReflection

  4. import org.apache.spark.sql.types.StructType

  5. import org.apache.spark.sql.{SaveMode, SparkSession}

  6.  
  7. /**

  8. * @Author Darren Zhang

  9. * @Date 2019-05-30

  10. * @Description TODO

  11. **/

  12. object CSVReader {

  13.  
  14. def main(args: Array[String]): Unit = {

  15. val spark = SparkSession.builder()

  16. .appName("CSV Reader")

  17. .master("local")

  18. .getOrCreate()

  19.  
  20. val result = spark.read.format("csv")

  21. .option("delimiter", "|")

  22. //.option("header", "true")

  23. .option("quote", "'")

  24. .option("nullValue", "\\N")

  25. .option("inferSchema", "true")

  26. .schema(ScalaReflection.schemaFor[User].dataType.asInstanceOf[StructType])

  27. .load("test-in/csv/csv_without_header.csv")

  28. //.toDF("id", "name", "age")

  29.  
  30. result.show()

  31. result.printSchema()

  32.  
  33.  
  34. }

  35. }

结果和上边的结果一样

写CSV文件:

 
  1. result.write

  2. .mode(SaveMode.Overwrite)

  3. .option("delimiter", "|")

  4. // .option("quote", "")

  5. .format("csv")

  6. .save("test-out/csv/")

 
  1. 1|darren|18

  2. 2|anne|18

  3. 3|"\"test\""|18

  4. 4|test2|

文件的内容和读的输出有一些变化

第一个变化:写出的文件会增加双引号\",会在有引号的地方再增加引号,因为双引号是默认值,如果不想增加,就把注释打开,设置引号为空即可

第二个变化:darren前后的空格没有了。在spark 2.1.1 使用 Spark SQL 保存 CSV 格式文件,默认情况下,会自动裁剪字符串前后空格。

这样的默认行为有时候并不是我们所期望的,在 Spark 2.2.0 之后,可以通过配置关闭改功能:

 
  1. result.write

  2. .mode(SaveMode.Overwrite)

  3. .option("delimiter", "|")

  4. // .option("quote", "")

  5. .option("ignoreLeadingWhiteSpace", false)

  6. .option("ignoreTrailingWhiteSpace", false)

  7. .option("nullValue", null)

  8. .format("csv")

  9. .save("test-out/csv/")

参数说明:

  • ignoreLeadingWhiteSpace 裁剪前面的空格
  • ignoreTrailingWhiteSpace 裁剪后面的空格
  • nullValue 空值设置,如果不想用任何符号作为空值,可以赋值null即可

问题:那么spark读写CSV到底有多少个属性可以设置呢?

答案:没有找到有资料显示有多少个,但是找到了源码,可以判断有多少个。

源码如下:

 
  1. /*

  2. * Licensed to the Apache Software Foundation (ASF) under one or more

  3. * contributor license agreements. See the NOTICE file distributed with

  4. * this work for additional information regarding copyright ownership.

  5. * The ASF licenses this file to You under the Apache License, Version 2.0

  6. * (the "License"); you may not use this file except in compliance with

  7. * the License. You may obtain a copy of the License at

  8. *

  9. * http://www.apache.org/licenses/LICENSE-2.0

  10. *

  11. * Unless required by applicable law or agreed to in writing, software

  12. * distributed under the License is distributed on an "AS IS" BASIS,

  13. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

  14. * See the License for the specific language governing permissions and

  15. * limitations under the License.

  16. */

  17.  
  18. package org.apache.spark.sql.execution.datasources.csv

  19.  
  20. import java.nio.charset.StandardCharsets

  21. import java.util.{Locale, TimeZone}

  22.  
  23. import com.univocity.parsers.csv.{CsvParserSettings, CsvWriterSettings, UnescapedQuoteHandling}

  24. import org.apache.commons.lang3.time.FastDateFormat

  25.  
  26. import org.apache.spark.internal.Logging

  27. import org.apache.spark.sql.catalyst.util._

  28.  
  29. class CSVOptions(

  30. @transient private val parameters: CaseInsensitiveMap[String],

  31. defaultTimeZoneId: String,

  32. defaultColumnNameOfCorruptRecord: String)

  33. extends Logging with Serializable {

  34.  
  35. def this(

  36. parameters: Map[String, String],

  37. defaultTimeZoneId: String,

  38. defaultColumnNameOfCorruptRecord: String = "") = {

  39. this(

  40. CaseInsensitiveMap(parameters),

  41. defaultTimeZoneId,

  42. defaultColumnNameOfCorruptRecord)

  43. }

  44.  
  45. private def getChar(paramName: String, default: Char): Char = {

  46. val paramValue = parameters.get(paramName)

  47. paramValue match {

  48. case None => default

  49. case Some(null) => default

  50. case Some(value) if value.length == 0 => '\u0000'

  51. case Some(value) if value.length == 1 => value.charAt(0)

  52. case _ => throw new RuntimeException(s"$paramName cannot be more than one character")

  53. }

  54. }

  55.  
  56. private def getInt(paramName: String, default: Int): Int = {

  57. val paramValue = parameters.get(paramName)

  58. paramValue match {

  59. case None => default

  60. case Some(null) => default

  61. case Some(value) => try {

  62. value.toInt

  63. } catch {

  64. case e: NumberFormatException =>

  65. throw new RuntimeException(s"$paramName should be an integer. Found $value")

  66. }

  67. }

  68. }

  69.  
  70. private def getBool(paramName: String, default: Boolean = false): Boolean = {

  71. val param = parameters.getOrElse(paramName, default.toString)

  72. if (param == null) {

  73. default

  74. } else if (param.toLowerCase(Locale.ROOT) == "true") {

  75. true

  76. } else if (param.toLowerCase(Locale.ROOT) == "false") {

  77. false

  78. } else {

  79. throw new Exception(s"$paramName flag can be true or false")

  80. }

  81. }

  82.  
  83. val delimiter = CSVUtils.toChar(

  84. parameters.getOrElse("sep", parameters.getOrElse("delimiter", ",")))

  85. val parseMode: ParseMode =

  86. parameters.get("mode").map(ParseMode.fromString).getOrElse(PermissiveMode)

  87. val charset = parameters.getOrElse("encoding",

  88. parameters.getOrElse("charset", StandardCharsets.UTF_8.name()))

  89.  
  90. val quote = getChar("quote", '\"')

  91. val escape = getChar("escape", '\\')

  92. val comment = getChar("comment", '\u0000')

  93.  
  94. val headerFlag = getBool("header")

  95. val inferSchemaFlag = getBool("inferSchema")

  96. val ignoreLeadingWhiteSpaceInRead = getBool("ignoreLeadingWhiteSpace", default = false)

  97. val ignoreTrailingWhiteSpaceInRead = getBool("ignoreTrailingWhiteSpace", default = false)

  98.  
  99. // For write, both options were `true` by default. We leave it as `true` for

  100. // backwards compatibility.

  101. val ignoreLeadingWhiteSpaceFlagInWrite = getBool("ignoreLeadingWhiteSpace", default = true)

  102. val ignoreTrailingWhiteSpaceFlagInWrite = getBool("ignoreTrailingWhiteSpace", default = true)

  103.  
  104. val columnNameOfCorruptRecord =

  105. parameters.getOrElse("columnNameOfCorruptRecord", defaultColumnNameOfCorruptRecord)

  106.  
  107. val nullValue = parameters.getOrElse("nullValue", "")

  108.  
  109. val nanValue = parameters.getOrElse("nanValue", "NaN")

  110.  
  111. val positiveInf = parameters.getOrElse("positiveInf", "Inf")

  112. val negativeInf = parameters.getOrElse("negativeInf", "-Inf")

  113.  
  114.  
  115. val compressionCodec: Option[String] = {

  116. val name = parameters.get("compression").orElse(parameters.get("codec"))

  117. name.map(CompressionCodecs.getCodecClassName)

  118. }

  119.  
  120. val timeZone: TimeZone = DateTimeUtils.getTimeZone(

  121. parameters.getOrElse(DateTimeUtils.TIMEZONE_OPTION, defaultTimeZoneId))

  122.  
  123. // Uses `FastDateFormat` which can be direct replacement for `SimpleDateFormat` and thread-safe.

  124. val dateFormat: FastDateFormat =

  125. FastDateFormat.getInstance(parameters.getOrElse("dateFormat", "yyyy-MM-dd"), Locale.US)

  126.  
  127. val timestampFormat: FastDateFormat =

  128. FastDateFormat.getInstance(

  129. parameters.getOrElse("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"), timeZone, Locale.US)

  130.  
  131. val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false)

  132.  
  133. val maxColumns = getInt("maxColumns", 20480)

  134.  
  135. val maxCharsPerColumn = getInt("maxCharsPerColumn", -1)

  136.  
  137. val escapeQuotes = getBool("escapeQuotes", true)

  138.  
  139. val quoteAll = getBool("quoteAll", false)

  140.  
  141. val inputBufferSize = 128

  142.  
  143. val isCommentSet = this.comment != '\u0000'

  144.  
  145. def asWriterSettings: CsvWriterSettings = {

  146. val writerSettings = new CsvWriterSettings()

  147. val format = writerSettings.getFormat

  148. format.setDelimiter(delimiter)

  149. format.setQuote(quote)

  150. format.setQuoteEscape(escape)

  151. format.setComment(comment)

  152. writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)

  153. writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)

  154. writerSettings.setNullValue(nullValue)

  155. writerSettings.setEmptyValue(nullValue)

  156. writerSettings.setSkipEmptyLines(true)

  157. writerSettings.setQuoteAllFields(quoteAll)

  158. writerSettings.setQuoteEscapingEnabled(escapeQuotes)

  159. writerSettings

  160. }

  161.  
  162. def asParserSettings: CsvParserSettings = {

  163. val settings = new CsvParserSettings()

  164. val format = settings.getFormat

  165. format.setDelimiter(delimiter)

  166. format.setQuote(quote)

  167. format.setQuoteEscape(escape)

  168. format.setComment(comment)

  169. settings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceInRead)

  170. settings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceInRead)

  171. settings.setReadInputOnSeparateThread(false)

  172. settings.setInputBufferSize(inputBufferSize)

  173. settings.setMaxColumns(maxColumns)

  174. settings.setNullValue(nullValue)

  175. settings.setMaxCharsPerColumn(maxCharsPerColumn)

  176. settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)

  177. settings

  178. }

  179. }

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值