【极简spark教程】常用数据源的加载、保存方法

1. 通用加载方法:

spark.read.load()

1) avro

	val usersDF = spark.read.format("avro").load("users.avro")
	val usersDF = spark.read.format("com.databricks.spark.avro").load("users.avro")

2) parquet

	val usersDF = spark.read.format("parquet").load("users.parquet")

3) json

	val peopleDF = spark.read.format("json").load("people.json")

4) csv

	val peopleDFCsv = spark.read.format("csv").option("sep", ";").option("inferSchema", "true").option("header", "true").load("people.csv")

5) sparkSQL读取

	val sqlDF = spark.sql("SELECT * FROM parquet.`users.parquet`")

2. 保存方法:

df.select(<sql语句>).write.format(<输出格式>).mode(<保存类型>).save(<保存路径>)

1) error:默认模式,保存时,若指定路径上数据文件已存在,则仅报错并退出

	df.select("name").write.format("parquet").save("file:///root/data/overwrite") 

2) append:追加模式,保存时,若指定路径上数据文件已存在,则将DF内容追加到数据文件末尾

	df.select("name").write.format("parquet").mode("append").save("file:///root/data/overwrite") 

3) overwrite:覆写模式,保存时,若指定路径上数据文件已存在,则将DF内容覆写到数据文件

	df.select("name").write.format("parquet").mode("overwrite").save("file:///root/data/overwrite")

4) ignore:忽略模式,保存时,若指定路径上数据文件已存在,则将DF内容丢弃,不对数据文件做任何改动,类似于CREATE TABLE IF NOT EXISTS

	df.select("name").write.format("parquet").mode("ignore").save("file:///root/data/overwrite")

3. option参数信息:

Spark DataSource Option 常用参数设置

1) parquet

	val mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")

2) json

	val mdf = spark.read.option("multiline", "true").json("multi.json")
	val mdf = spark.read.option("charset", "UTF-16BE").json("fileInUTF16.json")

3) csv

    val peopleDFCsv = spark.read.option("sep", "|").csv("examples/src/main/resources/people.csv")
	val peopleDFCsv = spark.read.option("encoding", "UTF-8").csv("examples/src/main/resources/people.csv")
	val peopleDFCsv = spark.read.option("quote", "'").csv("examples/src/main/resources/people.csv")
	val peopleDFCsv = spark.read.option("escape", "\").csv("examples/src/main/resources/people.csv")
	val peopleDFCsv = spark.read.option("charToEscapeQuoteEscaping", "escape").csv("examples/src/main/resources/people.csv")
	val peopleDFCsv = spark.read.option("comment", "").csv("examples/src/main/resources/people.csv")
	val peopleDFCsv = spark.read.option("header", "true").csv("examples/src/main/resources/people.csv")
	val peopleDFCsv = spark.read.option("ignoreLeadingWhiteSpace", "false").csv("examples/src/main/resources/people.csv")
	val peopleDFCsv = spark.read.option("ignoreTrailingWhiteSpace", "false").csv("examples/src/main/resources/people.csv")
	val peopleDFCsv = spark.read.option("nullValue", "").csv("examples/src/main/resources/people.csv")
	val peopleDFCsv = spark.read.option("emptyValue", "").csv("examples/src/main/resources/people.csv")
	val peopleDFCsv = spark.read.option("nanValue", "NaN").csv("examples/src/main/resources/people.csv")
	val peopleDFCsv = spark.read.option("positiveInf", "Inf").csv("examples/src/main/resources/people.csv")
	val peopleDFCsv = spark.read.option("negativeInf", "-Inf").csv("examples/src/main/resources/people.csv")
	val peopleDFCsv = spark.read.option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss.SSSXXX").csv("examples/src/main/resources/people.csv")
	val peopleDFCsv = spark.read.option("maxColumns", "20480").csv("examples/src/main/resources/people.csv")
	val peopleDFCsv = spark.read.option("maxCharsPerColumn", "-1").csv("examples/src/main/resources/people.csv")
	val peopleDFCsv = spark.read.option("mode", "PERMISSIVE").csv("examples/src/main/resources/people.csv")
	val peopleDFCsv = spark.read.option("multiLine", "false").csv("examples/src/main/resources/people.csv")

补充内容

  1. 写入csv文件时,有时候需要剔除字段两侧的引号,理论上讲可以使用option(“quote”,“”)将引号设置为空字符串,但是写入文件后发现引号部分为\00
  2. 因此,需要剔除字段两侧的引号时,可以使用option(“quote”,“\b”),\b的含义表示字符串的边界,实际上也可以用来标识字符串的边界,只是这种标识用的是空字符串,骗过了spark对quote传入参数的检查
  3. 如果只是为了让空值不显示,可以使用option(“emptyValue”,
    “”),即可让结果显示由…,“”,…改变为…,…

4) text

	val df = spark.read.option("wholetext","false").text("/path/to/spark/README.md") //是否根据\n识别多行记录
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

鱼摆摆

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值