SPark学习笔记:09SparkSQL 数据源之文件和JDBC

概述

SparkSQL支持通过DataFrame的接口操作各种各样的数据源。DataFrame既能够使用传统的算子做转换操作,也可以通过注册成临时视图的方式通过SQL来操作。这篇文章将总结Spark的各种数据源,以及如何将数据落地到文件或者传统的关系型数据库中。

从json文件中load数据

示例

val sensorDf:DataFrame = spark.read
  .format("json")
  .option("prefersDecimal",true)
  .option("multiLine",true)
  .option("allowComments",true)
  .option("allowUnquotedFieldNames",true)
  .load("data/sensor.json")
sensorDf.show()
sensorDf.printSchema()

参数说明

timeZone

  • 说明:设置json文件中的日期时间的时区的值。支持的格式有两种:
  1. 'area/city’这种地区格式:eg.‘America/Los_Angeles’
  2. Zone+偏移量的方式:格式:‘(+|-)HH:mm’ eg.‘-8:00’或者’+1:00
  • 默认值:默认是spark.sql.session.timeZone参数的值

primitivesAsString

  • 说明:将json文件的所有值都转换成String类型处理
  • 默认值:false
  • 用法:
val sensorDf:DataFrame = spark.read
  .format("json")
  .option("primitivesAsString",true)
  .load("data/sensor.json")
sensorDf.printSchema()

输出:

root
 |-- id: string (nullable = true)
 |-- temperature: string (nullable = true)
 |-- timestamp: string (nullable = true)

prefersDecimal

  • 说明:将浮点数的值转换成decimal类型处理
  • 默认值:false
  • 用法:

文件内容:

{"id": "sensor1","timestamp": 1639968630, "temperature": 12.1}
{"id": "sensor2", "timestamp": 1639968620,"temperature": 13.1}
val sensorDf:DataFrame = spark.read
  .format("json")
  .option("prefersDecimal",true)
  .load("data/sensor.json")
sensorDf.printSchema()

输出:

root
 |-- id: string (nullable = true)
 |-- temperature: decimal(3,1) (nullable = true)
 |-- timestamp: long (nullable = true)

multiLine

  • 说明:json文件是否是多行的(用于处理如下格式的json文件)
  • 默认值:false
  • 用法:

文件内容:

[
    {"id": "sensor1","timestamp": 1639968630, "temperature": 12.1},
    {"id": "sensor2", "timestamp": 1639968620,"temperature": 13.1}
]
val sensorDf:DataFrame = spark.read
  .format("json")
  .option("multiLine",true)
  .load("data/sensor.json")
sensorDf.printSchema()

输出:

+-------+-----------+----------+
|     id|temperature| timestamp|
+-------+-----------+----------+
|sensor1|       12.1|1639968630|
|sensor6|       20.1|1639968530|
+-------+-----------+----------+

allowComments

  • 说明:是否允许json文件中出现注释
  • 默认值:false
  • 用法:

文件内容:

{"id": "sensor1","timestamp": 1639968630, "temperature": 12.1}
//这是一段json注释
{"id": "sensor2", "timestamp": 1639968620,"temperature": 13.1}
val sensorDf:DataFrame = spark.read
  .format("json")
  .option("allowComments",true)
  .load("data/sensor.json")
sensorDf.printSchema()

输出:

+-------+-----------+----------+
|     id|temperature| timestamp|
+-------+-----------+----------+
|sensor1|       12.1|1639968630|
|sensor6|       20.1|1639968530|
+-------+-----------+----------+

如果为false,则会报错,处理失败。

allowUnquotedFieldNames

  • 说明:是否允许json文件数据的Key值不用’引号’括起来
  • 默认值:false,json格式的key必须要用引号包起来
  • 用法:

文件内容:

{"id": "sensor1","timestamp": 1639968630, "temperature": 12.1}
{id: "sensor2", timestamp: 1639968620,"temperature": 13.1}
val sensorDf:DataFrame = spark.read
  .format("json")
  .option("allowUnquotedFieldNames",true)
  .load("data/sensor.json")
sensorDf.printSchema()

输出:

+-------+-----------+----------+
|     id|temperature| timestamp|
+-------+-----------+----------+
|sensor1|       12.1|1639968630|
|sensor6|       20.1|1639968530|
+-------+-----------+----------+

如果为false,则会报错,处理失败。

allowSingleQuotes

  • 说明:是否允许json文件数据的Key使用“单引号”括起来
  • 默认值:true,json格式的key允许使用“单引号”括起来
  • 用法:

文件内容:

{"id": "sensor1","timestamp": 1639968630, "temperature": 12.1}
{'id': "sensor2", 'timestamp': 1639968620,"temperature": 13.1}
val sensorDf:DataFrame = spark.read
  .format("json")
  .option("allowSingleQuotes",true)
  .load("data/sensor.json")
sensorDf.printSchema()

输出:

+-------+-----------+----------+
|     id|temperature| timestamp|
+-------+-----------+----------+
|sensor1|       12.1|1639968630|
|sensor6|       20.1|1639968530|
+-------+-----------+----------+

如果为false,则会报错,处理失败。

dropFieldIfAllNull

  • 说明:如果key的值全部为null,就删除这一列(DataFrame中不显示)
  • 默认值:false
  • 用法:

文件内容:

{"id": "sensor1","timestamp": 1639968630, "temperature": null}
{'id': "sensor2", 'timestamp': 1639968620,"temperature": null}
val sensorDf:DataFrame = spark.read
  .format("json")
  .option("dropFieldIfAllNull",true)
  .load("data/sensor.json")
sensorDf.printSchema()

输出:

root
 |-- id: string (nullable = true)
 |-- timestamp: long (nullable = true)

temperature的值全部为null,所以被从shcema中删除。

dateFormat

文件内容:

{"id": null,"timestamp": null, "temperature": null,"enterdate": "20220701"}
{"id": "sensor6","timestamp": 1639968530,"temperature": null,"enterdate": "20220701"}
val sensorDf:DataFrame = spark.read
  .format("json")
  .option("dateFormat","yyyyMMdd")
  .schema("enterdate date,id string,temperature double,timestamp long")
  .load("data/sensor.json")
sensorDf.show()
sensorDf.printSchema()

输出:

+----------+-------+-----------+----------+
| enterdate|     id|temperature| timestamp|
+----------+-------+-----------+----------+
|2022-07-01|   null|       null|      null|
|2022-07-01|sensor6|       null|1639968530|
+----------+-------+-----------+----------+

从CSV文件中load数据

示例

val sensorDf:DataFrame = spark.read
  .format("csv")
  .schema("id string,timestamp long,temperature double")
  .option("sep",";")
  .load("data/sensor.csv")
sensorDf.show()
sensorDf.printSchema()

参数说明

sep

  • 说明:指定CSV文件的分隔符
  • 默认值:,
  • 用法:

文件内容:

sensor_1;1639968630;13.4
sensor_1;1639968631;42.8
val sensorDf:DataFrame = spark.read
  .format("csv")
  .schema("id string,timestamp long,temperature double")
  .option("sep",";")
  .load("data/sensor.csv")
sensorDf.show()

输出:

+--------+----------+-----------+
|      id| timestamp|temperature|
+--------+----------+-----------+
|sensor_1|1639968630|       13.4|
|sensor_1|1639968631|       42.8|
+--------+----------+-----------+

comment

  • 说明:指定注释开头的字符,以这个字符开头的行都会被当成注释忽略掉
  • 默认值 :默认是关闭的,
  • 用法:

文件内容:

sensor_1;1639968630;13.4
sensor_1;1639968631;42.8
#这是一行注释;请忽略
sensor_1;1639968630;13.4
sensor_1;1639968631;42.8
    val sensorDf:DataFrame = spark.read
      .format("csv")
      .schema("id string,timestamp long,temperature double")
      .option("sep",";")
      .option("comment","#")
      .load("data/sensor.csv")
    sensorDf.show()
    sensorDf.printSchema()

输出:

+--------+----------+-----------+
|      id| timestamp|temperature|
+--------+----------+-----------+
|sensor_1|1639968630|       13.4|
|sensor_1|1639968631|       42.8|
|sensor_1|1639968630|       13.4|
|sensor_1|1639968631|       42.8|
+--------+----------+-----------+

header

  • 说明:表头,是否以第一行作为DataFrame的列名
  • 默认值 :false,
  • 用法:

文件内容:

id;timestamp;temperature
sensor_1;1639968630;13.4
sensor_1;1639968631;42.8
val sensorDf:DataFrame = spark.read
  .format("csv")
  .option("sep",";")
  .option("header",true)
  .load("data/sensor.csv")
sensorDf.show()
sensorDf.printSchema()

输出:

+--------+----------+-----------+
|      id| timestamp|temperature|
+--------+----------+-----------+
|sensor_1|1639968630|       13.4|
|sensor_1|1639968631|       42.8|
+--------+----------+-----------+

root
 |-- id: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- temperature: string (nullable = true)

lineSep

  • 说明:行分割符号
  • 默认值 :在read时默认是’\r\n’或者’\n’或者’\r’,在write时默认是’\n’
  • 用法:

文件内容:

id;timestamp;temperature
sensor_1;1639968630;13.4
sensor_1;1639968631;42.8
val sensorDf:DataFrame = spark.read
  .format("csv")
  .option("lineSep","\r\n")
  .option("header",true)
  .load("data/sensor.csv")
sensorDf.show()
sensorDf.printSchema()

输出:

+--------+----------+-----------+
|      id| timestamp|temperature|
+--------+----------+-----------+
|sensor_1|1639968630|       13.4|
|sensor_1|1639968631|       42.8|
+--------+----------+-----------+

root
 |-- id: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- temperature: string (nullable = true)

从ORC文件中load数据

示例

val sensorDf:DataFrame = spark.read
  .format("orc")
  .load("data/save/sensor/orc")
sensorDf.show()
sensorDf.printSchema()

从Parquet文件中load数据

示例

val sensorDf:DataFrame = spark.read
  .format("parquet")
  .load("data/save/sensor/parquet")
sensorDf.show()
sensorDf.printSchema()

从Text文件中load数据

示例

    val sensorDf:DataFrame = spark.read
      .option("lineSep",";")
      .text("data/sensor.csv")
    sensorDf.show()
    sensorDf.printSchema()

参数说明

lineSep

  • 说明:行分隔符
  • 默认值:;
  • 用法:

文件内容:

sensor_1;1639968630;13.4;sensor_1;1639968631;42.8
    val sensorDf:DataFrame = spark.read
      .option("lineSep",";")
      .text("data/sensor.csv")
    sensorDf.show()
    sensorDf.printSchema()

输出:

+----------+
|     value|
+----------+
|  sensor_1|
|1639968630|
|      13.4|
|  sensor_1|
|1639968631|
|      42.8|
+----------+

root
 |-- value: string (nullable = true)

通过JDBC从关系型数据库load数据

示例使用

val df = spark.read.format("jdbc")
  .option("url","jdbc:mysql://127.0.0.1:3307/news?user=root&password=root")
  .option("query","SELECT ID,TIMESTAMP FROM sensor")
  .option("driver","com.mysql.cj.jdbc.Driver")
  .option("queryTimeout",30)
  .load()
df.show()

内置支持的关系型数据库有:

  • DB2
  • MariaDB
  • MS Sql
  • Oracle
  • PostgreSQL

参数说明

url

  • 说明:jdbc连接的url,如果url中没有用户名和密码,则需要在option中添加"user"和“password”选项。
  • 用法:
val df = spark.read
.format("jdbc")
.option("url","jdbc:mysql://127.0.0.1:3307/news?user=root&password=root")
.load()

query

  • 说明:指定取数据的查询语句,和dbtable只能配置一个(互斥)
  • 用法:
val df = spark.read
.format("jdbc")
.option("url","jdbc:mysql://127.0.0.1:3307/news?user=root&password=root")
.option("query","SELECT ID,TIMESTAMP FROM sensor")
.load()

dbtable

  • 说明:指定取数据表,和query只能配置一个(互斥)
  • 用法:
val df = spark.read
.format("jdbc")
.option("url","jdbc:mysql://127.0.0.1:3307/news?user=root&password=root")
.option("dbtable","sensor")
.load()

driver

  • 说明:指定数据库连接驱动,如果不指定,则使用默认的
  • 用法:
val df = spark.read
.format("jdbc")
.option("url","jdbc:mysql://127.0.0.1:3307/news?user=root&password=root")
.option("driver","com.mysql.cj.jdbc.Driver")
.option("dbtable","sensor")
.load()

customSchema

  • 说明:自定义查询后的结果表的schema
  • 用法:
val df = spark.read.format("jdbc")
  .option("url","jdbc:mysql://127.0.0.1:3307/news?user=root&password=root")
  .option("query","SELECT ID,TIMESTAMP FROM sensor")
  .option("customSchema","ID STRING,TIMESTAMP double")
  .load()
df.show()

官方文档:
https://spark.apache.org/docs/latest/sql-data-sources-json.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值