Load/Save
paquet文件
- save
Parquet仅仅是一种存储格式,它是语言、平台无关的,并且不需要和任何一种数据处理框架绑定.
var sql =
"""
|select * from t_user
""".stripMargin
val result = spark.sql(sql)
result.write.save("hdfs://train:9000/results/parquet")
- load
val dataFrame = spark.read.load("hdfs://train:9000/results/parquet")
dataFrame.printSchema()
dataFrame.show()
root
|-- name: string (nullable = true)
|-- sex: boolean (nullable = true)
|-- salary: double (nullable = true)
|-- deptNo: integer (nullable = true)
+-------+-----+-------+------+
| name| sex| salary|deptNo|
+-------+-----+-------+------+
|Michael| true|20000.0| 1|
| Andy| true|15000.0| 1|
| Justin| true| 8000.0| 1|
| Kaine| true|20000.0| 2|
| Lisa|false|18000.0| 2|
+-------+-----+-------+------+
等价写法:spark.read.parquet(“hdfs://train:9000/results/parquet”)
Json格式
- save
var sql =
"""
|select * from t_user
""".stripMargin
val result = spark.sql(sql)
result.write
.format("json")
.mode(SaveMode.Overwrite)
.save("hdfs://train:9000/results/json")
- load
val dataFrame = spark.read
.format("json")
.load("hdfs://train:9000/results/json")
dataFrame.printSchema()
dataFrame.show()
root
|-- age: long (nullable = true)
|-- dept: string (nullable = true)
|-- deptNo: long (nullable = true)
|-- name: string (nullable = true)
|-- salary: double (nullable = true)
|-- sex: boolean (nullable = true)
+---+--------+------+-------+-------+-----+
|age| dept|deptNo| name| salary| sex|
+---+--------+------+-------+-------+-----+
| 29| MANAGER| 1|Michael|20000.0| true|
| 30|SALESMAN| 1| Andy|15000.0| true|
| 19| CLERK| 1| Justin| 8000.0| true|
| 20| MANAGER| 2| Kaine|20000.0| true|
| 19|SALESMAN| 2| Lisa|18000.0|false|
+---+--------+------+-------+-------+-----+
可以简写:spark.read.json(“hdfs://train:9000/results/json”)
csv格式
- save
var sql =
"""
|select * from t_user
""".stripMargin
val result = spark.sql(sql)
result.write
.format("csv")
.mode(SaveMode.Overwrite)
.option("sep", ",") //指定分隔符
.option("inferSchema", "true") //参照表schema信息
.option("header", "true") //是否产生表头信息
.save("hdfs://train:9000/results/csv")
- load
val dataFrame = spark.read
.format("json")
.option("sep",",") //指定分隔符
.option("inferSchema", "true") //参照表schema信息
.option("header", "true") //是否产生表头信息
.load("hdfs://train:9000/results/csv")
dataFrame.printSchema()
dataFrame.show()
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- salary: double (nullable = true)
|-- sex: boolean (nullable = true)
|-- dept: string (nullable = true)
|-- deptNo: integer (nullable = true)
+-------+---+-------+-----+--------+------+
| name|age| salary| sex| dept|deptNo|
+-------+---+-------+-----+--------+------+
|Michael| 29|20000.0| true| MANAGER| 1|
| Andy| 30|15000.0| true|SALESMAN| 1|
| Justin| 19| 8000.0| true| CLERK| 1|
| Kaine| 20|20000.0| true| MANAGER| 2|
| Lisa| 19|18000.0|false|SALESMAN| 2|
+-------+---+-------+-----+--------+------+
ORC格式
ORC的全称是(Optimized Row Columnar),ORC文件格式是一种Hadoop生态圈中的列式存储格式,它的产生早在2013年初,最初产生自Apache Hive,用于降低Hadoop数据存储空间和加速Hive查询速度。
- save
var sql =
"""
|select * from t_user
""".stripMargin
val result = spark.sql(sql)
result.write
.format("orc")
.mode(SaveMode.Overwrite)
.save("hdfs://train:9000/results/orc")
- load
val dataFrame = spark.read
.orc("hdfs://train:9000/results/orc")
dataFrame.printSchema()
dataFrame.show()
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- salary: double (nullable = true)
|-- sex: boolean (nullable = true)
|-- dept: string (nullable = true)
|-- deptNo: integer (nullable = true)
+-------+---+-------+-----+--------+------+
| name|age| salary| sex| dept|deptNo|
+-------+---+-------+-----+--------+------+
|Michael| 29|20000.0| true| MANAGER| 1|
| Andy| 30|15000.0| true|SALESMAN| 1|
| Justin| 19| 8000.0| true| CLERK| 1|
| Kaine| 20|20000.0| true| MANAGER| 2|
| Lisa| 19|18000.0|false|SALESMAN| 2|
+-------+---+-------+-----+--------+------+
SQL读取文件
val parquetDF = spark.sql("select * from parquet.`hdfs://train:9000/results/parquet/`")
val jsonDF = spark.sql("select * from json.`hdfs://train:9000/results/json/`")
val orcDF = spark.sql("select * from orc.`hdfs://train:9000/results/orc/`")
parquetDF.show()
jsonDF.show()
orcDF.show()
JDBC读取数据
- load
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
val dataFrame = spark.read
.format("jdbc")
.option("url","jdbc:mysql://train:3306/test")
.option("dbtable","t_user")
.option("user","root")
.option("password","root")
.load()
dataFrame.show()
+---+-------+-----+---+----------+
| id| name| sex|age| birthDay|
+---+-------+-----+---+----------+
| 1| lisi|false| 25|2020-01-10|
| 3| wangwu| true| 30|2020-01-17|
| 4|zhaoliu|false| 50|1990-02-08|
| 5| win7| true| 20|1991-02-11|
+---+-------+-----+---+----------+
- save
val props = new Properties()
props.put("user", "root")
props.put("password", "root")
userDS.select("name","sex","salary","deptNo")
.toDF()
.write
.mode("append")
.jdbc("jdbc:mysql://train:3306/test","t_user1",props)
系统会自动创建表