Spark SQL-Load/Save

Load/Save
paquet文件
  • save
    Parquet仅仅是一种存储格式,它是语言、平台无关的,并且不需要和任何一种数据处理框架绑定.
 var sql =
      """
        |select * from t_user
      """.stripMargin
    
    val result = spark.sql(sql)
    
    result.write.save("hdfs://train:9000/results/parquet")
  • load
val dataFrame = spark.read.load("hdfs://train:9000/results/parquet")

    dataFrame.printSchema()
    dataFrame.show()
root
 |-- name: string (nullable = true)
 |-- sex: boolean (nullable = true)
 |-- salary: double (nullable = true)
 |-- deptNo: integer (nullable = true)

+-------+-----+-------+------+
|   name|  sex| salary|deptNo|
+-------+-----+-------+------+
|Michael| true|20000.0|     1|
|   Andy| true|15000.0|     1|
| Justin| true| 8000.0|     1|
|  Kaine| true|20000.0|     2|
|   Lisa|false|18000.0|     2|
+-------+-----+-------+------+

等价写法:spark.read.parquet(“hdfs://train:9000/results/parquet”)

Json格式
  • save
var sql =
      """
        |select * from t_user
      """.stripMargin

    val result = spark.sql(sql)

    result.write
      .format("json")
        .mode(SaveMode.Overwrite)
        .save("hdfs://train:9000/results/json")
  • load
val dataFrame = spark.read
        .format("json")
        .load("hdfs://train:9000/results/json")


    dataFrame.printSchema()
    dataFrame.show()
root
 |-- age: long (nullable = true)
 |-- dept: string (nullable = true)
 |-- deptNo: long (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: double (nullable = true)
 |-- sex: boolean (nullable = true)

+---+--------+------+-------+-------+-----+
|age|    dept|deptNo|   name| salary|  sex|
+---+--------+------+-------+-------+-----+
| 29| MANAGER|     1|Michael|20000.0| true|
| 30|SALESMAN|     1|   Andy|15000.0| true|
| 19|   CLERK|     1| Justin| 8000.0| true|
| 20| MANAGER|     2|  Kaine|20000.0| true|
| 19|SALESMAN|     2|   Lisa|18000.0|false|
+---+--------+------+-------+-------+-----+

可以简写:spark.read.json(“hdfs://train:9000/results/json”)

csv格式
  • save
 var sql =
      """
        |select * from t_user
      """.stripMargin

    val result = spark.sql(sql)

    result.write
      .format("csv")
      .mode(SaveMode.Overwrite)
      .option("sep", ",") //指定分隔符
      .option("inferSchema", "true") //参照表schema信息
      .option("header", "true") //是否产生表头信息
      .save("hdfs://train:9000/results/csv")
  • load
val dataFrame = spark.read
      .format("json")
      .option("sep",",") //指定分隔符
      .option("inferSchema", "true") //参照表schema信息
      .option("header", "true") //是否产生表头信息
      .load("hdfs://train:9000/results/csv")


    dataFrame.printSchema()
    dataFrame.show()
root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- salary: double (nullable = true)
 |-- sex: boolean (nullable = true)
 |-- dept: string (nullable = true)
 |-- deptNo: integer (nullable = true)

+-------+---+-------+-----+--------+------+
|   name|age| salary|  sex|    dept|deptNo|
+-------+---+-------+-----+--------+------+
|Michael| 29|20000.0| true| MANAGER|     1|
|   Andy| 30|15000.0| true|SALESMAN|     1|
| Justin| 19| 8000.0| true|   CLERK|     1|
|  Kaine| 20|20000.0| true| MANAGER|     2|
|   Lisa| 19|18000.0|false|SALESMAN|     2|
+-------+---+-------+-----+--------+------+

ORC格式

ORC的全称是(Optimized Row Columnar),ORC文件格式是一种Hadoop生态圈中的列式存储格式,它的产生早在2013年初,最初产生自Apache Hive,用于降低Hadoop数据存储空间和加速Hive查询速度。

  • save
var sql =
      """
        |select * from t_user
      """.stripMargin

    val result = spark.sql(sql)

    result.write
      .format("orc")
      .mode(SaveMode.Overwrite)
      .save("hdfs://train:9000/results/orc")
  • load
val dataFrame = spark.read
        .orc("hdfs://train:9000/results/orc")



    dataFrame.printSchema()
    dataFrame.show()
root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- salary: double (nullable = true)
 |-- sex: boolean (nullable = true)
 |-- dept: string (nullable = true)
 |-- deptNo: integer (nullable = true)

+-------+---+-------+-----+--------+------+
|   name|age| salary|  sex|    dept|deptNo|
+-------+---+-------+-----+--------+------+
|Michael| 29|20000.0| true| MANAGER|     1|
|   Andy| 30|15000.0| true|SALESMAN|     1|
| Justin| 19| 8000.0| true|   CLERK|     1|
|  Kaine| 20|20000.0| true| MANAGER|     2|
|   Lisa| 19|18000.0|false|SALESMAN|     2|
+-------+---+-------+-----+--------+------+
SQL读取文件
val parquetDF = spark.sql("select * from parquet.`hdfs://train:9000/results/parquet/`")
    val jsonDF = spark.sql("select * from json.`hdfs://train:9000/results/json/`")
    val orcDF = spark.sql("select * from orc.`hdfs://train:9000/results/orc/`")

    parquetDF.show()
    jsonDF.show()
    orcDF.show()
JDBC读取数据
  • load
<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.47</version>
</dependency>

 val dataFrame = spark.read
        .format("jdbc")
        .option("url","jdbc:mysql://train:3306/test")
        .option("dbtable","t_user")
        .option("user","root")
        .option("password","root")
        .load()
    dataFrame.show()
+---+-------+-----+---+----------+
| id|   name|  sex|age|  birthDay|
+---+-------+-----+---+----------+
|  1|   lisi|false| 25|2020-01-10|
|  3| wangwu| true| 30|2020-01-17|
|  4|zhaoliu|false| 50|1990-02-08|
|  5|   win7| true| 20|1991-02-11|
+---+-------+-----+---+----------+

  • save
 val props = new Properties()
    props.put("user", "root")
    props.put("password", "root")

    userDS.select("name","sex","salary","deptNo")
      .toDF()
      .write
      .mode("append")
      .jdbc("jdbc:mysql://train:3306/test","t_user1",props)

系统会自动创建表

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值