SparkSQL的数据源

最新推荐文章于 2022-05-26 11:30:22 发布

CatherineHuangTT

最新推荐文章于 2022-05-26 11:30:22 发布

阅读量886

点赞数 1

分类专栏： Spark学习随笔

Spark学习随笔专栏收录该内容

32 篇文章 0 订阅

订阅专栏

1.数据源

SparkSQL的数据源：结构化的文件（json，parquet），或者是Hive的表，或者是外部的数据库（mysql），也或者是已经存在的RDD。

2.load和save

SparkSQL默认的数据源的文件格式是parquet

Load是用来读取文件的时候加载文件中的数据

Save是用来往外写文件的时候存储写出的数据

val df = sqlContext.read.load("examples/src/main/resources/users.parquet")

df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")

3.format

SparkSQL的默认的数据源的文件的格式是parquet，但是，我们可以手动的指定数据源的文件格式，可以指定为json、jdbc等数据源的格式，使用format函数就可以指定。

val df = sqlContext.read.format("json").load("examples/src/main/resources/people.json")

df.select("name", "age").write.format("parquet").save("namesAndAges.parquet")

4.save mode

Df在保存数据的时候，可以设置保存的模式，这样就可以对现存的数据做对应的处理，具体的模式如下：

Scala/Java	Any Language	Meaning
SaveMode.ErrorIfExists(default)	"error"(default)	When saving a DataFrame to a data source, if data already exists, an exception is expected to be thrown.
SaveMode.Append	"append"	When saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.
SaveMode.Overwrite	"overwrite"	Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame.
SaveMode.Ignore	"ignore"	Ignore mode means that when saving a DataFrame to a data source, if data already exists, the save operation is expected to not save the contents of the DataFrame and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL.

val df = sqlContext.sql("select *from parquet.`examples\\src\\main\\resources\\users.parquet`")

 df.show()

df.write.mode(saveMode=SaveMode.Append).format("json").save("c:\\saveMode")
df.write.format("json").mode(saveMode = SaveMode.Overwrite).save("c:\\saveMode1")

5.Spark数据源之json

Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using SQLContext.read.json(path) orSQLContext.read.format(“json”).load(path) on either an RDD of String, or a JSON file.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// A JSON dataset is pointed to by path.// The path can be either a single text file or a directory storing text files.
val path = "examples/src/main/resources/people.json"
val people = sqlContext.read.json(path)
// The inferred schema can be visualized using the printSchema() method.
people.printSchema()
// root//  |-- age: integer (nullable = true)
//  |-- name: string (nullable = true)
// Register this DataFrame as a table.
people.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
// Alternatively, a DataFrame can be created for a JSON dataset represented by
// an RDD[String] storing one JSON object per string.
//并行化的方式创建json数据
val anotherPeopleRDD = sc.parallelize(
  """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val anotherPeople = sqlContext.read.json(anotherPeopleRDD)

6.SparkSQL数据源之JDBC

MySQL（postgresql这也是一个关系型数据库，最近也很火（也经常会使用这个做为Hive的元数据库））

package sparksql

import java.util.Properties

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext

import scala.collection.mutable
object JdbcTest {
  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName("test").setMaster("local")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
  TODO 方式一：
    val map = new mutable.HashMap[String,String]()
    map.put("url","jdbc:mysql://hadoop1:3306/aura")
    map.put("user","aura")
    map.put("password","aura")
    map.put("dbtable","people")
    val df = sqlContext.read.format("jdbc").options(
      map
    ).load()
   //TODO  方式二
    val properties = new Properties()
    properties.put("password","aura");
    properties.put("user","aura")
    /**
      * def jdbc(url: String, table: String, properties: Properties): DataFrame = {
       jdbc(url, table, JDBCRelation.columnPartition(null), properties)
  }
      */
    val df = sqlContext.read.jdbc(
     "jdbc:mysql://hadoop1:3306/aura","people", properties
    )


    df.registerTempTable("people")
    sqlContext.sql("select * from people").show()

  }

}

7.SparkSQL数据源之Hive

hive （我们最重要的一种数据源，或者我们的SparkSQL的数据仓库就是借助Hive去实现的）

这里的hive不是hive on spark，只是SparkSQL其中的一种数据源而已。

（1）我们在IDEA中创建项目写SparkSQL语句（数据源来源于Hive）的时候，要加下面的这个依赖：

<groupId>org.apache.spark</groupId>

<artifactId>spark-hive_2.11</artifactId>

<version>${spark.version}</version>

</dependency>

val conf = new SparkConf().setAppName("sparkSQL").setMaster("local")
val sc = new SparkContext(conf)
val hiveSQLcontext = new HiveContext(sc)
hiveSQLcontext.sql("create table student(id int,name string) row format delimited fields terminated by ','")
hiveSQLcontext.sql("load data local inpath  'root/hadoop/student.txt' into table student")
hiveSQLcontext.sql("select *from student")

输入：：paste

可以进行多行的编辑模式

输入：Ctrl +d

退出多行编辑的模式

总结性代码：

package lesson02

import java.util.Properties

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SQLContext, SaveMode}

import scala.collection.mutable

object DataSource {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("sparkSQL").setMaster("local")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)

    /*
    Generic Load/Save Functions
    In the simplest form, the default data source
    (parquet unless otherwise configured by spark.sql.sources.default) will be used for all operations.
    sparkSQL默认的文件的格式是parquet
     */
    val df = sqlContext.read.load("examples\\src\\main\\resources\\users.parquet")
    df.select("name","favorite_color").write.save("c:\\namesAndFavColors.parquet")

    /*
    Manually Specifying Options
    You can also manually specify the data source that will be used along with any extra options that you
    would like to pass to the data source. Data sources are specified by their fully qualified name
     (i.e., org.apache.spark.sql.parquet), but for built-in sources you can also use their
    short names (json, parquet, jdbc). DataFrames of any type can be converted into other types using this syntax.
     */

    val df = sqlContext.read.format("json").load("examples\\src\\main\\resources\\people.json")
    df.select("name","age").write.format("parquet").save("c:\\namesAndages.parquet")

    /*
    Run SQL on files directly
    Instead of using read API to load a file into DataFrame and query it,
     you can also query that file directly with SQL.
     */

    val df = sqlContext.sql("select *from parquet.`examples\\src\\main\\resources\\users.parquet`")

    df.show()

    /*
    Save Modes
    Save operations can optionally take a SaveMode, that specifies how to handle existing data if present.
    It is important to realize that these save modes do not utilize any locking and are not atomic. Additionally,
    when performing a Overwrite, the data will be deleted before writing out the new data.
     */

    df.write.mode(saveMode =SaveMode.Append).format("json").save("c:\\saveMode")
    df.write.format("json").mode(saveMode = SaveMode.Overwrite).save("c:\\saveMode1")

    //jdbc
    //style1
    val map = new mutable.HashMap[String,String]()
    map.put("url","jdbc:mysql://hadoop03:3306/test")
    map.put("user","root")
    map.put("password","root")
    map.put("dbtable","student")
    val df1 = sqlContext.read.format("jdbc").options(
      map
    ).load()
    df1.registerTempTable("student")
    sqlContext.sql("select *from student").show()

    //style2
    val properties = new Properties()
    properties.put("user","root")
    properties.put("password","root")
    val df2 = sqlContext.read.jdbc(
      "jdbc:mysql://hadoop03:3306/test", "student", properties
    )
    df2.registerTempTable("student")
    sqlContext.sql("select *from student")


    //Hive
    val hiveSQLcontext = new HiveContext(sc)
    hiveSQLcontext.sql("create table student(id int,name string) row format delimited fields terminated by ','")
    hiveSQLcontext.sql("load data local inpath  'root/hadoop/student.txt' into table student")
    hiveSQLcontext.sql("select *from student")

  }




}

以上使用的是Spark1.6.0的版本

目前2.2.0的版本对SparkSQL进行初始化的时候提供了统一的方式SparkSession，只是当数据源是Hive的时候需要多配置参数

非Hive：

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()

Hive：

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate()

注册成临时表的时候也修改了：

// Register the DataFrame as a global temporary view
df.createGlobalTempView("people")

// Global temporary view is tied to a system preserved database `global_temp`
spark.sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

// Global temporary view is cross-session
spark.newSession().sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

其他的没有太多的改动，详情请参考：

http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started

CatherineHuangTT

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
SparkSQL的数据源

1.数据源SparkSQL的数据源：结构化的文件（json，parquet），或者是Hive的表，或者是外部的数据库（mysql），也或者是已经存在的RDD。2.load和saveSparkSQL默认的数据源的文件格式是parquetLoad是用来读取文件的时候加载文件中的数据Save是用来往外写文件的时候存储写出的数据val df = sqlContext.rea
复制链接

扫一扫

专栏目录