Spark SQL精华及与Hive的集成
SQL on Hadoop
--SQL是一种传统的用来进行数据分析的标准
Hive是原始的SQL-on-Hadoop解决方案
Impala:和Hive一样,提供了一种可以针对已有Hadoop数据编写SQL查询的方法
Presto:类似于Impala,未被主要供应商支持
Shark:Spark SQL的前身,设计目标是作为Hive的补充
Phoenix:基于HBase的开源SQL查询引擎
Spark SQL****架构
Spark SQL是Spark的核心组件之一(2014.4 Spark1.0)
能够直接访问现存的Hive数据
提供JDBC/ODBC接口供第三方工具借助Spark进行数据处理
提供了更高层级的接口方便地处理数据
支持多种操作方式:SQL、API编程
支持多种外部数据源:Parquet、JSON、RDBMS等
Spark SQL****运行原理
Catalyst优化器是Spark SQL的核心
![在这里插入图片描述](https://i-blog.csdnimg.cn/blog_migrate/b95b152492a1a5f98e10d044881f300a.png)
Spark SQL
--SparkContext
--SQLContext
Spark SQL的编程入口
--HiveContext
SQLContext的子集,包含更多功能
--SparkSession(Spark 2.x推荐)
SparkSession:合并了SQLContext与HiveContext
提供与Spark功能交互单一入口点,并允许使用DataFrame和Dataset API对Spark进行编程
Dataset(Spark 1.6+)
1、createDataset()的参数可以是:Seq、Array、RDD
2、Dataset=RDD+Schema,所以Dataset与RDD有大部共同的函数,如map、filter等
scala> spark.createDataset(1 to 3).show
+-----+
|value|
+-----+
| 1|
| 2|
| 3|
+-----+
scala> spark.createDataset(List("a","b")).show
+-----+
|value|
+-----+
| a|
| b|
+-----+
scala> spark.createDataset(List(("a",1),("b",2))).show
+---+---+
| _1| _2|
+---+---+
| a| 1|
| b| 2|
+---+---+
scala> spark.createDataset(List("a","b")).show
+-----+
|value|
+-----+
| a|
| b|
+-----+
scala> spark.createDataset(List(("a",1),("b",2))).show
+---+---+
| _1| _2|
+---+---+
| a| 1|
| b| 2|
+---+---+
使用Case Class创建Dataset
典型应用场景1:
scala> case class Student(name:String,age:Int)
defined class Student
scala> Seq(Student("zhangsan",15),Student("lisi",20))
res0: Seq[Student] = List(Student(zhangsan,15), Student(lisi,20))
scala> Seq(Student("zhangsan",15),Student("lisi",20)).toDS
res1: org.apache.spark.sql.Dataset[Student] = [name: string, age: int]
scala> Seq(Student("zhangsan",15),Student("lisi",20)).toDS.show
+--------+---+
| name|age|
+--------+---+
|zhangsan| 15|
| lisi| 20|
+--------+---+
典型应用场景2:
package cn.kgc.day0812.test02
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Dataset, SparkSession}
object CreateDataSetDemo {
case class Point(lable:String,x:Double,y:Double)
case class Category(id:Long,name:String)
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().master("local[2]").appName("test02").getOrCreate()
val sc: SparkContext = spark.sparkContext
import spark.implicits._
val pointRDD: RDD[(String, Double, Double)] = sc.makeRDD(List(("bar", 3.0, 4.0), ("foo", 2.0, 2.5)))
val categoryRDD: RDD[(Int, String)] = sc.makeRDD(List((1, "foo"), (2, "bar")))
val ds1: Dataset[(String, Double, Double)] = pointRDD.toDS()
ds1.show()
val ds2: Dataset[Point] = pointRDD.map(x => Point(x._1, x._2, x._3)).toDS()
ds2.show()
val ca1: Dataset[Category] = categoryRDD.map(line => Category(line._1, line._2)).toDS()
ds2.join(ca1,ds2("lable")===ca1("name")).show()
}
}
结果:
+---+---+---+
| _1| _2| _3|
+---+---+---+
|bar|3.0|4.0|
|foo|2.0|2.5|
+---+---+---+
+-----+---+---+
|lable| x| y|
+-----+---+---+
| bar|3.0|4.0|
| foo|2.0|2.5|
+-----+---+---+
+-----+---+---+---+----+
|lable| x| y| id|name|
+-----+---+---+---+----+
| bar|3.0|4.0| 2| bar|
| foo|2.0|2.5| 1| foo|
+-----+---+---+---+----+