目录
SQL on Hadoop
基于Hadoop并集成了SQL语句的有以下这些:
- Hive
- Impala
- Presto
- Shark
- Phoenix
Spark SQL 前身
shark的初衷:让Hive运行在Spark之上
- 是对Hive的改造,继承了大量Hive代码,优化和维护带来了大量的麻烦;
Spark SQL架构
- Spark SQL是Spark的核心组件之一
- 能够直接访问现存的Hive数据
- 提供JDBC/ODBC接口供第三方借助Spark进行数据处理
- 提供了更高层级的接口方便处理数据
- 支持多种操作方式:SQL,API编程
- 支持多种外部数据源:Parquet、JSON、RDBMS等
Spark SQL运行原理
Catalyst优化器是Spark SQL的核心
Catalyst Optimizer:Catalyst优化器,将逻辑计划转为物理计划
- 逻辑计划
e.g.
select name from
(
select id,name from people
)p
where p.id =1
- 优化
- 在投影上面查询过滤器
- 检查过滤是否可以下压
- 物理优化
Spark SQL API
-
SparkContext
-
SQLContext
-
Spark SQL的编程入口
-
HiveContext
- SQLContext的子集,包含更多功能
-
SparkSession
- SparkSession:合并了SQLContext与HiveContext
- 提供与Spark功能交互单一入口点,并允许使用DataFrame和Dataset API对Spark进行编程
DataSet
- 特定域对象中的强类型集合
DataSet创建
- 通过seq创建DataSet
- 通过List创建DataSet
- 通过RDD来创建DataSet
e.g.
package SparkSQL
import org.apache.spark.sql.SparkSession
object Demo1_CreateDataSet {
def main(args: Array[String]): Unit = {
//创建SparkSession
val spark = SparkSession.builder()
.master("local[4]") //设置master
.appName(this.getClass.getName) //设置appName
.getOrCreate()
//通过SparkSession创建sparkContext
val sc = spark.sparkContext
//需要导入隐式转换
import spark.implicits._
//通过seq创建DataSet
val seqDs = spark.createDataset(1 to 10)
seqDs.show()
//通过List创建DataSet
val listDs = spark.createDataset(List(("a",1),("b",2),("c",3)))
listDs.show()
//通过RDD来创建DataSet
val rddDs = spark.createDataset(sc.parallelize(List(("a",1),("b",2),("c",3))))
rddDs.show()
}
}
result:
"C:\Program Files\Java\jdk1.8.0_231\bin\java.exe" ...
+-----+
|value|
+-----+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
+-----+
+---+---+
| _1| _2|
+---+---+
| a| 1|
| b| 2|
| c| 3|
+---+---+
+---+---+
| _1| _2|
+---+---+
| a| 1|
| b| 2|
| c| 3|
+---+---+
Process finished with exit code 0
使用Case Class 创建DataSet
package SparkSQL
import org.apache.spark.sql.SparkSession
object Demo2_CreateDataSetByCaseClass {
case class Point(label:String,x:Double,y:Double)
case class Category(id:Long,name:String)
def main(args: Array[String]): Unit = {
//创建SparkSession
val spark = SparkSession.builder()
.master("local[4]") //设置master
.appName(this.getClass.getName) //设置appName
.getOrCreate()
//需要导入隐式转换
import spark.implicits._
//通过Point的样例类创建一个DataSet
val points = Seq(Point("bar",2.6,3.5),Point("foo",4.0,3.7)).toDS()
//通过Category的样例类创建一个DataSet
val categories = Seq(Category(1,"bar"),Category(2,"foo")).toDS()
//进行join连接,注意这里需要传入三个"=",这是一个方法
points.join(categories,points("label")===categories("name")).show()
}
}
"C:\Program Files\Java\jdk1.8.0_231\bin\java.exe" ...
+-----+---+---+---+----+
|label| x| y| id|name|
+-----+---+---+---+----+
| bar|2.6|3.5| 1| bar|
| foo|4.0|3.7| 2| foo|
+-----+---+---+---+----+
Process finished with exit code 0
RDD->DataSet
package SparkSQL
import org.apache.spark.sql.SparkSession
object Demo2_CreateDataSetByCaseClass {
case class Point(label:String,x:Double,y:Double)
case class Category(id:Long,name:String)
def main(args: Array[String]): Unit = {
//创建SparkSession
val spark = SparkSession.builder()
.master("local[4]") //设置master
.appName(this.getClass.getName) //设置appName
.getOrCreate()
import spark.implicits._
val sc=spark.sparkContext
points.join(categories,points("label")===categories("name")).show()
//方式二:开发常规用法,先将一个RDD通过和样例类关联,再将其转换为DataSet
//通过Point的数据创建一个RDD
val pointsRDD = sc.parallelize(List(("bar",2.6,3.5),("foo",4.0,3.7)))
//通过Category的数据创建一个RDD
val categoriesRDD = sc.parallelize(List((1,"bar"),(2,"foo")))
//两个RDD和样例类进行关联
val pointDs = pointsRDD.map(x=>Point(x._1,x._2,x._3)).toDS()
val categoriesDs = categoriesRDD.map(x=>Category(x._1,x._2)).toDS()
//将两个DataSet进行关联,输出
pointDs.join(categoriesDs,pointDs("label")===categoriesDs("name")).show()
}
}
result:
"C:\Program Files\Java\jdk1.8.0_231\bin\java.exe" ...
+-----+---+---+---+----+
|label| x| y| id|name|
+-----+---+---+---+----+
| bar|2.6|3.5| 1| bar|
| foo|4.0|3.7| 2| foo|
+-----+---+---+---+----+
Process finished with exit code 0
DataFrame
- DataFrame=Dataset[Row]
- 类似传统数据的二维表格
- 在RDD基础上加入了Schema(数据结构信息)
- DataFrame Schema支持嵌套数据类型
- struct
- map
- array
- 提供更多类似SQL操作的API
创建DataFrame
- 将JSON文件转成DataFrame
- DataFrame API常用操作
e.g.
package SparkSQL
import org.apache.spark.sql.SparkSession
object Demo5_DataFrameCommonOps {
def main(args: Array[String]): Unit = {
//创建SparkSession
val spark = SparkSession.builder()
.master("local[4]") //设置master
.appName(this.getClass.getName) //设置appName
.getOrCreate()
import spark.implicits._
val sc=spark.sparkContext
val df = spark.read.json("data/people.json")
df.printSchema()
//和SQL的select一样,对数据个别字段进行提取
df.select("name").show()
//对某个字段进行简单计算
df.select(df("name"),df("age")+1).show()
//和SQL的where一样,对数据进行过滤,DataFrame也提供了where的方法,用法和filter一样
df.filter(df("age")>21).show()
//groupBy,对数据聚合,可以求最大值,最小值,数据条数...
df.groupBy(("age")).count().show()
//创建临时视图
df.createTempView("people")
//普通SQL查询语句
spark.sql("select * from people").show()
}
}
result:
"C:\Program Files\Java\jdk1.8.0_231\bin\java.exe" ...
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
+-------+
| name|
+-------+
|Michael|
| Andy|
| Justin|
+-------+
+-------+---------+
| name|(age + 1)|
+-------+---------+
|Michael| null|
| Andy| 31|
| Justin| 20|
+-------+---------+
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+
+----+-----+
| age|count|
+----+-----+
| 19| 1|
|null| 1|
| 30| 1|
+----+-----+
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
Process finished with exit code 0
RDD->DataFrame
- 通过样例类的方式将RDD转换为DataFrame
- 将RDD通过和Schema信息关联,得到DataFrame
e.g.
package SparkSQL
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
object Demo6_RDDToDataFrame {
//定义一个People的样例类
case class People(name:String,age:Int)
def main(args: Array[String]): Unit = {
//创建SparkSession
val spark = SparkSession.builder()
.master("local[4]") //设置master
.appName(this.getClass.getName) //设置appName
.getOrCreate()
val sc=spark.sparkContext
import spark.implicits._
val peopleRDD = sc.textFile("data2/people.txt")
//方式一:通过样例类的方式将RDD转换为DataFrame
val peopleDF = peopleRDD.map(_.split(","))
.map(x => People(x(0), x(1).toInt))
.toDF()
// peopleDF.show()
//创建临时表
peopleDF.createTempView("people")
spark.sql("select * from people where name = 'zhangsan'").show()
//方式二:将RDD通过和Schema信息关联,得到DataFrame
//通过StructType构建Schema信息,StructField代表一个字段
//第一个参数是字段名称,第二个参数是字段类型,第三个参数是是否可以为空
val schema = StructType(Array(
StructField("name", StringType, true),
StructField("age", IntegerType, true)
))
schema
//将每行的字符串切割,切割成Array,将其转化为RDD[Row]类型
val peopleRowRDD = peopleRDD.map(_.split(",")).map(x=>Row(x(0),x(1).toInt))
//将Row类型的RDD和Schema信息关联,创建一个DataFrame
val df = spark.createDataFrame(peopleRowRDD,schema)
df.createOrReplaceTempView("people2")
spark.sql("select * from people2").show()
}
}
result:
"C:\Program Files\Java\jdk1.8.0_231\bin\java.exe" ...
+--------+---+
| name|age|
+--------+---+
|zhangsan| 14|
+--------+---+
+--------+---+
| name|age|
+--------+---+
|zhangsan| 14|
| lisi| 20|
| wangwu| 30|
+--------+---+
Process finished with exit code 0
DataFrame->RDD
- 通过.rdd方法将DataFrame类型转换为RDD
e.g.
package SparkSQL
import org.apache.spark.sql.SparkSession
object Demo7_DateFramToRDD {
def main(args: Array[String]): Unit = {
//创建SparkSession
val spark = SparkSession.builder()
.master("local[4]") //设置master
.appName(this.getClass.getName) //设置appName
.getOrCreate()
import spark.implicits._
val sc=spark.sparkContext
val df = spark.read.json("data/people.json")
//通过.rdd方法将DataFrame类型转换为RDD
//转换为的RDD类型为RDD[Row],带有一定的结构
df.rdd.foreach(println)
}
}
result:
"C:\Program Files\Java\jdk1.8.0_231\bin\java.exe" ...
[null,Michael]
[30,Andy]
[19,Justin]
Process finished with exit code 0
package SparkSQL
import org.apache.spark.sql.SparkSession
object Demo7_DateFramToRDD {
def main(args: Array[String]): Unit = {
//创建SparkSession
val spark = SparkSession.builder()
.master("local[4]") //设置master
.appName(this.getClass.getName) //设置appName
.getOrCreate()
import spark.implicits._
val sc=spark.sparkContext
val df = spark.read.json("data/people.json")
//通过.rdd方法将DataFrame类型转换为RDD
//转换为的RDD类型为RDD[Row],带有一定的结构
df.rdd.foreach(println)
}
}
result:
"C:\Program Files\Java\jdk1.8.0_231\bin\java.exe" ...
[null,Michael]
[30,Andy]
[19,Justin]
Process finished with exit code 0
Spark SQL操作外部数据源
PS:如果有写错或者写的不好的地方,欢迎各位大佬在评论区留下宝贵的意见或者建议,敬上!如果这篇博客对您有帮助,希望您可以顺手帮我点个赞!不胜感谢!
原创作者:wsjslient |