一、简介
1.1 什么是 Spark SQL
Spark SQL 是 Spark 用来处理结构化数据的一个模块,结构化数据就是指任何有结构信息得数据。所谓结构信息,就是每条记录共用已知的字段集合(RDD 是没有结构的)。
Spark 对执行计划进行了优化,所以它的执行效率一般要比单纯使用 RDD 要高。
1.2 什么 DataFrame
与 RDD 类似,DataFrame 也是一个分布式数据容器。但 DataFrame 更像传统数据库的二维表格,除了数据外,还记录了数据的结构信息。
1.3 什么是 DataSet
- 是 DataFrame 的扩展,Spark 的最新数据抽象,DataSet 是面向对象思维的,一行数据就是一个对象。
- 样例类可以用来在 DataSet 中定义数据的结构信息,样例类中每个属性的名称直接映射到 DataSet 的字段名称。
- DataSet 是强类型的,比如可以有 DataSet[Person],DataSet[Car]。DataFrame 是 DataSet 的特例,DataFrame = DataSet[Row],所以可以通过 as 方法将 DataFrame 转换成 DataSet。Row 也是一个类型,所有表结构信息都用 Row 来表示。
- DataFrame 只知道字段,但是不知道字段的类型,所以没有办法再编译的时候检查是否类型失败的;而 DataSet 不仅知道字段,而且知道字段类型,所以有更严格给的错误检查。
- DataFrame 转 RDD 是不可逆的,只能转成 RDD[Row],而还原不回去 RDD 的实际类型,但是 DataSet 可以。
二、Spark SQL 编程
2.1 读取 json 文件创建 dataframe
//也可以从其他数据源中读取,只需要改变 read 后面的文件类型即可
scala> val df = spark.read.json("file:///home/datas/person.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show
+---+--------+
|age| name|
+---+--------+
| 20|zhangsan|
| 30| lisi|
| 40| wangwu|
+---+--------+
2.2 创建视图运行 sql
//创建全局视图
scala> df.createOrReplaceGlobalTempView("person2")
//一个新的会话只能访问全局视图
scala> spark.newSession().sql("select * from global_temp.person2")
res19: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
//global_temp.视图名称
scala> val df2= spark.newSession().sql("select * from global_temp.person2scala")
df2: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df2.show
+---+--------+
|age| name|
+---+--------+
| 20|zhangsan|
| 30| lisi|
| 40| wangwu|
+---+--------+
2.3 RDD 转成 DataFrame
2.3.1 法一:直接将元组进行转换,在 toDF 中指明字段名称
scala> val rdd = spark.sparkContext.makeRDD( List( ("zhangsan", 20, "nan"), ("lisi", 30, "nv"), ("wangwu", 40, "nan") ) )
rdd: org.apache.spark.rdd.RDD[(String, Int, String)] = ParallelCollectionRDD[5] at makeRDD at <console>:23
scala> val df = rdd.toDF("name", "age", "sex")
df: org.apache.spark.sql.DataFrame = [name: string, age: int ... 1 more field]
scala> df.showscala
+--------+---+---+
| name|age|sex|
+--------+---+---+
|zhangsan| 20|nan|
| lisi| 30| nv|
| wangwu| 40|nan|
+--------+---+---+
2.3.2 法二:把 RDD 中的数据封装到样例类中
//定义样例类
scala> case class Person(name: String, age: Int, sex: String)
defined class Person
//把数据封装到样例类中,此时的 rdd 类型是 Person
scala> val personRDD = rdd.map( info => Person(info._1, info._2, info._3) )
personRDD: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[10] at map at <console>:27
//rdd 转成 dataframe
scala> val df3 = personRDD.toDF
//注意:类型还是 [name: String, age: int, sex: String] 这样的,ds 会转成 [Person]
df3: org.apache.spark.sql.DataFrame = [name: string, age: int ... 1 more field]
//df 转会 rdd,rdd 失去了原有的类型,编程了 Row
scala> val rdd2 = df3.rdd
rdd2: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[5] at rdd at <console>:25
scala> df3.show
+--------+---+---+
| name|age|sex|
+--------+---+---+
|zhangsan| 20|nan|
| lisi| 30| nv|
| wangwu| 40|nan|
+--------+---+---+
2.4 DataFrame 转成 RDD
scala> val df = spark.read.json("file:///home/datas/person.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show
+---+--------+
|age| name|
+---+--------+
| 20|zhangsan|
| 30| lisi|
| 40| wangwu|
+---+--------+
scala> val dfRDD = df.rdd
dfRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[52] at rdd at <console>:25
scala> dfRDD.collect
res14: Array[org.apache.spark.sql.Row] = Array([zhangsan,20,nan], [lisiscala,30,nv], [wangwu,40,nan])
// Row 的数据结构通过下标来访问每一个字段
scala> dfRDD.foreach(row => println( row(0) ) )
zhangsan
lisi
wangwu
2.5 RDD 转成 DataSet
2.5.1 通过 DataFrame 做一个中转
scala> case class Person(name: String, age: Int, sex: String)
defined class Person
scala> val rdd = spark.sparkContext.makeRDD( List( ("zhangsan", 20, "nan"), ("lisi", 30, "nv"), ("wangwu", 40, "nan") ) )
rdd: org.apache.spark.rdd.RDD[(String, Int, String)] = ParallelCollectionRDD[0] at makeRDD at <console>:23
//1、先转成 DF,相当于在 RDD 上增加结构
scala> val df = rdd.toDF("name", "age", "sex")
df: org.apache.spark.sql.DataFrame = [name: string, age: int ... 1 more field]
//2、将 DF 转成 DS,相当于在 DF 上增加类型
scala> val ds = df.as[Person]
ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: int ... 1 more field]
2.5.2 直接将 RDD 转成 DataSet
//1、定义样例类,相当于定义类型
scala> case class Person(name: String, age: Int, sex: String)
defined class Person
scala> val rdd = spark.sparkContext.makeRDD( List( ("zhangsan", 20, "nan"), ("lisi", 30, "nv"), ("wangwu", 40, "nan") ) )
rdd: org.apache.spark.rdd.RDD[(String, Int, String)] = ParallelCollectionRDD[35] at makeRDD at <console>:23
//2、把 RDD 中的数据变成有结构的形式
scala> val personRDD = rdd.map( info => Person(info._1, info._2, info._3) )
personRDD: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[36] at map at <console>:27
//3、转换
scala> val ds = personRDD.toDS
ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: int ... 1 more field]
scala> ds.show
+--------+---+---+
| name|age|sex|
+--------+---+---+
|zhangsan| 20|nan|
| lisi| 30| nv|
| wangwu| 40|nan|
+--------+---+---+
2.6 DataSet 转成 RDD
//注意:此时的 RDD 还是 Person 类型,这与 DF 转换之后的 Row 不同
scala> val dsRDD = ds.rdd
dsRDD: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[56] at rdd at <console>:25
scala> dsRDD.collect
res16: Array[Person] = Array(Person(zhangsan,20,nan), Person(lisi,30,nv), Person(wangwu,40,nan))
//直接访问对象的属性即可
scala> dsRDD.foreach(person => println(person.name) )
zhangsan
wangwu
lisi
2.7 在 ide 中的开发时的转换
//方式与上述无异,最重要的就是增加一个隐式转换,必须加
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}
object TransformTest {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.builder()
.master("local[*]")
.appName("transform")
.getOrCreate()
//使用开发工具进行开发时,必须加上隐式转换,这里的 spark 指的不是报名,而是对象名,与上面建立的对象 spark 是同一个
import spark.implicits._
val rdd: RDD[(String, Int, String)]
= spark.sparkContext.makeRDD( List( ("zhangsan", 30, "nan"), ("lisi", 40, "nv"), ("wangwu", 50, "nan") ) )
//rdd -> df
val df = rdd.toDF("name", "age", "sex")
df.show()
//df -> ds
val ds = df.as[Person]
ds.show()
}
case class Person(name: String, age: Int, sex: String)
}
2.8 UDF
import org.apache.spark.sql.SparkSession
object UDF {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.master("local[*]")
.appName("udf")
.getOrCreate()
//1. 注册自定义函数
spark.udf.register("addName", (x: String) => "name: " + x)
//读取文件
val df = spark.read.json("spark-review/input/person.json")
//创建视图
df.createOrReplaceTempView("person")
//2. 执行 sql,使用 udf
spark.sql("select addName(name), age from person").show()
/**
* +-----------------+---+
* |UDF:addName(name)|age|
* +-----------------+---+
* | name: zhangsan| 20|
* | name: lisi| 30|
* | name: wangwu| 40|
* +-----------------+---+
*/
}
}
2.9 udaf
2.9.1 弱类型
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types.{DataType, DoubleType, IntegerType, LongType, StructType}
object UDAF {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.master("local[*]")
.appName("udaf")
.getOrCreate()
//1、注册自定义聚合函数
spark.udf.register("myAverage", MyAverage)
//读取文件,弱类型中使用 df 就能满足要求了
val df = spark.read.json("spark-review/input/person.json")
//创建视图
df.createOrReplaceTempView("person")
//2、执行 sql
spark.sql("select myAverage(age) as avgAge from person").show()
/**
* +------+
* |avgAge|
* +------+
* | 30.0|
* +------+
*/
}
}
//继承自抽象类
object MyAverage extends UserDefinedAggregateFunction {
//设置输入类型
override def inputSchema: StructType = {
new StructType()
.add("age", IntegerType)
}
//设置缓冲区中的值,相当于是局部变量
override def bufferSchema: StructType = {
new StructType()
.add("sum", LongType)
.add("count", IntegerType)
}
//返回值类型
override def dataType: DataType = {
DoubleType
}
//相同输入是否一直返回相同输出
override def deterministic: Boolean = {
true
}
//初始化
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = 0L
buffer(1) = 0
}
//更新
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = buffer.getLong(0) + input.getInt(0)
buffer(1) = buffer.getInt(1) + 1
}
//合并
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
buffer1(1) = buffer1.getInt(1) + buffer2.getInt(1)
}
//计算最终结果
override def evaluate(buffer: Row): Any = {
buffer.getLong(0).toDouble / buffer.getInt(1)
}
}
2.9.2 强类型
因为弱类型中只能用下标访问属性,可读性太差,而且容易出错
import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.apache.spark.sql.expressions.Aggregator
object udaf_class {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.master("local[*]")
.appName("udaf")
.getOrCreate()
import spark.implicits._
//注意:直接读出来的是 df,应该用 as 转成 ds
val ds = spark.read.json("spark-review/input/person.json").as[Person]
//传入一个 Person,返回一个 Double
val avgAge: TypedColumn[Person, Double] = MyAverage.toColumn.name("avgAge")
val result = ds.select(avgAge)
result.show()
/**
* +------+
* |avgAge|
* +------+
* | 30.0|
* +------+
*/
}
}
//输入类型
case class Person(name: String, age: BigInt)
//缓冲类型
case class Average(var sum: BigInt, var count: Int)
//[In, buf, Out]
object MyAverage extends Aggregator[Person, Average, Double] {
//初始化 buf
override def zero: Average = {
Average(0L, 0)
}
//更改 buf
override def reduce(b: Average, a: Person): Average = {
b.sum = b.sum + a.age
b.count = b.count + 1
b
}
//合并 buf
override def merge(b1: Average, b2: Average): Average = {
b1.sum = b1.sum + b2.sum
b1.count = b1.count + b2.count
b1
}
//结束时的返回值
override def finish(reduction: Average): Double = {
reduction.sum.toDouble / reduction.count
}
//设置编码
override def bufferEncoder: Encoder[Average] = {
Encoders.product
}
override def outputEncoder: Encoder[Double] = {
Encoders.scalaDouble
}
}
2.10 dataframe 操作 jdbc
import java.util.Properties
import org.apache.spark.sql.SparkSession
object sparkJDBC {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.master("local[*]")
.appName("sparkJDBC")
.getOrCreate()
//配置
val properties: Properties = new Properties()
properties.put("user", "root")
properties.put("password", "123456")
//读取
val jdbcDF = spark.read.jdbc("jdbc:mysql://localhost:3306/sparkTest", "user", properties)
jdbcDF.show()
//写入
jdbcDF.write.jdbc("jdbc:mysql://localhost:3306/sparkTest", "user2", properties)
}
}
.appName(“sparkJDBC”)
.getOrCreate()
//配置
val properties: Properties = new Properties()
properties.put("user", "root")
properties.put("password", "123456")
//读取
val jdbcDF = spark.read.jdbc("jdbc:mysql://localhost:3306/sparkTest", "user", properties)
jdbcDF.show()
//写入
jdbcDF.write.jdbc("jdbc:mysql://localhost:3306/sparkTest", "user2", properties)
}
}