Spark SQL
Spark SQL是构建在Spark RDD之上的一款ETL(Extract Transformation Load)工具,这类似于构建在MapReduce之上的1.x版本的Hive。同Spark RDD的不同之处在于Spark SQL的API可以给Spark计算引擎提供更多的信息(计算数据结构、转换算子),Spark计算引擎可以根据Spark SQL提供的信息优化底层计算任务。目前为止,Spark SQL提供了两种风格的交互式API:Dataset API
/SQL脚本
。
Dataset API
:加强版的RDD操作,例如支持map、flatMap、filter等算子,同时还支持select、where、groupBy、cube等SQL操作,使用者在操作数据的时候必须知道所操作数据的类型,因此通常将这种API操作称为strong-type操作
。
val wordPairRDD = spark.sparkContext.textFile("file:///D:/demo/words")
.flatMap(_.split("\\s+"))
.map((_, 1))
//将RDD变成Dataset
wordPairRDD.toDS()
.groupByKey(pair=>pair._1)
.agg(typed.sum[(String,Int)](t=>t._2).name("total"))
.show()
SQL脚本
:使用者无需关注底层 Dataset API
,所有的操作类型都是基于命名列类型
,在操作数据的时候不必关心所操作数据的类型,因此通常将这种操作称为untyped操作
。
word num
a 1
b 1
a 1
select sum(num) from t_words groupBy word
参考:http://spark.apache.org/docs/latest/sql-programming-guide.html
Spark SQL研究的主要对象是Dataset
/Dataframe
(加强版本的RDD)。Dataset是一个分布式的数据集合,在Spark 1.6版本中提供一个新的接口,Dataset提供RDD的优势(强类型,使用强大的lambda函数)以及具备了Spark SQL执行引擎的优点。Dataset可以通过JVM对象构建,然后可以使用转换函数(例如:map、flatMap、filter等算子),目前Dataset API对Scala和Java语言支持比较友好,对Python语言的支持还不算完备。
DataFrame是命名列的数据集 -特殊的Dataset,它在概念上等价于关系型数据库,DataFrame可以从很多地方来构建,比如结构化数据文件(json、csv等)、hive中的表(遗留系统对接)或者外部数据库,使用Dataset[Row]的数据集,可以理解DataFrame就是一个Dataset[Row]。
快速入门
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.3</version>
</dependency>
SQL脚本
//1.创建sparksession对象
val spark = SparkSession.builder().appName("word count")
.master("local[6]")
.getOrCreate()
//2.导入常见的隐式转换|增强 将RDD转换为Dataframe/Dataset
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL") //关闭日志信息
//创建RDD
val wordPairRDD = spark.sparkContext.textFile("file:///D:/demo/words")
.flatMap(_.split("\\s+"))
.map((_, 1))
//将RDD转换为Dataframe
val wordDataframe = wordPairRDD.toDF("word", "num")
//将wordDataframe注册为一个t_words表
wordDataframe.createOrReplaceTempView("t_words")
//写SQL脚本查询t_words
spark.sql("select word,sum(num) total from t_words group by word order by total desc")
.show()
//3.关闭sparksession对象
spark.stop()//close
命名列-API
//1.创建sparksession对象
val spark = SparkSession.builder().appName("word count")
.master("local[6]")
.getOrCreate()
//2.导入常见的隐式转换|增强 将RDD转换为Dataframe
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL") //关闭日志信息
//创建RDD
val wordPairRDD = spark.sparkContext.textFile("file:///D:/demo/words")
.flatMap(_.split("\\s+"))
.map((_, 1))
//将RDD变成Dataframe
import org.apache.spark.sql.functions._
wordPairRDD.toDF("word","num")
.groupBy($"word")
.agg(sum($"num") as "total")
.orderBy($"total" desc)
.show()
//3.关闭sparksession对象
spark.stop()//close
强类型-typed
-不推荐
//1.创建sparksession对象
val spark = SparkSession.builder().appName("word count")
.master("local[6]")
.getOrCreate()
//2.导入常见的隐式转换|增强 将RDD转换为Dataframe
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL") //关闭日志信息
//创建RDD
val wordPairRDD = spark.sparkContext.textFile("file:///D:/demo/words")
.flatMap(_.split("\\s+"))
.map((_, 1))
//将RDD变成Dataset
wordPairRDD.toDS()
.groupByKey(pair=>pair._1)
.agg(typed.sum[(String,Int)](t=>t._2).name("total"))
.rdd
.sortBy(_._2,false,3)
.toDF("word","total")
.show()
//3.关闭sparksession对象
spark.stop()//close
Dataset和DataFrame构建
Dataset√
Dataset是一个特殊的RDD(增强版本的RDD),与RDD的不同在于Spark SQL自己维护了一套序列化和反序列化规范,规避在计算过程中因为多次序列化对计算节点的性能损耗,由于Spark SQL提倡untyped的操作,使用者无需关注所操作数据的类型,只需提供针对命名列操作的算子即可,从而提升计算节点的计算性能。
rdd
> sc.textFile("hdfs://...").flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_)
dataset
> val wordPairRDD = spark.sparkContext.textFile("file:///D:/demo/words")
.flatMap(_.split("\\s+"))
.map((_, 1))
> wordPairRDD.toDF("word","num") //无需知道所操作数据类型,只需提供列名
.groupBy($"word")
.agg(sum($"num") as "total")
.orderBy($"total" desc)
.show()
- case-class
case class Person(id:Int,name:String,age:Int,dept:Int)
List(Person(1,"zs",18,1),Person(2,"ls",18,1)).toDS().show()
+---+----+---+----+
| id|name|age|dept|
+---+----+---+----+
| 1| zs| 18| 1|
| 2| ls| 18| 1|
+---+----+---+----+
- tuple元组
List((1,"zs",18,1),(2,"ls",18,1)).toDS().show()
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| zs| 18| 1|
| 2| ls| 18| 1|
+---+---+---+---+
- json数据
case class User(id:BigInt,name:String,age:Long,dept:Long)
spark.read.json("file:///D:/demo/json").as[User]
.show()
- rdd(略)
RDD中的元素必须是
元组
或者是case-class
才可以直接通过toDS()方法创建
Data Frame √
DataFrame是命名列
的数据集,它在概念上等价于关系型数据库。DataFrame可以从很多地方来构建,比如说结构化数据文件、hive中的表或者外部数据库,使用Dataset[row]的数据集,可以理解DataFrame就是一个Dataset[Row]。
- case-class
List(Person(1,"zs",18,1),Person(2,"ls",18,1)).toDF().show()
- tuple元组
List((1,"zs",18,1),(2,"ls",18,1)).toDF("id","name","age","dept").show()
- json数据
spark.read.json("file:///D:/demo/json")
.show()
-
RDD 创建
- RDD[Row]类型
val rowRDD:RDD[Row] = spark.sparkContext.makeRDD(List((1, "zs", 18, 1), (2, "ls", 18, 1))) .map(t => Row(t._1, t._2, t._3, t._4)) val schema = new StructType() .add("id",IntegerType) .add("name",StringType) .add("age",IntegerType) .add("dept",IntegerType) spark.createDataFrame(rowRDD,schema) .show()
- RDD[Javabean]
public class Student implements Serializable { private Integer id; private String name; private Integer age; private Integer dept; public Student() { } public Student(Integer id, String name, Integer age, Integer dept) { this.id = id; this.name = name; this.age = age; this.dept = dept; } //省略get/set } val userRDD:RDD[Student] = spark.sparkContext.makeRDD(List((1, "zs", 18, 1), (2, "ls", 18, 1))) .map(t => new Student(t._1, t._2, t._3, t._4)) spark.createDataFrame(userRDD,classOf[Student]) .show()
- RDD[case-class]
val userRDD:RDD[User] = spark.sparkContext.makeRDD(List((1, "zs", 18, 1), (2, "ls", 18, 1))) .map(t => User(t._1, t._2, t._3, t._4)) spark.createDataFrame(userRDD) .show()
+---+---+---+---+ | id| name| age| dept| +---+---+---+---+ | 1| zs| 18| 1| | 2| ls| 18| 1| +---+---+---+---+
可以直接调用toDF()方法,无需指定列别名
- RDD[Tuple]
val userRDD:RDD[(Int,String,Int,Int)] = spark.sparkContext.makeRDD(List((1, "zs", 18, 1), (2, "ls", 18, 1))) spark.createDataFrame(userRDD).toDF("") .show()
+---+---+---+---+ | _1| _2| _3| _4| +---+---+---+---+ | 1| zs| 18| 1| | 2| ls| 18| 1| +---+---+---+---+
可以直接调用toDF()方法,一般需要指定列别名
总结
RDD[元组/case-class] --toDS()/toDF() --> 转换为Dataset / DataFrame
Dataset[U] – rdd -->可以得到对应的RDD[U]
DataFrame – rdd --> 可以得到RDD[Row]类型
val userRDD:RDD[Student] = spark.sparkContext.makeRDD(List((1, "zs", 18, 1), (2, "ls", 18, 1)))
.map(t => new Student(t._1, t._2, t._3, t._4))
spark.createDataFrame(userRDD,classOf[Student])
.rdd
.map(row=>(row.getAs[Int]("id"),row.getAs[String]("name")))//操作Row类型数据
.collect()
.foreach(println)
Dataset[U],这里的U一定是case-class
或者元组
:Dataset[U] – toDF --> Dataframe
Dataframe – as[U] --> Dataset[U],这里的U一定是case-class
或者元组
无类型数据集操作 (又称为 DataFrame 操作)
id name sex age salary dept
---------------------------
1,Michael,false,29,2000,001
2,Andy,true,30,5000,001
3,Justin,true,19,1000,002
4,Kaine,false,20,5000,003
5,Lisa,false,19,1000,002
case class Employee(id:Int,name:String,sex:Boolean,age:Int,salary:Double,dept:String)
val employeeRDD:RDD[Employee] = spark.sparkContext.textFile("file:///D:/demo/users")
.map(line => line.split(","))
.map(ts => Employee(ts(0).toInt, ts(1), ts(2).toBoolean, ts(3).toInt, ts(4).toDouble, ts(5)))
val employeeDF = employeeRDD.toDF()
printSchema()
- 打印当前表的结构信息
employeeDF.printSchema()
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- sex: boolean (nullable = false)
|-- age: integer (nullable = false)
|-- salary: double (nullable = false)
|-- dept: string (nullable = true)
select
- 投影查询,只查询对应字段的信息
employeeDF.select("id","name","salary")
.show()
+---+-------+---+
| id| name|age|
+---+-------+---+
| 1|Michael| 29|
| 2| Andy| 30|
| 3| Justin| 19|
| 4| Kaine| 20|
| 5| Lisa| 19|
+---+-------+---+
如果用户想参与字段的计算,可以使用如下方式
employeeDF.select($"id",$"name",$"salary",$"salary" * 12 as "annual_salary")
.show()
+---+-------+------+-------------+
| id| name|salary|annual_salary|
+---+-------+------+-------------+
| 1|Michael|2000.0| 24000.0|
| 2| Andy|5000.0| 60000.0|
| 3| Justin|1000.0| 12000.0|
| 4| Kaine|5000.0| 60000.0|
| 5| Lisa|1000.0| 12000.0|
+---+-------+------+-------------+
selectExpr
- 可以在字段中使用SQL脚本/表达式
employeeDF.selectExpr("id as uid","name","salary","salary * 12 as annual_salary")
.show()
+---+-------+------+-------------+
|uid| name|salary|annual_salary|
+---+-------+------+-------------+
| 1|Michael|2000.0| 24000.0|
| 2| Andy|5000.0| 60000.0|
| 3| Justin|1000.0| 12000.0|
| 4| Kaine|5000.0| 60000.0|
| 5| Lisa|1000.0| 12000.0|
+---+-------+------+-------------+
注意在spark中,别名不要出现中文
where
- where操作等价于filter算子,将满足条件的记录过滤出来
employeeDF.selectExpr("id as uid","age","name","salary","salary * 12 as annual_salary")
.where("(name = 'Andy' or age >= 20) and (annual_salary > 24000) ")
.show()
where除了使用表达式以外,还可以使用column过滤结果
employeeDF.selectExpr("id as uid","age","name","salary","salary * 12 as annual_salary")
//.where("(name = 'Andy' or age >= 20) and (annual_salary > 24000) ")
.where(($"name" ==="Andy" or $"age" >= 20) and($"annual_salary" > 24000) )
.show()
+---+---+-----+------+-------------+
|uid|age| name|salary|annual_salary|
+---+---+-----+------+-------------+
| 2| 30| Andy|5000.0| 60000.0|
| 4| 20|Kaine|5000.0| 60000.0|
+---+---+-----+------+-------------+
withColumn√
- 给结果中添加新的字段
employeeDF.select($"id" ,$"age",$"name",$"salary",$"dept")
.withColumn("annual_salary",$"salary" * 12)
.where(($"name" ==="Andy" or $"age" >= 20) and($"annual_salary" > 24000) )
.withColumn("dept_as",$"dept")
.show()
+---+---+-----+------+----+-------------+-------+
| id|age| name|salary|dept|annual_salary|dept_as|
+---+---+-----+------+----+-------------+-------+
| 2| 30| Andy|5000.0| 001| 60000.0| 001|
| 4| 20|Kaine|5000.0| 003| 60000.0| 003|
+---+---+-----+------+----+-------------+-------+
withColumnRenamed
- 修改字段的名字
employeeDF.select($"id" ,$"age",$"name",$"salary",$"dept")
.withColumnRenamed("id","uid")
.show()
+---+---+-------+------+----+
|uid|age| name|salary|dept|
+---+---+-------+------+----+
| 1| 29|Michael|2000.0| 001|
| 2| 30| Andy|5000.0| 001|
| 3| 19| Justin|1000.0| 002|
| 4| 20| Kaine|5000.0| 003|
| 5| 19| Lisa|1000.0| 002|
+---+---+-------+------+----+
groupBy
- 类似于SQL中的
groupBy
算子,必须和聚合函数连用
employeeDF.select($"id" ,$"age",$"name",$"salary",$"dept")
.groupBy("dept")
.sum("salary")
.show()
+----+-----------+
|dept|sum(salary)|
+----+-----------+
| 003| 5000.0|
| 001| 7000.0|
| 002| 2000.0|
+----+-----------+
agg√
- 负责联合多个聚合算子,如结合
max
、min
、avg
、count
等操作实现常规的复合计算
import org.apache.spark.sql.functions._
employeeDF.select($"id" ,$"age",$"name",$"salary",$"dept")
.groupBy("dept")
.agg(sum("salary") as "total",avg("age") as "avg_age",max("salary") as "max_salary")
.show()
employeeDF.select($"id" ,$"age",$"name",$"salary",$"dept")
.groupBy("dept")
.agg(("salary","max"),("age","avg"),("salary","sum"))
//.agg("salary"->"max","age"->"avg","salary"->"sum")
.show()
employeeDF.select($"id" ,$"age",$"name",$"salary",$"dept")
.groupBy("dept")
.agg(Map("age"->"avg","salary"->"sum")) //注意key出现重复会导致覆盖
.show()
pivot√
- 实现行转列
stuno name course score
-----------------------
001 zhangsan math 80
001 zhangsan chinese 70
001 zhangsan english 70
001 zhangsan history 90
002 wangw math 90
002 wangw chinese 80
case class CourseScore(stuno:String,name:String,course:String,score:Double)
val courseRDD:RDD[CourseScore] = spark.sparkContext.textFile("file:///D:/demo/students")
.map(line => line.split("\\s+"))
.map(ts => CourseScore(ts(0), ts(1), ts(2), ts(3).toDouble))
val courses= spark.sparkContext.textFile("file:///D:/demo/students")
.map(line => line.split("\\s+")(2))
.distinct()
.collect()
val courseDF = courseRDD.toDF()
import org.apache.spark.sql.functions._
courseDF.groupBy("stuno","name")
.pivot($"course",courses)//需要行转列的字段、可选值不可以出现重复
.agg(sum($"score"))
.show()
+-----+--------+----+-------+-------+-------+
|stuno| name|math|history|english|chinese|
+-----+--------+----+-------+-------+-------+
| 001|zhangsan|80.0| 90.0| 70.0| 70.0|
| 002| wangw|90.0| null| null| 80.0|
+-----+--------+----+-------+-------+-------+
na
- 专门处理空值的数据,结合drop或者fill算子使用
courseDF.groupBy("stuno","name")
.pivot($"course",courses)
.agg(sum($"score"))
.na.drop(5) //保留至少5个字段不为null的记录
.show()
+-----+--------+----+-------+-------+-------+
|stuno| name|math|history|english|chinese|
+-----+--------+----+-------+-------+-------+
| 001|zhangsan|80.0| 90.0| 70.0| 70.0|
+-----+--------+----+-------+-------+-------+
courseDF.groupBy("stuno","name")
.pivot($"course",courses)
.agg(sum($"score"))
.na.fill(Map("history"-> -1,"english" -> 0)) //将一些null值填充默认值
.show()
+-----+--------+----+-------+-------+-------+
|stuno| name|math|history|english|chinese|
+-----+--------+----+-------+-------+-------+
| 001|zhangsan|80.0| 90.0| 70.0| 70.0|
| 002| wangw|90.0| -1.0| 0.0| 80.0|
+-----+--------+----+-------+-------+-------+
cube√
- 多维度分析手段,通常用作数据分析,比
groupBy
算子更加灵活
height weght eq iq
80 23 72 85
80 23 70 85
80 25 70 85
82 30 80 70
case class KindScore(height:Int,weight:Int,eq:Double,iq:Double)
val kindScoreRDD:RDD[KindScore]=spark.sparkContext.textFile("file:///D:/demo/kinds")
.map(line=>line.split("\\s+"))
.map(ts=>KindScore(ts(0).toInt,ts(1).toInt,ts(2).toDouble,ts(3).toDouble))
import org.apache.spark.sql.functions._
kindScoreRDD.toDF()
.cube("height","weight")
.agg(avg($"eq") as "EQ",avg($"iq") as "IQ")
.show()
+------+------+-----------------+----+
|height|weight| EQ| IQ|
+------+------+-----------------+----+
| null| 23| 70.0|85.0|
| null| null|73.33333333333333|80.0|
| null| 30| 80.0|70.0|
| 80| null| 70.0|85.0|
| null| 25| 70.0|85.0|
| 80| 25| 70.0|85.0|
| 82| 30| 80.0|70.0|
| 82| null| 80.0|70.0|
| 80| 23| 70.0|85.0|
+------+------+-----------------+----+
over√
- 配合聚合算子,完成局部分析
id name job salary dept
------------------------
1 zhangsan sale 8000.00 1
2 wangw sale 8500.00 1
3 lis clicker 6000.00 2
4 zhaol manager 10000.00 1
val empRDD:RDD[(Int,String,String,Double,Int)] = spark.sparkContext.textFile("file:///D:/demo/employee")
.map(_.split("\\s+"))
.map(ts=>(ts(0).toInt,ts(1),ts(2),ts(3).toDouble,ts(4).toInt))
import org.apache.spark.sql.functions._
val w=Window.partitionBy("dept") //获取当前记录所在dept的所有记录
.orderBy($"salary" desc) //按照salary进行降序 负数<-- currentRow(0) --> 正数
.rowsBetween(Window.unboundedPreceding,Window.currentRow) // 比当前记录大的偏移量
val empDF= empRDD.toDF("id","name","job","salary","dept")
empDF.select("id","name","salary","dept")
.withColumn("rank",count($"salary") over(w))
.withColumn("avg_salary",avg($"salary") over(Window.partitionBy("dept")))
.withColumn("max_salary",max($"salary") over(Window.partitionBy("dept")))
.withColumn("min_salary",min($"salary") over(Window.partitionBy("dept")))
.show()
spark.stop()
+---+--------+-------+----+----+-----------------+----------+----------+
| id| name| salary|dept|rank| avg_salary|max_salary|min_salary|
+---+--------+-------+----+----+-----------------+----------+----------+
| 4| zhaol|10000.0| 1| 1|8833.333333333334| 10000.0| 8000.0|
| 2| wangw| 8500.0| 1| 2|8833.333333333334| 10000.0| 8000.0|
| 1|zhangsan| 8000.0| 1| 3|8833.333333333334| 10000.0| 8000.0|
| 3| lis| 6000.0| 2| 1| 6000.0| 6000.0| 6000.0|
+---+--------+-------+----+----+-----------------+----------+----------+
join
- 用作表连接
val empDF= empRDD.toDF("id","name","job","salary","dept").select("id","name","salary","dept")
val deptDF=List((1,"销售部"),(2,"运营部")).toDF("deptid","deptname")
empDF.join(deptDF,$"dept" === $"deptid","left_outer")
//.na.drop(List("deptname"))
.show()
+---+--------+-------+----+------+--------+
| id| name| salary|dept|deptid|deptname|
+---+--------+-------+----+------+--------+
| 1|zhangsan| 8000.0| 1| 1| 销售部|
| 2| wangw| 8500.0| 1| 1| 销售部|
| 3| lis| 6000.0| 2| 2| 运营部|
| 4| zhaol|10000.0| 1| 1| 销售部|
| 5| win7|10000.0| 3| null| null|
+---+--------+-------+----+------+--------+
如果两个表的字段一样,可以使用以下方式
val empDF= empRDD.toDF("id","name","job","salary","dept").select("id","name","salary","dept")
val deptDF=List((1,"销售部"),(2,"运营部")).toDF("dept","deptname")
empDF.join(deptDF,"dept")
//.na.drop(List("deptname"))
.show()
+----+---+--------+-------+--------+
|dept| id| name| salary|deptname|
+----+---+--------+-------+--------+
| 1| 1|zhangsan| 8000.0| 销售部|
| 1| 2| wangw| 8500.0| 销售部|
| 2| 3| lis| 6000.0| 运营部|
| 1| 4| zhaol|10000.0| 销售部|
+----+---+--------+-------+--------+
drop
- 删除指定的字段
val empDF= empRDD.toDF("id","name","job","salary","dept").select("id","name","salary","dept")
val deptDF=List((1,"销售部"),(2,"运营部")).toDF("deptid","deptname")
empDF.join(deptDF,$"dept" === $"deptid","left_outer")
.drop("deptid")
.show()
+---+--------+-------+----+--------+
| id| name| salary|dept|deptname|
+---+--------+-------+----+--------+
| 1|zhangsan| 8000.0| 1| 销售部|
| 2| wangw| 8500.0| 1| 销售部|
| 3| lis| 6000.0| 2| 运营部|
| 4| zhaol|10000.0| 1| 销售部|
| 5| win7|10000.0| 3| null|
+---+--------+-------+----+--------+
dropDuplicates√
- 删除重复的记录
val empDF= empRDD.toDF("id","name","job","salary","dept").select("id","name","salary","dept")
val deptDF=List((1,"销售部"),(2,"运营部")).toDF("deptid","deptname")
empDF.join(deptDF,$"dept" === $"deptid","left_outer")
.drop("deptid")
.dropDuplicates("deptname")
.show()
+---+--------+-------+----+--------+
| id| name| salary|dept|deptname|
+---+--------+-------+----+--------+
| 1|zhangsan| 8000.0| 1| 销售部|
| 5| win7|10000.0| 3| null|
| 3| lis| 6000.0| 2| 运营部|
+---+--------+-------+----+--------+
orderBy
- 排序
val empDF= empRDD.toDF("id","name","job","salary","dept").select("id","name","salary","dept")
val deptDF=List((1,"销售部"),(2,"运营部")).toDF("deptid","deptname")
empDF.join(deptDF,$"dept" === $"deptid","left_outer")
.orderBy($"dept" desc,$"salary" desc)
.show()
+---+--------+-------+----+------+--------+
| id| name| salary|dept|deptid|deptname|
+---+--------+-------+----+------+--------+
| 5| win7|10000.0| 3| null| null|
| 3| lis| 6000.0| 2| 2| 运营部|
| 4| zhaol|10000.0| 1| 1| 销售部|
| 2| wangw| 8500.0| 1| 1| 销售部|
| 1|zhangsan| 8000.0| 1| 1| 销售部|
+---+--------+-------+----+------+--------+
limit
- 分页操作,等价于take(n)算子,获取前n条记录
val empDF= empRDD.toDF("id","name","job","salary","dept").select("id","name","salary","dept")
val deptDF=List((1,"销售部"),(2,"运营部")).toDF("deptid","deptname")
empDF.join(deptDF,$"dept" === $"deptid","left_outer")
.orderBy($"dept" desc,$"salary" desc)
.limit(3) //获取前3条记录
.show()
+---+-----+-------+----+------+--------+
| id| name| salary|dept|deptid|deptname|
+---+-----+-------+----+------+--------+
| 5| win7|10000.0| 3| null| null|
| 3| lis| 6000.0| 2| 2| 运营部|
| 4|zhaol|10000.0| 1| 1| 销售部|
+---+-----+-------+----+------+--------+
Dataset/DataFrame SQL
+---+-------+-----+---+------+----+
| id| name| sex|age|salary|dept|
+---+-------+-----+---+------+----+
| 1|Michael|false| 29|2000.0| 001|
| 2| Andy| true| 30|5000.0| 001|
| 3| Justin| true| 19|1000.0| 002|
| 4| Kaine|false| 20|5000.0| 003|
| 5| Lisa|false| 19|1000.0| 002|
+---+-------+-----+---+------+----+
+------+--------+
|deptid|deptname|
+------+--------+
| 001| 销售部|
| 002| 运营部|
| 003| 行政部|
+------+--------+
+-----+--------+-------+-----+
|stuno| name| course|score|
+-----+--------+-------+-----+
| 001|zhangsan| math| 80.0|
| 001|zhangsan|chinese| 70.0|
| 001|zhangsan|english| 70.0|
| 001|zhangsan|history| 90.0|
| 002| wangw| math| 90.0|
| 002| wangw|chinese| 80.0|
+-----+--------+-------+-----+
//1.创建SparkSession对象
val spark = SparkSession.builder()
.appName("sparksql")
.master("local[6]")
.getOrCreate()
import spark.implicits._ //导入常见的隐式转换|隐式增强
//2.构建RDD
val employeeRDD:RDD[Employee] = spark.sparkContext.textFile("file:///D:/demo/users")
.map(line => line.split(","))
.map(ts => Employee(ts(0).toInt, ts(1), ts(2).toBoolean, ts(3).toInt, ts(4).toDouble, ts(5)))
val courseRDD:RDD[(String,String,String,Double)] = spark.sparkContext.textFile("file:///D:/demo/students")
.map(line => line.split("\\s+"))
.map(ts => (ts(0), ts(1), ts(2), ts(3).toDouble))
//3.构建DF视图
val employeeDF = employeeRDD.toDF()
val courseDF= courseRDD.toDF("stuno" ,"name" ,"course" ,"score")
val deptDF=List(("001","销售部"),("002","运营部"),("003","行政部")).toDF("deptid","deptname")
employeeDF.createOrReplaceTempView("t_employee")
deptDF.createOrReplaceTempView("t_dept")
courseDF.createOrReplaceTempView("t_course")
//4.执行SQL操作
val sql="" //研究重点
spark.sql(sql).show()
///5.关闭SparkSession
spark.stop()
单表查询
select * from t_employee
+---+-------+-----+---+------+----+
| id| name| sex|age|salary|dept|
+---+-------+-----+---+------+----+
| 1|Michael|false| 29|2000.0| 001|
| 2| Andy| true| 30|5000.0| 001|
| 3| Justin| true| 19|1000.0| 002|
| 4| Kaine|false| 20|5000.0| 003|
| 5| Lisa|false| 19|1000.0| 002|
+---+-------+-----+---+------+----+
where子句过滤
select * from t_employee where name = 'Michael' or age >=29
+---+-------+-----+---+------+----+
| id| name| sex|age|salary|dept|
+---+-------+-----+---+------+----+
| 1|Michael|false| 29|2000.0| 001|
| 2| Andy| true| 30|5000.0| 001|
+---+-------+-----+---+------+----+
模糊查询
select * from t_employee where name like '%ich%'
+---+-------+-----+---+------+----+
| id| name| sex|age|salary|dept|
+---+-------+-----+---+------+----+
| 1|Michael|false| 29|2000.0| 001|
+---+-------+-----+---+------+----+
分组聚合
select dept,avg(salary) as avg_salary from t_employee group by dept
+----+----------+
|dept|avg_salary|
+----+----------+
| 003| 5000.0|
| 001| 3500.0|
| 002| 1000.0|
+----+----------+
分组过滤
select dept,avg(salary) as avg_salary from t_employee group by dept having avg_salary > 3500
+----+----------+
|dept|avg_salary|
+----+----------+
| 003| 5000.0|
+----+----------+
case when
select stuno,name,
max(case course when 'math' then score else 0 end ) math,
max(case course when 'english' then score else 0 end) english,
max(case course when 'chinese' then score else 0 end) chinese,
max(case course when 'history' then score else 0 end) history
from t_course
group by stuno,name
+-----+--------+----+-------+-------+-------+
|stuno| name|math|english|chinese|history|
+-----+--------+----+-------+-------+-------+
| 001|zhangsan|80.0| 70.0| 70.0| 90.0|
| 002| wangw|90.0| 0.0| 80.0| 0.0|
+-----+--------+----+-------+-------+-------+
pivot
select * from t_course pivot(max(score) for course in('math','chinese','english','history'))
+-----+--------+----+-------+-------+-------+
|stuno| name|math|chinese|english|history|
+-----+--------+----+-------+-------+-------+
| 001|zhangsan|80.0| 70.0| 70.0| 90.0|
| 002| wangw|90.0| 80.0| null| null|
+-----+--------+----+-------+-------+-------+
cube
select height,weight,avg(eq),avg(iq) from t_kindscore group by height,weight with cube
或者
select height,weight,avg(eq),avg(iq) from t_kindscore group by cube(height,weight)
+------+------+-----------------+-------+
|height|weight| avg(eq)|avg(iq)|
+------+------+-----------------+-------+
| null| 23| 70.0| 85.0|
| null| null|73.33333333333333| 80.0|
| null| 30| 80.0| 70.0|
| 80| null| 70.0| 85.0|
| null| 25| 70.0| 85.0|
| 80| 25| 70.0| 85.0|
| 82| 30| 80.0| 70.0|
| 82| null| 80.0| 70.0|
| 80| 23| 70.0| 85.0|
+------+------+-----------------+-------+
表连接
select e.*,d.* from t_employee e left join t_dept d on d.deptid = e.dept
+---+-------+-----+---+------+----+------+--------+
| id| name| sex|age|salary|dept|deptid|deptname|
+---+-------+-----+---+------+----+------+--------+
| 1|Michael|false| 29|2000.0| 001| 001| 销售部|
| 2| Andy| true| 30|5000.0| 001| 001| 销售部|
| 3| Justin| true| 19|1000.0| 002| 002| 运营部|
| 4| Kaine|false| 20|5000.0| 003| 003| 行政部|
| 5| Lisa|false| 19|1000.0| 002| 002| 运营部|
+---+-------+-----+---+------+----+------+--------+
子查询
- 查询部门员工信息,并且查询所在部门的人数
select id,name,salary,dept,(select count(*) from t_employee e2 where e1.dept=e2.dept) total from t_employee e1
+---+-------+------+----+-----+
| id| name|salary|dept|total|
+---+-------+------+----+-----+
| 4| Kaine|5000.0| 003| 1|
| 1|Michael|2000.0| 001| 2|
| 2| Andy|5000.0| 001| 2|
| 3| Justin|1000.0| 002| 2|
| 5| Lisa|1000.0| 002| 2|
+---+-------+------+----+-----+
开窗函数
- 计算部门员工薪资排名 -count函数
select id,name,salary,dept,
count(*) over(partition by dept order by salary desc rows between unbounded preceding and current row) rank
from t_employee
order by dept desc
+---+-------+------+----+-----+
| id| name|salary|dept|count|
+---+-------+------+----+-----+
| 4| Kaine|5000.0| 003| 1|
| 3| Justin|1000.0| 002| 1|
| 5| Lisa|1000.0| 002| 2|
| 2| Andy|5000.0| 001| 1|
| 1|Michael|2000.0| 001| 2|
| 6| Lily|2000.0| 001| 3|
| 7| Lucy|1000.0| 001| 4|
+---+-------+------+----+-----+
如果出现薪资一样的员工,最终排名会存在问题
- 计算部门员工薪资排名 -rank函数,排名不连续
select id,name,salary,dept,
rank() over(partition by dept order by salary desc) count
from t_employee
order by dept desc
+---+-------+------+----+-----+
| id| name|salary|dept|count|
+---+-------+------+----+-----+
| 4| Kaine|5000.0| 003| 1|
| 3| Justin|1000.0| 002| 1|
| 5| Lisa|1000.0| 002| 1|
| 2| Andy|5000.0| 001| 1|
| 6| Lily|2000.0| 001| 2|
| 1|Michael|2000.0| 001| 2|
| 7| Lucy|1000.0| 001| 4|
+---+-------+------+----+-----+
- 计算部门员工薪资排名 -dense_rank函数,排名不连续
select id,name,salary,dept,
dense_rank() over(partition by dept order by salary desc) count
from t_employee
order by dept desc
+---+-------+------+----+-----+
| id| name|salary|dept|count|
+---+-------+------+----+-----+
| 4| Kaine|5000.0| 003| 1|
| 3| Justin|1000.0| 002| 1|
| 5| Lisa|1000.0| 002| 1|
| 2| Andy|5000.0| 001| 1|
| 6| Lily|2000.0| 001| 2|
| 1|Michael|2000.0| 001| 2|
| 7| Lucy|1000.0| 001| 3|
+---+-------+------+----+-----+
附注:对应的API写法
import org.apache.spark.sql.functions._
val w=Window.partitionBy("dept")
.orderBy($"salary" desc)
employeeDF.select("id","name","salary","dept")
.withColumn("rank",dense_rank() over(w))
.orderBy($"dept" desc)
.show()
分页查询
select id,name,salary,dept,
dense_rank() over(partition by dept order by salary desc) count
from t_employee
order by dept desc limit 3
+---+------+------+----+-----+
| id| name|salary|dept|count|
+---+------+------+----+-----+
| 4| Kaine|5000.0| 003| 1|
| 3|Justin|1000.0| 002| 1|
| 5| Lisa|1000.0| 002| 1|
+---+------+------+----+-----+
自定义函数
单行函数√
spark.udf.register("annual_salary",(salary:Double,age:Int)=>{
if(age>25){
salary*12+5000
}else{
salary*12
}
})
select id,name,salary,annual_salary(salary,age) annual_salary,age from t_employee
无类型用户自定义聚合函数
- 用户需要继承
UserDefinedAggregateFunction
抽象类,例如自定义一个求平均的聚合方法:
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types.{DataType, DoubleType, IntegerType, StructType}
object MyAverage extends UserDefinedAggregateFunction {
//表示输入的字段类型
override def inputSchema: StructType = {
//这里的name叫什么随意,但是类型不能写错
new StructType().add("salary",DoubleType)
}
//存储计算的中间结果:累加的总数、计数器
override def bufferSchema: StructType = {
new StructType().add("total",DoubleType)
.add("count",IntegerType)
}
//最终聚合后的返回值类型
override def dataType: DataType = DoubleType
//结果类型是否固定,一般设置为true
override def deterministic: Boolean = true
//初始化中间变量buffer
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = 0.0 //总金额
buffer(1) = 0 //计数
}
//将记录中的数据更新到buffer中
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
if(!input.isNullAt(0)){
val salary = input.getAs[Double](0)
buffer(0)=buffer.getDouble(0)+salary
buffer(1)=buffer.getInt(1)+1
}
}
//合并最终结果,注意一定合并到buffer1
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0)=buffer1.getDouble(0)+buffer2.getDouble(0) //总金额
buffer1(1)=buffer1.getInt(0)+buffer2.getInt(0) //总计数
}
//返回最终结果类型
override def evaluate(buffer: Row): Any = {
buffer.getDouble(0)/buffer.getInt(1)
}
}
- 注册用户自定义聚合方法
spark.udf.register("myavg",MyAverage)
- 使用聚合
select dept,myavg(salary) from t_employee group by dept
+----+------------------+
|dept|myaverage$(salary)|
+----+------------------+
| 003| 5000.0|
| 001| 2500.0|
| 002| 1000.0|
+----+------------------+
数据导入/导出
文件系统
-
JSON
- 写出
spark.sql(sql).write.json("file:///D:/write/json") spark.sql(sql) .write .format("json") .save("file:///D:/write/json1")
- 读入
spark.read.format("json").load("file:///D:\\demo\\json") .show()
-
CSV
- 写出
spark.sql(sql) .write .option("sep", ",") //指定分隔符 .option("header", "true") //是否添加表头 .format("csv") .save("file:///D:/write/csv1")
- 读入
spark.read .option("sep", ",") .option("header", "true") .format("csv") .load("file:///D:/demo/csv") .show()
数据库RDBMS
-
JDBC
- 写入 √
spark.sql(sql).write .format("jdbc") .option("url", "jdbc:mysql://localhost:3306/test") .option("dbtable", "t_emp") .option("user", "root") .option("password", "root") .mode(SaveMode.Overwrite) .save()
- 读出
spark.read .format("jdbc") .option("url", "jdbc:mysql://localhost:3306/test") //.option("dbtable", "t_emp") .option("query","select id,name from t_emp") //不可以和dbtable同时出现 .option("user", "root") .option("password", "root") .load() .show()
为了不占用数据库连接参数,一般需要计算来自某种数据文件中的数据,推荐将数据使用工具导出到CSV格式的文件,然后将CSV格式的数据文件上传到HDFS,最后再使用Spark进行计算
任意外围系统
如果用户需要将数据写入到第三方系统,需要使用foreachPartition方式写出
spark.read
.format("jdbc")
.option("url", "jdbc:mysql://localhost:3306/test")
.option("dbtable", "t_emp")
//.option("query","select id,name from t_emp") //不可以和dbtable同时出现
.option("user", "root")
.option("password", "root")
.load()
.foreachPartition(rows=>{
rows.foreach(row=>{
val id = row.getAs[Int]("id")
val name = row.getAs[String]("name")
val salary = row.getAs[Double]("salary")
println(id+" "+name+" "+salary) //自定义sink将结果写出到外围系统
})
})