Spark SQL
Spark SQL是用于结构化数据处理的一个模块。同Spark RDD 不同地方在于Spark SQL的API可以给Spark计算引擎提供更多地信息,例如:数据结构、计算算子等。在内部Spark可以通过这些信息有针对对任务做优化和调整。这里有几种方式和Spark SQL进行交互,例如Dataset API和SQL等,这两种API可以混合使用。Spark SQL的一个用途是执行SQL查询。 Spark SQL还可用于从现有Hive安装中读取数据。从其他编程语言中运行SQL时,结果将作为Dataset/DataFrame返回,使用命令行或JDBC / ODBC与SQL接口进行交互。
Dataset是一个分布式数据集合在Spark 1.6提供一个新的接口,Dataset提供RDD的优势(强类型,使用强大的lambda函数)以及具备了Spark SQL执行引擎的优点。Dataset可以通过JVM对象构建,然后可以使用转换函数等(例如:map、flatMap、filter等),目前Dataset API支持Scala和Java 目前Python对Dataset支持还不算完备。
Data Frame是命名列的数据集,他在概念是等价于关系型数据库。DataFrames可以从很多地方构建,比如说结构化数据文件、hive中的表或者外部数据库,使用Dataset[row]的数据集,可以理解DataFrame就是一个Dataset[Row].
SparkSession
- 依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.3</version>
</dependency>
Spark中所有功能的入口点是SparkSession类。要创建基本的SparkSession,只需使用SparkSession.builder():
val spark = SparkSession.builder()
.appName("hellosql")
.master("local[10]")
.getOrCreate()
//一般都需要引入改隐试转换 主要是 将 RDD 转换为 DataFrame/Dataset
import spark.implicits._
//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()
Dataset
Dataset与RDD类似,但是,它们不使用Java序列化或Kryo,而是使用专用的Encoder来序列化对象以便通过网络进行处理或传输。虽然Encoder和标准序列化都负责将对象转换为字节,但Encoder是动态生成的代码,并使用一种格式,允许Spark执行许多操作,如过滤,排序和散列,而无需将字节反序列化为对象。
- 集合Case-Class
case class Person(id:Int,name:String,age:Int,sex:Boolean)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("hellosql")
.master("local[10]")
.getOrCreate()
import spark.implicits._
val dataset: Dataset[Person] = List(Person(1,"zhangsan",18,true),Person(2,"wangwu",28,true)).toDS()
dataset.select($"id",$"name").show()
//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()
}
- 元组
case class Person(id:Int,name:String,age:Int,sex:Boolean)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("hellosql")
.master("local[10]")
.getOrCreate()
import spark.implicits._
val dataset: Dataset[(Int,String,Int,Boolean)] = List((1,"zhangsan",18,true),(2,"wangwu",28,true)).toDS()
dataset.select($"_1",$"_2").show()//元组没有具体类做支撑,因此引入列指定以tuple._下标的形式
//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()
}
- 加载json数据
{"name":"张三","age":18}
{"name":"lisi","age":28}
{"name":"wangwu","age":38}
case class Person(name: String, age: Long)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[5]")
.appName("spark session")
.getOrCreate()
spark.sparkContext.setLogLevel("FATAL")
import spark.implicits._
val dataset = spark.read.json("D:///Persion.json").as[Person]
dataset.show()
spark.stop()
}
Data Frame
Data Frame是命名列的数据集,他在概念是等价于关系型数据库。DataFrames可以从很多地方构建,比如说结构化数据文件、hive中的表或者外部数据库,使用Dataset[row]的数据集,可以理解DataFrame就是一个Dataset[Row].
- 加载json文件
val spark = SparkSession.builder()
.appName("hellosql")
.master("local[10]")
.getOrCreate()
import spark.implicits._
val frame = spark.read.json("file:///f:/person.json")
frame.show()
//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()
- case-class
case class Person(name:String,age:Long)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("hellosql")
.master("local[10]")
.getOrCreate()
import spark.implicits._
List(Person("zhangsan",18),Person("王五",20)).toDF("uname","uage").show()
//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()
}
- 元组
case class Person(name:String,age:Long)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("hellosql")
.master("local[10]")
.getOrCreate()
import spark.implicits._
List(("zhangsan",18),("王五",20)).toDF("name","age").show()
//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()
}
- 通过 RDD 转换(灵活)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("hellosql")
.master("local[10]")
.getOrCreate()
val lines = spark.sparkContext.parallelize(List("zhangsan,20", "lisi,30"))
//先将字符串进行切分,得到Rdd[Row(zhangsan,20)]
.map(line => Row(line.split(",")(0), line.split(",")(1).toInt))
//定义要转换成的frame的结构类型(结构属性(名称,类型,是否可以为空))
val structType = new StructType(Array(StructField("name",StringType,true),StructField("age",IntegerType,true)))
//创建DataFrame(RDD[Row]数据,结构类型)
val frame = spark.createDataFrame(lines,structType)
frame.show()
//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()
DataFrame 算子操作
如下格式数据
Michael,29,2000,true
Andy,30,5000,true
Justin,19,1000,true
Kaine,20,5000,true
Lisa,19,1000,false
select
//数据格式 Michael,29,2000,true
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("hellosql")
.master("local[10]")
.getOrCreate()
import spark.implicits._//必须引入隐式转换,否则相关附加功能优化不能实现
var rdd=spark.sparkContext.textFile("file:///D:/WorkSpace02/people.txt")
//先将字符串进行切分,得到Rdd[Row(Michael,29,2000,true)]
.map(_.split(","))
.map(arr=>Row(arr(0),arr(1).trim().toInt,arr(2).trim().toDouble,arr(3).trim().toBoolean))
//定义要转换成frame的结构类型的结构属性(名称,类型,是否可以为空))
var fields=new StructField("name",StringType,true)::
new StructField("age",IntegerType,true)::
new StructField("salary",DoubleType,true)::
//::表示追加,给内容为Nil(Null)的集合向前追加,相当于Array(StructField("name",StringType,true),StructField("age",IntegerType,true)...)
new StructField("sex",BooleanType,true)::Nil
//创建DataFrame(RDD[Row]数据,结构类型)
val frame = spark.createDataFrame(rdd, StructType(fields)).as("t_user")
//执行select算子($"名称"引入结构属性名称)
frame.select($"name",$"age",$"salary",$"sex",$"salary"*12 as "年薪")
.show()
//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()
}
+-------+---+-----+------+-------+
| name|age| sex|salary| 年薪|
+-------+---+-----+------+-------+
|Michael| 29| true|2000.0|24000.0|
| Andy| 30| true|5000.0|60000.0|
| Justin| 19| true|1000.0|12000.0|
| Kaine| 20| true|5000.0|60000.0|
| Lisa| 19|false|1000.0|12000.0|
+-------+---+-----+------+-------+
filter
var rdd= spark.sparkContext.textFile("file:///D:/people.txt")
.map(_.split(","))
.map(arr=>Row(arr(0),arr(1).trim().toInt,arr(2).trim().toDouble,arr(3).trim().toBoolean))
var fields=new StructField("name",StringType,true)::new StructField("age",IntegerType,true)::new StructField("salary",DoubleType,true):: new StructField("sex",BooleanType,true)::Nil
spark.createDataFrame(rdd,StructType(fields))
.select($"name",$"age",$"sex",$"salary",$"salary" * 12 as "年薪")
.filter($"name" === "Michael" or $"年薪" < 60000)
.show()
+-------+---+-----+------+-------+
| name|age| sex|salary| 年薪|
+-------+---+-----+------+-------+
|Michael| 29| true|2000.0|24000.0|
| Justin| 19| true|1000.0|12000.0|
| Lisa| 19|false|1000.0|12000.0|
+-------+---+-----+------+-------+
where
//Michael,29,2000,true
var rdd= spark.sparkContext.textFile("file:///D:/people.txt")
.map(_.split(","))
.map(arr=>Row(arr(0),arr(1).trim().toInt,arr(2).trim().toDouble,arr(3).trim().toBoolean))
var fields=new StructField("name",StringType,true)::new StructField("age",IntegerType,true)::new StructField("salary",DoubleType,true):: new StructField("sex",BooleanType,true)::Nil
spark.createDataFrame(rdd,StructType(fields))
.select($"name",$"age",$"sex",$"salary",$"salary" * 12 as "year_salary") //不允许别名中有 中文 bug
.where("(name = 'Michael') or ( year_salary <= 24000) ")
.show()
var rdd= spark.sparkContext.textFile("file:///D:/people.txt")
.map(_.split(","))
.map(arr=>Row(arr(0),arr(1).trim().toInt,arr(2).trim().toDouble,arr(3).trim().toBoolean))
var fields=new StructField("name",StringType,true)::new StructField("age",IntegerType,true)::new StructField("salary",DoubleType,true):: new StructField("sex",BooleanType,true)::Nil
spark.createDataFrame(rdd,StructType(fields))
.select($"name",$"age",$"sex",$"salary",$"salary" * 12 as "年薪")
.where($"name" === "Michael" or $"年薪" <= 24000)
.show()
withColumn
//Michael,29,2000,true
var rdd= spark.sparkContext.textFile("file:///D:/people.txt")
.map(_.split(","))
.map(arr=>Row(arr(0),arr(1).trim().toInt,arr(2).trim().toDouble,arr(3).trim().toBoolean))
var fields=new StructField("name",StringType,true)::new StructField("age",IntegerType,true)::new StructField("salary",DoubleType,true):: new StructField("sex",BooleanType,true)::Nil
spark.createDataFrame(rdd,StructType(fields))
.select($"name",$"age",$"sex",$"salary",$"salary" * 12 as "年薪")
.where($"name" === "Michael" or $"年薪" <= 24000)
.withColumn("年终奖",$"年薪" * 0.8)
.show()
+-------+---+-----+------+-------+-------+
| name|age| sex|salary| 年薪| 年终奖|
+-------+---+-----+------+-------+-------+
|Michael| 29| true|2000.0|24000.0|19200.0|
| Justin| 19| true|1000.0|12000.0| 9600.0|
| Lisa| 19|false|1000.0|12000.0| 9600.0|
+-------+---+-----+------+-------+-------+
groupBy
var rdd= spark.sparkContext.textFile("file:///D:/people.txt")
.map(_.split(","))
.map(arr=>Row(arr(0),arr(1).trim().toInt,arr(2).trim().toDouble,arr(3).trim().toBoolean))
var fields=new StructField("name",StringType,true)::new StructField("age",IntegerType,true)::new StructField("salary",DoubleType,true):: new StructField("sex",BooleanType,true)::Nil
spark.createDataFrame(rdd,StructType(fields))
.select($"age",$"sex")
.groupBy($"sex")
.avg("age")
.show()
+-----+--------+
| sex|avg(age)|
+-----+--------+
| true| 24.5|
|false| 19.0|
+-----+--------+
agg
var rdd= spark.sparkContext.textFile("file:///D:/people.txt")
.map(_.split(","))
.map(arr=>Row(arr(0),arr(1).trim().toInt,arr(2).trim().toDouble,arr(3).trim().toBoolean))
var fields=new StructField("name",StringType,true)::new StructField("age",IntegerType,true)::new StructField("salary",DoubleType,true):: new StructField("sex",BooleanType,true)::Nil
import org.apache.spark.sql.functions._
spark.createDataFrame(rdd,StructType(fields))
.select($"age",$"sex",$"salary")
.groupBy($"sex")
//执行agg算子,算子内可以使用sum.avg.max.min.count组函数
.agg(sum($"salary") as "toatalSalary",avg("age") as "avgAge",max($"salary"))
.show()
+-----+------------+------+-----------+
| sex|toatalSalary|avgAge|max(salary)|
+-----+------------+------+-----------+
| true| 13000.0| 24.5| 5000.0|
|false| 1000.0| 19.0| 1000.0|
+-----+------------+------+-----------+
join
准备一下数据dept.txt
1,销售部门
2,研发部门
3,媒体运营
4,后勤部门
people.txt
Michael,29,2000,true,1
Andy,30,5000,true,1
Justin,19,1000,true,2
Kaine,20,5000,true,2
Lisa,19,1000,false,3
//Michael,29,2000,true,1
var rdd= spark.sparkContext.textFile("file:///D:/people.txt")
.map(_.split(","))
.map(arr=>Row(arr(0),arr(1).trim().toInt,arr(2).trim().toDouble,arr(3).trim().toBoolean,arr(4).trim().toInt))
var fields=new StructField("name",StringType,true)::new StructField("age",IntegerType,true)::new StructField("salary",DoubleType,true):: new StructField("sex",BooleanType,true)::
new StructField("deptno",IntegerType,true)::Nil
val user = spark.createDataFrame(rdd,StructType(fields)).as("user")
var dept = spark.sparkContext.textFile("file:///D:/dept.txt")
.map(line =>(line.split(",")(0).toInt,line.split(",")(1)))
.toDF("deptno","deptname").as("dept")
user.select($"name",$"user.deptno")
.join(dept,$"dept.deptno" === $"user.deptno")
.show()
+-------+------+------+--------+
| name|deptno|deptno|deptname|
+-------+------+------+--------+
|Michael| 1| 1|销售部门|
| Andy| 1| 1|销售部门|
| Lisa| 3| 3|媒体运营|
| Justin| 2| 2|研发部门|
| Kaine| 2| 2|研发部门|
+-------+------+------+--------+
drop
userDF.select($"deptno",$"salary" )
.groupBy($"deptno")
.agg(sum($"salary") as "总薪资",avg($"salary") as "平均值",max($"salary") as "最大值")
.join(deptDF,$"dept.deptno" === $"user.deptno")
.drop($"dept.deptno")
.show()
+------+-------+------------------+-------+--------+
|deptno| 总薪资| 平均值| 最大值|deptname|
+------+-------+------------------+-------+--------+
| 1|43000.0|14333.333333333334|20000.0|销售部门|
| 2|38000.0| 19000.0|20000.0|研发部门|
+------+-------+------------------+-------+--------+
orderBy
userDF.select($"deptno",$"salary" )
.groupBy($"deptno")
.agg(sum($"salary") as "总薪资",avg($"salary") as "平均值",max($"salary") as "最大值")
.join(deptDF,$"dept.deptno" === $"user.deptno")
.drop($"dept.deptno")
.orderBy($"总薪资" asc)
.show()
+------+-------+------------------+-------+--------+
|deptno| 总薪资| 平均值| 最大值|deptname|
+------+-------+------------------+-------+--------+
| 2|38000.0| 19000.0|20000.0|研发部门|
| 1|43000.0|14333.333333333334|20000.0|销售部门|
+------+-------+------------------+-------+--------+
map
userDF.map(row => (row.getString(0),row.getInt(1))).show()
+--------+---+
| name|age|
+--------+---+
|zhangsan| 28|
+--------+---+
默认情况下SparkSQL会在执行SQL的时候将序列化里面的参数数值,一般情况下系统提供了常见类型的Encoder,如果出现了没有的Encoder,用户需要声明 隐式转换Encoder
//声明 隐式转换Encoder
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
userDF.map(row => row.getValuesMap[Any](List("name","age","salary")))
.foreach(map=>{
var name=map.getOrElse("name","")
var age=map.getOrElse("age",0)
var salary=map.getOrElse("salary",0.0)
println(name+" "+age+" "+salary)
})
flatMap
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
userDF.flatMap(row => row.getValuesMap(List("name","age")))
.map(item => item._1 +" -> "+item._2)
.show()
+---------------+
| value|
+---------------+
|name -> Michael|
| age -> 29|
| name -> Andy|
| age -> 30|
| name -> Justin|
| age -> 19|
| name -> Kaine|
| age -> 20|
| name -> Lisa|
| age -> 19|
+---------------+
- limit (take(n))
//创建spark
val spark = SparkSession.builder().master("local").appName("testlimit").getOrCreate()
//导入隐式注入
import spark.implicits._
//读入dept文件的数据并转换成dataframe
val dept = spark.sparkContext.textFile("D://suns/dept.txt")
.map(line => (line.split(",")(0).toInt, line.split(",")(1)))
.toDF("deptno", "deptname")
.as("dept")
//将people文件中的数据读入rdd
val personRDD = spark.sparkContext.textFile("D://suns/people.txt")
.map(_.split(","))
.map(arr => Row(arr(0), arr(1).toInt, arr(2).toDouble, arr(3).toBoolean, arr(4).toInt))
// 通过StructType直接指定每个字段的schema
val fields=new StructField("name",StringType,true)::new StructField("age",IntegerType,true)::new StructField("salary",DoubleType)::new StructField("sex",BooleanType)::new StructField("deptno",IntegerType)::Nil
//定义person 相关dataframe
val person = spark.createDataFrame(personRDD,StructType(fields)).as("person")
//查询
person.select($"name",$"age",$"salary",$"deptno" as "p_deptno")
//表连接
.join(dept,$"dept.deptno"===$"p_deptno")
//删除字段
.drop("person.deptno")
//排序
.orderBy($"deptno")
//分页 (这里取前三条)
.limit(3)
.show()
spark.sparkContext.setLogLevel("FATAL")//关闭日志打印
spark.stop()
| name|deptno|deptname|
+------+------+--------+
| Lisa| 3|媒体运营|
|Justin| 2|研发部门|
| Kaine| 2|研发部门|
+------+------+--------+
SQL获取DataFrame
Michael,29,20000,true,MANAGER,1
Andy,30,15000,true,SALESMAN,1
Justin,19,8000,true,CLERK,1
Kaine,20,20000,true,MANAGER,2
Lisa,19,18000,false,SALESMAN,2
sql查询
//创建spark
val spark = SparkSession.builder().master("local[3]").appName("testsparksql").getOrCreate()
//读入文件内容 创建rdd
val personRDD = spark.sparkContext.textFile("D://suns/people.txt")
.map(_.split(","))
.map(arr => Row(arr(0), arr(1).toInt, arr(2).toDouble, arr(3).toBoolean, arr(4), arr(5).toInt))
// 通过StructType直接指定每个字段的schema
var fields=new StructField("name",StringType,true)::new StructField("age",IntegerType,true)::new StructField("salary",DoubleType,true)::new StructField("sex",BooleanType,true)::new StructField("job",StringType,true)::new StructField("deptno",IntegerType,true)::Nil
//将personRDD转换为dataframe
val personFrame = spark.createDataFrame(personRDD,StructType(fields)).as("person")
//创建视图
personFrame.createTempView("t_person")
//书写sql
spark.sql("select * from t_person").show()
//关闭日志打印
spark.sparkContext.setLogLevel("FATAL")
//停止spark
spark.stop()
group by
val rdd = spark.sparkContext.textFile("file:///D:/people.txt")
.map(line => {
val tokens = line.split(",")
Row(tokens(0), tokens(1).toInt, tokens(2).toDouble, tokens(3).toBoolean, tokens(4), tokens(5).toInt)
})
var fields=new StructField("name",StringType,true)::
new StructField("age",IntegerType,true)::
new StructField("salary",DoubleType,true)::
new StructField("sex",BooleanType,true)::
new StructField("job",StringType,true)::
new StructField("deptno",IntegerType,true)::Nil
val userDF = spark.createDataFrame(rdd,StructType(fields))
//创建一个视图
userDF.createTempView("t_user")
spark.sql("select deptno,max(salary),avg(salary),sum(salary),count(1) from t_user group by deptno").show()
//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()
- having 过滤
val rdd = spark.sparkContext.textFile("file:///D:/people.txt")
.map(line => {
val tokens = line.split(",")
Row(tokens(0), tokens(1).toInt, tokens(2).toDouble, tokens(3).toBoolean, tokens(4), tokens(5).toInt)
})
var fields=new StructField("name",StringType,true)::
new StructField("age",IntegerType,true)::
new StructField("salary",DoubleType,true)::
new StructField("sex",BooleanType,true)::
new StructField("job",StringType,true)::
new StructField("deptno",IntegerType,true)::Nil
val userDF = spark.createDataFrame(rdd,StructType(fields))
//创建一个视图
userDF.createTempView("t_user")
spark.sql("select deptno,max(salary),avg(salary),sum(salary),count(1) total from t_user group by deptno having total > 2 ")
.show()
//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()
- 表连接 join
val userRdd = spark.sparkContext.textFile("file:///D:/people.txt")
.map(line => {
val tokens = line.split(",")
Row(tokens(0), tokens(1).toInt, tokens(2).toDouble, tokens(3).toBoolean, tokens(4), tokens(5).toInt)
})
val deptRdd = spark.sparkContext.textFile("file:///D:/dept.txt")
.map(line => {
val tokens = line.split(",")
Row(tokens(0).toInt, tokens(1))
})
var userFields=new StructField("name",StringType,true)::
new StructField("age",IntegerType,true)::
new StructField("salary",DoubleType,true)::
new StructField("sex",BooleanType,true)::
new StructField("job",StringType,true)::
new StructField("deptno",IntegerType,true)::Nil
var deptFields=new StructField("deptno",IntegerType,true)::
new StructField("name",StringType,true)::Nil
spark.createDataFrame(userRdd,StructType(userFields)).createTempView("t_user")
spark.createDataFrame(deptRdd,StructType(deptFields)).createTempView("t_dept")
spark.sql("select u.*,d.name from t_user u left join t_dept d on u.deptno=d.deptno")
.show()
//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()
- limit
val userRdd = spark.sparkContext.textFile("file:///D:/people.txt")
.map(line => {
val tokens = line.split(",")
Row(tokens(0), tokens(1).toInt, tokens(2).toDouble, tokens(3).toBoolean, tokens(4), tokens(5).toInt)
})
val deptRdd = spark.sparkContext.textFile("file:///D:/dept.txt")
.map(line => {
val tokens = line.split(",")
Row(tokens(0).toInt, tokens(1))
})
var userFields=new StructField("name",StringType,true)::
new StructField("age",IntegerType,true)::
new StructField("salary",DoubleType,true)::
new StructField("sex",BooleanType,true)::
new StructField("job",StringType,true)::
new StructField("deptno",IntegerType,true)::Nil
var deptFields=new StructField("deptno",IntegerType,true)::
new StructField("name",StringType,true)::Nil
spark.createDataFrame(userRdd,StructType(userFields)).createTempView("t_user")
spark.createDataFrame(deptRdd,StructType(deptFields)).createTempView("t_dept")
spark.sql("select u.*,d.name from t_user u left join t_dept d on u.deptno=d.deptno order by u.age asc limit 8")
.show()
//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()
- 子查询
val userRdd = spark.sparkContext.textFile("file:///D:/people.txt")
.map(line => {
val tokens = line.split(",")
Row(tokens(0), tokens(1).toInt, tokens(2).toDouble, tokens(3).toBoolean, tokens(4), tokens(5).toInt)
})
val deptRdd = spark.sparkContext.textFile("file:///D:/dept.txt")
.map(line => {
val tokens = line.split(",")
Row(tokens(0).toInt, tokens(1))
})
var userFields=new StructField("name",StringType,true)::
new StructField("age",IntegerType,true)::
new StructField("salary",DoubleType,true)::
new StructField("sex",BooleanType,true)::
new StructField("job",StringType,true)::
new StructField("deptno",IntegerType,true)::Nil
var deptFields=new StructField("deptno",IntegerType,true)::
new StructField("name",StringType,true)::Nil
spark.createDataFrame(userRdd,StructType(userFields)).createTempView("t_user")
spark.createDataFrame(deptRdd,StructType(deptFields)).createTempView("t_dept")
spark.sql("select * from (select name,age,salary from t_user)")
.show()
//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()
开窗函数
在正常的统计分析中 ,通常使用聚合函数作为分析,聚合分析函数的特点是将n行记录合并成一行,在数据库的统计当中还有一种统计称为开窗统计,开窗函数可以实现将一行变成多行。可以将数据库查询的每一条记录比作是一幢高楼的一层, 开窗函数就是在每一层开一扇窗, 让每一层能看到整装楼的全貌或一部分。
查询每个部门员工信息,并返回本部门的平均薪资
Michael,29,20000,true,MANAGER,1
Andy,30,15000,true,SALESMAN,1
Justin,19,8000,true,CLERK,1
Kaine,20,20000,true,MANAGER,2
Lisa,19,18000,false,SALESMAN,2
+-------+---+-------+-----+--------+------+
| name|age| salary| sex| job|deptno|
+-------+---+-------+-----+--------+------+
|Michael| 29|20000.0| true| MANAGER| 1|
| Jimi| 25|20000.0| true|SALESMAN| 1|
| Andy| 30|15000.0| true|SALESMAN| 1|
| Justin| 19| 8000.0| true| CLERK| 1|
| Kaine| 20|20000.0| true| MANAGER| 2|
| Lisa| 19|18000.0|false|SALESMAN| 2|
+-------+---+-------+-----+--------+------+
val userRdd = spark.sparkContext.textFile("file:///D:/people.txt")
.map(line => {
val tokens = line.split(",")
Row(tokens(0), tokens(1).toInt, tokens(2).toDouble, tokens(3).toBoolean, tokens(4), tokens(5).toInt)
})
var userFields=new StructField("name",StringType,true)::
new StructField("age",IntegerType,true)::
new StructField("salary",DoubleType,true)::
new StructField("sex",BooleanType,true)::
new StructField("job",StringType,true)::
new StructField("deptno",IntegerType,true)::Nil
spark.createDataFrame(userRdd,StructType(userFields)).createTempView("t_user")
spark.sql("select *, avg(salary) over(partition by deptno) as avgSalary from t_user")
.show()
//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()
+-------+---+-------+-----+--------+------+------------------+
| name|age| salary| sex| job|deptno| avgSalary|
+-------+---+-------+-----+--------+------+------------------+
|Michael| 29|20000.0| true| MANAGER| 1|14333.333333333334|
| Andy| 30|15000.0| true|SALESMAN| 1|14333.333333333334|
| Justin| 19| 8000.0| true| CLERK| 1|14333.333333333334|
| Kaine| 20|20000.0| true| MANAGER| 2| 19000.0|
| Lisa| 19|18000.0|false|SALESMAN| 2| 19000.0|
+-------+---+-------+-----+--------+------+------------------+
ROW_NUMBER()
统计员工在部门内部薪资排名
spark.sql("select * , ROW_NUMBER() over(partition by deptno order by salary DESC) as rank from t_user")
.show()
+-------+---+-------+-----+--------+------+----+
| name|age| salary| sex| job|deptno|rank|
+-------+---+-------+-----+--------+------+----+
|Michael| 29|20000.0| true| MANAGER| 1| 1|
| Andy| 30|15000.0| true|SALESMAN| 1| 2|
| Justin| 19| 8000.0| true| CLERK| 1| 3|
| Kaine| 20|20000.0| true| MANAGER| 2| 1|
| Lisa| 19|18000.0|false|SALESMAN| 2| 2|
+-------+---+-------+-----+--------+------+----+
统计员工在公司所有员工的薪资排名
spark.sql("select * , ROW_NUMBER() over(order by salary DESC) as rank from t_user")
.show()
+-------+---+-------+-----+--------+------+----+
| name|age| salary| sex| job|deptno|rank|
+-------+---+-------+-----+--------+------+----+
|Michael| 29|20000.0| true| MANAGER| 1| 1|
| Kaine| 20|20000.0| true| MANAGER| 2| 2|
| Lisa| 19|18000.0|false|SALESMAN| 2| 3|
| Andy| 30|15000.0| true|SALESMAN| 1| 4|
| Justin| 19| 8000.0| true| CLERK| 1| 5|
+-------+---+-------+-----+--------+------+----+
可以看出ROW_NUMBER()函数只能计算结果在当前开窗函数中的顺序。并不能计算排名。
DENSE_RANK()
计算员工在公司薪资排名
val sql="select * , DENSE_RANK() over(order by salary DESC) rank from t_emp"
spark.sql(sql).show()
+-------+---+-------+-----+--------+------+----+
| name|age| salary| sex| job|deptno|rank|
+-------+---+-------+-----+--------+------+----+
|Michael| 29|20000.0| true| MANAGER| 1| 1|
| Jimi| 25|20000.0| true|SALESMAN| 1| 1|
| Kaine| 20|20000.0| true| MANAGER| 2| 1|
| Lisa| 19|18000.0|false|SALESMAN| 2| 2|
| Andy| 30|15000.0| true|SALESMAN| 1| 3|
| Justin| 19| 8000.0| true| CLERK| 1| 4|
+-------+---+-------+-----+--------+------+----+
计算员工在公司部门薪资排名
val sql="select * , DENSE_RANK() over(partition by deptno order by salary DESC) rank from t_emp"
spark.sql(sql).show()
+-------+---+-------+-----+--------+------+----+
| name|age| salary| sex| job|deptno|rank|
+-------+---+-------+-----+--------+------+----+
|Michael| 29|20000.0| true| MANAGER| 1| 1|
| Kaine| 20|20000.0| true| MANAGER| 2| 1|
| Lisa| 19|18000.0|false|SALESMAN| 2| 2|
| Andy| 30|15000.0| true|SALESMAN| 1| 3|
| Justin| 19| 8000.0| true| CLERK| 1| 4|
+-------+---+-------+-----+--------+------+----+
RANK()
该函数和DENSE_RANK()
类似,不同的是RANK计算的排名顺序不连续。
计算员工在公司部门薪资排名
val sql="select * , RANK() over(partition by deptno order by salary DESC) rank from t_user"
spark.sql(sql).show()
+-------+---+-------+-----+--------+------+----+
| name|age| salary| sex| job|deptno|rank|
+-------+---+-------+-----+--------+------+----+
|Michael| 29|20000.0| true| MANAGER| 1| 1|
| Kaine| 20|20000.0| true| MANAGER| 2| 1|
| Lisa| 19|18000.0|false|SALESMAN| 2| 3|
| Andy| 30|15000.0| true|SALESMAN| 1| 4|
| Justin| 19| 8000.0| true| CLERK| 1| 5|
+-------+---+-------+-----+--------+------+----+
自定义函数
单行函数
spark.udf.register("yearSalary",(job:String,salary:Double)=> {
job match {
case "MANAGER" => salary * 14
case "SALESMAN" => salary * 16
case "CLERK" => salary * 13
case _ => salary*12
}
})
spark.sql("select name,salary, yearSalary(job,salary) as yearSalary from t_user")
.show()
聚合函数
无类型聚合( Spark SQL)
- order.txt
1,苹果,4.5,2,001
2,橘子,2.5,5,001
3,机械键盘,800,1,002
- MySumAggregateFunction
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types.{DataType, DoubleType, IntegerType, StructType}
class MySumAggregateFunction extends UserDefinedAggregateFunction{
//说明输出参数,name 参数无所谓
override def inputSchema: StructType = {
new StructType().add("price",DoubleType).add("count",IntegerType)
}
//最终输出结果的Schema
override def bufferSchema: StructType = {
new StructType().add("totalCost",DoubleType)
}
// 统计结果值类型
override def dataType: DataType = DoubleType
//一般不需要做额外实现,直接返回true
override def deterministic: Boolean = true
//设置统计初始值
override def initialize(buffer: MutableAggregationBuffer): Unit = {
//初始化第一个参数的值是0
buffer.update(0,0.0)
}
//局部计算
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val price = input.getAs[Double](0)
val count = input.getAs[Int](1)
val historyCost = buffer.getDouble(0)
buffer.update(0,historyCost+(price*count))
}
//计算在最终结果,要将结果更新到buffer1
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
var totalCost=buffer1.getDouble(0)+buffer2.getDouble(0)
buffer1.update(0,totalCost)
}
//执行最终的返回结果
override def evaluate(buffer: Row): Any = {
buffer.getDouble(0)
}
}
- 按照userid统计用户消费
case class OrderLog(price: Double, count: Int,userid:String)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("hellosql")
.master("local[10]")
.getOrCreate()
import spark.implicits._
var orderDF= spark.sparkContext.textFile("file:///D:/order.txt")
.map(_.split(","))
.map(arr=> OrderLog(arr(2).toDouble,arr(3).toInt,arr(4)))
.toDF().createTempView("t_order")
spark.udf.register("customsum",new MySumAggregateFunction())
spark.sql("select userid , customsum(price,count) as totalCost from t_order group by userid").show()
//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()
}
- 结果
+------+---------+
|userid|totalCost|
+------+---------+
| 001| 21.5|
| 002| 800.0|
+------+---------+
有类型聚合|强类型聚合 (DataFrame API )
- AverageState
case class AverageState(var sum: Double, var total: Int)
- MyAggregator
import org.apache.spark.sql.{Encoder, Encoders}
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.expressions.Aggregator
class MyAggregator extends Aggregator[GenericRowWithSchema,AverageState,Double] {
//初始值
override def zero: AverageState = AverageState(0.0,0)
//局部合并
override def reduce(b: AverageState, a: GenericRowWithSchema): AverageState ={
var sum=b.sum + a.getAs[Int]("count") * a.getAs[Double]("price")
var count=b.total+1
b.copy(sum,count)
}
//最终合并
override def merge(b1: AverageState, b2: AverageState): AverageState = {
b1.copy(b1.sum+b2.sum,b1.total+b2.total)
}
//最终输出结果
override def finish(reduction: AverageState): Double = {
reduction.sum/reduction.total
}
//中间计算结果
override def bufferEncoder: Encoder[AverageState] = {
Encoders.product[AverageState]
}
//最终输出结果
override def outputEncoder: Encoder[Double] = {
Encoders.scalaDouble
}
}
- 使用
var orderDF= spark.sparkContext.textFile("file:///D:/order.txt")
.map(_.split(","))
.map(arr=> OrderLog(arr(2).toDouble,arr(3).toInt,arr(4)))
.toDF()
val avg = new MyAggregator().toColumn.name("avgCost")
orderDF.groupBy($"userid").agg(avg).show()