创建表
首先搞清楚Spark Session和Spark Application,Spark Application可以包括多个Spark Session。(SparkContext或SQLContext封装在Session中)
创建一个临时视图,此视图与SparkSession相关联。会话结束删除。sessiondf.createOrReplaceTempView("tempViewName")
也可手动删除,或停掉
spark.catalog.dropTempView("tempViewName")
createGlobalTempView()
创建全局临时视图,此视图与Spark Application绑定.
spark.catalog.dropGlobalTempView("tempViewName")
以下视图可以说明两种视图的用法,全局临时视图适合在多个Session共享。
object NewSessionApp {
def main(args: Array[String]): Unit = {
val logFile = "data/README.md" // Should be some file on your system
val spark = SparkSession.
builder.
appName("Simple Application").
master("local").
getOrCreate()
val logData = spark.read.textFile(logFile).cache()
logData.createGlobalTempView("logdata")
spark.range(1).createTempView("foo")
// within the same session the foo table exists
println("""spark.catalog.tableExists("foo") = """ + spark.catalog.tableExists("foo"))
//spark.catalog.tableExists("foo") = true
// for a new session the foo table does not exists
val newSpark = spark.newSession
println("""newSpark.catalog.tableExists("foo") = """ + newSpark.catalog.tableExists("foo"))
//newSpark.catalog.tableExists("foo") = false
//both session can access the logdata table
spark.sql("SELECT * FROM global_temp.logdata").show()
newSpark.sql("SELECT * FROM global_temp.logdata").show()
spark.stop()
}
}
数据集
Dataset与RDD相似,但其序列化是使用专门的编码器而不是Java或者Kryo序列化,虽然编码器和标准序列化都负责将对象转换为字节。但Dataset这种序列化允许Spark执行许多操作,如过滤、排序、散列。而无需将字节反序列化。
case class Person(name:String, age:Long)
val caseClassDS = Seq(Person("Andy",32)).toDS()
caseClassDS: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
caseClassDS.show()
+----+---+
|name|age|
+----+---+
|Andy| 32|
+----+---+
val primitiveDS = Seq(1,2,3).toDS()
org.apache.spark.sql.Dataset[Int] = [value: int]
primitiveDS.map(_ + 1).collect()
Array[Int] = Array(2, 3, 4)
//DataFrames通过类转化为Dataset
val path = "example/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
通过RDD构建数据集
第一种方式是通过反射推断,第二种方式通过接口创建一个Schema
以下创建一个Person的RDD,然后转换成DataFrame
case class Person(name:String,age:Long)
val peopleDF = spark.sparkContext.textFile("file:///home/hadoop/software/spark/spark-2.4.4-bin-hadoop2.7/examples/src/main/resources/people.txt").map(_.split(",")).map(attributes => Person(attributes(0),attributes(1).trim.toInt)).toDF()
peopleDF.createOrReplaceTempView("people")
//spark支持的sql方法运行SQL
val teenagersDF = spark.sql("SELECT name,age FROM people WHERE age BETWEEN 13 AND 19")
//通过index来获取某一列
teenagersDF.map(teenager => "Name:" + teenager(0)).show()
//通过field name获取某一列
teenagersDF.map(teenager => "Name:" + teenager.getAs[String]("name")).show()
//Dataset[Map[K,V]]没有预定义的编码器,这里定义一个明确的编码器
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String,Any]]
//原始的类型和case class也可以被定义为
implicit val stringIntMapEncoder: Encoder[Map[String,Amy]]=ExpressionEncoder()
//row.getValuesMap[T]返回多个列到一个Map[String,T]
teenagersDF.map(teenager => teenager.getValuesMap[Any](List("name","age"))).collect()
程序确定schema
如果不能提前定义case class,有以下三种方法创建DataFrame
1.根据原始的RDD创建一个rows的RDD
2.创建由StructType表示的schema,该schema与步骤1中创建的RDD中的行结构匹配
3.通过SparkSession提供的createDataFrame方法将schema应用到行的RDD
// StructField StructType
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val peopleRDD = spark.sparkContext.textFile("file:///home/hadoop/software/spark/spark-2.4.4-bin-hadoop2.7/examples/src/main/resources/people.txt")
val schemaString = "name age"
//构建两个 StructField(name,StringType,nullable)
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName,StringType,nullable=True))
val schema = StructType(fields)
val rowRDD = peopleRDD.map(_.split(",")).map(attributes => Row(attributes(0),attributes(1).trim))
val peopleDF = spark.createDataFrame(rowRDD,schema)
peopleDF.createOrReplaceTempView("people")
val results = spark.sql("SELECT name FROM people")
//sql的结果是DataFrames支持所有的普通RDD操作
//可以通过field index或者field name获取行的列
results.map(attributes => "Name: " + attributes(0)).show()
Aggregations
内置的DataFrames提供常见的聚合函数,如count()、countDistinct()、avg()、max()、min()等。虽然这些函数是为DataFrames设计的,但Spark SQL在Scala和Java中也有一些类型安全的版本,用于处理强类型数据集。此外,用户还可以创建自己的聚合函数。
无类型的用户定义聚合函数
用户需要扩展UserDefinedAggregateFunction
抽象类,来实现自定义的无类型聚合函数。例如,用户定义的平均值可能是:
import org.apache.spark.sql.{Row,SparkSession}
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.types._
object MyAverage extends UserDefinedAggregateFunction{
//函数的输入变量类型
def inputSchema: StructType = StructType(StructField("inputColumn",LongType) :: Nil)
//聚合内存区的数据类型
def bufferSchema: StructType = {StructType(StructField("sum",LongType) :: StructField("count",LongType) :: Nil)}
//返回值类型
def dataType: DataType = DoubleType
//在相同的输入是否总会返回相同的输出
def deterministic: Boolean = true
//初始化内存区的值,内存区本身也是一个行,可以通过标准的方法通过索引提取值,注意buffer内的数组和map仍是不可变的
def initialize(buffer: MutableAggregationBuffer):Unit = {buffer(0)=0L;buffer(1)=0L}
//根据输入 更新给定的聚合buffer
def update(buffer: MutableAggregationBuffer, input: Row):Unit={if(!input.isNullAt(0)){buffer(0)=buffer.getLong(0)+input.getLong(0);buffer(1)=buffer.getLong(1)+1}}
//合并两个聚合内存区,并存储到 buffer1 中
def merge(buffer1: MutableAggregationBuffer, buffer2: Row):Unit={buffer1(0)=buffer1.getLong(0)+buffer2.getLong(0);buffer1(1)=buffer1.getLong(1)+buffer2.getLong(1)}
def evaluate(buffer:Row): Double= buffer.getLong(0).toDouble/buffer.getLong(1)
}
spark.udf.register("myAverage",MyAverage)
val df = spark.read.json("file:///home/hadoop/software/spark/spark-2.4.4-bin-hadoop2.7/examples/src/main/resources/employees.json")
df.createOrReplaceTempView("employees")
df.show()
val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")
result.show()
强类型的用户自定义聚合函数
import org.apache.spark.sql.{Encoder,Encoders,SparkSession}
import org.apache.spark.sql.expressions.Aggregator
import spark.implicits._
case class Employee(name:String, salary:Long)
case class Average(var sum:Long, var count:Long)
object MyAverage extends Aggregator[Employee,Average,Double]{
// zero需要满足任何 b+zero=b
def zero: Average = Average(0L,0L)
// combine两个值,修改buffer并返回buffer
def reduce(buffer:Average,employee: Employee): Average = {
buffer.sum += employee.salary
buffer.count += 1
buffer
}
// 合并分区间的值
def merge(b1:Average,b2:Average): Average={
b1.sum += b2.sum
b1.count += b2.count
b1
}
// 计算结果
def finish(reduction: Average): Double = reduction.sum.toDouble/reduction.count
// 指定buffer值的Encoder
def bufferEncoder: Encoder[Average] = Encoders.product
// 指定输出值的Encoder
def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}
import spark.implicits._
val ds = spark.read.json("file:///home/hadoop/software/spark/spark-2.4.4-bin-hadoop2.7/examples/src/main/resources/employees.json").as[Employee]
ds.show()
// 这里会报错java.io.NotSerializableException: org.apache.spark.sql.TypedColumn,源码是将这个整体包装为一个object,这是一种解决没有序列化的方法
// https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedTypedAggregation.scala
// 目前还没有找到更好的解决方法,这个问题先保留
val averageSalary = MyAverage.toColumn.name("average_salary")
val result = ds.select(averageSalary)
result.show()