02_SparkSQL
1. Spark SQL概述
1.1 Spark SQL定义
Spark SQL是基于spark core提供的一个用来处理结构化数据的模块(库)
它提供了一个编程抽象叫做DataFrame/Dataset,它可以理解为一个基于RDD数据模型的更高级数据模型,带有结构化元信息(schema),DataFrame其实就是Dataset[Row],Spark SQL可以将针对DataFrame/Dataset的各类SQL运算,翻译成RDD的各类算子执行计划,从而大大简化数据运算编程(请联想Hive)
DataFrame = RDD + Schema
Scala object SQLDemo01 {
def main(args: Array[String]): Unit = {
//创建SparkSession val spark: SparkSession = SparkSession.builder() .appName(this.getClass.getSimpleName) .master("local[*]") .getOrCreate() //创建RDD val lines: RDD[String] = spark.sparkContext.textFile("data/boy.txt") //将RDD关联了Schema,但是依然是RDD val boyRDD: RDD[Boy] = lines.map(line => {
val fields = line.split(",") Boy(fields(0), fields(1).toInt, fields(2).toDouble) }) //强制将关联了schema信息的RDD转成DataFrame(DataSet) import spark.implicits._ val df: DataFrame = boyRDD.toDF //打印schema信息 df.printSchema() //注册视图 df.createTempView("v_boy") //写sql val res: DataFrame = spark.sql("select name, age, fv from v_boy order by fv desc, age asc") //打印执行计划 res.explain(true) //触发Action res.show() spark.stop() }
} //Boy类中的成员变量,包含字段名称和类型 case class Boy(name: String, age: Int, fv: Double) |
执行后的结果
打印DataFrame的schema
Scala //打印schema信息 df.printSchema() |
打印执行计划:
打印的执行计划如下:
Plain Text root |-- name: string (nullable = true) |-- age: integer (nullable = false) |-- fv: double (nullable = false)
== Parsed Logical Plan == 'Sort ['fv DESC NULLS LAST, 'age ASC NULLS FIRST], true +- 'Project ['name, 'age, 'fv] +- 'UnresolvedRelation [v_boy], [], false
== Analyzed Logical Plan == name: string, age: int, fv: double Sort [fv#6 DESC NULLS LAST, age#5 ASC NULLS FIRST], true +- Project [name#4, age#5, fv#6] +- SubqueryAlias v_boy +- View (`v_boy`, [name#4,age |