Spark入门(六)——最全的Saprk SQL算子介绍与使用(上)

Spark SQL是构建在RDD之上的ETL(Extract Transform Load)工具。SparkSQL在RDD之上抽象出来Dataset/Dataframe 这两个类提供了类似RDD的功能,也就意味用户可以使用map、faltMap、filter等高阶算子,同时也通过了基于列的命名查询,也就是说Dataset/DataFrame提供了两套操作数据的API,这些API可以给Saprk引擎要提供更多信息,系统可可以根据这些信息对计算实现一定的优化。目前Spark SQL提供了两种交互方式

  1. SQL 脚本
  2. Dataset API(strong-typed类型、untyped类型操作)

Datasets & DataFrames简介

Dataset是一个分布式数据集,Dataset是在spark-1.6提出新的API,该API构建在RDD(strong type,使用lambda表达式)之上同时可以借助于Spark SQL对执行引擎的优点,使得使用Dateset执行一些数据的转换比直接使用RDD算子功能和性能都有所提升。因此我们可以认为Dateset就是一个加强版本的RDD。 Dataset除了可以使用JVM中数组|集合对象创建之外,也可以将任意的一个RDD转换为Dataset.
Python does not have the support for the Dataset API.

DataFrames 是Dataset的一种特殊情况。比如 Dataset中可以存储任意 对象类型的数据作为Dataset的元素。但是Dataframe的元素只有一种类型Row类型,这种基于Row查询和传统数据库中ResultSet操作极其相似。因为Row类型的数据表示Dataframe的一个元素,类似数据库中的一行,这些行中的元素可以通过下标或者column name访问。由于 Dateset是API的兼容或者支持度上不是多么的好,但是Dataframe在 API层面支持的Scala、Java、R、Python支持比较全面。

快速入门

  • 引入依赖
<dependency> 
	<groupId>org.apache.spark</groupId> 
	<artifactId>spark-core_2.11</artifactId>
    <version>2.4.3</version> 
</dependency> 
<dependency> 
	<groupId>org.apache.spark</groupId> 
	<artifactId>spark-sql_2.11</artifactId> 
	<version>2.4.3</version> 
</dependency>
  • 创建字符统计(untyped)
def main(args: Array[String]): Unit = {
		//1.创建SparkSeesion
		val spark = SparkSession
			.builder()
			.appName("wordcount")
			.master("local[6]")
			.getOrCreate()

		//2.导入spark定义隐式增强|转换
		import spark.implicits._

		//3.创建dataset
		val lines = Array("this is a demo", "hello spark")
		val wordRDD = spark.sparkContext
			.makeRDD(lines)
			.flatMap(_.split("\\s+"))
			.map((_, 1))

		val ds: Dataset[(String, Int)] = wordRDD.toDS()
		
		//4.对Dataset执行sql算子操作
		ds.groupBy($"_1")
			//无类型操作
			.sum("_2")
			.as("total")
			.withColumnRenamed("_1", "word")
			.withColumnRenamed("sum(_2)", "total")
			.show()
		//5.关闭spark
		spark.stop()
}
  • 创建字符统计(strong typed)
def main(args: Array[String]): Unit = {
		//1.创建SparkSeesion 
		val spark = SparkSession
			.builder()
			.appName("wordcount")
			.master("local[6]")
			.getOrCreate()

		//2.导入spark定义隐式增强|转换

		import spark.implicits._

		//3.创建dataset 
		val lines = Array("this is a demo", "hello spark")
		val wordRDD = spark.sparkContext
			.makeRDD(lines)
			.flatMap(_.split("\\s+"))
			.map((_, 1))
		val ds: Dataset[(String, Int)] = wordRDD.toDS()

		//4.对Dataset执行sql算子操作 
		ds.groupByKey(t => t._1)
			.agg(typed.sum[(String, Int)](tuple => tuple._2).name("total"))
			.show()

		//5.关闭spark 
		spark.stop()
}

Dataset & DataFrame实战

Dataset create

Dataset类似于RDD,不同的是Spark SQL有一套自己的序列化规范独立于Spark RDD(Java/Kryo序列化)之上称为Encoders。不同于SparkRDD序列化,由于Dataset支持无类型操作,用户无需获取操作的类型,操作仅仅是列名,因为Spark SQL在执行算子操作的时候可以省略反序列化的步骤,继而提升程序执行
效率。

case-class

// case-class
case class Person(id: Int, name: String, age: Int, sex: Boolean)
/**
* case-class
*/
val person: Dataset[Person] = List(
   Person(1, "zhangsan", 18, true),
   Person(2, "lisi", 28, true)
)
   .toDS()
person.select($"id", $"name")
   .show()

注:

  • 加入隐式转换 import spark.implicits._ 该语句需要放在获取spark对象的语句之后
  • case class Person(id: Int, name: String, age: Int, sex: Boolean) 的定义需要放在方法的作用域之外(即Scala/Java的成员变量位置)
  • 如果case类是java编写,则java bean需要实现 Serializable,以及增加get、set方法

结果:

+---+--------+
| id|    name|
+---+--------+
|  1|zhangsan|
|  2|    lisi|
+---+--------+

Tuple(元组)

/**
 * tuple
 */
val person: Dataset[(Int, String, Int, Boolean)] = List(
	(1, "zhangsan", 18, true),
	(2, "lisi", 28, true)
)
	.toDS()
person.select($"_1", $"_2").show()

结果:

+---+--------+
| _1|      _2|
+---+--------+
|  1|zhangsan|
|  2|    lisi|
+---+--------+

json数据

数据:

{"name":"张三","age":18}
{"name":"李四","age":28}
{"name":"王五","age":38}
//数值默认是long类型
case class User(name: String, age: Long)
/**
 * json格式
 */
spark.read
	.json("file:///Users/mashikang/IdeaProjects/spark_sql/src/main/resources/json/")
	.as[User]
	.show()

结果:

+---+----+
|age|name|
+---+----+
| 18|张三|
| 28|李四|
| 38|王五|
+---+----+

RDD

  • Tuple(元组)
/**
 * RDD 元组
 */
val userRDD = spark.sparkContext.makeRDD(List((1,"张三",true,18,15000.0)))
userRDD.toDS().show()

结果:

+---+----+----+---+-------+
| _1|  _2|  _3| _4|     _5|
+---+----+----+---+-------+
|  1|张三|true| 18|15000.0|
+---+----+----+---+-------+
  • case-class
// case-class
case class Person(id: Int, name: String, age: Int, sex: Boolean)
/**
 * RDD case-class
 */
val userRDD = spark.sparkContext.makeRDD(List(Person(1,"张三",18,true)))
userRDD.toDS().show()

结果:

+---+----+---+----+
| id|name|age| sex|
+---+----+---+----+
|  1|张三| 18|true|
+---+----+---+----+

Dataframe create

DataFrame是一个命名列的数据集,用户可以直接操作 column 因此几乎所有DataFrame推荐操作都是无类型操作 。用户也可以把一个DataFrame看做是 Dataset[Row] 类型的数据集。

json文件

/**
 * json
 */
val dataFrame: DataFrame = spark.read
	.json("file:///Users/mashikang/IdeaProjects/spark_sql/src/main/resources/json/")
dataFrame.printSchema()
dataFrame.show()

结果:

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

+---+----+
|age|name|
+---+----+
| 18|张三|
| 28|李四|
| 38|王五|
+---+----+

case-class

case class User(id:Int,name:String,sex:Boolean)
/**
 * case-class
 */
var userDF=List(User(1,"张三",true))
	.toDF()
userDF.show()

结果:

+---+----+----+
| id|name| sex|
+---+----+----+
|  1|张三|true|
+---+----+----+

Tuple(元组)

/**
 * Tuple
 */
var userDF=List((1,"张三",true))
	.toDF("id","name","sex") 
userDF.show()

结果:

+---+----+----+
| id|name| sex|
+---+----+----+
|  1|张三|true|
+---+----+----+

RDD转换

  • Tuple
/**
 * RDD Tuple
 */
var userDF = spark.sparkContext.parallelize(List((1, "张三", true)))
	.toDF("id", "name", "sex") //可以指定列
userDF.show()

结果:

+---+----+----+
| id|name| sex|
+---+----+----+
|  1|张三|true|
+---+----+----+
  • case-class
/**
 * RDD case-class
 */
var userDF = spark.sparkContext.parallelize(List(User(1, "张三", true)))
	.toDF()
userDF.show()

结果:

+---+----+----+
| id|name| sex|
+---+----+----+
|  1|张三|true|
+---+----+----+
  • RDD[Row]类型转为DataFrame
/**
 * RDD[Row]转换
 */
var userRDD: RDD[Row] = spark.sparkContext.parallelize(List(User(1, "张三", true)))
	.map(u => Row(u.id, u.name, u.sex))
var schema = new StructType()
	.add("id", IntegerType)
	.add("name", StringType)
	.add("sex", BooleanType)
var userDF = spark.createDataFrame(userRDD, schema)
userDF.show()

结果:

+---+----+----+
| id|name| sex|
+---+----+----+
|  1|张三|true|
+---+----+----+
  • RDD case-class
/**
 * RDD case-class
 */
var userRDD: RDD[User] = spark.sparkContext
	.makeRDD(List(User(1, "张三", true)))
var userDF = spark.createDataFrame(userRDD)
userDF.show()

结果:

+---+----+----+
| id|name| sex|
+---+----+----+
|  1|张三|true|
+---+----+----+
  • RDD Tuple
/**
 * RDD Tuple
 */
var userRDD:RDD[(Int,String,Boolean)]=spark.sparkContext
	.makeRDD(List((1,"张三",true)))
var userDF=spark.createDataFrame(userRDD)
userDF.show()

结果:

+---+----+----+
| _1|  _2|  _3|
+---+----+----+
|  1|张三|true|
+---+----+----+

DataFrame Operations(Untyped)DataFrame无类型操作

printSchema 打印Dataframe的表结构(表头)

var df = List((1, "张三", true))
	.toDF("id", "name", "sex")
df.printSchema()
root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- sex: boolean (nullable = false)

show

var df = List(
	(1, "zs", true, 1, 15000),
	(2, "ls", false, 1, 15000)
)
	.toDF("id", "name", "sex", "dept", "salary")
	
df.select($"id", $"name", $"salary")
	.show()
+---+----+------+
| id|name|salary|
+---+----+------+
|  1|  zs| 15000|
|  2|  ls| 15000|
+---+----+------+

select

var df = List(
	(1, "zs", true, 1, 15000),
	(2, "ls", false, 1, 18000)
)
	.toDF("id", "name", "sex", "dept", "salary")
df.select($"id",$"name",$"sex",$"dept",$"salary" * 12 as "annual_salary")
	.show()
+---+----+-----+----+-------------+
| id|name|  sex|dept|annual_salary|
+---+----+-----+----+-------------+
|  1|  zs| true|   1|       180000|
|  2|  ls|false|   1|       216000|
+---+----+-----+----+-------------+

selectExpr

var df = List(
	(1, "zs", true, 1, 15000),
	(2, "ls", false, 1, 18000)
)
	.toDF("id", "name", "sex", "dept", "salary")
//等价 df.select($"id",$"name",$"sex",$"dept",$"salary" * 12 as "annual_salary")
df.selectExpr("id", "name", "sex", "dept", "salary * 12 as annual_salary")
	.show()
+---+----+-----+----+-------------+
| id|name|  sex|dept|annual_salary|
+---+----+-----+----+-------------+
|  1|  zs| true|   1|       180000|
|  2|  ls|false|   1|       216000|
+---+----+-----+----+-------------+

withColumn

var df = List(
	(1, "zs", true, 1, 15000),
	(2, "ls", false, 1, 18000)
)
	.toDF("id", "name", "sex", "dept", "salary")
df.select($"id", $"name", $"sex", $"dept", $"salary")
	.withColumn("annual_salary", $"salary" * 12)
	.show()
+---+----+-----+----+------+-------------+
| id|name|  sex|dept|salary|annual_salary|
+---+----+-----+----+------+-------------+
|  1|  zs| true|   1| 15000|       180000|
|  2|  ls|false|   1| 18000|       216000|
+---+----+-----+----+------+-------------+

withColumnRenamed

var df = List(
	(1, "zs", true, 1, 15000),
	(2, "ls", false, 1, 18000)
)
	.toDF("id", "name", "sex", "dept", "salary")

df.select($"id", $"name", $"sex", $"dept", $"salary")
	.withColumn("annula_salary", $"salary" * 12)
	.withColumnRenamed("dept", "department")
	.withColumnRenamed("name", "username")
	.show()
+---+--------+-----+----------+------+-------------+
| id|username|  sex|department|salary|annula_salary|
+---+--------+-----+----------+------+-------------+
|  1|      zs| true|         1| 15000|       180000|
|  2|      ls|false|         1| 18000|       216000|
+---+--------+-----+----------+------+-------------+

drop

var df = List(
	(1, "zs", true, 1, 15000),
	(2, "ls", false, 1, 18000)
)
	.toDF("id", "name", "sex", "dept", "salary")

df.select($"id",$"name",$"sex",$"dept",$"salary")
	.withColumn("annula_salary",$"salary" * 12)
	.withColumnRenamed("dept","department")
	.withColumnRenamed("name","username")
	.drop("sex")
	.show()
+---+--------+----------+------+-------------+
| id|username|department|salary|annula_salary|
+---+--------+----------+------+-------------+
|  1|      zs|         1| 15000|       180000|
|  2|      ls|         1| 18000|       216000|
+---+--------+----------+------+-------------+

dropDuplicates

var df = List(
	(1, "zs", true, 1, 15000),
	(2, "ls", false, 1, 18000),
	(3, "ww", false, 1, 19000),
	(4, "zl", false, 1, 18000)
)
	.toDF("id", "name", "sex", "dept", "salary")
df.select($"id", $"name", $"sex", $"dept", $"salary")
	.dropDuplicates("sex", "salary")
	.show()
+---+----+-----+----+------+
| id|name|  sex|dept|salary|
+---+----+-----+----+------+
|  3|  ww|false|   1| 19000|
|  1|  zs| true|   1| 15000|
|  2|  ls|false|   1| 18000|
+---+----+-----+----+------+

orderBy|sort

var df = List(
	(1, "zs", true, 1, 15000),
	(2, "ls", false, 2, 18000),
	(3, "ww", false, 2, 14000),
	(4, "zl", false, 1, 18000),
	(5, "zl", false, 1, 16000)
)
	.toDF("id", "name", "sex", "dept", "salary")
	
df.select($"id", $"name", $"sex", $"dept", $"salary")
	.orderBy($"salary" desc, $"id" asc)
	//.sort($"salary" desc,$"id" asc)
	.show()
+---+----+-----+----+------+
| id|name|  sex|dept|salary|
+---+----+-----+----+------+
|  2|  ls|false|   2| 18000|
|  4|  zl|false|   1| 18000|
|  5|  zl|false|   1| 16000|
|  1|  zs| true|   1| 15000|
|  3|  ww|false|   2| 14000|
+---+----+-----+----+------+

groupBy

var df = List(
	(1, "zs", true, 1, 15000),
	(2, "ls", false, 2, 18000),
	(3, "ww", false, 2, 14000),
	(4, "zl", false, 1, 18000),
	(5, "zl", false, 1, 16000)
)
	.toDF("id", "name", "sex", "dept", "salary")
	
df.select($"id", $"name", $"sex", $"dept", $"salary")
	.groupBy($"dept")
	.max("salary")
	.show()
+----+-----------+
|dept|max(salary)|
+----+-----------+
|   1|      18000|
|   2|      18000|
+----+-----------+

类似的算子还有 max、min、avg|mean、sum、count

agg

var df = List(
	(1, "zs", true, 1, 15000),
	(2, "ls", false, 2, 18000),
	(3, "ww", false, 2, 14000),
	(4, "zl", false, 1, 18000),
	(5, "zl", false, 1, 16000)
)
	.toDF("id", "name", "sex", "dept", "salary")
	
import org.apache.spark.sql.functions._

df.select($"id", $"name", $"sex", $"dept", $"salary")
	.groupBy($"dept")
	.agg(max("salary") as "max_salary", avg("salary") as "avg_salary")
	.show()
+----+----------+------------------+
|dept|max_salary|        avg_salary|
+----+----------+------------------+
|   1|     18000|16333.333333333334|
|   2|     18000|           16000.0|
+----+----------+------------------+

agg 还可以传递表达式

var df = List(
	(1, "zs", true, 1, 15000),
	(2, "ls", false, 2, 18000),
	(3, "ww", false, 2, 14000),
	(4, "zl", false, 1, 18000),
	(5, "zl", false, 1, 16000)
)
	.toDF("id", "name", "sex", "dept", "salary")
	
import org.apache.spark.sql.functions._

df.select($"id", $"name", $"sex", $"dept", $"salary")
	.groupBy($"dept")
	.agg(Map("salary"->"max","id"->"count"))
	.show()
+----+-----------+---------+
|dept|max(salary)|count(id)|
+----+-----------+---------+
|   1|      18000|        3|
|   2|      18000|        2|
+----+-----------+---------+

limit

var df = List(
	(1, "zs", true, 1, 15000),
	(2, "ls", false, 2, 18000),
	(3, "ww", false, 2, 14000),
	(4, "zl", false, 1, 18000),
	(5, "zl", false, 1, 16000)
)
	.toDF("id", "name", "sex", "dept", "salary")

df.select($"id", $"name", $"sex", $"dept", $"salary")
	.orderBy($"id" desc)
	.limit(4)
	.show()
+---+----+-----+----+------+
| id|name|  sex|dept|salary|
+---+----+-----+----+------+
|  5|  zl|false|   1| 16000|
|  4|  zl|false|   1| 18000|
|  3|  ww|false|   2| 14000|
|  2|  ls|false|   2| 18000|
+---+----+-----+----+------+

where

var df = List(
	(1, "zs", true, 1, 15000),
	(2, "ls", false, 2, 18000),
	(3, "ww", false, 2, 14000),
	(4, "zl", false, 1, 18000),
	(5, "win7", false, 1, 16000)
)
	.toDF("id", "name", "sex", "dept", "salary")
	
df.select($"id", $"name", $"sex", $"dept", $"salary")
	//where("(name like '%s%' and salary > 15000) or name = 'win7'") 
	.where(($"name" like "%s%" and $"salary" > 15000) or $"name" === "win7")
	.show()
+---+----+-----+----+------+
| id|name|  sex|dept|salary|
+---+----+-----+----+------+
|  2|  ls|false|   2| 18000|
|  5|win7|false|   1| 16000|
+---+----+-----+----+------+

pivot(行转列)

var scoreDF = List(
	(1, "math", 85),
	(1, "chinese", 80),
	(1, "english", 90),
	(2, "math", 90),
	(2, "chinese", 80)
)
	.toDF("id", "course", "score")
	
import org.apache.spark.sql.functions._

//select id,max(case course when 'math' then score else 0 end )as math ,max(case course when 'chinese' then score else 0 end) as chinese from t_course group by id;
scoreDF.selectExpr("id", "case course when 'math' then score else 0 end as math", "case course when 'chinese' then score else 0 end as chinese", "case course when 'english' then score else 0 end as english")
	.groupBy("id")
	.agg(max($"math"), max($"chinese"), max($"english"))
	.show()
+---+---------+------------+------------+
| id|max(math)|max(chinese)|max(english)|
+---+---------+------------+------------+
|  1|       85|          80|          90|
|  2|       90|          80|           0|
+---+---------+------------+------------+

简易写法

var scoreRDD = List(
	(1, "math", 85),
	(1, "chinese", 80),
	(1, "english", 90),
	(2, "math", 90),
	(2, "chinese", 80)
)
scoreRDD.toDF("id", "course", "score")
	.groupBy("id") 
	// 					行转列 				可选值
	.pivot("course", scoreRDD.map(t => t._2).distinct)
	.max("score")
	.show()
+---+----+-------+-------+
| id|math|chinese|english|
+---+----+-------+-------+
|  1|  85|     80|     90|
|  2|  90|     80|   null|
+---+----+-------+-------+

na(替换当前为null的值)

var scoreRDD = List(
	(1, "math", 85),
	(1, "chinese", 80),
	(1, "english", 90),
	(2, "math", 90),
	(2, "chinese", 80),
	(3, "math", 100)
)

scoreRDD.toDF("id", "course", "score")
	.groupBy("id")
	// 					行转列 				可选值
	.pivot("course", scoreRDD.map(t => t._2).distinct)
	.max("score")
	.na.fill(Map("english" -> -1, "chinese" -> 0))
	.show()
+---+----+-------+-------+
| id|math|chinese|english|
+---+----+-------+-------+
|  1|  85|     80|     90|
|  3| 100|      0|     -1|
|  2|  90|     80|     -1|
+---+----+-------+-------+

join

case class UserCost(id: Int, category: String, totalCost: Double)
case class User(id: Int, name: String, sex: Boolean, age: Int, salary: Double)
var userCostDF = spark.sparkContext
	.parallelize(List(
		UserCost(1, "电脑配件", 100),
		UserCost(1, "母婴用品", 100),
		UserCost(1, "生活用品", 100),
		UserCost(2, "居家美食", 79),
		UserCost(2, "消费电子", 80),
		UserCost(2, "生活用品", 100)
	))
	.toDF()
	.withColumnRenamed("id", "uid")
val categories = userCostDF
	.select("category")
	.as[(String)]
	.rdd
	.distinct
	.collect()
var userDF = spark.sparkContext
	.parallelize(List(
		User(1, "张晓三", true, 18, 15000),
		User(2, "李晓四", true, 18, 18000),
		User(3, "王晓五", false, 18, 10000)
	))
	.toDF()
userDF.join(userCostDF, $"id" === $"uid", "left_outer")
	.drop("uid")
	.groupBy("id", "name")
	.pivot($"category", categories)
	.sum("totalCost")
	.na.fill(0.0)
	.show()
+---+------+--------+--------+--------+--------+--------+
| id|  name| 电脑配件| 生活用品| 母婴用品| 居家美食|  消费电子|
+---+------+--------+--------+--------+--------+--------+
|  1|张晓三|   100.0|   100.0|   100.0|     0.0|     0.0|
|  3|王晓五|     0.0|     0.0|     0.0|     0.0|     0.0|
|  2|李晓四|     0.0|   100.0|     0.0|    79.0|    80.0|
+---+------+--------+--------+--------+--------+--------+

cube(多维度)

import org.apache.spark.sql.functions._
List(
	(110, 50, 80, 80),
	(120, 60, 95, 75),
	(120, 50, 96, 70)
)
	.toDF("height", "weight", "IQ", "EQ")
	.cube($"height", $"weight")
	.agg(avg("IQ"), avg("EQ"))
	.show()
+------+------+-----------------+-------+
|height|weight|          avg(IQ)|avg(EQ)|
+------+------+-----------------+-------+
|   110|    50|             80.0|   80.0|
|   120|  null|             95.5|   72.5|
|   120|    60|             95.0|   75.0|
|  null|    60|             95.0|   75.0|
|  null|  null|90.33333333333333|   75.0|
|   120|    50|             96.0|   70.0|
|   110|  null|             80.0|   80.0|
|  null|    50|             88.0|   75.0|
+------+------+-----------------+-------+

Dataset Oprations (Strong typed) 数据集操作-强类型

由于强类型操作都是基于类型操作,Spark SQL的操作都是推荐使用Dataframe基于列操作,因此一般情况下不推荐使用。

val lines = Array("this is a demo", "hello spark")
val wordRDD = spark.sparkContext.makeRDD(lines)
	.flatMap(_.split("\\s+"))
	.map((_, 1))
	
import org.apache.spark.sql.expressions.scalalang.typed

val ds: Dataset[(String, Int)] = wordRDD.toDS()
ds.groupByKey(t => t._1)
	.agg(typed.sum[(String, Int)](tuple => tuple._2).name("total"))
	.filter(tuple => tuple._1.contains("o"))
	.show()
+-----+-----+
|value|total|
+-----+-----+
|hello|  1.0|
| demo|  1.0|
+-----+-----+
©️2020 CSDN 皮肤主题: 黑客帝国 设计师:上身试试 返回首页