概述
项目中经常会用到sql相关操作,如果利用createTempView建立临时表,纯写sql方式实现是一种常用的方法,但是如何利用原生的算子进行sql的各种操作,经常摸不着头脑,本来基于spark 2.1.1进行原生算子进行sql常见操作的实战。
数据构造
原始数据
原始数据包含五个字段,通过toDF指定column名字进行改名操作,最后我们建立了一个临时表,作为后文的sql语言和算子进行同样操作的对照。
scala> val data = List(("James ","","Smith","M",60000),
| ("Michael ","Rose","","M",70000),
| ("Robert ","","Williams","",400000),
| ("Maria ","Anne","Jones","F",500000),
| ("Jen","Mary","Brown","F",0))
data: List[(String, String, String, String, Int)] = List(("James ","",Smith,M,60000), ("Michael ",Rose,"",M,70000), ("Robert ","",Williams,"",400000), ("Maria ",Anne,Jones,F,500000), (Jen,Mary,Brown,F,0))
scala> val cols = Seq("first_name","middle_name","last_name","gender","salary")
cols: Seq[String] = List(first_name, middle_name, last_name, gender, salary)
// def toDF(colNames: String*): DataFrame:将dataFrame的各个field重命名
scala> val df = spark.createDataFrame(data).toDF(cols:_*)
df: org.apache.spark.sql.DataFrame = [first_name: string, middle_name: string ... 3 more fields]
scala> df.printSchema
root
|-- first_name: string (nullable = true)
|-- middle_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: integer (nullable = false)
scala> df.createOrReplaceTempView("user_table")
选取某列的几种方式
sql的各项操作基本都是针对列的处理,所以先介绍一下获取列的方法
scala> df("first_name")
res64: org.apache.spark.sql.Column = first_name
scala> $"first_name"
res65: org.apache.spark.sql.ColumnName = first_name
scala> df.col("first_name")
res66: org.apache.spark.sql.Column = first_name
scala> df.col("first_name").alias("xx")
res67: org.apache.spark.sql.Column = first_name AS `xx`
复杂数据类型
array
scala> val arrayDf = Seq((Seq(1,1,2),2),(Seq(1,2,3),3)).toDF("item","id")
arrayDf: org.apache.spark.sql.DataFrame = [item: array<int>, id: int]
scala> arrayDf.printSchema
root
|-- item: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- id: integer (nullable = false)
是否包含某个元素
functions里面提供了array_contains函数,来判断是否包含某个元素,常用于筛选条件
import org.apache.spark.sql.functions.array_contains
scala> arrayDf.filter(array_contains($"item", 3)).show
+---------+---+
| item| id|
+---------+---+
|[1, 2, 3]| 3|
+---------+---+
展开,并命名
explode_outer,同explode,但当array或map为空或null时,会展开为null
import org.apache.spark.sql.functions.explode
scala> arrayDf.select($"id", explode($"item").alias("item_ele")).show
+---+--------+
| id|item_ele|
+---+--------+
| 2| 1|
| 2| 1|
| 2| 2|
| 3| 1|
| 3| 2|
| 3| 3|
+---+--------+
大小
size方法可以获取array大小
scala> arrayDf.filter(size($"item") > 2).show
+---------+---+
| item| id|
+---------+---+
|[1, 1, 2]| 2|
|[1, 2, 3]| 3|
+---------+---+
map
scala> var mapDf = df.select(map(df("first_name"), df("middle_name")).alias("map"), $"salary")
mapDf: org.apache.spark.sql.DataFrame = [map: map<string,string>, salary: int]
scala> mapDf.printSchema
root
|-- map: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- salary: integer (nullable = false)
展开
利用explode对map进行展开,可以得到两列key, value