一文让你了解DataSet处理Sql的各种实战技巧

概述

项目中经常会用到sql相关操作,如果利用createTempView建立临时表,纯写sql方式实现是一种常用的方法,但是如何利用原生的算子进行sql的各种操作,经常摸不着头脑,本来基于spark 2.1.1进行原生算子进行sql常见操作的实战。

数据构造

原始数据

原始数据包含五个字段,通过toDF指定column名字进行改名操作,最后我们建立了一个临时表,作为后文的sql语言和算子进行同样操作的对照。

scala> val data = List(("James ","","Smith","M",60000),
     |         ("Michael ","Rose","","M",70000),
     |         ("Robert ","","Williams","",400000),
     |         ("Maria ","Anne","Jones","F",500000),
     |         ("Jen","Mary","Brown","F",0))
data: List[(String, String, String, String, Int)] = List(("James ","",Smith,M,60000), ("Michael ",Rose,"",M,70000), ("Robert ","",Williams,"",400000), ("Maria ",Anne,Jones,F,500000), (Jen,Mary,Brown,F,0))

scala> val cols = Seq("first_name","middle_name","last_name","gender","salary")
cols: Seq[String] = List(first_name, middle_name, last_name, gender, salary)

// def toDF(colNames: String*): DataFrame:将dataFrame的各个field重命名
scala> val df = spark.createDataFrame(data).toDF(cols:_*)
df: org.apache.spark.sql.DataFrame = [first_name: string, middle_name: string ... 3 more fields]

scala> df.printSchema
root
 |-- first_name: string (nullable = true)
 |-- middle_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = false)

scala> df.createOrReplaceTempView("user_table")

选取某列的几种方式

sql的各项操作基本都是针对列的处理,所以先介绍一下获取列的方法

scala> df("first_name")
res64: org.apache.spark.sql.Column = first_name

scala> $"first_name"
res65: org.apache.spark.sql.ColumnName = first_name

scala> df.col("first_name")
res66: org.apache.spark.sql.Column = first_name

scala> df.col("first_name").alias("xx")
res67: org.apache.spark.sql.Column = first_name AS `xx`

复杂数据类型

array

scala> val arrayDf = Seq((Seq(1,1,2),2),(Seq(1,2,3),3)).toDF("item","id")
arrayDf: org.apache.spark.sql.DataFrame = [item: array<int>, id: int]

scala> arrayDf.printSchema
root
 |-- item: array (nullable = true)
 |    |-- element: integer (containsNull = false)
 |-- id: integer (nullable = false)
是否包含某个元素

functions里面提供了array_contains函数,来判断是否包含某个元素,常用于筛选条件

import org.apache.spark.sql.functions.array_contains
scala> arrayDf.filter(array_contains($"item", 3)).show
+---------+---+
|     item| id|
+---------+---+
|[1, 2, 3]|  3|
+---------+---+
展开,并命名

explode_outer,同explode,但当array或map为空或null时,会展开为null

import org.apache.spark.sql.functions.explode

scala> arrayDf.select($"id", explode($"item").alias("item_ele")).show
+---+--------+
| id|item_ele|
+---+--------+
|  2|       1|
|  2|       1|
|  2|       2|
|  3|       1|
|  3|       2|
|  3|       3|
+---+--------+
大小

size方法可以获取array大小

scala> arrayDf.filter(size($"item") > 2).show
+---------+---+
|     item| id|
+---------+---+
|[1, 1, 2]|  2|
|[1, 2, 3]|  3|
+---------+---+

map

scala> var mapDf = df.select(map(df("first_name"), df("middle_name")).alias("map"), $"salary")
mapDf: org.apache.spark.sql.DataFrame = [map: map<string,string>, salary: int]

scala> mapDf.printSchema
root
 |-- map: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- salary: integer (nullable = false)
展开

利用explode对map进行展开,可以得到两列key, value


                
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值