spark sql 高阶函数介绍

10 篇文章 0 订阅
5 篇文章 0 订阅

背景

An Introduction to Higher Order Functions in Spark SQL

Nested data types offer Apache Spark users powerful ways to manipulate structured data. In particular, they allow you to put complex objects like arrays, maps and structures inside of columns. This can help you model your data in a more natural way.

While this feature is certainly useful, it can quite bit cumbersome to manipulate data inside of complex objects because SQL (and Spark) do not have primitives for working with such data. In addition, it is time-consuming, non-performant, and non-trivial. During this talk we will discuss some of the commonly used techniques for working with complex objects, and we will introduce new ones based on Higher-order functions. Higher-order functions will be part of Spark 2.4 and are a simple and performant extension to SQL that allow a user to manipulate complex data such as arrays.

视频地址

transform

对array中的每个元素进行同样的操作

  1. 案例,初始数据 表名为data

data =====> createTempView(data, “data”)

idsumreduce
121
253
  1. 合并数组元素

result <- sql(“select *,array(sum,reduce) as merge from data”)
createTempView(result, “result”)

idsumreducemerge
1212,1
2535,3
  1. 使用 高阶函数 tranform 对 result (array类型)中的每个元素 +1操作

result <- sql(“select *,transform(merge,merge-> merge+ 1) as final from result”)
createTempView(final, “final”)

idsumreduceresultfinal
1212,13,2
2535,36,4

使用高阶函数可以提升显著性能,如果使用老方法,必须要先explode,把array先分解,然后再group by 唯一值进行collect_list聚合,此时group by会涉及shuffle,因此会比较耗费性能

SELECT id,
collect_list(val + 1) AS vals
FROM (SELECT id,
explode(vals) AS val
FROM input_tbl) x
GROUP BY id

也可以使用udf,但是会序列化数据,此时也是比较昂贵的操作

def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register(“plusOneInt”, addOne(_: Seq[Int]): Seq[Int])

SELECT id, plusOneInt(vals) as vals FROM input_tbl

transform 嵌套执行(nest)

当array里面套array的时候使用

SELECT key,
nested_values,
TRANSFORM(nested_values,
values -> TRANSFORM(values,
value -> value + key + SIZE(values))) AS new_nested_values
FROM nested_data

exists

表示array中元素的存在性

我们使用如上结果

createTempView(result, “result”)

idsumreducemerge
1212,1
2535,3

判断merge中的元素是否存在1

sql(“select *,exists(merge, merge_value -> merge_value==1) as exists from result”)

idsumreducemergeexists
1212,1TRUE
2535,3FALSE

aggregate 聚合

直接进入 高级聚合

聚合函数的第二个参数可以为数值,也可以为一个元祖,用来初始化值,其运作逻辑就是普通的reduce函数

SELECT key,
values,
AGGREGATE(values,
(1.0 AS product, 0 AS N),
(buffer, value) -> (value * buffer.product, buffer.N + 1),
buffer -> Power(buffer.product, 1.0 / buffer.N)) geomean
FROM nested_data

如上表示能够实现geomean计算

databricks专题

spark-2.4 notebook

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值