spark sql 高阶函数介绍

最新推荐文章于 2023-12-29 16:00:17 发布

乾坤瞬间

最新推荐文章于 2023-12-29 16:00:17 发布

阅读量912

点赞数

分类专栏： spark 机器学习大数据文章标签： spark sql 高阶函数

本文链接：https://blog.csdn.net/u012491646/article/details/102893174

版权

大数据同时被 3 个专栏收录

12 篇文章 0 订阅

订阅专栏

机器学习

10 篇文章 0 订阅

订阅专栏

spark

5 篇文章 0 订阅

订阅专栏

文章目录

- 背景

背景

An Introduction to Higher Order Functions in Spark SQL

Nested data types offer Apache Spark users powerful ways to manipulate structured data. In particular, they allow you to put complex objects like arrays, maps and structures inside of columns. This can help you model your data in a more natural way.

While this feature is certainly useful, it can quite bit cumbersome to manipulate data inside of complex objects because SQL (and Spark) do not have primitives for working with such data. In addition, it is time-consuming, non-performant, and non-trivial. During this talk we will discuss some of the commonly used techniques for working with complex objects, and we will introduce new ones based on Higher-order functions. Higher-order functions will be part of Spark 2.4 and are a simple and performant extension to SQL that allow a user to manipulate complex data such as arrays.

视频地址

transform

对array中的每个元素进行同样的操作

案例，初始数据表名为data

data =====> createTempView(data, “data”)

id	sum	reduce
1	2	1
2	5	3

合并数组元素

result <- sql(“select *,array(sum,reduce) as merge from data”)
createTempView(result, “result”)

id	sum	reduce	merge
1	2	1	2,1
2	5	3	5,3

使用高阶函数 tranform 对 result （array类型）中的每个元素 +1操作

result <- sql(“select *,transform(merge,merge-> merge+ 1) as final from result”)
createTempView(final, “final”)

id	sum	reduce	result	final
1	2	1	2,1	3,2
2	5	3	5,3	6,4

使用高阶函数可以提升显著性能，如果使用老方法，必须要先explode，把array先分解，然后再group by 唯一值进行collect_list聚合，此时group by会涉及shuffle，因此会比较耗费性能

SELECT id,
collect_list(val + 1) AS vals
FROM (SELECT id,
explode(vals) AS val
FROM input_tbl) x
GROUP BY id

也可以使用udf,但是会序列化数据，此时也是比较昂贵的操作

def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register(“plusOneInt”, addOne(_: Seq[Int]): Seq[Int])

SELECT id, plusOneInt(vals) as vals FROM input_tbl

transform 嵌套执行（nest）

当array里面套array的时候使用

SELECT key,
nested_values,
TRANSFORM(nested_values,
values -> TRANSFORM(values,
value -> value + key + SIZE(values))) AS new_nested_values
FROM nested_data

exists

表示array中元素的存在性

我们使用如上结果

createTempView(result, “result”)

id	sum	reduce	merge
1	2	1	2,1
2	5	3	5,3

判断merge中的元素是否存在1

sql(“select *,exists(merge, merge_value -> merge_value==1) as exists from result”)

id	sum	reduce	merge	exists
1	2	1	2,1	TRUE
2	5	3	5,3	FALSE

aggregate 聚合

直接进入高级聚合

聚合函数的第二个参数可以为数值，也可以为一个元祖，用来初始化值，其运作逻辑就是普通的reduce函数

SELECT key,
values,
AGGREGATE(values,
(1.0 AS product, 0 AS N),
(buffer, value) -> (value * buffer.product, buffer.N + 1),
buffer -> Power(buffer.product, 1.0 / buffer.N)) geomean
FROM nested_data

如上表示能够实现geomean计算

databricks专题

spark-2.4 notebook

乾坤瞬间

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
spark sql 高阶函数介绍

文章目录背景transformtransform 嵌套执行（nest）existsaggregate 聚合背景An Introduction to Higher Order Functions in Spark SQLNested data types offer Apache Spark users powerful ways to manipulate structured data. ...
复制链接

扫一扫

专栏目录