SPARK UDF多次执行的问题

最新推荐文章于 2022-06-30 21:32:33 发布

master.zZ

最新推荐文章于 2022-06-30 21:32:33 发布

阅读量1.5k

点赞数

分类专栏： spark 文章标签： spark

本文链接：https://blog.csdn.net/qq_33498670/article/details/96881851

版权

spark 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

SPARK UDF多次执行的问题

通常我们在一个dataframe中调用udf时，我们预期是每一行应用一次udf函数，但实际上这是不能保证每一行应用一次的，因为在一些可能多次访问udf返回值得场景下，spark内部会优先反复调用udf而不是job。
所以我们在设计udf时应该设计为pure function，这样可以保证即使对于同一条数据多次调用udf也不会影响预期结果，否则应该考虑使用map/mapPartitions实现你的需求
下方引用spark jira上的问题描述

Spark assumes UDF’s are pure function; we do not guarantee that a function is only executed once. This is due to the way the optimizer works, and the fact that sometimes retry stages. We could add a flag to UDF to prevent this from working, but this would be a considerable engineering effort.
The example you give is not really a pure function, as its side effects makes the thread stop (changes state).
If you are connecting to an external service, then I would suggest using Dataset.mapPartitions(…) (similar to a generator). This will allow you to setup one connection per partition, and you can call a method as much or as little as you like.

jira链接：UDFs are run too many times

master.zZ

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
SPARK UDF多次执行的问题

SPARK UDF多次执行的问题通常我们在一个dataframe中调用udf时，我们预期是每一行应用一次udf函数，但实际上这是不能保证每一行应用一次的，因为在一些可能多次访问udf返回值得场景下，spark内部会优先反复调用udf而不是job。所以我们在设计udf时应该设计为pure function，这样可以保证即使对于同一条数据多次调用udf也不会影响预期结果，否则应该考虑使用map/m...
复制链接

扫一扫