SPARK UDF多次执行的问题

SPARK UDF多次执行的问题


通常我们在一个dataframe中调用udf时,我们预期是每一行应用一次udf函数,但实际上这是不能保证每一行应用一次的,因为在一些可能多次访问udf返回值得场景下,spark内部会优先反复调用udf而不是job。
所以我们在设计udf时应该设计为pure function,这样可以保证即使对于同一条数据多次调用udf也不会影响预期结果,否则应该考虑使用map/mapPartitions实现你的需求
下方引用spark jira上的问题描述

Spark assumes UDF’s are pure function; we do not guarantee that a function is only executed once. This is due to the way the optimizer works, and the fact that sometimes retry stages. We could add a flag to UDF to prevent this from working, but this would be a considerable engineering effort.
The example you give is not really a pure function, as its side effects makes the thread stop (changes state).
If you are connecting to an external service, then I would suggest using Dataset.mapPartitions(…) (similar to a generator). This will allow you to setup one connection per partition, and you can call a method as much or as little as you like.

jira链接:UDFs are run too many times

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值