SPARK UDF多次执行的问题
通常我们在一个dataframe中调用udf时,我们预期是每一行应用一次udf函数,但实际上这是不能保证每一行应用一次的,因为在一些可能多次访问udf返回值得场景下,spark内部会优先反复调用udf而不是job。
所以我们在设计udf时应该设计为pure function,这样可以保证即使对于同一条数据多次调用udf也不会影响预期结果,否则应该考虑使用map/mapPartitions实现你的需求
下方引用spark jira上的问题描述
Spark assumes UDF’s are pure function; we do not guarantee that a function is only executed once. This is due to the way the optimizer works, and the fact that sometimes retry stages. We could add a flag to UDF to prevent this from working, but this would be a considerable engineering effort.
The example you give is not really a pure function, as its side effects makes the thread stop (changes state).
If you are connecting to an external service, then I would suggest using Dataset.mapPartitions(…) (similar to a generator). This will allow you to setup one connection per partition, and you can call a method as much or as little as you like.
jira链接:UDFs are run too many times