pyspark中udf_Pyspark中的Python聚合UDF

最新推荐文章于 2023-06-15 23:42:34 发布

cumei1658

最新推荐文章于 2023-06-15 23:42:34 发布

阅读量433

点赞数

文章标签： python java 大数据算法数据库

原文链接：https://www.pybloggers.com/2018/09/python-aggregate-udfs-in-pyspark/

版权

pyspark中udf

Pyspark has a great set of aggregate functions (e.g., count, countDistinct, min, max, avg, sum), but these are not enough for all cases (particularly if you’re trying to avoid costly Shuffle operations).

Pyspark有很多集合函数（例如count，countDistinct，min，max，avg，sum ），但是对于所有情况（特别是在避免昂贵的Shuffle操作时）来说，这些函数是不够的。

Pyspark currently has pandas_udfs, which can create custom aggregators, but you can only “apply” one pandas_udf at a time. If you want to use more than one, you’ll have to preform multiple groupBys…and there goes avoiding those shuffles.

Pyspark当前具有pandas_udfs ，可以创建自定义聚合器，但是您一次只能“应用”一个pandas_udf。如果要使用多个，则必须执行多个groupBys…并且避免了这些混洗。

In this post I describe a little hack which enables you to create simple python UDFs which act on aggregated data (this functionality is only supposed to exist in Scala!).

在这篇文章中，我描述了一个小技巧，使您可以创建对聚合数据起作用的简单python UDF（此功能仅应存在于Scala中！）。

id	ID	value	值
1	1个	‘a’	'一个'
1	1个	‘b’	'b'
1	1个	‘b’	'b'
2	2	‘c’	'C'

I use collect_list to bring all data from a given group into a single row. I print the output of this operation below.

我使用collect_list将给定组中的所有数据放入一行。我在下面打印此操作的输出。

1
1

id	ID	value_list	value_list
1	1个	[‘a’, ‘b’, ‘b’]	['a'，'b'，'b']
2	2	[‘c’]	['C']

I then create a UDF which will count all the occurences of the letter ‘a’ in these lists (this can be easily done without a UDF but you get the point). This UDF wraps around collect_list, so it acts on the output of collect_list.

然后，我创建一个UDF ，它将对这些列表中字母'a'的所有出现次数进行计数（无需UDF即可轻松完成，但您已明白了这一点）。此UDF环绕collect_list，因此它作用于collect_list的输出。

id	ID	a_count	a_count
1	1个	1	1个
2	2	0	0

There we go! A UDF that acts on aggregated data! Next, I show the power of this approach when combined with when which let’s us control which data enters F.collect_list.

好了！一个对聚合数据起作用的UDF！接下来，我将展示这种方法的强大功能，并结合何时让我们控制哪些数据输入F.collect_list。

First, let’s create a dataframe with an extra column.

首先，让我们创建一个带有额外列的数据框。

id	ID	value1	值1	value2	值2
1	1个	1	1个	‘a’	'一个'
1	1个	2	2	‘a’	'一个'
1	1个	1	1个	‘b’	'b'
1	1个	2	2	‘b’	'b'
2	2	1	1个	‘c’	'C'

Notice, how I included a when in the collect_list. Note that the UDF still wraps around collect_list.

注意，我如何在collect_list中包含一个when。请注意，UDF仍然环绕collect_list。

1
1

id	ID	a_count	a_count
1	1个	1	1个
2	2	0	0

翻译自: https://www.pybloggers.com/2018/09/python-aggregate-udfs-in-pyspark/

pyspark中udf

cumei1658

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pyspark中udf_Pyspark中的Python聚合UDF

pyspark中udfPyspark has a great set of aggregate functions (e.g., count, countDistinct, min, max, avg, sum), but these are not enough for all cases (particularly if you’re trying to avoid costly Shuffl...
复制链接

扫一扫