pyspark中udf_Pyspark中的Python聚合UDF

pyspark中udf

Pyspark has a great set of aggregate functions (e.g., count, countDistinct, min, max, avg, sum), but these are not enough for all cases (particularly if you’re trying to avoid costly Shuffle operations).

Pyspark有很多集合函数(例如count,countDistinct,min,max,avg,sum ),但是对于所有情况(特别是在避免昂贵的Shuffle操作时)来说,这些函数是不够的。

Pyspark currently has pandas_udfs, which can create custom aggregators, but you can only “apply” one pandas_udf at a time. If you want to use more than one, you’ll have to preform multiple groupBys…and there goes avoiding those shuffles.

Pyspark当前具有pandas_udfs ,可以创建自定义聚合器,但是您一次只能“应用”一个pandas_udf。 如果要使用多个,则必须执行多个groupBys…并且避免了这些混洗。

In this post I describe a little hack which enables you to create simple python UDFs which act on aggregated data (this functionality is only supposed to exist in Scala!).

在这篇文章中,我描述了一个小技巧,使您可以创建对聚合数据起作用的简单python UDF(此功能仅应存在于Scala中!)。

1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
id ID value
1 1个 ‘a’ '一个'
1 1个 ‘b’ 'b'
1 1个 ‘b’ 'b'
2 2 ‘c’ 'C'

I use collect_list to bring all data from a given group into a single row. I print the output of this operation below.

我使用collect_list将给定组中的所有数据放入一行。 我在下面打印此操作的输出。

1
1
id ID value_list value_list
1 1个 [‘a’, ‘b’, ‘b’] ['a','b','b']
2 2 [‘c’] ['C']

I then create a UDF which will count all the occurences of the letter ‘a’ in these lists (this can be easily done without a UDF but you get the point). This UDF wraps around collect_list, so it acts on the output of collect_list.

然后,我创建一个UDF ,它将对这些列表中字母'a'的所有出现次数进行计数(无需UDF即可轻松完成,但您已明白了这一点)。 此UDF环绕collect_list,因此它作用于collect_list的输出。

1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
id ID a_count a_count
1 1个 1 1个
2 2 0 0

There we go! A UDF that acts on aggregated data! Next, I show the power of this approach when combined with when which let’s us control which data enters F.collect_list.

好了! 一个对聚合数据起作用的UDF! 接下来,我将展示这种方法的强大功能,并结合何时让我们控制哪些数据输入F.collect_list。

First, let’s create a dataframe with an extra column.

首先,让我们创建一个带有额外列的数据框。

1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
id ID value1 值1 value2 值2
1 1个 1 1个 ‘a’ '一个'
1 1个 2 2 ‘a’ '一个'
1 1个 1 1个 ‘b’ 'b'
1 1个 2 2 ‘b’ 'b'
2 2 1 1个 ‘c’ 'C'

Notice, how I included a when in the collect_list. Note that the UDF still wraps around collect_list.

注意,我如何在collect_list中包含一个when。 请注意,UDF仍然环绕collect_list。

1
1
id ID a_count a_count
1 1个 1 1个
2 2 0 0

翻译自: https://www.pybloggers.com/2018/09/python-aggregate-udfs-in-pyspark/

pyspark中udf

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值