pyspark中udf
Pyspark has a great set of aggregate functions (e.g., count, countDistinct, min, max, avg, sum), but these are not enough for all cases (particularly if you’re trying to avoid costly Shuffle operations).
Pyspark有很多集合函数(例如count,countDistinct,min,max,avg,sum ),但是对于所有情况(特别是在避免昂贵的Shuffle操作时)来说,这些函数是不够的。
Pyspark currently has pandas_udfs, which can create custom aggregators, but you can only “apply” one pandas_udf at a time. If you want to use more than one, you’ll have to preform multiple groupBys…and there goes avoiding those shuffles.
Pyspark当前具有pandas_udfs ,可以创建自定义聚合器,但是您一次只能“应用”一个pandas_udf。 如果要使用多个,则必须执行多个groupBys…并且避免了这些混洗。
In this post I describe a little hack which enables you to create simple python UDFs which act on aggregated data (this functionality is only supposed to exist in Scala!).
在这篇文章中,我描述了一个小技巧,使您可以创建对聚合数据起作用的简单python UDF(此功能仅应存在于Scala中!)。
|
id | ID | value | 值 |
---|---|---|---|
1 | 1个 | ‘a’ | '一个' |
1 | 1个 | ‘b’ | 'b' |
1 | 1个 | ‘b’ | 'b' |
2 | 2 | ‘c’ | 'C' |
I use collect_list to bring all data from a given group into a single row. I print the output of this operation below.
我使用collect_list将给定组中的所有数据放入一行。 我在下面打印此操作的输出。
|
id | ID | value_list | value_list |
---|---|---|---|
1 | 1个 | [‘a’, ‘b’, ‘b’] | ['a','b','b'] |
2 | 2 | [‘c’] | ['C'] |
I then create a UDF which will count all the occurences of the letter ‘a’ in these lists (this can be easily done without a UDF but you get the point). This UDF wraps around collect_list, so it acts on the output of collect_list.
然后,我创建一个UDF ,它将对这些列表中字母'a'的所有出现次数进行计数(无需UDF即可轻松完成,但您已明白了这一点)。 此UDF环绕collect_list,因此它作用于collect_list的输出。
|
id | ID | a_count | a_count |
---|---|---|---|
1 | 1个 | 1 | 1个 |
2 | 2 | 0 | 0 |
There we go! A UDF that acts on aggregated data! Next, I show the power of this approach when combined with when which let’s us control which data enters F.collect_list.
好了! 一个对聚合数据起作用的UDF! 接下来,我将展示这种方法的强大功能,并结合何时让我们控制哪些数据输入F.collect_list。
First, let’s create a dataframe with an extra column.
首先,让我们创建一个带有额外列的数据框。
|
id | ID | value1 | 值1 | value2 | 值2 |
---|---|---|---|---|---|
1 | 1个 | 1 | 1个 | ‘a’ | '一个' |
1 | 1个 | 2 | 2 | ‘a’ | '一个' |
1 | 1个 | 1 | 1个 | ‘b’ | 'b' |
1 | 1个 | 2 | 2 | ‘b’ | 'b' |
2 | 2 | 1 | 1个 | ‘c’ | 'C' |
Notice, how I included a when in the collect_list. Note that the UDF still wraps around collect_list.
注意,我如何在collect_list中包含一个when。 请注意,UDF仍然环绕collect_list。
|
id | ID | a_count | a_count |
---|---|---|---|
1 | 1个 | 1 | 1个 |
2 | 2 | 0 | 0 |
翻译自: https://www.pybloggers.com/2018/09/python-aggregate-udfs-in-pyspark/
pyspark中udf