User-Defined Functions
-
Define a function
-
Create and apply UDF
-
Register UDF to use in SQL
-
Use Decorator Syntax (Python Only)
-
Use Vectorized UDF (Python Only)
Methods
-
UDF Registration (
spark.udf
):register
-
Built-In Functions :
udf
-
Python UDF Decorator :
@udf
-
Pandas UDF Decorator :
@pandas_udf
Define a function
Define a function in local Python/Scala to get the first letter of a string from the email
field.
def firstLetterFunction(email):
return email[0]
该函数在spark.DataFrame中是无法使用的。
from pyspark.sql.functions import col
display(salesDF.select(firstLetterFunction(col("email"))))
通过udf
函数将该函数定义为udf函数后就可以使用了
from pyspark.sql.functions import udf
firstLetterUDF = udf(firstLetterFunction)
display(salesDF.select(firstLetterUDF(col("email"))))
Register UDF to use in SQL
Register UDF using spark.udf.register
to create UDF in the SQL namespace.
salesDF.createOrReplaceTempView("sales")
spark.udf.register("sql_udf", firstLetterFunction)
SELECT email,sql_udf(email) AS firstLetter FROM sales
Use Decorator Syntax (Python Only)
Alternatively, define UDF using decorator syntax in Python with the datatype the function returns.
# Our input/output is a string
@udf("string")
def decoratorUDF(email: str) -> str:
return email[0]
from pyspark.sql.functions import col
salesDF = spark.read.parquet("/mnt/dbswarehouse/raw/sales.parquet")
display(salesDF.select(decoratorUDF(col("email"))))
Use Vectorized UDF (Python Only)
import pandas as pd
from pyspark.sql.functions import pandas_udf
# We have a string input/output
@pandas_udf("string")
def vectorizedUDF(email: pd.Series) -> pd.Series:
return email.str[0]
# Alternatively
vectorizedUDF = pandas_udf(lambda s: s.str[0], "string")
display(salesDF.select(vectorizedUDF(col("email"))))
We can also register these Vectorized UDFs to the SQL namespace.
spark.udf.register("sql_vectorized_udf", vectorizedUDF)