Custom UDF in Apache Spark

最新推荐文章于 2024-08-07 20:28:18 发布

张博208

最新推荐文章于 2024-08-07 20:28:18 发布

阅读量234

点赞数

分类专栏： Spark Scala

Spark 同时被 2 个专栏收录

28 篇文章 0 订阅

订阅专栏

Scala

25 篇文章 0 订阅

订阅专栏

Apache Spark has become very widely used framework to build Big data application. Spark SQL has made adhoc analysis on structured data very easy, So it is very popular among users who deal with huge amount of structured data. However many times you will realise that some functionality is not available in Spark SQL. For example in Spark 1.3.1 , when you are using sqlContext basic functions like variance , standard deviation and percentile are not available(You can access these using HiveContext). In a similar situation, you may like to do some string operation on an input column but functionality may not be available in sqlContext. In such situations , we can use spark udf to add new constructs in sqlContext. I have found them very handy, specially while processing string type columns. In this post we will discuss how to create UDFs in Spark and how to use them.

Use Case : Let us keep this example very simple. We will create a udf which will take a string as input and it will convert it into upper case. we will take an example to use that udf with Data Frame.

Create a UDF : First of all let us see how to declare a UDF. Following is the syntax to create a UDF in spark using scala as language.

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val toUpper = udf[String, String]( x => x.toUpperCase() )

As you can see it is very simple to define a UDF.
Here first two lines are just imports. we are importing the packages that are required by piece of code that we are writing. Third line is the place where UDF is created. Note the following things in code.
1. toUpper is the name of UDF
2. Notice udf[String, String] part. first argument is type of value that will be returned by this UDF. second argument is type of input argument. If there are multiple input arguments, we have to mention their type in similar way. For example if we are creating a sum udf , which will take two int as input and return int value. then its definition will be as following

val getSum = udf[Int, Int, Int]( x,y => x + y )

Using UDF : Once UDF is created, we can use it in our code. suppose we have a spark dataframe named df. df has a column named name which contains name. Type of name is String. we want to add one more column in our dataframe with name name_upper where name will be in upper case. Now in this case we can use our udf that we defined earlier.

val df_new = df.withColumn("name_upper", toUpper(df.col("name")))

As you can see that once we create a UDF, using them is very convenient.

Add on : some times when we are working on data frame , we want to add a new column with some dummy value. Now UDFs can be very handy in this case. you can do this using following.
Create a dummy UDF.

val dummy= udf[String,String] (x => x)

Now this udf can be called to add a dummy value column in dataframe.

val df_new = df.withColumn("dummy_col", dummy(lit("Dummy_value")))

Hope this post will be useful for you. please feel free to add your comments and thought