在开始正式数据处理之前,我觉得有必要去学习理解下UDF。
UDF
UDF全称User-Defined Functions
,用户自定义函数,是Spark SQL的一项功能,用于定义新的基于列的函数,这些函数扩展了Spark SQL的DSL用于转换数据集的词汇表。
我在databricks上找到一个比较简单理解的入门栗子:
Register the function as a UDF
1val squared = (s: Int) => {
2 s * s
3}
4spark.udf.register("square", squared)
Call the UDF in Spark SQL
1spark.range(1, 20).registerTempTable("test")
2%sql select id, square(id) as id_squared from test
我理解就是先定义一个函数squared
,返回输入数字的平方,然后register,并绑定square
方法名为square
,然后就在Spark SQL中直接使用square
方法。
实例一:温度转化
1import org.apache.spark.sql.SparkSession
2import org.apache.spark.SparkConf
3
4object ScalaUDFExample {
5 def main(args: Array[String]) {
6 val conf = new SparkConf().setAppName("Scala UDF Example")
7 val spark = SparkSession.builder().enableHiveSupport().config(conf).getOrCreate()
8
9 val ds = spark.read.json("temperatures.json")
10 ds.createOrReplaceTempView("citytemps")
11
12 // Register the UDF with our SparkSession
13 spark.udf.register("CTOF", (degreesCelcius: Double) => ((degreesCelcius * 9.0 / 5.0) + 32.0))
14
15 spark.sql("SELECT city, CTOF(avgLow) AS avgLowF, CTOF(avgHigh) AS avgHighF FROM citytemps").show()
16 }
17}
我们将定义一个 UDF 来将以下 JSON 数据中的温度从摄氏度(degrees Celsius)转换为华氏度(degrees Fahrenheit):
1{"city":"St. John's","avgHigh":8.7,"avgLow":0.6}
2{"city":"Charlottetown","avgHigh":9.7,"avgLow":0.9}
3{"city":"Halifax","avgHigh":11.0,"avgLow":1.6}
4{"city":"Fredericton","avgHigh":11.2,"avgLow":-0.5}
5{"city":"Quebec","avgHigh":9.0,"avgLow":-1.0}
6{"city":"Montreal","avgHigh":11.1,"avgLow":1.4}
7...
实例二:时间转化
1case class Purchase(customer_id: Int, purchase_id: Int, date: String, time: String, tz: String, amount:Double)
2
3val x = sc.parallelize(Array(
4 Purchase(123, 234, "2007-12-12", "20:50", "UTC", 500.99),
5 Purchase(123, 247, "2007-12-12", "15:30", "PST", 300.22),
6 Purchase(189, 254, "2007-12-13", "00:50", "EST", 122.19),
7 Purchase(187, 299, "2007-12-12", "07:30", "UTC", 524.37)
8))
9
10val df = sqlContext.createDataFrame(x)
11df.registerTempTable("df")
自定义函数
1def makeDT(date: String, time: String, tz: String) = s"$date $time $tz"
2sqlContext.udf.register("makeDt", makeDT(_:String,_:String,_:String))
3
4// Now we can use our function directly in SparkSQL.
5sqlContext.sql("SELECT amount, makeDt(date, time, tz) from df").take(2)
6// but not outside
7df.select($"customer_id", makeDt($"date", $"time", $"tz"), $"amount").take(2) // fails
如果想要在SQL外面使用,必须通过spark.sql.function.udf
来创建UDF
1import org.apache.spark.sql.functions.udf
2val makeDt = udf(makeDT(_:String,_:String,_:String))
3// now this works
4df.select($"customer_id", makeDt($"date", $"time", $"tz"), $"amount").take(2)
实践操作
写一个UDF来将一些Int数字分类
1val formatDistribution = (view: Int) => {
2 if (view < 10) {
3 "<10"
4 } else if (view <= 100) {
5 "10~100"
6 } else if (view <= 1000) {
7 "100~1K"
8 } else if (view <= 10000) {
9 "1K~10K"
10 } else if (view <= 100000) {
11 "10K~100K"
12 } else {
13 ">100K"
14 }
15}
注册:
1session.udf.register("formatDistribution", UDF.formatDistribution)
SQL:
1session.sql("select user_id, formatDistribution(variance_digg_count) as variance from video")
写到这里,再回顾UDF,我感觉这就像是去为了方便做一个分类转化等操作,和Python里面的函数一样,只不过这里的UDF一般特指Spark SQL里面使用的函数。然后发现这里和SQL中的自定义函数挺像的:
1CREATE FUNCTION [函数所有者.]<函数名称>
2(
3 -- 添加函数所需的参数,可以没有参数
4 [<@param1> <参数类型>]
5 [,<@param1> <参数类型>]…
6)
7RETURNS TABLE
8AS
9RETURN
10(
11 -- 查询返回的SQL语句
12 SELECT查询语句
13)
1/*
2* 创建内联表值函数,查询交易总额大于1W的开户人个人信息
3*/
4create function getCustInfo()
5returns @CustInfo table --返回table类型
6(
7 --账户ID
8 CustID int,
9 --帐户名称
10 CustName varchar(20) not null,
11 --身份证号
12 IDCard varchar(18),
13 --电话
14 TelePhone varchar(13) not null,
15 --地址
16 Address varchar(50) default('地址不详')
17)
18as
19begin
20 --为table表赋值
21 insert into @CustInfo
22 select CustID,CustName,IDCard,TelePhone,Address from AccountInfo
23 where CustID in (select CustID from CardInfo
24 where CardID in (select CardID from TransInfo group by CardID,transID,TransType,TransMoney,TransDate having sum(TransMoney)>10000))
25 return
26end
27go
28-- 调用内联表值函数
29select * from getCustInfo()
30go
好像有异曲同工之妙~