在spark2.4.4下UDF写法有两种(scala)
法一 sql拼接
val sparkSession = SparkSession.builder()
.appName("PKPMBimAnalyse")
.config("spark.mongodb.input.uri", "mongodb://10.100.140.35/mydb.netflows")
.master("local")
.getOrCreate()
sparkSession.udf.register("TotalVolume",(HighPts: mutable.WrappedArray[GenericRow]) => {
print(HighPts)
HighPts.size
})
var resultDataFrame = sparkSession.sql("select RootNode.ChildNode.HighPt,RootNode.ChildNode.LowPt,TotalVolume(RootNode.ChildNode.HighPt) from netflows")
主意sql中TotalVolume(RootNode.ChildNode.HighPt),将register的udf函数引入
法二 select或withColumn函数叠加
val totalVolume = udf((HighPts: mutable.WrappedArray[GenericRow]) => {
print(HighPts)
HighPts.size
})
var resultDataFrame = sparkSession.sql("select RootNode.ChildNode.HighPt,RootNode.ChildNode.LowPt from netflows")
var testDataFrame = resultDataFrame.withColumn("name_len", totalVolume(col("HighPt")))