在使用SparkSQL的SQL语句进行聚合后拼接时,需要使用CONCAT_WS进行多字段拼接,再使用COLLECT_SET进行收集,返回一个Array数组的集合。
如:
val imo_type_sql =
"""
|SELECT IMO, MMSI, COLLECT_SET(CONCAT_WS("~",ShipType, count)) as type_count
|FROM agg_table
|GROUP BY IMO, MMSI
""".stripMargin
val imo_type_df: DataFrame = session.sql(imo_type_sql)
imo_type_df.createOrReplaceTempView("all_type_table")
后面进行业务处理,需要在这个数组中选取count次数最多的一个ShipType,故选用Spark的自定义UDF方式进行实现。
session.udf.register("filterType", (arr:Seq[String]) =>{
val map: Map[String, Long] = arr.map {
case type_count_str: String => {
val type_count = type_count_str.split("~")
val shiptype = type_count(0)
var count = 0L
try {
count = type_count(1).toLong
} catch {
case e: Exception => count = 0L
}
(shiptype, count)
}
}.sortBy(-_._2).toMap
map.head.toString()
})
注: 在进行自定义UDF时,不能使用输入类型是Array,如:
session.udf.register("filterType", (arr:Array[String]) =>{}
使用Array时,会报如下异常:
Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;
这是是选用Array的父类,将Array改成Seq就可以传入Array类型,原因是此Array非Scala中的原生Array,而是封装了一下的Array,详细说明:
So it looks like the ArrayType on Dataframe "idDF" is really a WrappedArray and not an Array - So the function call to "filterMapKeysWithSet" failed as it expected an Array but got a WrappedArray/ Seq instead (which doesn't implicitly convert to Array in Scala 2.8 and above). |