在使用SparkSQL进行处理数据时,将数据保存为Map,并读取出Map的数据
数据列聚合操作后拼接为一个字符集合:BSV ANGLIA_1~BSV ANGLIA---_2
SELECT MMSI, IMO, concat_ws("~",collect_set(concat_ws("_",ShipName,name_num))) as shipNum, max(time) as last_time, sum(name_num) as all_num
FROM ship_num
GROUP BY MMSI, IMO
统计各字符在总数中的占比,且需将占比较高的字符显示在前面,故自定义一个UDF函数,并将结果保存为TreeMap中:
/**
* 统计所占百分比
* @param shipName BSV ANGLIA_1~BSV ANGLIA---_2
* @param all 3
* @return
* Map(0.67 -> BSV ANGLIA---, 0.33 -> BSV ANGLIA)
*/
def showPercent(shipName:String, all: String): mutable.Map[String, String] ={
val map = new util.TreeMap[String, String]()
val shipNums = shipName.split("~")
for (ship_num <- shipNums){
val name_num = ship_num.split("_")
val percent = ((name_num(1).toLong + 0.0) / all.toLong).formatted("%.2f")
map.put(percent, name_num(0))
}
import scala.collection.JavaConverters._
map.descendingMap().asScala
}
使用session进行函数的注册:
// 自定义UDF:统计所占百分比
session.udf.register[mutable.Map[String, String], String,String]("percent",showPercent)
在SQL中进行调用:
SELECT t.MMSI, t.IMO, t.shipNum, percent(t.shipNum, t.all_num) as percent, s.name as last_name, t.last_time as last_time, t.all_num
FROM (
SELECT MMSI, IMO, concat_ws("~",collect_set(concat_ws("_",ShipName,name_num))) as shipNum, max(time) as last_time, sum(name_num) as all_num
FROM ship_num
GROUP BY MMSI, IMO
) t
LEFT JOIN ship_num s ON t.MMSI = s.MMSI AND t.IMO = s.IMO and t.last_time = s.time
将查询结果进行保存:
res.write.option("header","true").text("D:/loong/ship_num")
res.write.option("header","true").csv("D:/loong/ship_num")
在进行保存时,发现Spark SQL中的 .text方法只支持只有一列数据,.csv方法在进行保存的时候不支持Map类型,所以在进行保存前需要对Map进行处理,转化为String:
session.udf.register("mapToString",(map: Map[String, String]) =>{
var str =""
for (kv <- map){
str += kv._1 + "_"+ kv._2
}
str
})
注册后再进行调用:
val sql =
"""
select MMSI, IMO, mapToString(percent) as percent, last_name, last_time, all_num
from table
""".stripMargin
最后将结果保存即可:
val res: DataFrame = session.sql(sql).coalesce(10)
res.write.option("header","true").csv("D:/loong/ship_num_01")