写在前面:
我们之前已经介绍过如何写UDAF,也介绍过map、flatMap的区别,也使用flatMap实现了UDTF的功能效果。这篇我们介绍如何使用mapPartitions()来实现UDAF的功能效果。
1、直接上代码实例看实现方法
//in scala
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions.sum
import scala.collection.mutable.ArrayBuffer
object mapPartitions {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("mapPartions-UDAF").master("local[2]").enableHiveSupport().getOrCreate()
import spark.implicits._
//构造一个公司员工ID、员工信息(入职年份、职级level)、公司的fake数据
val df =spark.createDataFrame(
Seq(
(123335,Seq(2020,4),"HuaWei"),
(124353,Seq(2020,4),"HuaWei"),
(142367,Seq(2020,4),"HuaWei"),
(133346,Seq(2021,7),"HuaWei"),
(137654,Seq(2021,7),"HuaWei"),
(142373,Seq(2021,5),"HuaWei"),
(424546,Seq(2021,8),"Apple"),
(427789,Seq(2020,4),"Apple"),
(422456,Seq(2020,4),"Apple"),
(427432,Seq(2021,7),"Apple"),
(427854,Seq(2021,5),"Apple"),
(424765,Seq(2021,7),"Apple")
)).toDF("userId","info","company")
df.show()
//接下来我们使用mapPartitions的方法来统计各公司、各入职年、各职级的人数
import spark.implicits._
val df1:DataFrame = df
.mapPartitions(iter => {
val userTypeMap = scala.collection.mutable.Map[(Int,Int,String), Long]()
val resultList = ArrayBuffer[(Int,Int,String,Long)]()
while (iter.hasNext) {
val row = iter.next()
val company = row.getAs[String]("company")
val userInfo = row.getAs[Seq[Int]]("info")
val inYear = userInfo.head
val level = userInfo(1)
//累加各公司、各入职年、各职级人数
userTypeMap((inYear,level,company)) = userTypeMap.getOrElse((inYear,level,company), 0L) + 1L
}
//输出结果
for((k,v) <- userTypeMap) {
resultList.append((k._1,k._2,k._3,v))
}
resultList.iterator
}).toDF( "inYear", "level", "company","UserCnt")
.groupBy( "inYear", "level", "company")
.agg(sum("UserCnt").as("UserCnt"))
df1.show()
}
//可以使用case class来定义数据结构和类型,也可以不使用。这里示例的代码是没用的。
case class employeeCounts(inYear:Int,
level:Int,
company:String,
UserCnt:Long)
}
结果展示:
输入的DF是:
输出的结果DF是:
与期望的输出结果一致,验证了mapPartions() +groupBy() +agg() 在实现聚合计算上的功能有效性 等同于UDAF。
2、对比看下UDAF 和mapPartions() +groupBy()+agg()的实现流程