spark dataframe dataset reducebykey用法

case class Record(ts: Long, id: Int, value: Int)
如果是rdd,我们经常会用reducebykey获取到最新时间戳的一条记录,用下面的方法
def findLatest(records: RDD[Record])(implicit spark: SparkSession) = {
  records.keyBy(_.id).reduceByKey{
    (x, y) => if(x.ts > y.ts) x else y
  }.values
}
在dataset中可以用一下方法:
import org.apache.spark.sql.functions._
val newDF = df.groupBy('id).agg.max(struct('ts, 'val)) as 'tmp).select($"id", $"tmp.*")
为什么可以这样操作呢?因为对于struct,或者tuple类型而言,max方法默认按照第一个元素进行排序处理
举个详细点的例子:
import org.apache.spark.sql.functions._


val data = Seq(
  ("michael", 1, "event 1"),
  ("michael", 2, "event 2"),
  ("reynold", 1, "event 3"),
  ("reynold", 3, "event 4")).toDF("user", "time", "event")




val newestEventPerUser = 
  data
    .groupBy('user)
    .agg(max(struct('time, 'event)) as 'event)
    .select($"user", $"event.*") // Unnest the struct into top-level columns.
scala> newestEventPerUser.show()
+-------+----+-------+                                                          
|   user|time|  event|
+-------+----+-------+
|reynold|   3|event 4|
|michael|   2|event 2|
+-------+----+-------+
复杂一点可参考如下:
case class AggregateResultModel(id: String,
                                      mtype: String,
                                      healthScore: Int,
                                      mortality: Float,
                                      reimbursement: Float)
// assume that the rawScores are loaded behorehand from json,csv files

val groupedResultSet = rawScores.as[AggregateResultModel].groupByKey( item => (item.id,item.mtype ))
      .reduceGroups( (x,y) => getMinHealthScore(x,y)).map(_._2)

// the binary function used in the reduceGroups
def getMinHealthScore(x : AggregateResultModel, y : AggregateResultModel): AggregateResultModel = {
    // complex logic for deciding between which row to keep
    if (x.healthScore > y.healthScore) { return y }
    else if (x.healthScore < y.healthScore) { return x }
    else {
      if (x.mortality < y.mortality) { return y }
      else if (x.mortality > y.mortality) { return x }
      else  {
        if(x.reimbursement < y.reimbursement)
          return x
        else
          return y
      }
    }
  }


ref:https://stackoverflow.com/questions/41236804/spark-dataframes-reducing-by-key

  • 2
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值