大数据之spark_自定义排序

自定义排序

使用rdd的sortBy或sortByKey方法进行排序,实现根据需求实现灵活的排序规则,在给定的数据中,有三个字段,分别是:姓名,年龄,颜值。先按照颜值的从高到低进行排序,如果颜值相等,再安装年龄的升序排序,如果不指定排序规则,直接对Bean排序:sortBy(x => x)会报错

数据

name,age,fv
taoge,30,99.99
xingge,18,9999.99
hangge,29,99.99

错误示范

class BoyBean (val name :String,var age :Int,val fv:Double){
}
package com.doit.spark.day08

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD

object SortDemo {
  def main(args: Array[String]): Unit = {

    val conf: SparkConf = new SparkConf().setAppName("ContinuousLogin")

    val setMaster = args(0).toBoolean

    if (setMaster){
      conf.setMaster("local[*]")
    }
    val sc = new SparkContext(conf)
    val textFile = args(1)
    val line: RDD[String] = sc.textFile(textFile)
    
    val maped: RDD[BoyBean] = line.map(line => {
      val sp: Array[String] = line.split(",")
      val name = sp(0)
      var age = sp(1).toInt
      val fv = sp(2).toDouble
      new BoyBean(name, age, fv)
    })
    
    val sorted: RDD[BoyBean] = maped.sortBy(x => x)
    println(sorted.collect().toBuffer)
    
  }
}

在这里插入图片描述

实现排序方式一:

在Bean上实现Comparable或Ordered接口(因为底层Ordered实现了Comparable接口),然后重写里面的排序方法,注意这两个接口需要指定范型类型,同时因为我们的Bean里面封装了数据,之后要走网络并写磁盘,所以还需要实现序列化接口Serializable(该序列化接口默认序列化类型为java的序列化类型,占用内存空间多,但读写速度块)

package com.doit.spark.day08

class BoyBean (val name :String,var age :Int,val fv:Double) extends Comparable[BoyBean] with Serializable {

  override def compareTo(o: BoyBean): Int = {
    if (this.fv == o.fv){
      this.age - o.age
    }else{
    java.lang.Double.compare(o.fv,this.fv)
    }
  }
  override def toString: String = s"Boy($name, $age, $fv)"
}

我们还可以自定义序列化规则,然后让Bean注册该序列化规则,使占用内存变小,自定义序列化接口为Kryo类型的,该序列化方式占用内存空间小,但读写速度不如java默认的序列化方式

package com.doit.spark.day08
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD

object SortDemo {
  def main(args: Array[String]): Unit = {

    val conf: SparkConf = new SparkConf().setAppName("ContinuousLogin")

    //自定义序列化接口为Kryo类型的,该序列化方式占用内存空间小,但读写速度不如java默认的序列化方式
    conf.set("spark.serializer","org.apache.spark.serializer.KryoRegistrator")

    //BoyBean注册该序列化接口
    conf.registerKryoClasses(Array(classOf[BoyBean]))

    val setMaster = args(0).toBoolean

    if (setMaster){
      conf.setMaster("local[*]")
    }
    val sc = new SparkContext(conf)
    val textFile = args(1)
    val line: RDD[String] = sc.textFile(textFile)
    val maped: RDD[BoyBean] = line.map(line => {
      val sp: Array[String] = line.split(",")
      val name = sp(0)
      var age = sp(1).toInt
      val fv = sp(2).toDouble
      new BoyBean(name, age, fv)
    })
    val sorted: RDD[BoyBean] = maped.sortBy(x => x)
    println(sorted.collect().toBuffer)
  }
}

如果不想实现序列化接口和toString方法,可以定义一个case class Bean,因为它里面默认实现了序列化接口,和toString方法,还有get,set方法

实现排序方式二:

使用隐式转换实现一个排序规则
在调用rdd的sortBy方法前导入隐式转换

package com.doit.spark.day08

case class BoyBean (val name :String,var age :Int,val fv:Double)  extends Serializable {
  
  override def toString: String = s"Boy($name, $age, $fv)"
}

object BoySortBy {
  //自定义一个object类型的隐式参数BuyOrdering
  implicit object BoyOrdering extends Ordering[BoyBean]{

    override def compare(x: BoyBean, y: BoyBean): Int = {
      if (x.fv == y.fv){
        x.age - y.age
      }else{
        java.lang.Double.compare(y.fv,x.fv)
      }
    }
  }
}
package com.doit.spark.day08

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD

object SortDemo {
  def main(args: Array[String]): Unit = {
   

    val conf: SparkConf = new SparkConf().setAppName("ContinuousLogin")

    //自定义序列化接口为Kryo类型的,该序列化方式占用内存空间小,但读写速度不如java默认的序列化方式
    conf.set("spark.serializer","org.apache.spark.serializer.KryoRegistrator")

    //BoyBean注册该序列化接口
    conf.registerKryoClasses(Array(classOf[BoyBean]))

    val setMaster = args(0).toBoolean

    if (setMaster){
      conf.setMaster("local[*]")
    }
    val sc = new SparkContext(conf)
    val textFile = args(1)
    val line: RDD[String] = sc.textFile(textFile)

    val maped: RDD[BoyBean] = line.map(line => {
      val sp: Array[String] = line.split(",")
      val name = sp(0)
      var age = sp(1).toInt
      val fv = sp(2).toDouble
      new BoyBean(name, age, fv)
    })
    //导入隐式参数
    import BoySortBy.BoyOrdering
    val sorted: RDD[BoyBean] = maped.sortBy(x => x)
    println(sorted.collect().toBuffer)
  }
}

实现排序方式三:

数据存储在元组或Bean中,可以直接在SortBy算子中指定要排序的字段:

package com.doit.spark.day08

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD

object SortDemo {
  def main(args: Array[String]): Unit = {

    val conf: SparkConf = new SparkConf().setAppName("ContinuousLogin")

    //自定义序列化接口为Kryo类型的,该序列化方式占用内存空间小,但读写速度不如java默认的序列化方式
    conf.set("spark.serializer","org.apache.spark.serializer.KryoRegistrator")

    //BoyBean注册该序列化接口
    conf.registerKryoClasses(Array(classOf[BoyBean]))

    val setMaster = args(0).toBoolean

    if (setMaster){
      conf.setMaster("local[*]")
    }
    val sc = new SparkContext(conf)
    val textFile = args(1)
    val line: RDD[String] = sc.textFile(textFile)

    val maped: RDD[BoyBean] = line.map(line => {
      val sp: Array[String] = line.split(",")
      val name = sp(0)
      var age = sp(1).toInt
      val fv = sp(2).toDouble
      new BoyBean(name, age, fv)
    })

    //直接指定按颜值降序,再按年龄升序,元组类型同理
    val sorted: RDD[BoyBean] = maped.sortBy(x => (-x.fv,x.age))
    println(sorted.collect().toBuffer)
  }
}

实现排序方式四:

直接在sortBy算子的上方定义自定义排序的隐式参数:

package com.doit.spark.day08

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD

object SortDemo {
  def main(args: Array[String]): Unit = {


    val conf: SparkConf = new SparkConf().setAppName("ContinuousLogin")

    //自定义序列化接口为Kryo类型的,该序列化方式占用内存空间小,但读写速度不如java默认的序列化方式
    conf.set("spark.serializer","org.apache.spark.serializer.KryoRegistrator")

    //BoyBean注册该序列化接口
    conf.registerKryoClasses(Array(classOf[BoyBean]))

    val setMaster = args(0).toBoolean

    if (setMaster){
      conf.setMaster("local[*]")
    }
    val sc = new SparkContext(conf)
    val textFile = args(1)
    val line: RDD[String] = sc.textFile(textFile)

    val maped: RDD[BoyBean] = line.map(line => {
      val sp: Array[String] = line.split(",")
      val name = sp(0)
      var age = sp(1).toInt
      val fv = sp(2).toDouble
      new BoyBean(name, age, fv)
    })
    
    //自定义排序的隐式参数
    implicit val ord =Ordering[(Double,Int)].on[BoyBean](x => (-x.fv,x.age))
    
    val sorted: RDD[BoyBean] = maped.sortBy(x => x)
    println(sorted.collect().toBuffer)

  }
}

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值