Spark RDD 分组统计案例

题目
1.在所有的老师中求出最受欢迎的老师Top3
2.求每个学科中最受欢迎老师的top3(至少用2到三种方式实现)

数据内容
http://bigdata.edu360.cn/laozhang
http://bigdata.edu360.cn/laozhang
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laoduan
http://bigdata.edu360.cn/laoduan
http://javaee.edu360.cn/xiaoxu
http://javaee.edu360.cn/xiaoxu
http://javaee.edu360.cn/laoyang
http://javaee.edu360.cn/laoyang
http://javaee.edu360.cn/laoyang
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laoduan
http://bigdata.edu360.cn/laoduan
http://javaee.edu360.cn/xiaoxu
http://javaee.edu360.cn/xiaoxu
http://javaee.edu360.cn/laoyang
http://javaee.edu360.cn/laoyang
http://javaee.edu360.cn/laoyang
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laoduan
http://bigdata.edu360.cn/laoduan
http://javaee.edu360.cn/xiaoxu
http://javaee.edu360.cn/xiaoxu
http://javaee.edu360.cn/laoyang
http://javaee.edu360.cn/laoyang
http://javaee.edu360.cn/laoyang
http://php.edu360.cn/laoli
http://php.edu360.cn/laoliu
http://php.edu360.cn/laoli
http://php.edu360.cn/laoli

1.在所有的老师中求出最受欢迎的老师Top3

思路: 正常wordcount逻辑,先将数据整理,然后拼接成(teacher,1)元组,然后对这个RDD进行分组求和排序

object TeacherTest {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("teacherWork").setMaster("local[3]")
    val sparkContext = new SparkContext(conf)
    val lines: RDD[String] = sparkContext.textFile(args(0))
    val line: RDD[(String, Int)] = lines.map(line => {
      val teacherName: String = line.split("/")(3)
      //      val project: String = line.split("/")(2).split("[.]")(0)
      (teacherName, 1)
    })
    val result: RDD[(String, Int)] = line.reduceByKey(_+_).sortBy(_._2, false)
    println(result.collect().toBuffer)
  }
}

2.求每个学科中最受欢迎老师的top3(至少用2到三种方式实现)

解法1:
思路:组成 ((project, teacherName),1) 元组,然后先按key=(project, teacherName),value=1的元组进行分组聚合,再根据project分组,最后将value排序,注意此时用的排序方法为scala自带排序

object TeacherGroupT1 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("teacherWork").setMaster("local[3]")
    val sparkContext = new SparkContext(conf)
    val lines: RDD[String] = sparkContext.textFile(args(0))
    val projectAndTeacher: RDD[((String, String), Int)] = lines.map(line => {
      val teacherName: String = line.split("/")(3)
      val project: String = line.split("/")(2).split("[.]")(0)
      ((project, teacherName),1)
    })
    val reduced: RDD[((String, String), Int)] = projectAndTeacher.reduceByKey(_+_)
    val projectReduced: RDD[(String, Iterable[((String, String), Int)])] = reduced.groupBy(_._1._1)
    val result: RDD[(String, List[((String, String), Int)])] = projectReduced.mapValues(_.toList.sortBy(_._2).reverse.take(3))
    println(result.collect().toBuffer)
    sparkContext.stop()
  }
}

问题所在:

 val projectReduced: RDD[(String, Iterable[((String, String), Int)])] = reduced.groupBy(_._1._1)
 val result: RDD[(String, List[((String, String), Int)])] = projectReduced.mapValues(_.toList.sortBy(_._2).reverse.take(3))

首先groupby效率低下,其次mapValues中用到了_toList,即在内存中将数据toList,全部加载到内存后再做排序,数据量大的时候会有性能问题,内存溢出的问题,不建议这样使用

解法二:
思路:不在内存中做toList排序,即采用spark自带的sortBy,即底层调用了RangePartitioner进行分级分区,那么在每个分区中进行排序的数据,就只有该级别的一部分,例如 1-100的放到第一个分区。100-200放到第二个分区, 进行排序,这里其实调用了ShuffledRDD进行了分区操作。
这里首先做了一次filter,取出同一类别的数据,然后对这一部分同一类别的数据单独做分区排序。如果不filter,直接排序的话,则是对所有数据分区排序,结果不正确。

object TeacherGroupT2 {
  def main(args: Array[String]): Unit = {
    val subjects = Array("javaee", "bigdata", "php")
    val conf = new SparkConf().setAppName("teacherWork").setMaster("local[3]")
    val sparkContext = new SparkContext(conf)
    val lines: RDD[String] = sparkContext.textFile(args(0))
    val projectAndTeacher: RDD[((String, String), Int)] = lines.map(line => {
      val teacherName: String = line.split("/")(3)
      val project: String = line.split("/")(2).split("[.]")(0)
      ((project, teacherName),1)
    })
    val reduced: RDD[((String, String), Int)] = projectAndTeacher.reduceByKey(_+_)
    for (sb <- subjects){
      val filtered: RDD[((String, String), Int)] = reduced.filter(_._1._1 == sb)
      val result = filtered.sortBy(_._2, false).take(3)
      println(result.toBuffer)
    }
    sparkContext.stop()
  }
}

问题:
多次Shuffled,首先reduceByKey做了一次shuffled,然后sortBy又会根据有几个科目就做几个shuffled

解法三:
思路: val reduced: RDD[((String, String), Int)] = projectAndTeacher.reduceByKey(sbPartitioner, _ + _)
解决shuffled问题, 自定义分区 以及自定义排序,通过自定义分区,将不同科目放到不同分区中,再通过自定义排序,每次抽取分区中的一个数据进行类似插入排序的算法,进行操作,这样能够避免一次性全部将数据放到内存来排序导致的问题。

object TeacherGroupT3 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("teacherWork").setMaster("local[3]")
    val sparkContext = new SparkContext(conf)
    val lines: RDD[String] = sparkContext.textFile(args(0))
    val projectAndTeacher: RDD[((String, String), Int)] = lines.map(line => {
      val teacherName: String = line.split("/")(3)
      val project: String = line.split("/")(2).split("[.]")(0)
      ((project, teacherName), 1)
    })

    val subjects: Array[String] = projectAndTeacher.map(_._1._1).distinct().collect()
    val sbPartitioner = new SubjectPartitioner(subjects)

    val reduced: RDD[((String, String), Int)] = projectAndTeacher.reduceByKey(sbPartitioner, _ + _)

    val result: RDD[((String, String), Int)] = reduced.mapPartitions(it => {
//      it.toList.sortBy(_._2).reverse.take(3).iterator
      var set = new mutable.TreeSet[((String, String), Int)]()(new MyComparator)
      var setResult = new mutable.TreeSet[((String, String), Int)]()(new MyComparator)
      val length = 2
        it.foreach(f = x => {
          set += x
          if (set.size == length) {
            set = set.dropRight(1)
          }
        })
      set.iterator
    })
    println(result.collect().toBuffer)
    sparkContext.stop()
  }
}

class SubjectPartitioner(val subjects: Array[String]) extends Partitioner {

  private val rules: mutable.HashMap[String,Int] = new mutable.HashMap[String, Int]()
  var i = 0
  for(sb <- subjects){
    rules += ((sb, i))
    i += 1
  }
  override def numPartitions: Int = {
    subjects.length
  }

  override def getPartition(key: Any): Int = {
    val subject = key.asInstanceOf[(String,String)]._1
    rules(subject)
  }
}

class MyComparator extends Ordering[((String, String), Int)]{
  override def compare(x: ((String, String), Int), y: ((String, String), Int)): Int = {
    - (x._2 - y._2)
  }
}
  • 3
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值