Spark 统计各学科排名前3-未分区

一、元数据如下:

20161123101523 http://h5.learn.com/h5/teacher.shtml
20161123101523 http://h5.learn.com/h5/course.shtml
20161123101523 http://bigdatalearn.com/bigdata/teacher.shtml
20161123101523 http://java.learn.com/java/video.shtml
20161123101523 http://bigdata.learn.com/bigdata/teacher.shtml
20161123101523 http://ui.learn.com/ui/course.shtml
20161123101523 http://bigdata.learn.com/bigdata/teacher.shtml
20161123101523 http://h5.learn.com/h5/course.shtml
20161123101523 http://java.learn.com/java/video.shtml
20161123101523 http://ui.learn.com/ui/video.shtml
20161123101523 http://h5.learn.com/h5/course.shtml
20161123101523 http://h5.learn.com/h5/teacher.shtml
20161123101523 http://bigdatalearn.com/bigdata/teacher.shtml
20161123101523 http://bigdata.learn.com/bigdata/video.shtml
20161123101523 http://ui.learn.com/ui/teacher.shtml
20161123101523 http://java.learn.com/java/video.shtml

 

二、目标:

统计出每个学科排名前3的,如:

 

三、思路:

1、先将所有Key相同的合并,得到如下结果

java-jdbc-8

java-js2e 18

java-高并发-100

java-中间件 15

h5-画布-18

h5-js-184

h5-事件驱动-20

h5-css-192

h5-easyui-3

大数据-kudu-38

大数据-Spark-300

大数据-Sqoop-9

大数据-Hbase-150

大数据-Hadoop 280

 

2、按照Key分组,然后排序取前3

 

四、代码如下

package scalapackage.testspark

import java.net.URL

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by Germmy on 2018/5/12.
  */
object TestTop3 {


  def main(args: Array[String]) {

    val sparkConf: SparkConf = new SparkConf().setAppName("SparkTop3").setMaster("local[*]")

    val sc: SparkContext = new SparkContext(sparkConf)

    val file: RDD[String] = sc.textFile("D:\\temp\\course.txt")

    //1.先reduceByKey
    val res=file.map(x=>{
      val split: Array[String] = x.split(" ")
      val url: String = split(1)
      (url,1)
    })

    val sumedUrls: RDD[(String, Int)] = res.reduceByKey(_+_)

    //取出学科
    val res2=sumedUrls.map(x=>{
      val url=x._1
      val count=x._2
      val project=new URL(url).getHost
      (project,url,count)
    })

    //分组统计
    val values: RDD[(String, List[(String, String, Int)])] = res2.groupBy(_._1).mapValues(_.toList.sortBy(_._3).reverse.take(3))
    println(values.collect().toBuffer)
  }



}

 

五、运行结果如下

 

转载于:https://my.oschina.net/windows20/blog/1811388

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值