spark实现 wordcount 单词计数

需求:

单词计数,将集合中出现的相同的单词,进行计数,取计数排名前三的结果

代码实现:

    val tupleList1 = List(("Hello Scala Spark World ", 4), ("Hello Scala Spark", 3), ("Hello Scala", 2), ("Hello", 1))
    //0).将元组(字符串,次数)  进行转换为一个大的字符串
    val newList: List[String] = tupleList1.map(kv => (kv._1.trim + " ") * kv._2)
    println("Step0(转字符串):   "  + newList)
    //1).扁平映射
    val wordList: List[String] = newList.flatMap(_.split(" "))
    println("Step1(扁平化):     "  + wordList)

    //2).将相同的单词放到一组  Map(Hello -> List(Hello, Hello, Hello, Hello))
    val groupList: Map[String, List[String]] = wordList.groupBy(elem => elem)
    println("Step2(分组):       "  + groupList)

    //3).对分组后map集合中的内容进行结构的转换   Map(Hello->4)
    //注意:map里面的函数参数是一个元素,不要误认为是两个参数
    val countList: Map[String, Int] = groupList.map(kv => {(kv._1,kv._2.size)})
    println("Step3(单词计数):   "  + countList)

    //4).转换成list   List((Hello,4), (Hbase,2), (kafka,1), (Scala,3))
    val tupleList: List[(String, Int)] = countList.toList //变成list
    println("Step4(转元组):     "  + tupleList)

    //5).排序  取前3
    //val sortList: List[(String, Int)] = tupleList.sortBy(_._2).reverse.take(3)
    val sortList: List[(String, Int)] = tupleList.sortWith(_._2 > _._2).take(3)
    println("Step5(排序,取值):  "  + sortList)

 简写:

 val wordCountList: List[(String, Int)] = tupleList1
      .map(tup => (tup._1.trim + " ") * tup._2)
      .flatMap(_.split(" "))
      .groupBy(elem => elem)
      .map(tup => (tup._1, tup._2.size))
      .toList
      .sortBy(tup => tup._2)
      .reverse
      .take(3)
  println(wordCountList)

打印信息:

Step0(转字符串):   List(Hello Scala Spark World Hello Scala Spark World Hello Scala Spark World Hello Scala Spark World , Hello Scala Spark Hello Scala Spark Hello Scala Spark , Hello Scala Hello Scala , Hello )
Step1(扁平化):     List(Hello, Scala, Spark, World, Hello, Scala, Spark, World, Hello, Scala, Spark, World, Hello, Scala, Spark, World, Hello, Scala, Spark, Hello, Scala, Spark, Hello, Scala, Spark, Hello, Scala, Hello, Scala, Hello)
Step2(分组):       Map(Hello -> List(Hello, Hello, Hello, Hello, Hello, Hello, Hello, Hello, Hello, Hello), Spark -> List(Spark, Spark, Spark, Spark, Spark, Spark, Spark), Scala -> List(Scala, Scala, Scala, Scala, Scala, Scala, Scala, Scala, Scala), World -> List(World, World, World, World))
Step3(单词计数):   Map(Hello -> 10, Spark -> 7, Scala -> 9, World -> 4)
Step4(转元组):     List((Hello,10), (Spark,7), (Scala,9), (World,4))
Step5(排序,取值):  List((Hello,10), (Scala,9), (Spark,7))

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值