普通 WordCount 案例
单词计数:将集合中出现的相同的单词,进行计数,取计数排名前三的结果。
//定义数据
val stringList : List[String] = List("hello spark","scala yes","hello java","java yes","hello scala","scala yes","java yes","hello java")
//1.对字符进行切分 得到一个打散所有单词的列表
// val wordList1 : List[Array[String]] = stringList.map(_.split(" ")) //切分
// val wordList2 : List[String] = wordList1.flatten
//使用简化步骤
val wordList = stringList.flatMap(_.split(" "))
println(wordList)
//2.相同单词进行分组
val gp : Map[String,List[String]] = wordList.groupBy(w => w)
println(gp)
//3.对分组之后的list取长度,得到每个单词的个数
val count: Map[String,Int] = gp.map(kv => (kv._1,kv._2.length))
//4.将map转换为list, 并转换取前3个元素
val sortList : List[(String,Int)] = count.toList
.sortWith(_._2>_._2)
.take(3) //选出前3
//输出
println(sortList)
复杂 WordCount 案例
方法1:
//定义数据
val stringList : List[(String,Int)] = List(("hello spark",1),("scala java",2),("hello scala",1),("java spark",2))
val stringList1 :List[String] = stringList.map(kv => {
(kv._1.trim + " ") * kv._2 //使用trim方法避免空白符
})
println(stringList1)
//1.对字符进行切分 得到一个打散所有单词的列表
val wordList = stringList1.flatMap(_.split(" "))
println(wordList)
//2.相同单词进行分组
val gp : Map[String,List[String]] = wordList.groupBy(w => w)
println(gp)
//3.对分组之后的list取长度,得到每个单词的个数
val count: Map[String,Int] = gp.map(kv => (kv._1,kv._2.length))
//4.将map转换为list, 并转换取前3个元素
val sortList : List[(String,Int)] = count.toList
.sortWith(_._2>_._2)
.take(3) //选出前3
//输出
println(sortList)
方法2:
val tuples = List(("Hello Scala Spark World", 4), ("Hello Scala Spark", 3), ("Hello Scala", 2), ("Hello", 1))
val wordToCountList: List[(String, Int)] = tuples.flatMap
{
t => {
val strings: Array[String] = t._1.split(" ")
strings.map(word => (word, t._2))
}
}
val wordToTupleMap: Map[String, List[(String, Int)]] =
wordToCountList.groupBy(t=>t._1)
val stringToInts: Map[String, List[Int]] =
wordToTupleMap.mapValues {
datas => datas.map(t => t._2)
}
stringToInts
val wordToCountMap: Map[String, List[Int]] =
wordToTupleMap.map {
t => {
(t._1, t._2.map(t1 => t1._2))
}
}
val wordToTotalCountMap: Map[String, Int] =
wordToCountMap.map(t=>(t._1, t._2.sum))
println(wordToTotalCountMap)