Scala：实现 wordCount 需懂得的基础知识，真不简单

最新推荐文章于 2023-11-20 15:31:37 发布

唐樽

最新推荐文章于 2023-11-20 15:31:37 发布

阅读量624

点赞数

分类专栏：大数据 Linux scala 大数据--学习文章标签： hive sql hadoop

本文链接：https://blog.csdn.net/weixin_44775255/article/details/121375643

版权

大数据 Linux 同时被 3 个专栏收录

62 篇文章 12 订阅

订阅专栏

大数据--学习

41 篇文章 1 订阅

订阅专栏

scala

10 篇文章 2 订阅

订阅专栏

文章目录

1、读取文件

// 1、导入包
import scala.io.Source
// 2、按行读取文件
val lines = Source.fromFile("/usr/hadoop/badou/The_Man_of_Property.txt").getLines

-- 会显示：
-- scala.io.BufferedSource = non-empty iterator --这个资源结果是非空的迭代器
--iterator 表示这是一个迭代器

--it.next()： 获取迭代器中下一个元素
--it.hasNext()：判断集合中是否还有元素

// 3、toList： 将上面迭代器中放入列表中进行返回
val lines = Source.fromFile("/usr/hadoop/badou/The_Man_of_Property.txt").getLines.toList

在这里插入图片描述

1.1 判断数据数量是否一致？

lines.length
// res4: Int = 2866
wc -l The_Man_of_Property.txt

在这里插入图片描述

2、序列化、区间类型

val a = Range(0,5) // [0,5)   步长是1  
val b = 0 until 5  // [0,5)   步长是1  
---
val c = 1 to 5	   // [0,5]   步长是1  
val d = 1.to(5)	   // [0,5]   步长是1

如何把Range类型转换为List类型？
a.toList
val list1 = (1 to 10).toList

在这里插入图片描述

2.1 理解map、Vector

val a = Range(0,5)
// ...immutable.Range = Range(0, 1, 2, 3, 4)

a.map(x=>x*2) // 读取变量a的元素，作为x，再进行操作*2
// Vector(0, 2, 4, 6, 8)
--Vector：可以认为是保存数据的容器，也称为集合

-- 1、创建Vector对象
val v1 =Vector(1,2,3)
-- 获取Vector元素，索引下标从0 开始
println(v1(0))  
-- 2、Vector 遍历
for(i<- v1) print(i+" ")

在这里插入图片描述

2.2 理解 _ ，作用是通配符

(1) _代表集合中每一个元素
a.map(_*2) <==> a.map(x=>x*2)
(2) 获取tuple中的元素
val s = ("hello","badou")
s._1
s._2

(3) 导入所有包
import scala.collection.immutable.xxx  指定具体包
import scala.collection.immutable._    导入 immutable的全部包

(4)初始化变量
val a=1   val定义的变量不能被修改，而var可以修改
var name:String=_	// name: String = null
var score:Int=_		// score: Int = 0

在这里插入图片描述

2.3 理解 split: 以什么为切分条件？

val s = "The Man of Property"
s.split(" ") //切分结果以 Array[String] 存储
// res7: Array[String] = Array(The, Man, of, Property)

-- 结合: map+split
lines.map(x=>x.split(" "))
lines.map(_.split(" "))
// res8: List[Array[String]]

返回的是List (Array(), Array(),…)
在这里插入图片描述
我们不想要有Array，将Array进行打平
用 flatten 函数

val s1 = List(Range(0,5), Range(0,5), Range(0,5))

val s2 = s1.flatten // 打平Array
s2: List[Int] = List(0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4)

--每个元素遍历操作
s2.map(x=>x*2) <==> s2.map(_*2)

--直接针对s1进行处理
s1.map(x=>x.map(x=>x*2)) 
s1.map(_.map(_*2)) // 先读取x的数据，再对x进行x*2操作
// List(Vector(0, 2, 4, 6, 8), Vector(0, 2, 4, 6, 8), Vector(0, 2, 4, 6, 8))

--将上面Vector进行打散 用flatten 或 flatMap
s1.map(_.map(_*2)).flatten
s1.flatMap(_.map(_*2))

--映射到lines
-- map + flatten  <==> flatMap
lines.map(x=>x.split(" ")).flatten
lines.flatMap(_.split(" "))

在这里插入图片描述

3、进行 MapReduce 的 Map 操作

val r1 = lines.flatMap(x=>x.split(" "))
r1.map((_,1)) // 把r1的结果作为_, 合成(_,1)形式
// 进行Map操作
lines.flatMap(x=>x.split(" ")).map(x=>(x,1))
lines.flatMap(_.split(" ")).map(x=>(x,1))
lines.flatMap(_.split(" ")).map((_,1))

在这里插入图片描述
把单词作为key，通过key进行分组groupBy 操作。

lines.flatMap(_.split(" ")).map((_,1)).groupBy(_._1)

--Map(forgotten -> List((forgotten,1), (forgotten,1), (forgotten,1)))
从tuple 中(forgotten,1) 获取第一个单词 forgotten 作为key;
将整个tuple作为value,收集到一个List中;
这样对应的value:
_1: forgotten 
_2: List((forgotten,1), (forgotten,1), (forgotten,1))

统计词频，也就是单词出现的个数，在上面代表的List的大小 == 就是forgotten 出现的次数。

lines.flatMap(_.split(" ")).map((_,1)).groupBy(_._1).map(x=>(x._1, x._2.length))
等价于：
lines.flatMap(_.split(" ")).map((_,1)).groupBy(_._1).map(x=>(x._1, x._2.size))

在这里插入图片描述

3.1 理解数组求和方式

val a1 = List((1,2), (3,4),(5,6))
a1.map(_._2).sum
a1.map(_._2).reduce(_+_)

在这里插入图片描述

val l = lines.flatMap(_.split(" ")).map((_,1)).groupBy(_._1)

l.map(x=>(x._1,x._2.map(_._2).sum))
等价于
l.map(x=>(x._1,x._2.map(_._2).reduce(_+_)))
等价于
l.mapValues(_.size) // 获取map中的Value进行统计

在这里插入图片描述
这样得到的结果就是类似 wordCount。

3.2 对 wordCount 结果获取前10个与排序

scala> val a1=List((3,2),(1,0),(4,4),(5,1))
a1: List[(Int, Int)] = List((3,2), (1,0), (4,4), (5,1))

// _._2 表示按照tuple中第二个元素进行排序
// sortBy()：从小到大的排序，默认升序
scala> a1.sortBy(_._2) 
res1: List[(Int, Int)] = List((1,0), (5,1), (3,2), (4,4))

// reverse 颠倒结果
scala> a1.sortBy(_._2).reverse
res2: List[(Int, Int)] = List((4,4), (3,2), (5,1), (1,0))

// sortWith(第二个元素>另一个第二个元素) 
scala> a1.sortWith(_._2 > _._2)
res3: List[(Int, Int)] = List((4,4), (3,2), (5,1), (1,0))

// slice(0,2) 分片[0-2)
scala> a1.sortWith(_._2 > _._2).slice(0,2)
res4: List[(Int, Int)] = List((4,4), (3,2))

在这里插入图片描述
【注意】：排序，切分是在List类型进行的。

import scala.io.Source
val lines = Source.fromFile("/usr/hadoop/badou/The_Man_of_Property.txt").getLines.toList
val l = lines.flatMap(_.split(" ")).map((_,1)).groupBy(_._1).map((x=>(x._1,x._2.size)))
// l: scala.collection.immutable.Map[String,Int]

-- 这里我们要对变量l的Map[String,Int]中的Int进行排序，切分。有：
// 1、转换为List，再进行排序，切分
l.toList.sortBy(_._2)
l.toList.sortBy(_._2).reverse
--方式一：
l.toList.sortBy(_._2).reverse.slice(0,10)
--方式二：
l.toList.sortWith(_._2 > _._2).slice(0,10)
--方式三：
lines.flatMap(_.split(" ")).map((_,1)).groupBy(_._1).mapValues(_.size).toArray.sortWith(_._2 > _._2).slice(0,10)

在这里插入图片描述

4、过滤特殊字符（用正则）

在上面我们发现，进行split，map 时候，key是含有特殊字符的，我们要过滤掉它们。

// 定义正则
val p = "[0-9]+".r // 只要数字
val s = "a12!@#$a^&*vbdd12309++9"
p.findAllIn(s).toArray
// res13: Array[String] = Array(12, 12309, 9)
p.findAllIn(s).foreach(x=>println(x))

在这里插入图片描述
用 mkString() 可以把结果转为String类型。

p.findAllIn(s).mkString(" ")
// res18: String = 12 12309 9

p.findAllIn(s).mkString("[","  ","]")
// res20: String = [12  12309  9]

接下来我们对lines进行操作

val p = "[0-9a-zA-Z]+".r

--1、切分后--> map(x=>(过滤,1))
val ll = lines.flatMap(_.split(" ")).map(x=>(p.findAllIn(x).mkString(""),1))

--2、分组，统计次数
ll.groupBy(_._1).mapValues(_.size)

--3、转换类型、排序、分片
ll.groupBy(_._1).mapValues(_.size).toList.sortWith(_._2 > _._2).slice(0,10)