RDD编程基础学习笔记1_Spark优雅的wordCount

最新推荐文章于 2024-03-01 11:07:15 发布

单林敏

最新推荐文章于 2024-03-01 11:07:15 发布

阅读量362

点赞数

分类专栏： shell 大数据文章标签： spark hadoop 大数据 linux

本文链接：https://blog.csdn.net/neve_give_up_dan/article/details/104104462

版权

shell 同时被 2 个专栏收录

25 篇文章 0 订阅

订阅专栏

大数据

11 篇文章 0 订阅

订阅专栏

hdfs默认主目录是`/user/用户名`

(可能要自己在hdfs中先创建好)

所以在hdfs操作文件的时候,1.txt 等价于 /usr/用户名/1.txt ,也等价于 hdfs://localhost:9000/usr/hadoop/1.txt

下面这句是因为自己下面的疑问，自己错操作成了 /usr 而非 /user
~~直接在master的终端hadoop fs -cat 1.txt则不行，必须在spark中，因为本地终端可能没有支持~~

疑问

子雨老师说的是user/用户,我/usr和/user的textFile都可以？？

# 1.txt在hdfs://localhost:9000/usr/root下
scala> val lines = sc.textFile("1.txt")
lines: org.apache.spark.rdd.RDD[String] = 1.txt MapPartitionsRDD[1] at textFile at <console>:24

# 1.txt在hdfs://localhost:9000/user/root下
scala> val lines = sc.textFile("output2")
lines: org.apache.spark.rdd.RDD[String] = output2 MapPartitionsRDD[3] at textFile at <console>:24

解答

后面发现是因为加载文件也算是转化(trans),所以要等action的时候才会执行判断是否正确
果然翻车了,下面我用默认上传以及定点读取发现确实是默认用户

# 终端1
scala> lines.foreach(elem=>println(elem))
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://master:9000/user/root/1.txt

# 终端2
[root@master ~]# hadoop fs -put 1.txt 
19/12/04 17:24:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[root@master ~]# hadoop fs -cat /user/root/1.txt 
19/12/04 17:25:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
hadoop	hello
bianhao	shan
nihao
hello	shan
hello	bianhao
	nihao
lizhao	hello

# 终端1
scala> val lines = sc.textFile("1.txt")
lines: org.apache.spark.rdd.RDD[String] = 1.txt MapPartitionsRDD[10] at textFile at <console>:24

scala> lines.foreach(elem=>println(elem))
hadoop	hello
bianhao	shan
nihao
hello	shan
hello	bianhao
	nihao
lizhao	hello

一些转化trans基本操作

所有的操作

scala> val array = Array(1,2,3,4,5)
array: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val rdd = sc.parallelize(array)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:26

scala> val lines = sc.textFile("1.txt")
lines: org.apache.spark.rdd.RDD[String] = 1.txt MapPartitionsRDD[1] at textFile at <console>:24

scala> val linesWith_hello = lines.filter(line => line.contains("hello"))
linesWith_hello: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at filter at <console>:26

scala> val rdd2 = rdd.map(x => x+10)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[8] at map at <console>:28

# 同理还有文件的map(,split(" "))
# flatmap先调用map,之后flat把多个数组拍扁成为一个数组

reduceByKey

reduce的迭代操作

一些操作action基本操作

Action

持久化

每一个Action都会生成一个job
RDD.persist(MEMORY_AND_DISK) 这样在内存不足时存放磁盘
RDD.persist(MEMORY_ONLY) 等价与 RDD.cache()，内存不足时，把“老(根据源码实现而定)”的数据交换到磁盘
persist要等一个Action执行的时候才会真正地持久化，否则先只是一个标记，就像前面的转换操作一样都不会执行

分区

作用

并行计算

减少网络通信开销

UserData是用户信息表
Events是过去5分钟内的访问信息表
现在我们要知道在过去5分钟内，用户访问了哪些信息，哪些主题等等
这样我们必须对UserData部分信息和Events部分信息进行连接操作

不分区

假设我们不分区，那个信息分布在各个机器上(hdfs),那么连接操作基本上是O(n^2)枚举遍历，如下图左右两边是机器，都把自己1-10的信息放到中间变量j1,把11-20的信息放到中间变量j2，…

分区

这样就相当于先预处理，把左边的用户信息表UserData一开始就把1-10分在u1机器上，11-20分在u2机器上，这样子就可以在后续每次O(n)操作–把e1的1-10扔给u1就行了

虽然这样子算上之前的预处理还是相当于O(n^2)的操作，但是由于我们这里不是采用中间变量，或者说是直接把中间变量合理分区存储了

这样我们前期预处理暂时固定了数据，就在后期有大量操作的时候，不用过多的网络通信了

从而达到了减少网络开销的效果

注意：后期没大量操作的时候应该效果不佳
减少网络通信开销

原则

分区数等于CPU核心数
分区多了则会造成部分分区的操作等待
分区少了则没有充分利用CPU

各模式的CPU默认数

默认数值可以通过spark.default.parallelism来修改
local默认是本地核心数，可以local[N]指定个数
Mesos默认为8
Standlone和yarn—max{集群中各机器的核心数}

设置

语法格式

sc.textFile(path,partitionNum)

重分区

自定义分区

继承修改org.apache.spark.Partition

在这里插入图片描述

上面是一个单例对象，解释如下
1 to 10,Range类型的集合，一开始5个分区，为了验证我们的重定义的类
关于data.map(_,1)看下图，主要是因为partitionBy操作只能对键值对进行操作，所以要先转化成键值对
而map(_._1) 表示把 key取出来存起来，这里是因为要去掉之前加的1 — 键值对只是为了符合这个partitionBy的使用规范
同理可知我们还可以通过map(_._2)取出value存起来
在这里插入图片描述

综合实例–Spark优雅的wordCount

# 记得在hdfs中上传自己的1.txt(见本文中最上面的疑问解答中解答部分)--含有我上传1.txt的内容与上传方法
val lines = sc.textFile("1.txt")

################## 去除空格和制表符版  #######################
scala> val wordCount = lines.flatMap(line => line.split("\\s+"))
wordCount: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[25] at flatMap at <console>:26

scala> val ans = wordCount.map(word => (word,1)).reduceByKey((a,b) => a+b).collect()
ans: Array[(String, Int)] = Array((hadoop,1), (shan,2), ("",1), (lizhao,1), (hello,4), (bianhao,2), (nihao,2))

scala> ans.foreach(println)
(hadoop,1)
(shan,2)
(,1)
(lizhao,1)
(hello,4)
(bianhao,2)
(nihao,2)

###################  选取单词版  ####################
scala> val wordCount = lines.flatMap(line => line.split("\\W+"))
wordCount: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[28] at flatMap at <console>:26

scala> val ans = wordCount.map(word => (word,1)).reduceByKey((a,b) => a+b).collect()
ans: Array[(String, Int)] = Array((hadoop,1), (shan,2), ("",1), (lizhao,1), (hello,4), (bianhao,2), (nihao,2))

scala> ans.foreach(println)
(hadoop,1)
(shan,2)
(,1)
(lizhao,1)
(hello,4)
(bianhao,2)
(nihao,2)


####################  最终版(去空白+去掉空key) ########################
scala> val wordCount = lines.flatMap(line => line.split("\\s+")).filter(_ != "")
wordCount: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[32] at filter at <console>:26

scala> val ans = wordCount.map(word => (word,1)).reduceByKey((a,b) => a+b).collect()
ans: Array[(String, Int)] = Array((hadoop,1), (shan,2), (lizhao,1), (hello,4), (bianhao,2), (nihao,2))

scala> ans.foreach(println)
(hadoop,1)
(shan,2)
(lizhao,1)
(hello,4)
(bianhao,2)
(nihao,2)