若泽数据B站视频Spark06 - Spark-RDD的基本操作(二)

最新推荐文章于 2024-03-24 12:11:10 发布

zhikanjiani

最新推荐文章于 2024-03-24 12:11:10 发布

阅读量383

点赞数 1

分类专栏：高级班Spark RDD

本文链接：https://blog.csdn.net/zhikanjiani/article/details/97902220

版权

高级班Spark RDD 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

一、上次课回顾

二、RDD常用算子再次实验

三、RDD中join使用深度详解

四、使用Spark-Core进行词频统计剖析

五、RDD中subtract & intersection & cartesian 使用详解

一、上次课回顾

若泽数据B站视频Spark06 - Spark-RDD的基本操作(一)

https://blog.csdn.net/zhikanjiani/article/details/97833470

注意点：

写代码的时候检查有无action，没有action的话，即使有100个transformation，作业也不会执行。

Spark Application编程时的整个执行流程：
在这里插入图片描述

二、RDD常用算子再次实验

scala> nums.map(x =>(x*x)).collect
res8: Array[Int] = Array(1, 4, 9, 16, 25, 36, 49, 64, 81)

scala> nums.flatMap(x =>(1 to x)).collect
res9: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9)

scala> nums.flatMap(x =>(1 to x)).reduce(+)
res10: Int = 165

作业1：Spark Core读取SequenceFIle文件

**由于某些历史原因：**Hive中有些表是采用SequenceFIle存储，现在你想要使用Spark Core来作为分布式计算框架；Spark SQL是能直接读取的。

三、RDD中join使用深度详解

1、scala> val a = sc.parallelize(Array(("A","a1"),("C","c1"),("D","d1"),("F","f1")))
a: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[13] at parallelize at <console>:24

2、scala> val b = sc.parallelize(Array(("A","a2"),("C","c2"),("C","c3"),("E","e1")))
b: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[15] at parallelize at <console>:24

3、a.join(b).collect			//相当于是innerJoin，只返回左右都匹配上的.
scala> a.join(b).collect
res16: Array[(String, (String, String))] = Array((A,(a1,a2)), (C,(c1,c2)), (C,(c1,c3)))

4、a.leftOuterJoin(b).collect		//看返回的数据结构，以a表为主表，去b表匹配；返回左表的所有
res30: Array[(String, (String, Option[String]))] = Array((F,(f1,None)), (D,(d1,None)), (A,(a1,Some(a2))), (C,(c1,Some(c2))), (C,(c1,Some(c3))))

5、a.rightOuterJoin(b).collect			//必然是返回右表的所有
res31: Array[(String, (Option[String], String))] = Array((A,(Some(a1),a2)), (C,(Some(c1),c2)), (C,(Some(c1),c3)), (E,(None,e1)))

6、a.fullOuterJoin(b).collect			//全连接
res32: Array[(String, (Option[String], Option[String]))] = Array((F,(Some(f1),None)), (D,(Some(d1),None)), (A,(Some(a1),Some(a2))), (C,(Some(c1),Some(c2))), (C,(Some(c1),Some(c3))), (E,(None,Some(e1))))

四、使用Spark-Core进行词频统计剖析

1、scala> val log = sc.textFile("file:///home/hadoop/data/ruozeinput.txt")
log: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/data/ruozeinput.txt MapPartitionsRDD[62] at textFile at <console>:24
第一次打印：Array[String] = Array(hello      world   john, hello     world, hello)

2、scala> log.map( x => x.split("\t"))
res33: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[63] at map at <console>:26
第二次打印：Array(Array(hello, world, john), Array(hello, world), Array(hello))

3、scala> val splits = log.flatMap( x => x.split("\t"))
splits: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[66] at flatMap at <console>:25
打印：Array[String] = Array(hello, world, john, hello, world, hello)

4、scala> splits.map(x =>(x,1)).reduceByKey(_+_).collect
res41: Array[(String, Int)] = Array((hello,3), (world,2), (john,1))
//这个操作存在shuffle，把相同的key分发到同一个reduce中，把key相加.

（hello,1）(hello,1) (hello,1)	==>(hello,3)
(world,1) (world,1)		==>(world,2)
(john,1)		==>(john,1)

作业2：按照每个单词出现次数做降序排列/升序排列

五、RDD中subtract & intersection & cartesian 使用详解

1、subtract（减去、扣掉）

代码注释：Return an RDD with the elements from `this` that are not in `other`.
在RDD中，两个DF做减法是非常常见的：
val a = sc.parallelize(1 to 5)
val b = sc.parallelize(2 to 3)
a.subtract(b).collect
输出：Array[Int] = Array(4, 1, 5)

2、intersection（交集）
概念：Return the intersection of this RDD and another one. The output will not contain any duplicate

a.intersection(b).collect
输出： Array[Int] = Array(2, 3)

3、cartesian（笛卡尔积）
概念：Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in this and b is in other.

a.cartesian(b).collect
Array[(Int, Int)] = Array((1,2), (2,2), (1,3), (2,3), (3,2), (4,2), (5,2), (3,3), (4,3), (5,3))

Spark-shell适用于测试，开发环境：IDEA + Maven + Scala

zhikanjiani

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
若泽数据B站视频Spark06 - Spark-RDD的基本操作(二)

一、上次课回顾二、RDD常用算子再次实验三、RDD中join使用深度详解四、使用Spark-Core进行词频统计剖析五、RDD中subtract & intersection & cartesian 使用详解一、上次课回顾https://blog.csdn.net/zhikanjiani/article/details/97833470写代码的时候检查是否有actio...
复制链接

扫一扫

专栏目录