五、RDD中subtract & intersection & cartesian 使用详解
一、上次课回顾
若泽数据B站视频Spark06 - Spark-RDD的基本操作(一)
- https://blog.csdn.net/zhikanjiani/article/details/97833470
注意点:
- 写代码的时候检查有无action,没有action的话,即使有100个transformation,作业也不会执行。
Spark Application编程时的整个执行流程:
二、RDD常用算子再次实验
scala> nums.map(x =>(x*x)).collect
res8: Array[Int] = Array(1, 4, 9, 16, 25, 36, 49, 64, 81)
scala> nums.flatMap(x =>(1 to x)).collect
res9: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9)
scala> nums.flatMap(x =>(1 to x)).reduce(+)
res10: Int = 165
作业1:Spark Core读取SequenceFIle文件
**由于某些历史原因:**Hive中有些表是采用SequenceFIle存储,现在你想要使用Spark Core来作为分布式计算框架;Spark SQL是能直接读取的。
三、RDD中join使用深度详解
1、scala> val a = sc.parallelize(Array(("A","a1"),("C","c1"),("D","d1"),("F","f1")))
a: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[13] at parallelize at <console>:24
2、scala> val b = sc.parallelize(Array(("A","a2"),("C","c2"),("C","c3"),("E","e1")))
b: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[15] at parallelize at <console>:24
3、a.join(b).collect //相当于是innerJoin,只返回左右都匹配上的.
scala> a.join(b).collect
res16: Array[(String, (String, String))] = Array((A,(a1,a2)), (C,(c1,c2)), (C,(c1,c3)))
4、a.leftOuterJoin(b).collect //看返回的数据结构,以a表为主表,去b表匹配;返回左表的所有
res30: Array[(String, (String, Option[String]))] = Array((F,(f1,None)), (D,(d1,None)), (A,(a1,Some(a2))), (C,(c1,Some(c2))), (C,(c1,Some(c3))))
5、a.rightOuterJoin(b).collect //必然是返回右表的所有
res31: Array[(String, (Option[String], String))] = Array((A,(Some(a1),a2)), (C,(Some(c1),c2)), (C,(Some(c1),c3)), (E,(None,e1)))
6、a.fullOuterJoin(b).collect //全连接
res32: Array[(String, (Option[String], Option[String]))] = Array((F,(Some(f1),None)), (D,(Some(d1),None)), (A,(Some(a1),Some(a2))), (C,(Some(c1),Some(c2))), (C,(Some(c1),Some(c3))), (E,(None,Some(e1))))
四、使用Spark-Core进行词频统计剖析
1、scala> val log = sc.textFile("file:///home/hadoop/data/ruozeinput.txt")
log: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/data/ruozeinput.txt MapPartitionsRDD[62] at textFile at <console>:24
第一次打印:Array[String] = Array(hello world john, hello world, hello)
2、scala> log.map( x => x.split("\t"))
res33: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[63] at map at <console>:26
第二次打印:Array(Array(hello, world, john), Array(hello, world), Array(hello))
3、scala> val splits = log.flatMap( x => x.split("\t"))
splits: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[66] at flatMap at <console>:25
打印:Array[String] = Array(hello, world, john, hello, world, hello)
4、scala> splits.map(x =>(x,1)).reduceByKey(_+_).collect
res41: Array[(String, Int)] = Array((hello,3), (world,2), (john,1))
//这个操作存在shuffle,把相同的key分发到同一个reduce中,把key相加.
(hello,1)(hello,1) (hello,1) ==>(hello,3)
(world,1) (world,1) ==>(world,2)
(john,1) ==>(john,1)
作业2:按照每个单词出现次数做降序排列/升序排列
五、RDD中subtract & intersection & cartesian 使用详解
1、subtract(减去、扣掉)
代码注释:Return an RDD with the elements from `this` that are not in `other`.
在RDD中,两个DF做减法是非常常见的:
val a = sc.parallelize(1 to 5)
val b = sc.parallelize(2 to 3)
a.subtract(b).collect
输出:Array[Int] = Array(4, 1, 5)
2、intersection(交集)
概念:Return the intersection of this RDD and another one. The output will not contain any duplicate
a.intersection(b).collect
输出: Array[Int] = Array(2, 3)
3、cartesian(笛卡尔积)
概念:Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in this
and b is in other
.
a.cartesian(b).collect
Array[(Int, Int)] = Array((1,2), (2,2), (1,3), (2,3), (3,2), (4,2), (5,2), (3,3), (4,3), (5,3))
Spark-shell适用于测试,开发环境:IDEA + Maven + Scala