若泽数据B站视频Spark06 - Spark-RDD的基本操作(二)

一、上次课回顾

二、RDD常用算子再次实验

三、RDD中join使用深度详解

四、使用Spark-Core进行词频统计剖析

五、RDD中subtract & intersection & cartesian 使用详解

一、上次课回顾

若泽数据B站视频Spark06 - Spark-RDD的基本操作(一)

  • https://blog.csdn.net/zhikanjiani/article/details/97833470

注意点:

  • 写代码的时候检查有无action,没有action的话,即使有100个transformation,作业也不会执行。

Spark Application编程时的整个执行流程:
在这里插入图片描述

二、RDD常用算子再次实验

scala> nums.map(x =>(x*x)).collect
res8: Array[Int] = Array(1, 4, 9, 16, 25, 36, 49, 64, 81)

scala> nums.flatMap(x =>(1 to x)).collect
res9: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9)

scala> nums.flatMap(x =>(1 to x)).reduce(+)
res10: Int = 165

作业1:Spark Core读取SequenceFIle文件

**由于某些历史原因:**Hive中有些表是采用SequenceFIle存储,现在你想要使用Spark Core来作为分布式计算框架;Spark SQL是能直接读取的。

三、RDD中join使用深度详解

1、scala> val a = sc.parallelize(Array(("A","a1"),("C","c1"),("D","d1"),("F","f1")))
a: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[13] at parallelize at <console>:24

2、scala> val b = sc.parallelize(Array(("A","a2"),("C","c2"),("C","c3"),("E","e1")))
b: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[15] at parallelize at <console>:24

3、a.join(b).collect			//相当于是innerJoin,只返回左右都匹配上的.
scala> a.join(b).collect
res16: Array[(String, (String, String))] = Array((A,(a1,a2)), (C,(c1,c2)), (C,(c1,c3)))

4、a.leftOuterJoin(b).collect		//看返回的数据结构,以a表为主表,去b表匹配;返回左表的所有
res30: Array[(String, (String, Option[String]))] = Array((F,(f1,None)), (D,(d1,None)), (A,(a1,Some(a2))), (C,(c1,Some(c2))), (C,(c1,Some(c3))))

5、a.rightOuterJoin(b).collect			//必然是返回右表的所有
res31: Array[(String, (Option[String], String))] = Array((A,(Some(a1),a2)), (C,(Some(c1),c2)), (C,(Some(c1),c3)), (E,(None,e1)))

6、a.fullOuterJoin(b).collect			//全连接
res32: Array[(String, (Option[String], Option[String]))] = Array((F,(Some(f1),None)), (D,(Some(d1),None)), (A,(Some(a1),Some(a2))), (C,(Some(c1),Some(c2))), (C,(Some(c1),Some(c3))), (E,(None,Some(e1))))

四、使用Spark-Core进行词频统计剖析
1、scala> val log = sc.textFile("file:///home/hadoop/data/ruozeinput.txt")
log: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/data/ruozeinput.txt MapPartitionsRDD[62] at textFile at <console>:24
第一次打印:Array[String] = Array(hello      world   john, hello     world, hello)

2、scala> log.map( x => x.split("\t"))
res33: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[63] at map at <console>:26
第二次打印:Array(Array(hello, world, john), Array(hello, world), Array(hello))

3、scala> val splits = log.flatMap( x => x.split("\t"))
splits: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[66] at flatMap at <console>:25
打印:Array[String] = Array(hello, world, john, hello, world, hello)

4、scala> splits.map(x =>(x,1)).reduceByKey(_+_).collect
res41: Array[(String, Int)] = Array((hello,3), (world,2), (john,1))
//这个操作存在shuffle,把相同的key分发到同一个reduce中,把key相加.

(hello,1)(hello,1) (hello,1)	==>(hello,3)
(world,1) (world,1)		==>(world,2)
(john,1)		==>(john,1)

作业2:按照每个单词出现次数做降序排列/升序排列

五、RDD中subtract & intersection & cartesian 使用详解

1、subtract(减去、扣掉)

代码注释:Return an RDD with the elements from `this` that are not in `other`.
在RDD中,两个DF做减法是非常常见的:
val a = sc.parallelize(1 to 5)
val b = sc.parallelize(2 to 3)
a.subtract(b).collect
输出:Array[Int] = Array(4, 1, 5)

2、intersection(交集)
概念:Return the intersection of this RDD and another one. The output will not contain any duplicate

a.intersection(b).collect
输出: Array[Int] = Array(2, 3)

3、cartesian(笛卡尔积)
概念:Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in this and b is in other.

a.cartesian(b).collect
Array[(Int, Int)] = Array((1,2), (2,2), (1,3), (2,3), (3,2), (4,2), (5,2), (3,3), (4,3), (5,3))

Spark-shell适用于测试,开发环境:IDEA + Maven + Scala

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值