若泽数据B站视频Spark05 - Spark-RDD的基本操作(一)

最新推荐文章于 2024-03-24 12:11:10 发布

zhikanjiani

最新推荐文章于 2024-03-24 12:11:10 发布

阅读量375

点赞数 1

分类专栏：高级班Spark RDD

本文链接：https://blog.csdn.net/zhikanjiani/article/details/97833470

版权

高级班Spark RDD 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

一、上次课回顾

二、从宏观角度看RDD

三、RDD-map算子详解

四、RDD-filter结合map算子详解

五、RDD-mapValues算子详解

六、RDD常用action算子

一、上次课回顾

1、若泽数据B站视频Spark基础篇05-Spark-RDD的创建

https://blog.csdn.net/zhikanjiani/article/details/90613976

二、从宏观角度看RDD

从宏观角度看RDD operations:
官网描述：

RDDs support two types of operations : transformations（转化）, which create a new dataset from an existing one, and actions , which return a value to the driver program after running a computation on the dataset.

如何理解：RDD支持两种类型操作：

transformation(转换) ==> 从一个已经存在RDD创建一个新的RDD，也体现了RDD的不可变性.
actions(操作) ==> eg：在控制台上启动一个shell，在结果集之上运行一个计算把结果在控制台上返回回来.
map(映射) ==> 传递到数据集中的每一个元素，传一个方法进去（y =f(x)+1），返回一个新的RDD.

实际操作：

：y = f(x) + 1
RDDA ====> RDDB
map
y = f(x) + 1

RDDA	map	RDDB	reduce(a+b)
(1,2,3,4,5)	map(+1)	(2,3,4,5,6)	sum

	map			RDDB			reduce(a+b)
(1,2,3,4,5)	  map( + 1 )	==>	(2,3,4,5,6)		sum		//把map方法中的+1作用到前面的每一个元素中，并且也返回到了一个新的RDD.

map is a transformation that passes each dataset element through a function and returns a new RDD representing the results		==>	传递到每一个数据集中的元素

：actions：reduce
概念：reduce is an action that aggregates all the elements of the RDD using some functions and returns the final result to the driver program. ==>聚合RDD中所有元素并且使用一个方法返回结果

我们对RDDB进行了如下一个操作：reduce(a+b) 两两相加，等价于==>sum

3)：all transformations in spark are lazy ***** 意思是遇到action算子的时候才去加载。
概念：in that they do not compute their results right away，The transformations are only computed when an action requires a result to be returned to the driver program. ==> 只有遇到action算子才会进行计算。

This design enables Spark to run more efficiently：它能让我们运行spark更高效.

eg1：rdda.map().filter().map().filter() ==>这一步操作并不会真正触发执行，它仅仅是记录了这一层转换关系.

eg2：rdda.map().reduce(+) ==>此操作必定会返回一个结果.

4)：cache
by default, each transformed RDD may be recomputed each time you run an action on it.
==> 每一个transformed RDD会被重新计算，场景：某个RDD的分区信息丢了，会根据血缘关系，找到丢失的RDD的父RDD上重新计算.

however, you may also persist an RDD in memory using the persist(or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it.
==> 将来某一时刻需要数据，把数据cache住，将来访问更快.

There is also support for persisting RDDs on disks, or replicated across multiple nodes
==> 也支持持久化的RDD到磁盘中，多副本的方式跨机器跨节点存储.

作业1：思考为什么transformation不触发，遇到action才会执行这个设计会让spark运行更加高效

三、RDD map算子详解：

Spark控制台操作：

例子1：
1、scala> val a = sc.parallelize(1 to 9)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24

2、scala> val b = a.map(x => (x*2))
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at :25

3、scala> b.collect
res0: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18) //未给它赋值就是从res0开始

例子2：
1、scala> val a = sc.parallelize(List(“dog”,“lion”,“cat”,“tiger”,“panda”))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[6] at parallelize at :24

2、scala> val b = a.map(x => (x,1))
b: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[7] at map at :25

3、scala> b.collect
res3: Array[(String, Int)] = Array((dog,1), (lion,1), (cat,1), (tiger,1), (panda,1))

在scala控制台进行如下操作：
1、scala> val a = List(“dog”,“cat”,“tiger”)

2、scala> a.map(x => (x,1))

小结：

	通过map算子看出使用RDD API编程和使用scala编程是一模一样的
		===>scala中集合操作	是	打死都要掌握的
		Spark可以跑单机也可以跑分布式，scala只支持单机版，val b = a.map(x =>(x,1))	//只看这段代码并不清楚是scala还是spark，开发分布式应用程序和开发单机版程序是一样的，从单机模式转移到大数据分布式开发可以做到无缝对接。

四、RDD-filter结合map算子详解

filter:对元素进行过滤

在spark控制台中操作：

1、scala> val a = sc.parallelize( 1 to 10)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24

2、求1~10中被2整除的数
scala> a.filter(_%2==0).collect
res11: Array[Int] = Array(2, 4, 6, 8, 10)	//结果输出出来

3、求出<4的数
scala> a.filter(_<5).collect
res12: Array[Int] = Array(1, 2, 3, 4)

**例子1：**
1、scala> val a = sc.parallelize(1 to 6)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:24

2、scala> val mapRDD = a.map(_*2)
mapRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[13] at map at <console>:25

3、scala> val filterRDD = mapRDD.filter(_>5)
filterRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[14] at filter at <console>:25

4、scala> filterRDD.collect
res8: Array[Int] = Array(6, 8, 10, 12)

第一步：map * 2		=>	第二步：filter > 5

由于分步显得步骤多：所以我们后期统一都采用链式编程*****，这是很重要的
scala> sc.parallelize(1 to 6).map(_*2).filter(_>5).collect
res9: Array[Int] = Array(6, 8, 10, 12)

举例: JQuery中典型的链式编程
	$("#p1").css("color","red").slideUp(200).slideDown(200)
	getSession().createQuery(hql).setFirstResult(........).list

五、mapValues算子详解

概念： 字面意思，对value中的内容做操作。

进行测试：
启动spark-shell --master local[2]

1、scala> val a = sc.parallelize(List(“cat”,“lion”,“tiger”,“dog”,“panda”))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :24

2、scala> val b = a.map(x =>(x.length,x))
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[1] at map at :25
//把字符转成对应的字符长度

3、scala> b.collect
res: Array[(Int, String)] = Array((3,cat), (4,lion), (5,tiger), (3,dog), (5,panda))

4、scala> b.mapValues(“x” + _ + “x”).collect
res5: Array[(Int, String)] = Array((3,xcatx), (4,xlionx), (5,xtigerx), (3,xdogx), (5,xpandax)

	 注意：mapValues : key不动，只动value。
	mapValues是只对value做动作，不对key做动作。
	****工作中非常有用的一个小特性，生产只动values，不动key的情况。

作业2：flatmap和map的区别？

六、RDD中常用action算子

1、count(): Return the number of elements in the dataset.

在Spark控制台操作：

1、scala> val a = sc.parallelize(List("dog","lion","tiger","cat","panda"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at parallelize at <console>:24

2、scala> a.count
res13: Long = 5

2、reduce(func): Aggregate the elements of the dataset using a function func(which takes two arguments and returns one). The function should be commutative and associative
译：使用函数聚合数据集中的元素（接收两个参数并返回一个值）

在Spark控制台操作：

1、scala> val a = sc.parallelize(1 to 100)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:24

2、scala> a.reduce(_+_)     ==>a.reduce((x,y) => (x+y))		//这两句话是等价的
res15: Int = 5050

3、scala> a.reduce((x,y) => x+y)
res16: Int = 5050

4、scala> val b = sc.parallelize(1 to 100)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:24

5、scala> b.sum
res18: Double = 5050.0

6、scala> b.sum.toInt
res18: Int = 5050.0

3、fisrt

在Spark控制台操作：

1、scala> val a = sc.parallelize(List("dog","lion","tiger","cat","panda"))

2、scala> a.first
res30: String = dog

3、scala> a.take(1)
res31: String = dog

4、scala> a.take(10)		//10大于输入个数，为什么没有报错呢
res32: Array[String] = Array(dog, lion, tiger, cat, panda)

4、top

在Spark控制台操作：

1、scala> val a = sc.parallelize(Array(6,9,4,7,19,16,8))
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24

2、scala> a.top(2)
res31: Array[Int] = Array(19, 16)

scala> val a = sc.parallelize(List("dog","cat","lion","tiger","ron","panda"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[14] at parallelize at <console>:24

scala> a.top(2)
res33: Array[String] = Array(tiger, panda)

对于Int类型、String类型，都是降序排列

扩展：隐式转换

需求：top默认是降序排列，怎么变成升序排列？

写一个隐式转换：
implicit val myOrder = implicitly[Ordering[Int]].reverse

scala> val a = sc.parallelize(1 to 10)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:26

scala> a.top(2)
res25: Array[Int] = Array(1, 2)

scala> a.max
res26: Int = 1			//隐式转换不要乱用，在当前spark-shell中，a.max的值被反了。重新打开一个spark-shell，a.max的值才是正确的10.

作业：Spark Core读取SequenceFile文件
历史原因：Hive中有些表示采用SequenceFile存储的，现在想使用Spark Core来作为分布式计算框架。

作业1：思考为什么transformation不触发，遇到action才会执行这个设计会让spark运行更加高效

作业2：flatmap和map的区别？

作业3：自行测试takeSample、takeOrdered、saveAsTextFile、saveAsSequenceFile

zhikanjiani

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
若泽数据B站视频Spark05 - Spark-RDD的基本操作(一)

一、上次课回顾二、从宏观角度看RDD三、RDD-map算子详解四、RDD-filter结合map算子详解五、RDD-mapValues算子详解六、RDD常用action算子一、上次课回顾https://blog.csdn.net/zhikanjiani/article/details/90613976二、从宏观角度看RDD从宏观角度看RDD operationsy = f(x...
复制链接

扫一扫

专栏目录