若泽数据B站视频Spark05 - Spark-RDD的基本操作(一)

一、上次课回顾

二、从宏观角度看RDD

三、RDD-map算子详解

四、RDD-filter结合map算子详解

五、RDD-mapValues算子详解

六、RDD常用action算子

一、上次课回顾

1、若泽数据B站视频Spark基础篇05-Spark-RDD的创建

  • https://blog.csdn.net/zhikanjiani/article/details/90613976

二、从宏观角度看RDD

从宏观角度看RDD operations:
官网描述:

  • RDDs support two types of operations : transformations(转化), which create a new dataset from an existing one, and actions , which return a value to the driver program after running a computation on the dataset.

如何理解:RDD支持两种类型操作:

  1. transformation(转换) ==> 从一个已经存在RDD创建一个新的RDD,也体现了RDD的不可变性.

  2. actions(操作) ==> eg:在控制台上启动一个shell,在结果集之上运行一个计算把结果在控制台上返回回来.

  3. map(映射) ==> 传递到数据集中的每一个元素,传一个方法进去(y =f(x)+1),返回一个新的RDD.

实际操作:

  1. :y = f(x) + 1
    RDDA ====> RDDB
    map
    y = f(x) + 1
RDDAmapRDDBreduce(a+b)
(1,2,3,4,5)map(+1)(2,3,4,5,6)sum
	map			RDDB			reduce(a+b)
(1,2,3,4,5)	  map( + 1 )	==>	(2,3,4,5,6)		sum		//把map方法中的+1作用到前面的每一个元素中,并且也返回到了一个新的RDD.

map is a transformation that passes each dataset element through a function and returns a new RDD representing the results		==>	传递到每一个数据集中的元素
  1. :actions:reduce
    概念:reduce is an action that aggregates all the elements of the RDD using some functions and returns the final result to the driver program. ==>聚合RDD中所有元素并且使用一个方法返回结果

我们对RDDB进行了如下一个操作:reduce(a+b) 两两相加,等价于==>sum

3):all transformations in spark are lazy ***** 意思是遇到action算子的时候才去加载。
概念:in that they do not compute their results right away,The transformations are only computed when an action requires a result to be returned to the driver program. ==> 只有遇到action算子才会进行计算。

This design enables Spark to run more efficiently:它能让我们运行spark更高效.

eg1:rdda.map().filter().map().filter() ==>这一步操作并不会真正触发执行,它仅仅是记录了这一层转换关系.

eg2:rdda.map().reduce(+) ==>此操作必定会返回一个结果.

4):cache
by default, each transformed RDD may be recomputed each time you run an action on it.
==> 每一个transformed RDD会被重新计算,场景:某个RDD的分区信息丢了,会根据血缘关系,找到丢失的RDD的父RDD上重新计算.

however, you may also persist an RDD in memory using the persist(or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it.
==> 将来某一时刻需要数据,把数据cache住,将来访问更快.

There is also support for persisting RDDs on disks, or replicated across multiple nodes
==> 也支持持久化的RDD到磁盘中,多副本的方式跨机器跨节点存储.

作业1:思考为什么transformation不触发,遇到action才会执行这个设计会让spark运行更加高效

三、RDD map算子详解:

Spark控制台操作:

例子1:
1、scala> val a = sc.parallelize(1 to 9)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24

2、scala> val b = a.map(x => (x*2))
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at :25

3、scala> b.collect
res0: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18) //未给它赋值就是从res0开始

例子2:
1、scala> val a = sc.parallelize(List(“dog”,“lion”,“cat”,“tiger”,“panda”))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[6] at parallelize at :24

2、scala> val b = a.map(x => (x,1))
b: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[7] at map at :25

3、scala> b.collect
res3: Array[(String, Int)] = Array((dog,1), (lion,1), (cat,1), (tiger,1), (panda,1))

在scala控制台进行如下操作:
1、scala> val a = List(“dog”,“cat”,“tiger”)

2、scala> a.map(x => (x,1))

小结:

	通过map算子看出使用RDD API编程和使用scala编程是一模一样的
		===>scala中集合操作	是	打死都要掌握的
		Spark可以跑单机也可以跑分布式,scala只支持单机版,val b = a.map(x =>(x,1))	//只看这段代码并不清楚是scala还是spark,开发分布式应用程序和开发单机版程序是一样的,从单机模式转移到大数据分布式开发可以做到无缝对接。

四、RDD-filter结合map算子详解

filter:对元素进行过滤

在spark控制台中操作:

1、scala> val a = sc.parallelize( 1 to 10)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24

2、求1~10中被2整除的数
scala> a.filter(_%2==0).collect
res11: Array[Int] = Array(2, 4, 6, 8, 10)	//结果输出出来

3、求出<4的数
scala> a.filter(_<5).collect
res12: Array[Int] = Array(1, 2, 3, 4)

**例子1:**
1、scala> val a = sc.parallelize(1 to 6)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:24

2、scala> val mapRDD = a.map(_*2)
mapRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[13] at map at <console>:25

3、scala> val filterRDD = mapRDD.filter(_>5)
filterRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[14] at filter at <console>:25

4、scala> filterRDD.collect
res8: Array[Int] = Array(6, 8, 10, 12)

第一步:map * 2		=>	第二步:filter > 5

由于分步显得步骤多:所以我们后期统一都采用链式编程*****,这是很重要的
scala> sc.parallelize(1 to 6).map(_*2).filter(_>5).collect
res9: Array[Int] = Array(6, 8, 10, 12)

举例: JQuery中典型的链式编程
	$("#p1").css("color","red").slideUp(200).slideDown(200)
	getSession().createQuery(hql).setFirstResult(........).list

五、mapValues算子详解

概念: 字面意思,对value中的内容做操作。

进行测试:
启动spark-shell --master local[2]

1、scala> val a = sc.parallelize(List(“cat”,“lion”,“tiger”,“dog”,“panda”))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :24

2、scala> val b = a.map(x =>(x.length,x))
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[1] at map at :25
//把字符转成对应的字符长度

3、scala> b.collect
res: Array[(Int, String)] = Array((3,cat), (4,lion), (5,tiger), (3,dog), (5,panda))

4、scala> b.mapValues(“x” + _ + “x”).collect
res5: Array[(Int, String)] = Array((3,xcatx), (4,xlionx), (5,xtigerx), (3,xdogx), (5,xpandax)

	 注意:mapValues : key不动,只动value。
	mapValues是只对value做动作,不对key做动作。
	****工作中非常有用的一个小特性,生产只动values,不动key的情况。

作业2:flatmap和map的区别?

六、RDD中常用action算子

1、count(): Return the number of elements in the dataset.

在Spark控制台操作:

1、scala> val a = sc.parallelize(List("dog","lion","tiger","cat","panda"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at parallelize at <console>:24

2、scala> a.count
res13: Long = 5

2、reduce(func): Aggregate the elements of the dataset using a function func(which takes two arguments and returns one). The function should be commutative and associative
译:使用函数聚合数据集中的元素(接收两个参数并返回一个值)

在Spark控制台操作:

1、scala> val a = sc.parallelize(1 to 100)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:24

2、scala> a.reduce(_+_)     ==>a.reduce((x,y) => (x+y))		//这两句话是等价的
res15: Int = 5050

3、scala> a.reduce((x,y) => x+y)
res16: Int = 5050

4、scala> val b = sc.parallelize(1 to 100)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:24

5、scala> b.sum
res18: Double = 5050.0

6、scala> b.sum.toInt
res18: Int = 5050.0

3、fisrt

在Spark控制台操作:

1、scala> val a = sc.parallelize(List("dog","lion","tiger","cat","panda"))

2、scala> a.first
res30: String = dog

3、scala> a.take(1)
res31: String = dog

4、scala> a.take(10)		//10大于输入个数,为什么没有报错呢
res32: Array[String] = Array(dog, lion, tiger, cat, panda)

4、top

在Spark控制台操作:

1、scala> val a = sc.parallelize(Array(6,9,4,7,19,16,8))
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24

2、scala> a.top(2)
res31: Array[Int] = Array(19, 16)

scala> val a = sc.parallelize(List("dog","cat","lion","tiger","ron","panda"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[14] at parallelize at <console>:24

scala> a.top(2)
res33: Array[String] = Array(tiger, panda)

对于Int类型、String类型,都是降序排列

扩展:隐式转换

需求:top默认是降序排列,怎么变成升序排列?

写一个隐式转换:
implicit val myOrder = implicitly[Ordering[Int]].reverse

scala> val a = sc.parallelize(1 to 10)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:26

scala> a.top(2)
res25: Array[Int] = Array(1, 2)

scala> a.max
res26: Int = 1			//隐式转换不要乱用,在当前spark-shell中,a.max的值被反了。重新打开一个spark-shell,a.max的值才是正确的10.

作业:Spark Core读取SequenceFile文件
历史原因:Hive中有些表示采用SequenceFile存储的,现在想使用Spark Core来作为分布式计算框架。

作业1:思考为什么transformation不触发,遇到action才会执行这个设计会让spark运行更加高效

作业2:flatmap和map的区别?

作业3:自行测试takeSample、takeOrdered、saveAsTextFile、saveAsSequenceFile

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值