Spark中RDD的Value型Transformation算子操作(一)

Spark算子大致上可分为三大类算子:
     1)Value数据类型的Transformation算子,这种变换不触发提交作业,针对处理的数据项是Value型的数据。
    2)Key-Value数据类型的Transformation算子,这种变换不触发提交作业,针对处理的数据项是Key-Value型的数据。
     3)Action算子,这类算子会触发SparkContext提交作业。

处理数据类型为Value型的Transformation算子可以根据RDD变换算子的输入分区与输出分区关系分为以下几种类型:
1)输入分区与输出分区一对一型 :map、flatMap、mapPartitions、glom
2)输入分区与输出分区多对一型 :union、certesian
3)输入分区与输出分区多对多型 :groupBy、
4)输出分区为输入分区子集型 :filter、distinct、subtract、sample、takeSample
5)还有一种特殊的输入与输出分区一对一的算子类型:Cache型。 Cache算子对RDD分区进行缓存


1、map
数据集中的每个元素经过用户自定义的函数转换形成一个新的RDD,新的RDD叫MappedRDD
val a = sc.parallelize(List("dog","cat","hippopotamus","sheep","pig"),3);
val b = a.map(_.length);
val c = a.zip(b);
c.collect


zip函数用于将两个RDD组合成Key/Value形成的RDD
结果:
res3: Array[(String, Int)] = Array((dog,3), (cat,3), (hippopotamus,12), (sheep,5), (pig,3))

val a = sc.parallelize(List("dog","cat","hippopotamus","sheep","pig"),3);
val b = a.map(_.split(","));
b.collect;

结果:
res4: Array[Array[String]] = Array(Array(dog), Array(cat), Array(hippopotamus), Array(sheep), Array(pig))


2、flatMap
与map类似,但每个元素输入项都可以被映射到0个或多个的输出项,最终将结果“扁平化“后输出
val a = sc.parallelize(1 to 10, 5);
a.flatMap(1 to ).collect;

结果:
res7: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
sc.parallelize(List(1,2,3,2)).flatMap(x => List(x,x,x)).collect;

结果:
res7: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3, 3, 2, 2, 2)

3、mapPartitions
类似于map,map作用于每个分区的每个元素,但mapPartitions作用于每个分区的func的类型:Iterator[T] => Iterator[U]假设有N个元素,有M个分区,那么map的函数将被调用N次,而mapPartitions被调用M次,当在映射的过程中不断地创建对象时就可以使用mapPartitions,比map的效率要高很多。比如:当向数据库写入数据时,如果使用map,就需要为每个元素创建connection对象;但使用mapPartitions的话,就需要为每个分区创建mapPartitions对象
val name = List(("name","zhangsan"),("sex","man"),("address","pek"),("username","zhangsan"));
val rdd = sc.parallelize(name,2);
rdd.mapPartitions(x => x.filter(_._2 == 
"zhangsan")).foreachPartition(p=>{
println(p.toList)
println("=====分区分割线=====")
})

结果:
List((name,zhangsan))
=====分区分割线=====
List((username,zhangsan))
=====分区分割线=====


4、glom
将RDD的每个分区中的类型为T的元素转换为数组Array[T]
val number = sc.parallelize(1 to 100, 3);
number.glom.collect;


结果:
res12: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33), Array(34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66), Array(67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100))

5、union
UNION指将两个RDD中的数据进行合并,最终返回两个RDD的并集,若RDD中存在相同的元素,也不会去重
val a = sc.parallelize(1 to 3, 1);
val b = sc.parallelize(1 to 7,1);
(a ++ b).collect;

结果:
res15: Array[Int] = Array(1, 2, 3, 1, 2, 3, 4, 5, 6, 7)

6、cartesian
对两个RDD中的所有元素进行笛卡尔积操作
val x = sc.parallelize(List(1,2,3,4,5));
val y = sc.parallelize(List(6,7,8,9,10));
x.cartesian(y).collect;

结果:
res16: Array[(Int, Int)] = Array((1,6), (1,7), (2,6), (2,7), (1,8), (1,9), (1,10), (2,8), (2,9), (2,10), (3,6), (3,7), (4,6), (4,7), (5,6), (5,7), (3,8), (3,9), (3,10), (4,8), (4,9), (4,10), (5,8), (5,9), (5,10))


7、groupBy
生成相应的key,相同的放在一起even(2,4,6,8)
val a = sc.parallelize(1 to 9, 3);
a.groupBy(x => { if (x%2 ==0) "even" else "odd"}).collect;

结果:
res17: Array[(String, Iterable[Int])] = Array((even,CompactBuffer(2, 4, 6, 8)), (odd,CompactBuffer(1, 3, 5, 7, 9)))


8、filter
对元素进行过滤,对每个元素应用f函数,返回值为true对元素在RDD中保留,返回为false都将过滤掉
val a = sc.parallelize(1 to 10 ,3);
val b = a.filter(_ % 2 ==0);
b.collect;

结果:
res18: Array[Int] = Array(2, 4, 6, 8, 10)


9、distinct
distinct用于去重
val str = sc.parallelize(List("abc","adc","qwe","aaa","adc"),2);
str.distinct.collect;

结果:
res19: Array[String] = Array(adc, abc, qwe, aaa)


10、subtract
去掉含有重复的项
val a = sc.parallelize(1 to 6, 3);
val b = sc.parallelize(1 to 3, 3);
val c = a.subtract(b);
c.collect;

结果:
res21: Array[Int] = Array(6, 4, 5)


11、sample
 以指定的随机种随机抽样出数量为fraction的数据,withReplacement表示是抽出的数是否返回,true为有放回的抽样,false为无放回的抽样
val a = sc.parallelize(1 to 10000,3);
a.sample(false ,0.1,0).count;


结果:

res22: Long = 1032



12、takesample
takesample()函数和sample函数是一个原理,但是不使用相对比例采样,而是按照设定的采样个数进行采样,同时返回结果不再是RDD,而是相对于对采样后对数据进行collect(),返回结果对集合为单机对数组
val x = sc.parallelize(1 to 1000,3);
x.takeSample(true,100,1);

结果:
res25: Array[Int] = Array(764, 815, 274, 452, 39, 538, 238, 544, 475, 480, 416, 868, 517, 363, 39, 316, 37, 90, 210, 202, 335, 773, 572, 243, 354, 305, 584, 820, 528, 749, 188, 366, 913, 667, 214, 540, 807, 738, 204, 968, 39, 863, 541, 703, 397, 489, 172, 29, 211, 542, 600, 977, 941, 923, 900, 485, 575, 650, 258, 31, 737, 155, 685, 562, 223, 675, 330, 864, 291, 536, 392, 108, 188, 408, 475, 565, 873, 504, 34, 343, 79, 493, 868, 974, 973, 110, 587, 457, 739, 745, 977, 800, 783, 59, 276, 987, 160, 351, 515, 901)



13、cache、persist
cache和persist都是用于将一个RDD进行缓存对,这样在之后对使用过程中就不需要重新计算了,可以大大节省程序运行时间
val stringStr = sc.parallelize(List("aaa","bbb","ccc","ddd",'aaa"),2)
val stringStr = sc.parallelize(List("aaa","bbb","ccc","ddd","aaa"),2);
c.getStorageLevel;


结果:
org.apache.spark.storage.StorageLevel = StorageLevel(1 replicas)
c.cache;
c.getStorageLevel;

结果:
res28: org.apache.spark.storage.StorageLevel = StorageLevel(memory, deserialized, 1 replicas)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值