Spark中RDD的Value型Transformation算子操作（一）

最新推荐文章于 2024-07-22 21:45:00 发布

zjh_746140129

最新推荐文章于 2024-07-22 21:45:00 发布

阅读量1.1k

点赞数

分类专栏： Spark 文章标签： Spark Transformation算子 RDD

本文链接：https://blog.csdn.net/zjh_746140129/article/details/80465312

版权

Spark 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

  Spark算子大致上可分为三大类算子： 

       
 1）Value数据类型的Transformation算子，这种变换不触发提交作业，针对处理的数据项是Value型的数据。 

      
 2）Key-Value数据类型的Transformation算子，这种变换不触发提交作业，针对处理的数据项是Key-Value型的数据。 

       
 3）Action算子，这类算子会触发SparkContext提交作业。 

   处理数据类型为Value型的Transformation算子可以根据RDD变换算子的输入分区与输出分区关系分为以下几种类型: 
 
   1）输入分区与输出分区一对一型 ：map、flatMap、mapPartitions、glom 
 
   2）输入分区与输出分区多对一型 ：union、certesian 
 
   3）输入分区与输出分区多对多型 ：groupBy、 
 
   4）输出分区为输入分区子集型 ：filter、distinct、subtract、sample、takeSample 
 
   5）还有一种特殊的输入与输出分区一对一的算子类型：Cache型。 Cache算子对RDD分区进行缓存

  1、map 

  数据集中的每个元素经过用户自定义的函数转换形成一个新的RDD，新的RDD叫MappedRDD 

val a = sc.parallelize(List("dog","cat","hippopotamus","sheep","pig"),3);
val b = a.map(_.length);
val c = a.zip(b);
c.collect

 
  zip函数用于将两个RDD组合成Key/Value形成的RDD 
 

 
  结果： 
 

   res3: Array[(String, Int)] = Array((dog,3), (cat,3), (hippopotamus,12), (sheep,5), (pig,3)) 
 

val a = sc.parallelize(List("dog","cat","hippopotamus","sheep","pig"),3);
val b = a.map(_.split(","));
b.collect;

 
 结果： 

   res4: Array[Array[String]] = Array(Array(dog), Array(cat), Array(hippopotamus), Array(sheep), Array(pig)) 
 

  2、flatMap 

  与map类似，但每个元素输入项都可以被映射到0个或多个的输出项，最终将结果“扁平化“后输出 

val a = sc.parallelize(1 to 10, 5);
a.flatMap(1 to ).collect;

 
 结果： 

  res7: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) 

sc.parallelize(List(1,2,3,2)).flatMap(x => List(x,x,x)).collect;

 
 结果： 

  res7: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3, 3, 2, 2, 2) 

  3、mapPartitions 

  类似于map，map作用于每个分区的每个元素，但mapPartitions作用于每个分区的func的类型：Iterator[T] => Iterator[U]假设有N个元素，有M个分区，那么map的函数将被调用N次，而mapPartitions被调用M次，当在映射的过程中不断地创建对象时就可以使用mapPartitions，比map的效率要高很多。比如：当向数据库写入数据时，如果使用map，就需要为每个元素创建connection对象；但使用mapPartitions的话，就需要为每个分区创建mapPartitions对象 

val name = List(("name","zhangsan"),("sex","man"),("address","pek"),("username","zhangsan"));
val rdd = sc.parallelize(name,2);
rdd.mapPartitions(x => x.filter(_._2 == 
"zhangsan")).foreachPartition(p=>{
println(p.toList)
println("=====分区分割线=====")
})

 
 结果： 

  List((name,zhangsan)) 

  =====分区分割线===== 

  List((username,zhangsan)) 

  =====分区分割线===== 

  4、glom 

  将RDD的每个分区中的类型为T的元素转换为数组Array[T] 

val number = sc.parallelize(1 to 100, 3);
number.glom.collect;

 
 结果： 

  res12: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33), Array(34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66), Array(67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)) 

  5、union 

  UNION指将两个RDD中的数据进行合并，最终返回两个RDD的并集，若RDD中存在相同的元素，也不会去重 

val a = sc.parallelize(1 to 3, 1);
val b = sc.parallelize(1 to 7,1);
(a ++ b).collect;

 
 结果： 

  res15: Array[Int] = Array(1, 2, 3, 1, 2, 3, 4, 5, 6, 7) 

  6、cartesian 

  对两个RDD中的所有元素进行笛卡尔积操作 

val x = sc.parallelize(List(1,2,3,4,5));
val y = sc.parallelize(List(6,7,8,9,10));
x.cartesian(y).collect;

 
 结果： 

  res16: Array[(Int, Int)] = Array((1,6), (1,7), (2,6), (2,7), (1,8), (1,9), (1,10), (2,8), (2,9), (2,10), (3,6), (3,7), (4,6), (4,7), (5,6), (5,7), (3,8), (3,9), (3,10), (4,8), (4,9), (4,10), (5,8), (5,9), (5,10)) 

  7、groupBy 

  生成相应的key，相同的放在一起even(2,4,6,8) 

val a = sc.parallelize(1 to 9, 3);
a.groupBy(x => { if (x%2 ==0) "even" else "odd"}).collect;

 
 结果： 

   res17: Array[(String, Iterable[Int])] = Array((even,CompactBuffer(2, 4, 6, 8)), (odd,CompactBuffer(1, 3, 5, 7, 9))) 
 

  8、filter 

  对元素进行过滤，对每个元素应用f函数，返回值为true对元素在RDD中保留，返回为false都将过滤掉 

val a = sc.parallelize(1 to 10 ,3);
val b = a.filter(_ % 2 ==0);
b.collect;

 
 结果： 

    res18: Array[Int] = Array(2, 4, 6, 8, 10) 
  

  9、distinct 

  distinct用于去重 

val str = sc.parallelize(List("abc","adc","qwe","aaa","adc"),2);
str.distinct.collect;

 
 结果： 

     res19: Array[String] = Array(adc, abc, qwe, aaa) 
   

  10、subtract 

  去掉含有重复的项 

val a = sc.parallelize(1 to 6, 3);
val b = sc.parallelize(1 to 3, 3);
val c = a.subtract(b);
c.collect;

 
 结果： 

      res21: Array[Int] = Array(6, 4, 5) 
    

  11、sample 

   以指定的随机种随机抽样出数量为fraction的数据，withReplacement表示是抽出的数是否返回，true为有放回的抽样，false为无放回的抽样 

val a = sc.parallelize(1 to 10000,3);
a.sample(false ,0.1,0).count;

 
 结果： 

     res22: Long = 1032 
   

  12、takesample 

  takesample()函数和sample函数是一个原理，但是不使用相对比例采样，而是按照设定的采样个数进行采样，同时返回结果不再是RDD，而是相对于对采样后对数据进行collect(),返回结果对集合为单机对数组 

val x = sc.parallelize(1 to 1000,3);
x.takeSample(true,100,1);

 
 结果： 

      res25: Array[Int] = Array(764, 815, 274, 452, 39, 538, 238, 544, 475, 480, 416, 868, 517, 363, 39, 316, 37, 90, 210, 202, 335, 773, 572, 243, 354, 305, 584, 820, 528, 749, 188, 366, 913, 667, 214, 540, 807, 738, 204, 968, 39, 863, 541, 703, 397, 489, 172, 29, 211, 542, 600, 977, 941, 923, 900, 485, 575, 650, 258, 31, 737, 155, 685, 562, 223, 675, 330, 864, 291, 536, 392, 108, 188, 408, 475, 565, 873, 504, 34, 343, 79, 493, 868, 974, 973, 110, 587, 457, 739, 745, 977, 800, 783, 59, 276, 987, 160, 351, 515, 901) 
    

  13、cache、persist 

  cache和persist都是用于将一个RDD进行缓存对，这样在之后对使用过程中就不需要重新计算了，可以大大节省程序运行时间 

  val stringStr = sc.parallelize(List("aaa","bbb","ccc","ddd",'aaa"),2) 

val stringStr = sc.parallelize(List("aaa","bbb","ccc","ddd","aaa"),2);
c.getStorageLevel;

 
 结果： 

      org.apache.spark.storage.StorageLevel = StorageLevel(1 replicas) 
    

c.cache;
c.getStorageLevel;

 
 结果： 

       res28: org.apache.spark.storage.StorageLevel = StorageLevel(memory, deserialized, 1 replicas) 
     

zjh_746140129

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录