Spark常用算子

Spark常见算子及说明

  1. map:将一个RDD中的每个数据项,通过map中的函数映射变为一个新的元素。输入分区与输出分区一对一,即:有多少个输入分区,就有多少个输出分区。
  2. flatMap:同Map算子一样,最后将所有元素放到同一集合中;
  3. distinct:将RDD中重复元素做去重处理,针对Array[String]类型,将String对象视为字符串数组;
  4. coalesce:将RDD的分区数进行修改,并生成新的RDD;有两个参数:第一个参数为分区数,第二个参数为shuffle Booleean类型,默认为false;如果更改分区数比原有RDD的分区数小,shuffle为false;如果更改分区数比原有RDD的分区数大,shuffle必须为true;应用说明:一般处理filter或简化操作时,新生成的RDD中分区内数据骤减,可考虑重分区
  5. repartition:修改RDD分区数:重分区;和coalesce相反,第二个参数为shuffle Booleean类型,默认为true;如果更改分区数比原有RDD的分区数大,shuffle为false;
  6. randomSplit:
    def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]] 
    说明:将RDD按照权重(weights)进行随机分配,返回指定个数的RDD集合;
  7. glom:说明:返回每个分区中的数据项
  8. union:并集,说明:将两个RDD进行合并,不去重
  9. subtrat:差集
  10. intersection:交集,去重
  11. mapPartitions:说明:针对每个分区进行操作;应用:对RDD进行数据库操作时,需采用mapPartitions对每个分区实例化数据库连接conn对象
  12. mapPartitionsWithIndex:
    val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)
    	def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
    	  iter.map(x => index + "," + x)
    	}
    	注意:iter: Iterator[Int]:Iterator[T]类型,应和RDD内部数据类型一致
    	x.mapPartitionsWithIndex(myfunc).collect()
    	res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9, 2,10)

     

  13. zip:两个RDD进行组合,1.两个RDD之间数据类型可以不同;2.要求每个RDD具有相同的分区数;3.需RDD的每个分区具有相同的数据个数
  14. zipParititions:要求:需每个RDD具有相同的分区数;
  15. zipWithIndex:将现有的RDD的每个元素和相对应的Index组合,生成新的RDD[(T,Long)]
  16. zipWithUniqueId:
    def zipWithUniqueId(): RDD[(T, Long)]
    
    	scala> val rdd = sc.parallelize(List(1,2,3,4,5),2)
    	rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[21] at parallelize at <console>:24
    
            scala> rdd.glom.collect
    	res25: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4, 5))                   
            
    	scala> val rdd2 = rdd.zipWithUniqueId()
    	rdd2: org.apache.spark.rdd.RDD[(Int, Long)] = MapPartitionsRDD[23] at zipWithUniqueId at <console>:26
    
    	scala> rdd2.collect
    	res26: Array[(Int, Long)] = Array((1,0), (2,2), (3,1), (4,3), (5,5))
            计算规则:
    	     step1:第一个分区的第一个元素0,第二个分区的第一个元素1
    	     step2:第一个分区的第二个元素0+2
                 step2:第二个分区的第二个元素1+2=3;第二个分区的第三个元素3+2=5,3是1+2得到;加的都是分区数;

     

  17. reduceByKey:
    def reduceByKey(func: (V, V) => V): RDD[(K, V)]
            说明:合并具有相同键的值
    	 val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
    	 val b = a.map(x => (x.length, x))
    	 b.reduceByKey(_ + _).collect
    	 res86: Array[(Int, String)] = Array((3,dogcatowlgnuant))
    
    	 val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
    	 val b = a.map(x => (x.length, x))
    	 b.reduceByKey(_ + _).collect
    	 res87: Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther), (5,tigereagle))

     

  18. groupByKey():
    def groupByKey(): RDD[(K, Iterable[V])]
    	 说明:按照相同的key进行分组,返回值为RDD[(K, Iterable[V])]
    	  val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
    	  val b = a.keyBy(_.length)
    	  b.groupByKey.collect
    	  res11: Array[(Int, Seq[String])] = Array((4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)), (3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger, eagle)))

     

  19. keyBy:
    def keyBy[K](f: T => K): RDD[(K, T)]
    	说明:将f函数的返回值作为Key,与RDD的每个元素构成piarRDD{RDD[(K, T)]}
    	val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
    	val b = a.keyBy(_.length)
    	b.collect
    	res26: Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon), (3,rat), (8,elephant))

     

  20. keys:
    def keys: RDD[K]
           说明:返回具有key的RDD
    	val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
    	val b = a.map(x => (x.length, x))
    	b.keys.collect
    	res2: Array[Int] = Array(3, 5, 4, 3, 7, 5)

     

  21. values:
    def values: RDD[V]
            说明:返回具有value的RDD
    	val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
    	val b = a.map(x => (x.length, x))
    	b.values.collect
    	res3: Array[String] = Array(dog, tiger, lion, cat, panther, eagle)

     

  22. sortByKey:
    def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.size): RDD[P]
            说明:根据key进行排序,默认为ascending: Boolean = true(“升序”)
            val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
    	val b = sc.parallelize(1 to a.count.toInt, 2)
    	val c = a.zip(b)
    	c.sortByKey(true).collect
    	res74: Array[(String, Int)] = Array((ant,5), (cat,2), (dog,1), (gnu,4), (owl,3))
    	c.sortByKey(false).collect
    	res75: Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2), (ant,5))

     

  23. partitionBy:
    def partitionBy(partitioner: Partitioner): RDD[(K, V)]
            说明:通过设置Partitioner对RDD进行重分区
    	scala> val rdd = sc.parallelize(List((1,"a"),(2,"b"),(3,"c"),(4,"d")),2)
    	rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[26] at parallelize at <console>:24
    
    	scala> rdd.glom.collect
    	res28: Array[Array[(Int, String)]] = Array(Array((1,a), (2,b)), Array((3,c), (4,d)))
    	
    	scala> val rdd1=rdd.partitionBy(new org.apache.spark.HashPartitioner(2))
    	rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[28] at partitionBy at <console>:26
    
    	scala> rdd1.glom.collect
    	res29: Array[Array[(Int, String)]] = Array(Array((4,d), (2,b)), Array((1,a), (3,c)))
  24. join:

    def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
           说明:将两个RDD进行内连接
    	val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
    	val b = a.keyBy(_.length)
    	val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
    	val d = c.keyBy(_.length)
    	b.join(d).collect

     

  25. rightOuterJoin:说明:对两个RDD 进行连接操作,确保第一个RDD 的键必须存在(右外连接)

  26. leftOuterJoin:说明:对两个RDD 进行连接操作,确保第二个RDD 的键必须存在(左外连接)

  27. cogroup:说明:将两个RDD 中拥有相同键的数据分组到一起,全连

 

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值