RDD -- Transformation算子分析

最新推荐文章于 2022-01-02 16:04:01 发布

游九河

最新推荐文章于 2022-01-02 16:04:01 发布

阅读量623

点赞数

本文链接：https://blog.csdn.net/qq_40337206/article/details/90241225

版权

spark core 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

RDD

RDD(Resilient Distributed Datasets) ,弹性分布式数据集，是分布式内存的一个抽象概念，RDD提供了一种高度受限的共享内存模型，即RDD是只读的记录分区的集合，只能通过在其他RDD执行确定的转换操作（如map、join和group by）而创建，然而这些限制使得实现容错的开销很低。对开发者而言，RDD可以看作是Spark的一个对象，它本身运行于内存中，如读文件是一个RDD，对文件计算是一个RDD，结果集也是一个RDD ，不同的分片、数据之间的依赖、key-value类型的map数据都可以看做RDD。(注意：来自百度百科)

RDD 操作分类

RDD操作分为两种算子：Transformation 和 Actions。这两种算子区分本质是否触发任务提交。
Transformation：只是把依赖关系和转换关系记录在血统中并不会触发任务提交。
Actions：遇到这种算子就会触发任务提交，并把结果返回。

Transformation：

map(func)
filter(func)
flatMap(func)
mapPartitions(func)
mapPartitionsWithIndex(func)
sample(withReplacement, fraction, seed)
union(otherDataset)
intersection(otherDataset)
distinct([numPartitions]))
groupByKey([numPartitions])
reduceByKey(func, [numPartitions])
aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions])
sortByKey([ascending], [numPartitions])
join(otherDataset, [numPartitions])
cogroup(otherDataset, [numPartitions])
cartesian(otherDataset)
pipe(command, [envVars])
coalesce(numPartitions)
repartition(numPartitions)
repartitionAndSortWithinPartitions(partitioner)

RDD 继承关系

在这里插入图片描述

map

在这里插入图片描述

官网 API 介绍

map(func) Return a new distributed dataset formed by passing each element of the source through a function func.

源码

  /**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

mapPartitions

在这里插入图片描述

官网 API 介绍

mapPartitions(func) Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator => Iterator when running on an RDD of type T.

源码

  /**
   *  Return a new RDD by first applying a function to all elements of this
   *  RDD, and then flattening the results.
   */
  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }

Transformations 算子都是一样，创建一个新的RDD，并没有去提交计算任务。

例子

map

map:对集合中每个元素操作

def map[U: ClassTag](f: T => U): RDD[U]

    val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
    val b = a.map(x => (x.length, x))
    b.collect.foreach(println)

    //    (3,dog)
    //    (5,tiger)
    //    (4,lion)
    //    (3,cat)
    //    (7,panther)
    //    (5,eagle)

filter

filter：过滤

def filter(f: T => Boolean): RDD[T]

    val a = sc.parallelize(1 to 10, 3)
    val b = a.filter(_ % 2 == 0)
    b.collect.foreach(println)
    //    2
    //    4
    //    6
    //    8
    //    10

flatMap

flatMap和map很像，多了一个压扁过程

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]

    val a = sc.parallelize(1 to 10, 5)
    a.flatMap(1 to _).collect.foreach(println)
    //    1
    //    1
    //    2
    //    1
    //    2
    //    3
    //    1
    //    2
    //    3
    //    4
    //    1
    //    2
    //    3
    //    4
    //    5
    //    1
    //    2
    //    3
    //    4
    //    5
    //    6
    //    1
    //    2
    //    3
    //    4
    //    5
    //    6
    //    7
    //    1
    //    2
    //    3
    //    4
    //    5
    //    6
    //    7
    //    8
    //    1
    //    2
    //    3
    //    4
    //    5
    //    6
    //    7
    //    8
    //    9
    //    1
    //    2
    //    3
    //    4
    //    5
    //    6
    //    7
    //    8
    //    9
    //    10

mapPartitions

mapPartitions:在每个分区中执行map操作，和map操作的单位为单个元素，mapPartitions操作的单位为分区，在map操作数据库等消耗资源时，用mapPartitions优化。

def mapPartitions[U: ClassTag](f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

    val a = sc.parallelize(1 to 9, 3)

    def myfunc[T](iter: Iterator[T]): Iterator[(T, T)] = {
      var res = List[(T, T)]()
      var pre = iter.next
      while (iter.hasNext) {
        val cur = iter.next
        res.::=(pre, cur)
        pre = cur
      }
      res.iterator
    }

    a.mapPartitions(myfunc).collect.foreach(println)
    //    (2,3)
    //    (1,2)
    //    (5,6)
    //    (4,5)
    //    (8,9)
    //    (7,8)

mapPartitionsWithIndex

mapPartitionsWithIndex：函数作用同mapPartitions，不过提供了两个参数，第一个参数为分区的索引

  def main(args: Array[String]): Unit = {
    //first()
    //second()
    third()
  }

  def first(): Unit = {
    val x = sc.parallelize(List(1, 2, 3, 4, 5, 7, 8, 9, 10), 3)

    def myfunc1(index: Int, iter: Iterator[Int]): Iterator[String] = {
      iter.map(x => index + ", " + x)
    }

    x.mapPartitionsWithIndex(myfunc1).collect().foreach(println)
    //    0, 1
	//    0, 2
	//    0, 3
	//    1, 4
	//    1, 5
	//    1, 7
	//    2, 8
	//    2, 9
	//    2, 10
  }

  def second(): Unit = {
    val randRDD = sc.parallelize(List((2, "cat"), (6, "mouse"), (7, "cup"), (3, "book"), (4, "tv"), (1, "screen"), (5, "heater")), 3)
    val rPartitioner = new RangePartitioner(3, randRDD)
    val partitioned = randRDD.partitionBy(rPartitioner)

    def myfunc2(index: Int, iter: Iterator[(Int, String)]): Iterator[String] = {
      iter.map(x => "[partID: " + index + ", val:" + x + "]")
    }

    partitioned.mapPartitionsWithIndex(myfunc2).collect().foreach(println)
    //    [partID: 0, val:(2,cat)]
	//    [partID: 0, val:(3,book)]
	//    [partID: 0, val:(1,screen)]
	//    [partID: 1, val:(4,tv)]
	//    [partID: 1, val:(5,heater)]
	//    [partID: 2, val:(6,mouse)]
	//    [partID: 2, val:(7,cup)]
  }

  def third(): Unit = {
    val z = sc.parallelize(List(1, 2, 3, 4, 5, 6), 2)

    def myfunc3(index: Int, iter: Iterator[Int]): Iterator[String] = {
      iter.map(x => "[partID:" + index + ", val:" + x + "]")
    }

    z.mapPartitionsWithIndex(myfunc3).collect().foreach(println)
    //  [partID:0, val:1]
	//  [partID:0, val:2]
	//  [partID:0, val:3]
	//  [partID:1, val:4]
	//  [partID:1, val:5]
	//  [partID:1, val:6]
  }

sample

sample : 从原来RDD随机抽样出一部分元素组成一个新的RDD

def sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T]

  def main(args: Array[String]): Unit = {
    first()
  }
  def first(): Unit ={
    val a = sc.parallelize(1 to 10000,3)
    a.sample(false,0.001,444).collect().foreach(println)
  }
  	//  120
	//  424
	//  477
	//  2349
	//  2691
	//  2773
	//  2988
	//  5143
	//  6449
	//  6659
	//  9820

union, ++

union：对于两个数据集进行合并操作(不会去除重复元素)

def ++(other: RDD[T]): RDD[T]
def union(other: RDD[T]): RDD[T]

    val a = sc.parallelize(1 to 7,1)
    val b = sc.parallelize(5 to 10,2)
    a.union(b).collect().foreach(println)
    a.++(b).collect().foreach(println)
    //    1
    //    2
    //    3
    //    4
    //    5
    //    6
    //    7
    //    5
    //    6
    //    7
    //    8
    //    9
    //    10

intersection

intersection : 求这个数据集的交集(会去除重复元素)

def intersection(other: RDD[T], numPartitions: Int): RDD[T]
def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
def intersection(other: RDD[T]): RDD[T]

    val x = sc.parallelize(1 to 20)
    val y = sc.parallelize(5 to 25)
    x.intersection(y).sortBy(x => x,true).collect().foreach(println)
	//    5
	//    6
	//    7
	//    8
	//    9
	//    10
	//    11
	//    12
	//    13
	//    14
	//    15
	//    16
	//    17
	//    18
	//    19
	//    20

distinct

distinct:去重

def distinct(): RDD[T]
def distinct(numPartitions: Int): RDD[T]

    val x = sc.parallelize(1 to 10)
    x.union(x).distinct().collect().foreach(println)
    //    8
    //    1
    //    9
    //    10
    //    2
    //    3
    //    4
    //    5
    //    6
    //    7

groupByKey

groupByKey和reduceByKey虽然两个函数都能得出正确的结果，但reduceByKey函数更适合使用在大数据集上。这是因为Spark知道它可以在每个分区移动数据之前将输出数据与一个共用的key结合。

reduceByKey

reduceByKey:类似于mapreduce的reduce阶段

def reduceByKey(func: (V, V) => V): RDD[(K, V)]
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]

    val a = sc.parallelize(List("dog","cat","owl","gnu","ant",""))
    val animal = sc.parallelize(List("Lion","Deer","Leopard","Monkey","Elephant","Chimpanzees","Horse","Bear","Donkey","Kangaroo","Ox","Hedgehog","Sheep","Rhinoceros"))
    val b = a.union(animal).map(x => (x.length,x))
    b.reduceByKey((x,y)=> x+",\t"+y).collect().foreach(println)
    //    (0,)
    //    (8,Elephant,	Kangaroo,	Hedgehog)
    //    (10,Rhinoceros)
    //    (2,Ox)
    //    (11,Chimpanzees)
    //    (3,dog,	cat,	owl,	gnu,	ant)
    //    (4,Lion,	Deer,	Bear)
    //    (5,Horse,	Sheep)
    //    (6,Monkey,	Donkey)
    //    (7,Leopard)

aggregateByKey

aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U) 先局部操作，再全局操作

zeroValue：分区操作初始值
seqOp：分区内操作规则
combOp：全局操作规则

def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, numPartitions: Int)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]

    val pairRDD = sc.parallelize(List(("cat", 2), ("cat", 5), ("mouse", 4), ("cat", 12), ("dog", 12), ("mouse", 2)))
    println(pairRDD.partitions.length)

    def func(index: Int, iter: Iterator[(String, Int)]): Iterator[String] = {
      iter.map(x => "partID:" + index + ",val:" + x)
    }

    pairRDD.mapPartitionsWithIndex(func).collect().foreach(println)
    //    partID:0,val:(cat,2)
    //    partID:1,val:(cat,5)
    //    partID:1,val:(mouse,4)
    //    partID:2,val:(cat,12)
    //    partID:3,val:(dog,12)
    //    partID:3,val:(mouse,2)
    pairRDD.aggregateByKey(0)(math.max(_, _), math.max(_, _)).collect().foreach(println)
    //    (dog,12)
    //    (mouse,4)
    //    (cat,12)

    pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect().foreach(println)
    //    (dog,12)
    //    (mouse,6)
    //    (cat,19)
    pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect().foreach(println)
    //    (dog,100)
    //    (mouse,200)
    //    (cat,300)
    pairRDD.aggregateByKey(100)(_ + _, _ + _).collect().foreach(println)
    //    (dog,112)
    //    (mouse,206)
    //    (cat,319)

join

join：相同的key join

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
在这里插入图片描述

    val a = sc.parallelize(List("dog","salmon","salmon","rat","elephant"))
    val b = a.keyBy(_.length)
    b.collect().foreach(println)

    val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"))
    val d = c.keyBy(_.length)
    d.collect().foreach(println)
    b.join(d).collect().foreach(println)

cogroup, groupWith

cogroup / groupWith : 是对最多三个RDD里key相同的，合并成集合

def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
def cogroup[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W]))]
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def groupWith[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
def groupWith[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]): RDD[(K, (Iterable[V], IterableW1], Iterable[W2]))]
在这里插入图片描述

    val a = sc.parallelize(List(1,2,1,3,4,5,1,2,3,1,2,3))
    val b = a.map(x=>(x,"b"))
    b.collect().foreach(println)
    val c = a.map((_,"c"))
    val d = a.map(x=>(x,"d"))
    c.collect().foreach(println)
    d.collect().foreach(println)
    b.cogroup(c).collect().foreach(println)
    b.groupWith(c).collect().foreach(println)
    b.cogroup(c,d).collect().foreach(println)
    val x = sc.parallelize(List((1,"apple"),(2,"banana"),(3,"orange"),(4,"kiwi")))
    val y = sc.parallelize(List((5,"computer"),(1,"laptop"),(1,"desktop"),(4,"iPad")))
    x.cogroup(y).collect().foreach(println)

repartitionAndSortWithinPartitions

repartitionAndSortWithinPartitions :根据给定的分区程序重新分区RDD，并在每个结果分区中根据键对记录进行排序。

def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)]
在这里插入图片描述

    val randRDD = sc.parallelize(List((2,"cat"),(6,"mouse"),(7,"cup"),(3,"book"),(4,"tv"),(1,"screen"),(5,"heater")),3)
    val rPartitioner = new RangePartitioner(3,randRDD)
    val partitioned2 = randRDD.repartitionAndSortWithinPartitions(rPartitioner)
    partitioned2.mapPartitionsWithIndex(myfunc).collect().foreach(println)

    def myfunc2(index:Int,iter:Iterator[(Int,String)]):Iterator[String] = {
      iter.map(x => "partID:"+index+", val:"+x)
    }
    partitioned2.mapPartitionsWithIndex(myfunc2).collect().foreach(println)

车遥遥，马憧憧。

君游东山东复东，安得奋飞逐西风。

愿我如星君如月，夜夜流光相皎洁。

月暂晦，星常明。

留明待月复，三五共盈盈。

游九河

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
RDD -- Transformation算子分析

RDDRDD(Resilient Distributed Datasets) ,弹性分布式数据集，是分布式内存的一个抽象概念，RDD提供了一种高度受限的共享内存模型，即RDD是只读的记录分区的集合，只能通过在其他RDD执行确定的转换操作（如map、join和group by）而创建，然而这些限制使得实现容错的开销很低。对开发者而言，RDD可以看作是Spark的一个对象，它本身运行于内存中，如...
复制链接

扫一扫