spark的RDD编程（常用的Transformation（算子））

最新推荐文章于 2022-09-19 10:42:28 发布

浊酒倾壶

最新推荐文章于 2022-09-19 10:42:28 发布

阅读量575

点赞数

分类专栏：大数据技能学习文章标签： RDD学习

本文链接：https://blog.csdn.net/qq_41519227/article/details/84638186

版权

大数据技能学习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

spark中的RDD编程之常用的Transformation（算子）

RDD是什么

RDD（Resilient Distributed Dataset）叫做分布式数据集，是Spark中最基本的数据抽象，它代表一个不可变、可分区、里面的元素可并行计算的集合。在 Spark 中，对数据的所有操作不外乎创建 RDD、转化已有RDD 以及调用 RDD 操作进行求值。

在Spark中，RDD被表示为对象，通过对象上的方法调用来对RDD进行转换。经过一系列的transformations定义RDD之后，就可以调用actions触发RDD的计算，这里也就是说在没有action动作比如collect()等操作之前，前面的算子操作只是一个状态，并没有去真正的执行算子操作，在action动作(count, collect等,或者是向存储系统保存数据(saveAsTextFile等)时才去真正的去执行所有的算子操作。

简单说在Spark中，只有遇到action，才会执行RDD的计算(即延迟计算)，这样在运行时可以通过管道的方式传输多个转换。

要使用Spark，开发者需要编写一个Driver程序，它被提交到集群以调度运行Worker，Driver中定义了一个或多个RDD，并调用RDD上的action，Worker则执行RDD分区计算任务。

创建RDD

在spark中创建RDD，大致有三种方式：（1）从集合中创建RDD（parallelize和makeRDD）；（2）从外部存储创建RDD（textFile）；（3）从其他RDD创建(就相当于用创建好的RDD，经过算子操作还是RDD)。

（1）从集合中创建RDD（parallelize和makeRDD）

parallelize创建

object CreateRDDOps1 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setAppName("scala-trans")//application的名字
    conf.setMaster("local[2]")//设置本地运行 2核

    val context = new SparkContext(conf)


    //第一种方式 第一种函数 通过scala java集合得到context.parallellize
       val array = Array(1, 2, 3)
       context.parallelize(array)
      //RDD是数据集，和集合类似，所以你集合的类型是什么，创建出来的RDD的数据类型也应该保持一致
       val arrayRdd:RDD[Int] = context.parallelize(array)//将集合转化为RDD
       val mapRDD: RDD[Int] = arrayRdd.map(_ + 1)//将RDD中的每个元素都增加一
      mapRDD.collect().foreach(println)
      context.stop()
      }
}

makeRDD创建

object CreateRDDOps1 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setAppName("scala-trans")
    conf.setMaster("local[2]")

    val context = new SparkContext(conf)
    //第一种方式  第二种函数makeRDD方式1 底层还是调用的parallellize
	  val list = List(
	      (1, List(1, 2, 3)),
	      (2, List(4, 5, 6)),
	      (3, List(7, 8, 9))
	    )
    val listRDD = context.makeRDD(list, 3)
    val part = listRDD.mapPartitionsWithIndex((i, items) => {
      Iterator(i + "=====" + items.mkString("[" + "]"))
    })

    val array1: Array[String] = part.collect()
    println(array1.toList)


    context.stop()
  }

}

（2）从外部存储创建RDD

object ScalaWordCountOps {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setAppName("scalajob") 
    conf.setMaster("local[2]") 
    val context = new SparkContext(conf)
    context.textFile("d:/a.txt") //读取文件
      .flatMap(_.split(" ")) //将文件内容以空格切割
      .map((_, 1)) //将每个单词形成kv键值对
      .reduceByKey(_ + _) //以单词为key统计出现的个数
      .collect() //将数据拉取到driver节点中（主要是为了控制台打印）
      .foreach(println) //遍历打印数据

    context.stop() //关闭资源
  }

RDD中常用的Transformation（算子）

(1) map

含义：返回一个新的RDD，该RDD由每一个输入元素经过func函数转换后组成
比如一个集合List(1,2,3),经过函数运算后里面的每一个元素+1，最后得到List(2,3,4)

object RDDProgram {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setAppName("scala-trans")
    conf.setMaster("local[1]")

    val context = new SparkContext(conf)
    //map算子操作（对给定集合加一）
    mapOps(context)
    context.stop()


  }
  //map算子
  def mapOps(context: SparkContext): Unit = {

    //rangRDD
    val rangRDD: RDD[Int] = context.makeRDD(1 to 10)

    //复杂的函数操作
    //    def add(a: Int): Int = {
    //      println("rangRDD中的元素" + a)
    //      a + 1
    //    }
    //   val mapRDD: RDD[Int] = rangRDD.map(add)

    //简单的函数算子操作
    val mapRDD: RDD[Int] = rangRDD.map(_ + 1) //此时程序并没有执行，只有action动作后才会执行

    /**
      * 慎用，如果计算后数据量非常庞大，全部数据拉取回driver可能会导致driver结点OOM异常
      *java.lang.OutOfMemoryError 堆内存溢出
      * 所以在拉取数据之前要提前清楚数据的大小
      * 或者过滤数据
      */
    val array: Array[Int] = mapRDD.collect()
    array.foreach(println)
  }
 }

(2) filter

含义：返回一个新的RDD，该RDD由经过func函数计算后返回值为true的输入元素组成
【注意】这一filter中放的函数的返回值类型必须是Boolean的

object RDDProgram {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setAppName("scala-trans")
    conf.setMaster("local[1]")

    val context = new SparkContext(conf)
   //filter算子操作（对给定的一群人，过滤出年龄大于30的）
    filterOps(context)
    context.stop()
  }
  //filter算子（通过一个函数过滤出满足指定条件的元素（true返回），并返回一个新的RDD）
  def filterOps(context: SparkContext): Unit = {
    val personRDD: RDD[(String, Int)] = context.parallelize(List(("zs", 20), ("ls", 31), ("ww", 25), ("ml", 35)))

    //    // 第一种  复杂的函数操作 
    //    def filterPer(person: (String, Int)): Boolean = {
    //      //如果年龄大于三十  返回姓名
    //      if (person._2 > 30) {
    //        return true
    //      } else {
    //        return false
    //      }
    //    }
    //val filterRDD: RDD[(String, Int)] = personRDD.filter(filterPer)

    //第二种  匿名函数操作
    //    val filterRDD: RDD[(String, Int)] = personRDD.filter((tuple: (String, Int)) => {
    //      if (tuple._2 > 30)
    //        true
    //      else false
    //    })
    //第三种 简单的算子函数操作
    val filterRDD: RDD[(String, Int)] = personRDD.filter(_._2 > 30)
    val array: Array[(String, Int)] = filterRDD.collect()

    for (elem <- array) {
      println(elem)
    }

  }
}

(3) flatMap

含义：类似于map，但是每一个输入元素可以被映射为0或多个输出元素（所以func应该返回一个序列，而不是单一元素）
举个例子，集合List("aaa","bbb","ccc"),定义一个字母小写转大写的函数，map之后是得到List("AAA","BBB","CCC"),而flatmap之后得到时List("A","A","A","B","B","B","C","C","C")

object RDDProgram {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setAppName("scala-trans")
    conf.setMaster("local[1]")

    val context = new SparkContext(conf)
   //flatmap 算子操作(将文本中的一行一行切割成一个一个的单词)
    flatmapOps(context)
    context.stop()
  }
  //flatmap算子操作（将给定数据的子集的子集操作）
  def flatmapOps(context: SparkContext): Unit = {
    //RDD of lines of the text file  所以是RDD[String]
    val sourceRDD: RDD[String] = context.textFile("d:/a.txt")

    //复杂的算法
    //    def filterword(a:String): Array[String] ={
    //      a.split(" ")
    //    }
    //    sourceRDD.flatMap(filterword).foreach(println)

    //匿名函数方式
    //sourceRDD.flatMap((line:String) => line.split(" ")).foreach(println)

    //简单的算子函数操作
    sourceRDD.flatMap(_.split(" ")).foreach(println)

  }
}

(4) mapPartitions

含义：类似于map，但独立地在RDD的每一个分片上运行，因此在类型为T的RDD上运行时，
     func的函数类型必须是Iterator[T] => Iterator[U]。假设有N个元素，有M个分区，那么map的函数的将被调用N次,而mapPartitions被调用M次,一个函数一次处理所有分区

object RDDProgram {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setAppName("scala-trans")
    conf.setMaster("local[1]")

    val context = new SparkContext(conf)
   //mappartiton算子操作 和map类似  但在某些情况下优于map 有几个分区运行和分区个数相同的次数   将给定的集合数据加2返回
    mappartitionOps(context)
    context.stop()
  }
  //mappartition算子
  def mappartitionOps(context: SparkContext): Unit = {
    val rangRDD: RDD[Int] = context.makeRDD(1 to 10, 3)

    //因为默认创建的列表是不可变集合，所以  列表本身   不会  发生变化，而是会返回一个新的列表对象
    def mappartitionDef(iterator: Iterator[Int]): Iterator[Int] = {
    //这里必须是var,否则下面不能赋值给它
      var ints = List[Int]()
      while (iterator.hasNext) {
        var newNum = iterator.next() + 2
        //踩的坑，因为 列表本身不会发生变化，而是会返回一个新的列表对象，所以这里的ints：+时ints应用是初始化的空集合
        ints = ints :+ newNum
        print(ints)
      }
      ints.iterator

//两种方式，和上面的方式结果一样
//      //将遍历过程中处理的结果返回到一个新集合中，使用yield关键字
//      var res = for (e <- iterator) yield e + 2
//
//      res
    }

    val fields = rangRDD.mapPartitions(mappartitionDef, false).collect()
    for (elem <- fields) {
      println(elem)
    }
  }
}

(5) mapPartitionsWithIndex

含义：类似于mapPartitions，但func带有一个整数参数表示分片的索引值，因此在类型为T的RDD上运行时，func的函数类型必须是(Int, Interator[T]) => Iterator[U]

object RDDProgram {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setAppName("scala-trans")
    conf.setMaster("local[1]")

    val context = new SparkContext(conf)
   //mappartitionWithIndex 返回那个分区有那些数据
    mappartitionWithIndexOps(context)
    context.stop()
  }
   //mappartitionWithIndex算子（获取每个分区所有数据，并指明分区编号）
  def mappartitionWithIndexOps(context: SparkContext): Unit = {
    val partRDD = context.parallelize(1 to 10, 3)
    partRDD.mapPartitionsWithIndex((i, iter) => {
      Iterator("第" + i + "分区" + "数据为" + iter.mkString("[", ",", "]"))
    }).collect().foreach(println)
 }
}

(6) Sample,taskSample,union,intersection,partitionBy,reduceByKey,groupByKey七个transformation算子

object RDDTransformation1 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setAppName("RDDName").setMaster("local[2]")
    val context = new SparkContext(conf)
    //sample算子 从100条数据随机抽样1%条数据样本（不一定是10条，在10条左右）
    sampleOps(context)

    //takesample算子 按照个数取样本
    takesakpleOps(context)

    //union算子  将两个RDD并集返回新的RDD
    unionOps(context)

    //intersection算子  返回两个RDD的交集
    intersectionOps(context)

    //partitionBy算子
    partitionByOps(context)

    //reduceByKey算子  单词统计
    reduceByKeyOps(context)

    //groupbykey算子  统计男女学生对应的人数
    groupByKeyOps(context)


    context.stop()
  }


  /** 11111
    *
    * sample算子操作(对数据集中的数据抽样  分为有放回的抽样和无放回的抽样)
    * 一般用于大量数据的抽样 通过样本来估算整体
    * 常用与估算可能出现数据倾斜的key
    * 数据量小的话不一定准确
    *
    * @param context
    */
  def sampleOps(context: SparkContext): Unit = {
    val dataRDD = context.parallelize(1 to 100)

    //无放回抽取1%作为样本数据
    //val sampleRDD = dataRDD.sample(false, 0.1)

    //有放回的抽样
    val sampleRDD = dataRDD.sample(true, 0.1)
    val count = sampleRDD.count()
    println("========抽样的个数为" + count)
    sampleRDD.collect().foreach(println)
  }


  /** 22222
    *
    * takesample
    * 和Sample的区别是：
    * 1 sample操作后返回的是新的RDD数据集  而takesample返回的是array数组
    * 2 sample按照比例抽取     takesample按照个数抽取
    *
    * @param context
    */
  def takesakpleOps(context: SparkContext): Unit = {
    val dataRDD = context.parallelize(1 to 100)

    //又放回的抽样
    //    val dataArray: Array[Int] = dataRDD.takeSample(true, 20)

    //又=无放回的抽样
    val dataArray: Array[Int] = dataRDD.takeSample(false, 20)
    println("抽取的样本的个数为" + dataArray.length)
    println(dataArray.toList)
  }


  /** 33333
    *
    * union算子（类似于sql语法中的union all 操作）
    * 对源RDD和参数RDD求并集后返回一个新的RDD(但并不会去除重复的数据)
    * 联合后的数据集中可能有重复的数据  如果想要去重  要使用 distinct（）
    * distinct()去重后数据就没有顺序了
    *
    * @param context
    */
  def unionOps(context: SparkContext): Unit = {
    val RDD1 = context.parallelize(1 to 10)
    val RDD2 = context.parallelize(5 to 20)

    val unionRDD: RDD[Int] = RDD1.union(RDD2)

    //去重数据（如果不去重，此步骤可省略）
    val distinctRdd: RDD[Int] = unionRDD.distinct()
    println(distinctRdd.collect().toList)
  }


  /** 44444
    *
    * intersection算子
    * 对源RDD和参数RDD求交集后返回一个新的RDD
    *
    * @param context
    */
  def intersectionOps(context: SparkContext): Unit = {
    val RDD1 = context.parallelize(1 to 10)
    val RDD2 = context.parallelize(5 to 20)

    val interRdd: RDD[Int] = RDD1.intersection(RDD2)
    interRdd.collect().foreach(println(_))
  }

  /** 55555
    *
    * partitionBy算子
    * 注意  操作的是【键值型RDD】
    * 对RDD进行分区操作，如果原有的partionRDD和现有的partionRDD是一致的话就不进行分区， 否则会生成ShuffleRDD.
    * 即 如果在创建RDD的时候指明是3个分区  进行partitionBy算子操作时也指定哈希分区3个   但如果不是创建的RDD分区数据不是按照哈希分区
    * 就会重新hash计算分区    如果分区个数不一致也会重新hash分区
    *
    * @param context
    */
  def partitionByOps(context: SparkContext): Unit = {
    val RDD1: RDD[(Int, Int)] = context.parallelize(List((1, 1), (2, 2), (3, 3)))
    RDD1.mapPartitionsWithIndex((i, iter) => {
      val partData = iter.mkString("[", ",", "]")
      Iterator("分区号" + i + "=====" + "分区的数据为" + partData)
    }).collect().foreach(println)


    val hashRdd: RDD[(Int, Int)] = RDD1.partitionBy(new HashPartitioner(3))
    hashRdd.mapPartitionsWithIndex((i, iter) => {
      Iterator("分区号" + i + "=====" + "分区的数据为" + iter.mkString("[", ",", "]"))
    }).collect().foreach(println)

    hashRdd.collect()
  }

  /** 66666
    *
    * reduceByKey算子
    * 【键值型】
    * 在一个(K,V)的RDD上调用，返回一个(K,V)的RDD，使用指定的reduce函数，将相同key的值聚合到一起，
    * reduce任务的个数可以通过第二个可选的参数来设置
    *
    * @param context
    */
  def reduceByKeyOps(context: SparkContext): Unit = {
    val lines = List("zs li hadoop", "hbase hadoop zs", "li hbase hadoop")

    val lineRdd = context.makeRDD(lines)
    //将集合切分成一个一个的单词
    val wordRdd = lineRdd.flatMap(_.split(" "))
    //将每个单词转化为kv键值对
    val wordtupleRDD: RDD[(String, Int)] = wordRdd.map((_, 1))
    wordtupleRDD.mapPartitionsWithIndex((i, iter) => {
      Iterator("分区号" + i + "=====" + "分区的数据为" + iter.mkString("[", ",", "]"))
    }).collect().foreach(println(_))

    //使用reducebykey统计
    def sum(a: Int, b: Int): Int = {
      a + b

    }

    val wordcountRDD: RDD[(String, Int)] = wordtupleRDD.reduceByKey(sum)
    wordcountRDD.mapPartitionsWithIndex((i, iter) => {
      Iterator("分区号" + i + "=====" + "分区的数据为" + iter.mkString("[", ",", "]"))
    }).collect().foreach(println(_))

    wordcountRDD.collect().foreach(println(_))

    /*scala简洁版wordcount

   context.makeRDD(lines)
      .flatMap(_.split(" "))
      .map((_,1))
      .reduceByKey(_+_)
      .collect()
      .foreach(println(_))*/

  }

  /** 77777
    *
    * groupbykey算子
    * 【键值型】
    * groupByKey也是对每个key进行操作，但只生成一个sequence。
    *
    * @param context
    */
  def groupByKeyOps(context: SparkContext): Unit = {
    val lineRDD: RDD[String] = context.textFile("d:/a.txt")

    val tupleRDD: RDD[(String, String)] = lineRDD.map(line => {
      val fields = line.split("\t")
      //指定key以性别为key
      (fields(2), line)
    })

    //以性别分组
    val byKeyRDD: RDD[(String, Iterable[String])] = tupleRDD.groupByKey()
    //将按性别分组的计算人数
    val sexRDD: RDD[(String, Int)] = byKeyRDD.map(tuple => {

      (tuple._1, tuple._2.size)
    })
    sexRDD.collect().foreach(println(_))
  }

}

(7) combineByKey、aggregateByKey、foldByKey、sortByKey、sortBy、join、cogroup、coalesce、repartition

object RDDTransformation2 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setAppName("transformation3").setMaster("local[2]")
    val sc = new SparkContext(conf)

    //combinebykey  求每个同学的平均分
    //    combinebykeyOps(sc)

    //aggregateByKey  求每个分区key对应的最大值的累加和
    //    aggregateByKeyOps(sc)

    //foldByKey  赋值初始值  对每个分区的key相同的value与初始值累加
    //    foldByKeyOps(sc)

    //sortbykey  对班级的学生按身高排序
    //    sortByKeyOps(sc)

    //sortby  自定义按照那个比较
    //    sortByOps(sc)

    //join 将两表关联
    //    joinOps(sc)

    //cogroup
    //    cogroupOps(sc)

    //coalesce  数据过滤后重新分区
    //    coalesceOps(sc)

    //repartition
    repartitionOps(sc)

    sc.stop()
  }


  /** 11111
    *
    * combineByKey
    * 对相同K，把V合并成一个集合.
    * createCombiner: V => C, 对第一出现的key进行此方法
    * mergeValue: (C, V) => C,对再次出现的相同的key调用此方法  对对应的value进行合并
    * mergeCombiners: (C, C) => C, 对所有分区相同的key组合
    *
    * @param sc
    */
  def combinebykeyOps(sc: SparkContext): Unit = {
    val rdd1 = sc.makeRDD(List(("Fred", 88.0), ("Fred", 95.0), ("Fred", 91.0), ("Wilma", 93.0), ("Wilma", 95.0), ("Wilma", 98.0)))

    val comRDD: RDD[(String, (Double, Int))] = rdd1.combineByKey(
      v => (v, 1),
      (c: (Double, Int), v) => (c._1 + v, c._2 + 1),
      (c1: (Double, Int), c2: (Double, Int)) => (c1._1 + c2._1, c1._2 + c2._2)
    )

    val resRDD: RDD[(String, Double)] = comRDD.map(comrdd => (comrdd._1, comrdd._2._1 / comrdd._2._2))

    resRDD.foreach(println(_))
  }


  /** 22222
    *
    * aggregateByKey
    * (zeroValue: U)
    * (seqOp: (U, V)
    * combOp: (U, U)
    * 按照key将每个分区中的key对应的最大值进行累加
    *
    * @param sc
    */
  def aggregateByKeyOps(sc: SparkContext): Unit = {
    val rdd1 = sc.makeRDD(List((1, 3), (1, 2), (1, 4), (2, 3), (3, 6), (3, 8)), 3)

    rdd1.mapPartitionsWithIndex((i, item) => {
      Iterator("第" + i + "分区" + "=======" + item.mkString("[", ",", "]"))
    }).collect().foreach(println(_))
    val resRDD: RDD[(Int, Int)] = rdd1.aggregateByKey(0)((a: Int, u: Int) => math.max(a, u), (u1, u2) => u1 + u2)
    resRDD.foreach(println(_))

  }


  /** 33333
    *
    * foldbykey
    * aggregateByKey的简化操作，seqop和combop相同
    *
    * @param sc
    */
  def foldByKeyOps(sc: SparkContext) {

    val rdd1 = sc.makeRDD(List(("Mobin", 2), ("Mobin", 1), ("Lucy", 2), ("Amy", 1), ("Lucy", 3)), 3)
    rdd1.mapPartitionsWithIndex((i, item) => {
      Iterator("第" + i + "分区" + "=======" + item.mkString("[", ",", "]"))
    }).collect().foreach(println(_))

    val resRDD = rdd1.foldByKey(0)((a: Int, b: Int) => a + b)
    resRDD.foreach(println(_))
  }

  /** 44444
    *
    * sortbykey
    * 在一个(K,V)的RDD上调用，K必须实现Ordered接口，返回一个按照key进行排序的(K,V)的RDD
    *
    * @param sc
    */
  def sortByKeyOps(sc: SparkContext): Unit = {
    val stuList = sc.makeRDD(List("张三,165", "李四,188", "王五,158", "周六,175", "孙七,173", "王八,175"))

    val mapRDD: RDD[(Int, String)] = stuList.map(s => {
      val fields = s.split(",")
      (fields(1).trim.toInt, fields(0).trim)
    })

    //false  true 代表是否升序排序   默认为true 升序
    val resRDD = mapRDD.sortByKey(false).map(tuple => (tuple._2, tuple._1))
    resRDD.collect().foreach(println(_))


  }

  /** 55555
    *
    * sortby
    * 与sortByKey类似，但是更灵活,可以用func先对数据进行处理，按照处理后的数据比较结果排序。
    *
    * @param sc
    */
  def sortByOps(sc: SparkContext): Unit = {
    val sourRDD: RDD[String] = sc.makeRDD(List(
      "张三, 165",
      "李四, 188",
      "王五, 158",
      "周六, 175",
      "孙七, 173",
      "王八, 175"
    ))
    sourRDD.sortBy(line => line.split(",")(1), false).collect().foreach(println(_))

    //    val mapRDD: RDD[(String, Int)] = sourRDD.map(s => {
    //      val fields = s.split(",")
    //      (fields(0).trim, fields(1).trim.toInt)
    //    })
    //
    //    mapRDD.sortBy(tuple => tuple._2, false).collect().foreach(println(_))


  }


  /** 66666
    *
    * join
    * 在类型为(K,V)和(K,W)的RDD上调用，返回一个相同key对应的所有元素对在一起的(K,(V,W))的RDD
    *
    * @param sc
    */
  def joinOps(sc: SparkContext): Unit = {
    //学员表
    val rdd1 = sc.parallelize(List(
      "001 : zhangsan",
      "002 : lisi",
      "003 : wangwu"
    ))
    //成绩表
    val rdd2 = sc.parallelize(List(
      "003 : 90.0",
      "002 : 60.0",
      "001 : 70.0"
    ))

    val maprddd1 = rdd1.map(line => {
      val fields = line.split(":")
      (fields(0).trim, fields(1).trim)
    })

    val maprddd2: RDD[(String, String)] = rdd2.map(line => {
      val fields = line.split(":")
      (fields(0).trim, fields(1).trim)
    })

    val rdd3: RDD[(String, (String, String))] = maprddd1.join(maprddd2)

    rdd3.collect().foreach(println(_))

  }


  /** 77777
    *
    * cogroup
    * 在类型为(K,V)和(K,W)的RDD上调用，返回一个(K,(Iterable<V>,Iterable<W>))类型的RDD
    *
    * @param sc
    */
  def cogroupOps(sc: SparkContext): Unit = {
    val list1 = List((1, 2), (1, 3), (1, 4), (2, 3), (2, 3), (2, 5), (2, 6), (3, 4))
    val list2 = List((1, 5), (1, 6), (1, 7), (2, 4), (2, 5), (3, 4), (3, 5))
    val rdd1 = sc.makeRDD(list1)
    val rdd2 = sc.makeRDD(list2)

    val rdd3: RDD[(Int, (Iterable[Int], Iterable[Int]))] = rdd1.cogroup(rdd2)

    rdd3.collect().foreach(println(_))
  }

  /** 88888
    *
    * coalesce
    * 缩减分区数，用于大数据集过滤后，提高小数据集的执行效率
    *
    * @param sc
    */
  def coalesceOps(sc: SparkContext): Unit = {
    val rdd1 = sc.parallelize(1 to 100, 4)
    val rdd2 = rdd1.filter(_ % 3 == 0)
    println(rdd2.getNumPartitions)
    val rdd3 = rdd2.coalesce(2)
    println(rdd3.getNumPartitions)
    rdd3.collect()
  }


  /** 99999
    *
    * repartition
    * 根据分区数，从新通过网络随机洗牌所有数据。
    *
    * @param sc
    */
  def repartitionOps(sc: SparkContext): Unit = {
    val rdd1 = sc.parallelize(1 to 1000, 4)
    println(rdd1.getNumPartitions)
    val rdd2 = rdd1.repartition(2)
    println(rdd2.getNumPartitions)
    rdd2.collect()
  }




}

浊酒倾壶

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
spark的RDD编程（常用的Transformation（算子））

spark中的RDD编程之常用的Transformation（算子）RDD是什么RDD（Resilient Distributed Dataset）叫做分布式数据集，是Spark中最基本的数据抽象，它代表一个不可变、可分区、里面的元素可并行计算的集合。在 Spark 中，对数据的所有操作不外乎创建 RDD、转化已有RDD 以及调用 RDD 操作进行求值。在Spark中，RDD被表示为对象，通...
复制链接

扫一扫