【大数据开发】SparkCore——进阶算子、Action算子、查看分区数的三种方式

最新推荐文章于 2023-04-28 15:59:33 发布

这个妹妹我见过

最新推荐文章于 2023-04-28 15:59:33 发布

阅读量251

点赞数

分类专栏： # Spark

本文链接：https://blog.csdn.net/weixin_37090394/article/details/108862821

版权

Spark 专栏收录该内容

23 篇文章 1 订阅

订阅专栏

源代码中的大写V，指的是value rdd.getNumberPartitions获取分区数量
Transformation算⼦全都是RDD[U,T]类型的
Action算子的返回值一般情况下不会是RDD[U,T]类型的，会返回一个具体的类型

一、进阶算子

准备工作

    val sc: SparkContext = new SparkContext(new SparkConf().setMaster("local").setAppName("RDD-complicated"))

1.1 aggregateByKey

 * Transformation算子: aggregateByKey
 * 也是一个聚合运算，类似于reduceByKey和foldByKey
 * aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U, combOp: (U, U) => U)
 * 将RDD中，相同的Key的Value视为一个分组，对这些数据进行计算处理
 * - 依次将每一个分区中的相同的键对应的所有值， 按照 seqOp 进行计算，得到结果。 在这个过程中， 会使用到初始值 zeroValue
 * - 每一个分区中的计算都完成后， 会将不同的分区中相同的键聚合到一起， 使用 combOp 函数进行聚合的计算， 这里不会使用到初始值zeroValue
 *
 * zeroValue: 分区内计算的初始值，相当于foldByKey中的zeroValue（注意zeroValue的类型应当和seqOp第一个参数保持一致）
 * seqOp: 分区内的聚合计算，会将分区内的相同的Key的值进行计算，会使用到zeroValue
 * combOp: 分区之间的聚合计算，会将不同的分区的相同的Key进行计算，不会使用到zeroValue

    @Test def aggregateByKeyTest(): Unit = {
        // 1. 准备数据
        val rdd: RDD[(String, Int)] = sc.parallelize(Array(("贝吉塔", 3), ("贝吉塔", 4), ("卡卡罗特", 5), ("樱木花道", 6), ("卡卡罗特", 7), ("贝吉塔", 8)), 2)
        // 2.
        val res0: RDD[(String, Int)] = rdd.aggregateByKey(5)(Math.max, _ + _)
        res0.foreach(println)

        // 贝吉塔  5   卡卡罗特  5
        // 贝吉塔  8   卡卡罗特  7   樱木花道  6
        //
        // 贝吉塔  13  卡卡罗特  12   樱木花道  6
    }

在这里插入图片描述

1.2 combineByKey

 * Transformation算子: combineByKey
 * 作用在KV键值对的RDD身上的， PairedRDD
 *
 * combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]
 *
 * createCombiner：
 *      combineByKey在进行逻辑运算的时候，会依次遍历到分区每一个元素；
 *      如果这个Key是第一次遍历到，此时就会触发这个函数，创建一个自定义的累加器的初始值
 * mergeValue：
 *      如果一个分区中的数据，键之前已经遍历过了，已经创建好了累加器了，此时会触发这个方法，进行累加
 * mergeCombiners：
 *      不同的分区中，可能存在相同的键，最后会将不同分区的相同的键对应的累加器Combiner进行合并

    @Test def combineByKeyTest(): Unit = {
        // 1. 准备数据
        val rdd: RDD[(String, Int)] = sc.parallelize(Array(("chinese", 98), ("math", 97),
            ("math", 94), ("engish", 82), ("math", 88), ("chinese", 98)), 1)
        // chinese 98 98 => 196 2
        // math 97 94 88 => 279 3
        // english 82    => 82  1

        // 2. 需求： 统计每一个学科的总成绩和成绩的数量
        val res: RDD[(String, (Int, Int))] = rdd.combineByKey(createCombiner => {
            println(s"第一次遍历到$createCombiner， 创建一个累加器")
            // 返回一个自定义的累加的元组，createCombiner._1表示累加的成绩, createCombiner._2表示累加的数据的数量
            (createCombiner, 1)
        }, (mergeValue: (Int, Int), v) => {
            // mergeValue._1里面存储的是每门学科的总成绩，mergeValue._2里面存储的是每门学科的的出现次数
            println(s"不是第一次遇到$v, mergeValue._1 = ${mergeValue._1}, mergeValue._2 = ${mergeValue._2}, ")
            (mergeValue._1 + v, mergeValue._2 + 1)
        }, (mergeCombiners1: (Int, Int), mergeCombiners2: (Int, Int)) => {
            println(s"分区之间的合并, mergeCombiners1 = $mergeCombiners1, mergeCombiners2 = $mergeCombiners2")
            (mergeCombiners1._1 + mergeCombiners2._1, mergeCombiners1._2 + mergeCombiners2._2)
        })

        // rdd.combineByKey((_, 1), (c: (Int, Int), v) => (c._1 + v, c._2 + 1), (c1: (Int, Int), c2: (Int, Int)) => (c1._1 + c2._1, c1._2 + c2._2))

        res.foreach(println)
    }

在这里插入图片描述

1.3 sortByKey

 * Transformation算子: sortByKey
 * 作用于PairedRDD，按照Key进行排序
 * ascending: 排序， true: 代表升序， false： 代表降序
 * numPartitions: 排序之后的数据，保存在几个分区中，默认的是原来的分区数量

    @Test def sortByKeyTest(): Unit = {
        val rdd: RDD[String] = sc.parallelize(Array("Li Lei", "Han Meimei", "Lucy", "Lily", "Uncle Wang", "Polly"), 3)
        val rdd1: RDD[(Int, String)] = rdd.keyBy(_.length)
        val res: RDD[(Int, String)] = rdd1.sortByKey(ascending = false, 1)
        res.foreach(println)
        println(res.getNumPartitions)
    }

在这里插入图片描述

1.4 sortBy

 * Transformation算子： sortBy
 * 可以对任意类型的RDD进行排序，不一定非要是PairedRDD
 * 需要提供一个排序的依据

    @Test def sortByTest(): Unit = {
        val rdd: RDD[String] = sc.parallelize(Array("Li Lei", "Han Meimei", "Lucy", "Lily", "Uncle Wang", "Polly"))
        // 将Rdd中的元素，按照长度进行排序
        val res: RDD[String] = rdd.sortBy(_.length, ascending = true, 1)
        res.foreach(println)
    }

在这里插入图片描述

1.5 union、++

 * Transformation算子： union
 * 对两个RDD进行合并，合并结果不去重
 * 合并之后，会生成参与合并的两个RDD的分区数量总和的分区数
 * ++和union具有同样的效果

    @Test def unionTest(): Unit = {
        val rdd1: RDD[Int] = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 2)
        val rdd2: RDD[Int] = sc.parallelize(Array(7, 8, 9, 10, 11, 12, 13, 14, 15), 2)
        // 合并rdd
        val res0: RDD[Int] = rdd1.union(rdd2)
        res0.foreach(println)
        println(res0.getNumPartitions)

        val res1: RDD[Int] = rdd1.++(rdd2)
        res1.foreach(println)
        println(res1.getNumPartitions)
    }

在这里插入图片描述

1.6 intersect

 * Transformation算子： intersect
 * 求交集
 * 在内部会触发shuffle过程

    @Test def intersectTest(): Unit = {
        val rdd1: RDD[Int] = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 2)
        val rdd2: RDD[Int] = sc.parallelize(Array(7, 8, 9, 10, 11, 12, 13, 14, 15), 5)
        val res0: RDD[Int] = rdd1.intersection(rdd2)
        res0.foreach(println)
        println(res0.getNumPartitions)
    }

在这里插入图片描述

1.7 distinct

 * Transformation算子: distinct
 * 对RDD中的数据，进行去重处理

    @Test def distinctTest(): Unit = {
        val rdd1: RDD[Int] = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 2)
        val rdd2: RDD[Int] = sc.parallelize(Array(7, 8, 9, 10, 11, 12, 13, 14, 15), 2)
        val res0: RDD[Int] = rdd1.union(rdd2).distinct(1)
        res0.foreach(println)
    }

在这里插入图片描述

1.8 join、leftOuterJoin、rightOuterJoin、fullOuterJoin

 * Tranformation算子: join
 * pairedRDD
 * 类似于SQL中的连接，将两个RDD中的元素连接在一起，这会形成笛卡尔积
 * 应当注意的是，join的返回值都是Option类型的

    @Test def joinTest(): Unit = {
        val rdd1: RDD[(Int, String)] = sc.parallelize(Array("Tom", "LIN", "Jerry", "诸葛大力")).keyBy(_.length)
        val rdd2: RDD[(Int, String)] = sc.parallelize(Array("MIT", "math", "nature", "ABC")).keyBy(_.length)

        val res0: RDD[(Int, (String, String))] = rdd1.join(rdd2)
        res0.foreach(println)

        // 3, Tom   3, LIN    5, jerry    4, 诸葛大力
        // 3, MIT    4, math     6, nature  3, ABC

        val res1: RDD[(Int, (String, Option[String]))] = rdd1.leftOuterJoin(rdd2)
        res1.foreach(println)

        val res2: RDD[(Int, (Option[String], String))] = rdd1.rightOuterJoin(rdd2)
        res2.foreach(println)

        val res3: RDD[(Int, (Option[String], Option[String]))] = rdd1.fullOuterJoin(rdd2)
        res3.foreach(println)

    }

在这里插入图片描述

1.9 repartition、coalesce

reparation

 * Transformation算子: repartition
 * 对RDD中的数据进行重新的分区， repartition适合扩大分区
 * 扩大分区，意味着需要对现有的分区数据进行重新划分，这个过程一定会触发Shuffle
 * 其实，repartition就是 coalesce(numPartitions, shuffle = true)

适合扩大分区，会触发shuffle分区

   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
   * a shuffle to redistribute data.

底层代码就是调用的是coalesce方法，如下：
在这里插入图片描述

    @Test def repartitionTest(): Unit = {
        val rdd: RDD[Int] = sc.parallelize(1 to 20, 2)
        showDataWithPartition(rdd)

        val rdd1: RDD[Int] = rdd.repartition(4)
        showDataWithPartition(rdd1)
   }
   def showDataWithPartition(rdd: RDD[Int]): Unit = {
        rdd.mapPartitionsWithIndex((index, iter) => iter.map(index + ": " + _)).foreach(println)
    }

在这里插入图片描述

coalesce

 * Transformation算子: coalesce
 * 重新规划分区，适合与缩小分区
 * 默认是不触发shuffle的

  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T]

   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
   * which can avoid performing a shuffle.

    @Test def coalesceTest(): Unit = {
        val rdd: RDD[Int] = sc.parallelize(1 to 20, 4)
        val rdd2: RDD[Int] = rdd.coalesce(8, true)
        showDataWithPartition(rdd2)
        println(rdd2.getNumPartitions)
    }


    def showDataWithPartition(rdd: RDD[Int]): Unit = {
        rdd.mapPartitionsWithIndex((index, iter) => iter.map(index + ": " + _)).foreach(println)
    }

在这里插入图片描述

1.10 cogroup

最多支持三个RDD的重载聚合

 * Transformation算子： cogroup
 * 作用在PairedRDD，将多个RDD中的数据，按照Key进行分区，将相同Key的值聚合为一个集合

    @Test def cogroupTest(): Unit = {
        val rdd1: RDD[(String, Int)] = sc.parallelize(Array(("chinese", 90), ("chinese", 87), ("chinese", 78), ("math", 88), ("math", 98)))
        val rdd2: RDD[(String, Int)] = sc.parallelize(Array(("chinese", 88), ("chinese", 89), ("math", 88), ("math", 68), ("math", 100)))

        val res0: RDD[(String, (Iterable[Int], Iterable[Int]))] = rdd1.cogroup(rdd2)
        res0.foreach(println)
        val res1: RDD[(String, (Iterable[Int], Iterable[Int], Iterable[Int]))] = rdd1.cogroup(rdd2, rdd2)

        // ("Chinese", List(90, 87, 78, 88, 89))
        res1.foreach(println)
    }

在这里插入图片描述

1.11 sample

使⽤场景：数据倾斜发现

返回抽样得到的⼦集
withReplacement为true时表示抽样之后还放回，可以被多次抽样，false表示不放回；
fraction表示抽样⽐例；
seed为随机数种⼦，⽐如当前时间戳

    @Test def sampleTest(): Unit = {
        val rdd: RDD[Int] = sc.parallelize(1 to 1000)
        val res: RDD[Int] = rdd.sample(withReplacement = true, 0.1)
        res.foreach(println)
    }

随机取100个值
在这里插入图片描述

1.12 ⽐较reduceByKey和groupByKey，aggregateByKey

reduceByKey(func, numPartitions=None)
Merge the values for each key using an associative reduce function. This will also perform the merginglocally on each mapperbefore sending results to a reducer, similarly to a “combiner” in MapReduce. Output will be hash-partitioned with numPartitions partitions, or the default parallelism level if numPartitions is not specified.
也就是，reduceByKey⽤于对每个key对应的多个value进⾏merge操作，最重要的是它能够在本地先进⾏merge操作，并且merge操作可以通过函数⾃定义。

groupByKey(numPartitions=None)
Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with numPartitions partitions.** Note**: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will provide much better performance.
也就是，groupByKey也是对每个key进⾏操作，但只⽣成⼀个sequence。需要特别注意“Note”中的
话，它告诉我们：如果需要对sequence进⾏aggregation操作（注意，groupByKey本身不能⾃定义操
作函数），那么，选择reduceByKey/aggregateByKey更好。这是因为groupByKey不能⾃定义函数，
我们需要先⽤groupByKey⽣成RDD，然后才能对此RDD通过map进⾏⾃定义函数操作。

二、Action算子

2.1 foreach

在数据集的每⼀个元素上，运⾏函数func进⾏更新
注意：
1.foreach是作⽤在每个分区，结果输出到分区；
2.由于并⾏的原因，每个节点上打印的结果每次运⾏可能会不同。
3.在分布式运行spark时，直接使用foreach是无法遍历Seq类型的值的，这是因为我们的数据并不在master上，而是在各个worker结点上，因此打印不出来数据；当我们使用local模式的时候是可以看到数据的。下面模拟了数据分发的某种情况：
在这里插入图片描述

2.1 collect

collect 将分布式的 RDD 返回为⼀个单机的 scala Array 数组。在这个数组上运⽤ scala 的函数式操作。
通过函数操作，将结果返回到 Driver 程序所在的节点，以数组形式存储。

在实际的生产环境下，应当慎用collect，这是因为分布式数据太多，直接拉取很容易造成master内存溢出

2.2 reduce

 * Action算子： reduce
 * 将RDD中的数据，按照指定的规则进行聚合

    @Test def reduceTest(): Unit = {
        val rdd: RDD[Int] = sc.parallelize(1 to 100, 2)
        val res: Int = rdd.reduce(_ + _)
        println(res)
    }

在这里插入图片描述

2.3 fold

 * Action算子: fold
 * 将RDD中的每个分区的数据进行聚合，聚合的过程中，会用到zeroValue
 * 每个分区都会加上zeroValue的值，当所有分区合并成一个分区时，还会再加一次zeroValue的值

    @Test def foldTest(): Unit = {
        val rdd: RDD[Int] = sc.parallelize(1 to 100, 2)
        val res: Int = rdd.fold(10)(_ + _)
        println(res)
    }

在这里插入图片描述

2.4 aggregate

 * Action算子： aggregate
 * 定制了分区内的计算逻辑和不同分区之间的计算逻辑
 * 分区内的计算逻辑，会考虑到zeroValue
 * 分区之间的计算逻辑，也考虑到了zeroValue
 * zeroValue会作为SeqOp和combOp的默认初始值
 *
 * 使用初始值，在分区内部按照Math.max，计算最大值
 * 将每一个分区中计算出来的最大值，结合上zeroValue，通过_+_累加到一起

    @Test def aggregateTest(): Unit = {
        val rdd: RDD[Int] = sc.parallelize(1 to 100, 2)
        val res: Int = rdd.aggregate(80)(Math.max, _ + _)
        println(res)
    }

2.5 collectAsMap

 * Action算子: collectAsMap
 * 将PairedRDD，以键值对的形式，聚合成一个Map到Driver端

@Test def collectAsMapTest(): Unit = {
    val rdd: RDD[(Int, String)] = sc.parallelize(Array("Lily", "Uncle Wang")).keyBy(_.length)
    //
    val map: collection.Map[Int, String] = rdd.collectAsMap()

    println(map)
}

在这里插入图片描述

2.6 count

返回RDD的元素个数

 * Action算子: count
 * 计算RDD中有多少个元素

在这里插入图片描述

2.7 countByKey

 * Action算子: countByKey
 * 通过key进行计数，收集Map结果到本地

   * @note This method should only be used if the resulting map is expected to be small, as
   * the whole thing is loaded into the driver's memory.
   * To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which
   * returns an RDD[T, Long] instead of a map.
   * 
   * 注意：这个方法只能用于数据量小的map，这是因为该方法会将整个worker端数据加载到driver端的内存中
   * 处理数据量大的map，考虑使用rdd.mapValues(_ => 1L).reduceByKey(_ + _)，这个方式会犯规一个RDD[T,Long]，而不是一个map

@Test def countByKey(): Unit = {
    val rdd: RDD[(Int, String)] = sc.parallelize(Array("Lily", "Lucy", "Uncle Wang", "Polly", "Han Meimei", "Li Lei")).keyBy(_.length)
    val count: collection.Map[Int, Long] = rdd.countByKey()
    println(count)
}

在这里插入图片描述
由此方法可以改写countByKey

    @Test def wordcount(): Unit = {
        val rdd: RDD[String] = sc.parallelize(Array("java mysql hadoop spark flink  hdfs spark java scala", "java linux spark hadoop hdfs hive hbase hbase hive hdfs"))
        val res: collection.Map[String, Long] = rdd.flatMap(_.split(" +")).map((_, 1)).countByKey()
        println(res)
    }

2.8 take

 * Action算子 ： take
 * 返回RDD中指定数量的前N个元素组成的数组

    @Test def takeTest(): Unit = {
        val rdd: RDD[Int] = sc.parallelize(1 to 20, 2)
        val res: Array[Int] = rdd.take(15)
        res.foreach(println)
    }

在这里插入图片描述

2.9 takeSample

 * Action算子: takeSample
 * 从RDD中随机取指定数量的元素，返回到一个集合中

    @Test def takeSampleTest(): Unit = {
        val rdd: RDD[Int] = sc.parallelize(1 to 100)
        val res: Array[Int] = rdd.takeSample(true, 10)
        res.foreach(println)
    }

在这里插入图片描述

2.10 takeOrdered

 * Action算子: takeOrdered
 * 从RDD中取最小的几个元素（从小到大排列）

    @Test def takeOrderedTest(): Unit = {
        val rdd: RDD[Int] = sc.parallelize(Array(1, 3, 5, 7, 9, 0, 8, 6, 4, 2))
        val res: Array[Int] = rdd.takeOrdered(5)
        res.foreach(println)
    }

在这里插入图片描述

2.11 top

 * Action算子: top
 * 获取RDD中的指定数量的大的元素（从大到小）

    @Test def topTest(): Unit = {
        val rdd: RDD[Int] = sc.parallelize(Array(1, 3, 5, 7, 9, 0, 8, 6, 4, 2))
        val res: Array[Int] = rdd.top(5)
        res.foreach(println)
    }

在这里插入图片描述

2.12 first

 * Action算子: first
 * 获取RDD中的第一个元素，相当于take(1)

    @Test def first(): Unit = {
        val rdd: RDD[Int] = sc.parallelize(1 to 10)
        val res: Int = rdd.first()
        println(res)
    }

2.13 saveAsTest

    @Test def saveTest(): Unit = {
        val rdd: RDD[Int] = sc.parallelize(1 to 100, 4)
        rdd.saveAsTextFile("C:\\Users\\mgs\\Desktop\\output")
    }

三、查看分区数的三种方式

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

/**
 * Local模式下：
 *      如果TextFile不给定minPartitions数量，会以当前local的线程数量，和2找出最小值，作为分区的数量
 *      local: 默认是单线程的
 *      local[N]: 指定的数量的线程
 *      local[*]: 空闲的线程数量
 */
object TextFileTest {
    def main(args: Array[String]): Unit = {
        // 创建SparkContext
        val sc: SparkContext = new SparkContext(new SparkConf().setMaster("local[3]").setAppName("textFile"))
        // 1. 通过文件，创建RDD
        val filePath: String = "C:\\Users\\luds\\Desktop\\file\\file1"
        val rdd: RDD[String] = sc.textFile(filePath, 4)

        // 三种获取分区数目的方法
        println(rdd.getNumPartitions)
        println(rdd.partitions.length)
        // 可以查看输出目录output里有多少个文件
        rdd.saveAsTextFile("C:\\Users\\luds\\Desktop\\output")
    }
}