Spark

Joe-Stalin

已于 2022-08-10 15:11:46 修改

阅读量391

点赞数

分类专栏： Spark基础文章标签： spark 大数据

于 2022-04-19 13:57:56 首次发布

本文链接：https://blog.csdn.net/qq_44378386/article/details/124272080

版权

Spark基础专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Spark 教程

一：Spark运行时架构

二：转换算子

1：`groupBy`

对数据进行重新分组。对数据的格式没有具体的要求。

object GroupByExample {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Spark_GroupBy")
    val sparkContext = new SparkContext(sparkConf)
    val data = sparkContext.parallelize(1 to 16,4)
    val value = data.groupBy(_ % 2 == 0)
    value.collect().foreach(println)
    sparkContext.stop()
  }
}

2：`groupByKey`

对数据进行重新分组。数据必须是含有key的二元组类型。

object GroupByKeyExample {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Spark_GroupBykey")
    val sparkContext = new SparkContext(sparkConf)
    val person = sparkContext.makeRDD(List(("name", "dzh"), ("name", "ysy"), ("age", 12), ("age", 13)), 1)
    val value = person.groupByKey()
    value.collect().foreach(println)
    sparkContext.stop()
  }
}

4：`join`

sql的连接操作。分为 join笛卡尔积， leftOuterJoin左连接，rightOuterJoin右连接。

object JoinExample {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Spark_Join")
    val sparkContext = new SparkContext(sparkConf)
    val rdd1 = sparkContext.makeRDD(List(
      ("a", 1), ("b", 2), ("c", 3), ("e", 7)
    ), 2)
    val rdd2 = sparkContext.makeRDD(List(
      ("a", 11), ("b", 12), ("c", 13), ("d", 4)
    ), 2)
    val value = rdd1.join(rdd2)
    value.collect().foreach(println)
    println(" ---- ")
    val value2 = rdd1.leftOuterJoin(rdd2)
    value2.collect().foreach(println)
    println(" ---- ")
    val value3 = rdd1.rightOuterJoin(rdd2)
    value3.collect().foreach(println)
    sparkContext.stop()
  }
}

运行结果：

# 只有key全部存在时才会进行连接
(b,(2, 12))
(a,(1, 11))
(c,(3, 13))
# 以左边rdd为主进行连接
(b,(2, Some(12)))
(e,(7, None))
(a,(1, Some(11)))
(c,(3, Some(13)))
# 以右边rdd为主进行连接
(d,(None, 4))
(b,(Some(2), 12))
(a,(Some(1), 11))
(c,(Some(3), 13))

5：`partitionBy`

对数据进行分区。指定分区规则。
函数签名：

  /**
   * Return a copy of the RDD partitioned using the specified partitioner.
   */
  def partitionBy(partitioner: Partitioner): RDD[(K, V)]

案例：

object KeyPartitionExample {

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Spark_KeyPartitionExample")
    val sparkContext = new SparkContext(sparkConf)
    val rdd = sparkContext.parallelize(List((1, "aaa"),(2, "bbb"),(3, "ccc"),(4, "ddd"), (1, "eee")),4)
    val rdd2 = rdd.partitionBy(new org.apache.spark.HashPartitioner(2))
    val rdd3 = rdd.partitionBy(new MyPartition)
    sparkContext.stop()
  }
}

/**
 * 自定义分区器类。
 * numPartitons 指定分区的个数。
 * getPartition 返回数据属于哪个分区，必须从0开始
 */
class MyPartition extends Partitioner {

  override def numPartitions: Int = 2

  override def getPartition(key: Any): Int = {
    if(key.toString.toInt % 2 == 0) {
      0
    } else {
      1
    }
  }
}

6：`reduceByKey`

对相同key的value进行数据聚合。同时还可以传入一个分区器或者分区数，作用是将输出数据进行重新分区。

函数签名：

  /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce.
   */
  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }

  /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.
   */
  def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {
    reduceByKey(new HashPartitioner(numPartitions), func)
  }

案例：

object KeyPartitionExample {

  def reduceFunc(a: String, b: String) : String = {
    a + b
  }

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Spark_KeyPartitionExample")
    val sparkContext = new SparkContext(sparkConf)
    val rdd = sparkContext.parallelize(List((1, "aaa"),(2, "bbb"),(3, "ccc"),(4, "ddd"), (1, "eee")),4)
    val value = rdd.reduceByKey((x, y) => reduceFunc(x, y))
    value.collect().foreach(println)
    sparkContext.stop()
  }

}

7：`rePartition`

对数据进行重新分区。

object PartitionExample {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Spark_Partition")
    val sparkContext = new SparkContext(sparkConf)
    val data = sparkContext.makeRDD(List(("name", "dzh"), ("age", 12), ("age", 13), ("name", "ysy"), ("age", 13), ("name", 16)))
    /**
     * 缩减分区数，用于大数据集过滤后，提高小数据集的执行效率。
     * 可以简单地理解为合并分区，默认并没有shuffle过程，可能导致数据倾斜。
     */
    data.coalesce(2)
    // 底层调用的是coalesce，默认进行shuffle
    data.repartition(3)
  }
}

8：`sample`

抽奖函数。

object SampleExample {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Spark_SampleExample")
    val sparkContext = new SparkContext(sparkConf)
    val data = sparkContext.makeRDD(List(1, 2, 3, 4, 5, 6))
    val sampleRDD = data.sample(withReplacement = true, 0.5, System.currentTimeMillis())
    sampleRDD.collect().foreach(println)
    sparkContext.stop()
  }
}

9：`distinct`

数据去重。

object DistinctExample {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Spark_distinctExample")
    val sparkContext = new SparkContext(sparkConf)
    val data = sparkContext.makeRDD(List(1, 2, 3, 4, 5, 6))
    val result = data.distinct()
    result.collect().foreach(println)
    sparkContext.stop()
  }
}

10：`cogroup`

cogroup的作用是连接两个或者四个rdd(分别为this rdd 和other rdd,要求所有的rdd必须是二元组类型，只有数据存在key的时候才会有连接的依据)并且返回一个新的rdd，新的rdd是一个二元组类型：
第一个参数代表rdd中出现的key。
第二个参数也是一个二元组，第一个值代表在this rdd中出现的数据，第二个值代表在other rdd中出现的数据。

object CogroupExample {

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Spark_Cogroup")
    val sparkContext = new SparkContext(sparkConf)
    val names = sparkContext.makeRDD(List((1, "Spark"), (2, "Hadoop"), (3, "Kylin"), (4, "Flink"), (7, "rdd")))
    val types = sparkContext.makeRDD(List((1, "String"), (2, "int"), (3, "byte"), (4, "bollean"),
      Tuple2(5, "float"), Tuple2(1, "34"), Tuple2(2, "45"), Tuple2(3, "75")))
    val nameAndType = names.cogroup(types)
    // or
    // val nameAndType = names.cogroup(types, types, types)
    nameAndType.collect().foreach(println)
    sparkContext.stop()
  }

}

运算结果：

(1,(CompactBuffer(Spark),CompactBuffer(String, 34)))
(2,(CompactBuffer(Hadoop),CompactBuffer(int, 45)))
(3,(CompactBuffer(Kylin),CompactBuffer(byte, 75)))
(4,(CompactBuffer(Flink),CompactBuffer(bollean)))
(5,(CompactBuffer(),CompactBuffer(float)))
(7,(CompactBuffer(rdd),CompactBuffer()))

两种函数签名：

  /**
   * For each key k in `this` or `other1` or `other2` or `other3`,
   * return a resulting RDD that contains a tuple with the list of values
   * for that key in `this`, `other1`, `other2` and `other3`.
   */
  def cogroup[W1, W2, W3](other1: RDD[(K, W1)],
      other2: RDD[(K, W2)],
      other3: RDD[(K, W3)],
      partitioner: Partitioner)
      : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]

  /**
   * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
   * list of values for that key in `this` as well as `other`.
   */
  def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
      : RDD[(K, (Iterable[V], Iterable[W]))]

11：`glom`

将每个分区内的数据聚合成一个数组然后返回一个新的rdd.

object GlomExample {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Spark_GlomExample")
    val sparkContext = new SparkContext(sparkConf)
    val data = sparkContext.parallelize(1 to 16, 4)
    val value = data.glom()
    value.collect().foreach(array => println(array.mkString("Array(", ", ", ")")))
    sparkContext.stop()
  }
}

运算结果：

Array(1, 2, 3, 4)
Array(5, 6, 7, 8)
Array(9, 10, 11, 12)
Array(13, 14, 15, 16)

12：`aggregateByKey`

函数主要实现的功能是数据合并，包括分区间数据合并的规则和分区内数据合并的规则。
函数签名是：

  /**
   * Aggregate the values of each key, using given combine functions and a neutral "zero value".
   * This function can return a different result type, U, than the type of the values in this RDD,
   * V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
   * as in scala.TraversableOnce. The former operation is used for merging values within a
   * partition, and the latter is used for merging values between partitions. To avoid memory
   * allocation, both of these functions are allowed to modify and return their first argument
   * instead of creating a new U.
   */
  def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)]

  /**
   * Aggregate the values of each key, using given combine functions and a neutral "zero value".
   * This function can return a different result type, U, than the type of the values in this RDD,
   * V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
   * as in scala.TraversableOnce. The former operation is used for merging values within a
   * partition, and the latter is used for merging values between partitions. To avoid memory
   * allocation, both of these functions are allowed to modify and return their first argument
   * instead of creating a new U.
   */
  def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)]

aggregate函数会返回一个新的类型U，这个U取决于传入的零值的类型。分区内进行数据聚合时value的类型可以和U的类型不一致，但是分区内聚合的结果的类型一定是U，分区间数据聚合的类型要求都是U。

例如：
先求分区内数据的最大值，再求分区间数据最大值之和。

object AggregateByKeyExample {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Spark_AggregateByKey")
    val sparkContext = new SparkContext(sparkConf)
    val rdd = sparkContext.makeRDD(List(
      ("a", 1), ("a", 2), ("a", 3), ("a", 4), ("a", 2), ("a", 3), ("a", 4), ("b", 1), ("b", 2), ("b", 3), ("c", 4), ("d", 10)
    ), 2)
    // 查看每个分区间的数据
    rdd.foreach(value => {
       println("ThreadId is " + Thread.currentThread().getId + ", value is " + value)
    })
    // 指定一个初始值 0
    rdd.aggregateByKey(0)(
      (x, y) => { math.max(x, y) },
      (x, y) => x + y
    ).collect().foreach(println)
    sparkContext.stop()
  }
}

运行结果：

ThreadId is 56, value is (a,1)
ThreadId is 57, value is (a,4)
ThreadId is 56, value is (a,2)
ThreadId is 56, value is (a,3)
ThreadId is 57, value is (b,1)
ThreadId is 56, value is (a,4)
ThreadId is 56, value is (a,2)
ThreadId is 56, value is (a,3)
ThreadId is 57, value is (b,2)
ThreadId is 57, value is (b,3)
ThreadId is 57, value is (c,4)
ThreadId is 57, value is (d,10)
(a, 8)
(c, 4)
(d, 10)
(b, 3)

函数执行流程：
请添加图片描述

13：`foldByKey`

aggregateByKey的简化版本，当分区内和分区间运算规则相同时使用该方法。

14：`combineByKey`

函数实现的功能为按照key对数据进行合并，分为分区间数据合并和分区内数据合并。
函数签名：

  /**
   * Simplified version of combineByKeyWithClassTag that hash-partitions the output RDD.
   * This method is here for backward compatibility. It does not provide combiner
   * classtag information to the shuffle.
   *
   * @see `combineByKeyWithClassTag`
   */
  def combineByKey[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      numPartitions: Int): RDD[(K, C)]

createCombiner: 根据每个分区内的第一个元素创建一个初始化值。
mergeValue：合并分区内的数据。
mergeCombiners：合并分区间的数据。

例如：

object CombineByKeyExample {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Spark_CombineByKey")
    val sparkContext = new SparkContext(sparkConf)
    val rdd = sparkContext.makeRDD(List(("b", 4), ("b", 5), ("b", 6),("c", 14), ("c", 5), ("c", 6)), 2)
    rdd.combineByKey(
      v => {
        println("v = " + v)
        (v, 1)
      },
      (t: (Int, Int), v) => {
        println("t = " + t + " , v = " + v)
        (t._1 + v, t._2 + 1)
      },
      (t: (Int, Int), v: (Int, Int)) => {
        println("t2 = " + t + " , v2 = " + v)
        (t._1 + v._1, t._2 + v._2)
      }
    ).collect().foreach(println)
    sparkContext.stop()
  }
}

请添加图片描述

三：动作算子

1: aggregate

聚合型动作算子，会触发操作，真正的执行。

object AggregateExample {

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("Spark_AggregateExample")
    val sparkContext = new SparkContext(sparkConf)
    val data = sparkContext.makeRDD(Array(Tuple2(1, "Spark"), Tuple2(5, "Spark"),
      Tuple2(2, "Hadoop"), Tuple2(3, "Kylin"), Tuple2(4, "Flink"), Tuple2(11, "Spark"), Tuple2(15, "Spark"),
      Tuple2(12, "Hadoop"), Tuple2(13, "Kylin"), Tuple2(14, "Flink"),Tuple2(21, "Spark"), Tuple2(25, "Spark"),
      Tuple2(22, "Hadoop"), Tuple2(23, "Kylin"), Tuple2(24, "Flink")))
    val result = data.aggregate("0")(
      (x, y) => {
        x + y
      },
      (x, y) => {
        x + y
      }
    )
    println(result)
    sparkContext.stop()
  }

}

执行流程：
请添加图片描述

aggregate使用于分区间和分区内使用不同的计算规则的动作算子，zerovalue会同时使用在分区内第一个元素的计算和分区间第一个元素的计算。

2：`countByKey & countByValue`

countByKey：对二元组类型的数据，按照相同的key统计元素出现的次数。
countByValue：统计每个元素出现的次数。
count：统计RDD中的元素个数。

object CountByExample {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[3]").setAppName("Spark_CountByExample")
    val sparkContext = new SparkContext(sparkConf)
    val elements = sparkContext.makeRDD(List(1, 2, 3, 3, 5, 1))
    val data = sparkContext.makeRDD(List(("c1", "cai"), ("c2", "niao"), ("c1", "feng"), ("c2", "jin"), ("c2", "niao")))
    val result01 = elements.countByValue()
    println(result01)
    val result02 = data.countByKey()
    println(result02)
    val result03 = elements.count()
    println(result03)
    val result04 = data.count()
    println(result04)
    sparkContext.stop()
  }
}

四：累加器

累加器用来把Executor端变量信息聚合到Driver端。在Driver 程序中定义的变量，在 Executor端的每个Task都会得到这个变量的一份新的副本，每个task更新这些副本的值后，传回Driver端进行merge。

object AccumulatorExample {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Spark_Accumulator")
    val sparkContext = new SparkContext(sparkConf)
    val data = sparkContext.makeRDD(List(("a", 4), ("a", 2), ("b", 3), ("c", 4)))
    val sum: LongAccumulator = sparkContext.longAccumulator("sum")
    data.foreach {
      item => sum.add(item._2)
    }
    println(sum.value)
    sparkContext.stop()
  }
}

自定义累加器：

class myAccumulator extends AccumulatorV2[String, mutable.Map[String, Long]] {

  var map: mutable.Map[String, Long] = mutable.Map[String, Long]()

  // 是否为初始化状态
  override def isZero: Boolean = map.isEmpty

  // 复制累加器
  override def copy(): AccumulatorV2[String, mutable.Map[String, Long]] = new myAccumulator()

  // 重置累加器
  override def reset(): Unit = map.clear()

  // 增加数据
  override def add(v: String): Unit = {
    if (v.startsWith("H")) {
      map(v) = map.getOrElse(v, 0L) + 1
    }
  }

  // 合并累加器
  override def merge(other: AccumulatorV2[String, mutable.Map[String, Long]]): Unit = {
    val map2: mutable.Map[String, Long] = other.value
    map2.foreach {
      case (word, count) => {
        map(word) = map.getOrElse(word, 0L) + count
      }
    }
  }

  // 累加器的值
  override def value: mutable.Map[String, Long] = map
}

AccumulatorV2源码分析：

/**
 * 第一个参数表示输入数据类型。
 * 第二个参数表示输出数据类型。
 * out 必须是一个基本原子类型或者是线程安全的类型，因为会有多个线程访问该变量。
 */
abstract class AccumulatorV2[IN, OUT] extends Serializable {

五：广播变量

广播变量用来高效分发较大的对象。向所有工作节点发送一个较大的只读值，以供一个或多个 Spark 操作使用。比如，如果你的应用需要向所有节点发送一个较大的只读查询表，广播变量用起来都很顺手。在多个并行操作中使用同一个变量，但是Spark会为每个任务分别发送。

六：RDD的依赖关系

请添加图片描述
两个rdd的依赖存在两种情况
一对一(宅依赖)：一个分区的数据只被一个子分区进行消费。OneToOne。
多对一(宽依赖)：多个分区的数据被一个子分区进行消费。Shuffer。
对象依赖和数据依赖

@DeveloperApi
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
  override def getParents(partitionId: Int): List[Int] = List(partitionId)
}

阶段 & 分区 & 任务

窄依赖不增加任务数。不需要划分阶段。
宽依赖会增加任务数。需要划分阶段。

阶段划分源码：
当rdd中存在shuffle依赖时，阶段个数会自动加1，
阶段的数量 = shuffle依赖的数量 + 1
必然存在且仅仅存在一个ResultStage

任务划分：
任务的数量 = 当前阶段中最后一个RDD的分区数量。
任务名称就是当前的阶段名称

失败重试与黑名单机制：

除了选择合适的Task调度运行外，还需要监控Task的执行状态，前面也提到，与外部打交道的是SchedulerBackend，Task被提交到Executor启动执行后，Executor会将执行状态上报给 SchedulerBackend，SchedulerBackend 则告诉 TaskScheduler，TaskScheduler 找到该 Task 对应的 TaskSetManager，并通知到该 TaskSetManager，这样 TaskSetManager 就知道 Task 的失败与成功状态，对于失败的 Task，会记录它失败的次数，如果失败次数还没有超过最大重试次数，那么就把它放回待调度的 Task 池子中，否则整个 Application 失败。
在记录 Task 失败次数过程中，会记录它上一次失败所在的 Executor Id 和 Host，这样下次再调度这个 Task 时，会使用黑名单机制，避免它被调度到上一次失败的节点上，起到一定的容错作用。黑名单记录 Task 上一次失败所在的 Executor Id 和 Host，以及其对应的“拉黑”时间，“拉黑”时间是指这段时间内不要再往这个节点上调度这个 Task 了。