Spark学习笔记(三)

最新推荐文章于 2022-08-09 19:05:38 发布

浮生醉清风i

最新推荐文章于 2022-08-09 19:05:38 发布

阅读量325

点赞数 1

分类专栏： scala spark 文章标签： spark 大数据

本文链接：https://blog.csdn.net/weixin_41678001/article/details/117921123

版权

spark 同时被 2 个专栏收录

8 篇文章 4 订阅

订阅专栏

scala

5 篇文章 0 订阅

订阅专栏

一、行动算子

行动算子的执行，就会触发整个作业的执行, 会采集各个分区的数据到driver端的内存中。

1. 常见的行动算子

val data: RDD[Int] = context.makeRDD(List(1,2,3,4), 2)

// 数据源的个数
val count: Long = data.count()

// 数据源的第一个
val first = data.first()

// 数据源的前多少个组成一个数组
val take = data.take(4)


// 输出结果
result.collect()

// 按照对象的形式保存
data.saveAsObjectFile("output")

// 要求数据类型是是保存kv 类型的数据
data.saveAsSequenceFile("output")

2. aggregate

val data: RDD[Int] = context.makeRDD(List(1,2,3,4), 2)

// 将分区内的数据进行计算  分区间进行计算
val i: Int = data.aggregate(0)(_ + _, _ + _)

aggregateBykey和aggregate的区别

1. aggregateBykey的初始值只会参与分区内的计算 
2. aggregate的初始值不仅参与分区内的计算还会参与分区间的计算

3. fold

// 当分区内和分区间的规则计算相同时使用fold  同样分区内和分区间初始值都会参与计算
val i: Int = data.fold(1)(_ + _)

4. countByKey

// 通过value进行统计个数
val i: collection.Map[Int, Long] = data.countByValue()

// 通过key进行统计个数
val i: collection.Map[Int, Long] = data.countByKey()

5. forEach

rdd.foreach() 算子采集数据会乱序

二、血缘关系、依赖关系

相邻的两个RDD关系称之为依赖关系，多个连续的RDD的依赖称之为血缘关系。

val data: RDD[Int] = context.makeRDD(List(1,2,3,4), 2)

data.toDebugString //rdd的血缘关系
data.dependencies  //rdd的依赖关系
--------------------------------------------------
1. 新RDD的一个分区数据依赖于旧RDD的一个分区的数据叫做 一对一 是窄依赖

// 继承的窄依赖
@DeveloperApi
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
  override def getParents(partitionId: Int): List[Int] = List(partitionId)
}

2. 新RDD的一个分区数据依赖于旧RDD的多个分区的数据叫做 shuffle依赖

// 继承的是普通的依赖
@DeveloperApi
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](...)
  extends Dependency[Product2[K, V]] {...}

阶段划分源码解读：总阶段数 = 1 + shuffle数量

rdd.collect()  // 从collect进入，collect会触发作业
---------------------------------------------------
def collect(): Array[T] = withScope {
  // 把当前RDD对象传入
  val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
  Array.concat(results: _*)
}
---------------------------------------------------
一路进入runJob
---------------------------------------------------
def runJob[T, U](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    callSite: CallSite,
    resultHandler: (Int, U) => Unit,
    properties: Properties): Unit = {
  // 提交作业
  val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
}
---------------------------------------------------
def submitJob[T, U](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    callSite: CallSite,
    resultHandler: (Int, U) => Unit,
    properties: Properties): JobWaiter[U] = {
  val waiter = new JobWaiter[U](this, jobId, partitions.size, resultHandler)
  // 进入handleJobSubmitted，eventProcessLoop.post会将消息发送出去
  eventProcessLoop.post(JobSubmitted(
    jobId, rdd, func2, partitions.toArray, callSite, waiter,
    Utils.cloneProperties(properties)))
  waiter
}
---------------------------------------------------
def post(event: E): Unit = {
  if (!stopped.get) {
    if (eventThread.isAlive) {  // eventThread启动的时候会执行receive方法
      eventQueue.put(event)
    }
  }
}
---------------------------------------------------
private[spark] val eventThread = new Thread(name) {
override def run(): Unit = {
  try {
    while (!stopped.get) {
      val event = eventQueue.take()  // 从队列中获取
      try {
        onReceive(event)
      }
       ....
    }
  } 
}
---------------------------------------------------
override def onReceive(event: DAGSchedulerEvent): Unit = {
  doOnReceive(event)
}
---------------------------------------------------
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
  // 模式匹配
  case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
      dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
   ...
}
---------------------------------------------------
private[scheduler] def handleJobSubmitted(jobId: Int,
    finalRDD: RDD[_],
    func: (TaskContext, Iterator[_]) => _,
    partitions: Array[Int],
    callSite: CallSite,
    listener: JobListener,
    properties: Properties): Unit = {
  var finalStage: ResultStage = null
  try {
    // 去创建阶段
    finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
  } 
}
---------------------------------------------------
private def createResultStage(
    rdd: RDD[_],
    func: (TaskContext, Iterator[_]) => _,
    partitions: Array[Int],
    jobId: Int,
    callSite: CallSite): ResultStage = {

  // 获取当前rdd的依赖
  val (shuffleDeps, resourceProfiles) = getShuffleDependenciesAndResourceProfiles(rdd)
  val resourceProfile = mergeResourceProfilesForStage(resourceProfiles)
  checkBarrierStageWithDynamicAllocation(rdd)
  checkBarrierStageWithNumSlots(rdd, resourceProfile)
  checkBarrierStageWithRDDChainPattern(rdd, partitions.toSet.size)

   // 获取或者创建stage
  val parents = getOrCreateParentStages(shuffleDeps, jobId)
  val id = nextStageId.getAndIncrement()

  // 无论如何都会创建一个stage
  val stage = new ResultStage(id, rdd, func, partitions, parents, jobId,
    callSite, resourceProfile.id)
  stageIdToStage(id) = stage
  updateJobIdStageIdMaps(jobId, stage)
  stage
}
---------------------------------------------------
private[scheduler] def getShuffleDependenciesAndResourceProfiles(
      rdd: RDD[_]): (HashSet[ShuffleDependency[_, _, _]], HashSet[ResourceProfile]) = {
    val parents = new HashSet[ShuffleDependency[_, _, _]]
    val resourceProfiles = new HashSet[ResourceProfile]
    val visited = new HashSet[RDD[_]]
    val waitingForVisit = new ListBuffer[RDD[_]]
    waitingForVisit += rdd  // 添加进去
    while (waitingForVisit.nonEmpty) {
      // 取数据
      val toVisit = waitingForVisit.remove(0)
      if (!visited(toVisit)) {
        visited += toVisit
        Option(toVisit.getResourceProfile).foreach(resourceProfiles += _)
        toVisit.dependencies.foreach {
          // 进行模式匹配 将shuffle数量添加到parents中
          case shuffleDep: ShuffleDependency[_, _, _] => parents += shuffleDep
          case dependency => waitingForVisit.prepend(dependency.rdd)
        }
      }
    }
    (parents, resourceProfiles)
  }
---------------------------------------------------
private def getOrCreateParentStages(shuffleDeps: HashSet[ShuffleDependency[_, _, _]],
    firstJobId: Int): List[Stage] = {
   // 创建stage
  shuffleDeps.map { shuffleDep => getOrCreateShuffleMapStage(shuffleDep, firstJobId)
  }.toList
}

任务划分源码解读：先说结论任务数量 = 最后一个RDD的分区数量

private[scheduler] def handleJobSubmitted(jobId: Int,
    finalRDD: RDD[_],
    func: (TaskContext, Iterator[_]) => _,
    partitions: Array[Int],
    callSite: CallSite,
    listener: JobListener,
    properties: Properties): Unit = {
  var finalStage: ResultStage = null
  try {
    finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
  } catch {
    ...
  }
  ...
  // 任务划分
  submitStage(finalStage)
}
-------------------------------------------------------
private def submitStage(stage: Stage): Unit = {
  val jobId = activeJobForStage(stage)
  if (jobId.isDefined) {
    if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
      // 进行校验
      val missing = getMissingParentStages(stage).sortBy(_.id)
      if (missing.isEmpty) {
        // 划分任务
        submitMissingTasks(stage, jobId.get)
      } else {
        for (parent <- missing) {
          submitStage(parent)
        }
      }
    }
  } 
}
-------------------------------------------------------
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()

private def submitMissingTasks(stage: Stage, jobId: Int): Unit = {
  val tasks: Seq[Task[_]] = try {
      stage match {
        case stage: ShuffleMapStage =>
          stage.pendingPartitions.clear()
          partitionsToCompute.map { id =>
            val locs = taskIdToLocations(id)
            val part = partitions(id)
            stage.pendingPartitions += id
            new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
              taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
              Option(sc.applicationId), sc.applicationAttemptId, stage.rdd.isBarrier())
          }
        
        //  这里创建多少个任务取决于partitionsToCompute的元素个数
        case stage: ResultStage =>
          partitionsToCompute.map { id =>
            val p: Int = stage.partitions(id)
            val part = partitions(p)
            val locs = taskIdToLocations(id)
            new ResultTask(stage.id, stage.latestInfo.attemptNumber,
              taskBinary, part, locs, id, properties, serializedTaskMetrics,
              Option(jobId), Option(sc.applicationId), sc.applicationAttemptId,
              stage.rdd.isBarrier())
          }
      }
    } 
}
---------------------------------------------
stage: ResultStage

override def findMissingPartitions(): Seq[Int] = {
  val job = activeJob.get
  // numPartitions就是分区的数量
  (0 until job.numPartitions).filter(id => !job.finished(id))
}
---------------------------------------------
stage: ShuffleMapStage

override def findMissingPartitions(): Seq[Int] = {
  mapOutputTrackerMaster
    .findMissingPartitions(shuffleDep.shuffleId)
    .getOrElse(0 until numPartitions)
}

三、持久化

如果想要复用RDD的对象，表面上看着确实复用了对象，但是实际上却没有，因为RDD本身是不保存数据的，所以我们需要进行数据的持久化。

1. cache方法

val value: RDD[(String, Int)] = data.map {
  case (a, b) => (a, b + 1)
}
value.cache()  // 缓存RDD中的数据，保存到内存中

2. persist 可以指定数据持久化在内存还是磁盘中

val value: RDD[(String, Int)] = data.map {
  case (a, b) => (a, b + 1)
}
value.persist(StorageLevel.DISK_ONLY)  // 缓存RDD到磁盘中

3. checkPoint 设置检查点一般和cache联合使用

// 需要设置检查点的保存路径  一般是hdfs
context.setCheckpointDir("cp");

val value: RDD[(String, Int)] = data.map(word => {
 (word, 1)
})
value.checkpoint()   // 设置检查点

※：三种模式的差异

cache：将数据缓存在内存中，缓存前后会增加血缘关系，不改变原有的血缘关系。

persist：将数据缓存在磁盘中，执行完作业文件会丢失。

checkPoint：将数据缓存在磁盘中，会切断血缘关系，执行完作业文件不会丢失。

四、自定义分区器

val base: RDD[String] = context.makeRDD(List("hello world", "hello scala"), numSlices = 1)
   
// 自定义分区器
val partitionData: RDD[(String, Int)] = base.partitionBy(new MyPartitioner)
partitionData.saveAsTextFile("outPut")
context.stop()

class MyPartitioner extends Partitioner{
  // 定义分区数
  override def numPartitions: Int = 2

  // 返回到哪个分区的编号（从0开始）
  override def getPartition(key: Any): Int = {

    if(key == "hello"){
      0
    }else{
      1
    }
  }
}

五、累加器：分布式共享只写变量

如果我们按照平常的思路写一个累加器，这样得到的结果会是0.原因是这是一个分布式的数据累加。

val data: RDD[Int] = context.makeRDD(List(1,2,3,4))
var sum = 0
data.foreach(num => sum += num)

所以我们需要用spark为我们提供的累加器

val data: RDD[Int] = context.makeRDD(List(1,2,3,4))

// 声明一个累加器, 每调用一次行动算子就会执行一次作业，所以一般放置在行动算子中进行操作
val counter: LongAccumulator = context.longAccumulator("counter")
data.foreach(num => counter.add(num))
 println(counter.value)

自定义一个累加器，

object Test02 {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("RDD")
    val context = new SparkContext(conf)
    val data: RDD[String] = context.makeRDD(List("hello","world", "hello","scala"))

    val counter = new MyCounter()
    context.register(counter, "counter")
    data.foreach(word => counter.add(word))
    println(counter.value)

    context.stop()
  }

  // 自定义累加器需要继承AccumulatorV2并且定义好 入参(传入的参数)出参(保存的数据结构类型)的泛型
  class MyCounter extends AccumulatorV2[String, mutable.Map[String, Long]]{
    private var wCount = mutable.Map[String, Long]()

    // 判断是否是初始状态,如果为空就是初始状态
    override def isZero: Boolean = {
      wCount.isEmpty
    }
    // 复制一个
    override def copy(): AccumulatorV2[String, mutable.Map[String, Long]] = {
      new MyCounter
    }
    // 重置计数器
    override def reset(): Unit = {
      wCount.clear()
    }

    // 获取累加
    override def add(v: String): Unit = {
      val newCnt: Long = wCount.getOrElse(v, 0L) + 1
      wCount.update(v, newCnt)
    }

    // 合并多个累加器
    override def merge(other: AccumulatorV2[String, mutable.Map[String, Long]]): Unit = {
      val map1 = this.wCount
      val map2 = other.value
      map2.foreach(word=> {
        val newCont: Long = map1.getOrElse(word._1, 0L) + word._2
        wCount.update(word._1, newCont)
      })
    }

    // 获取累价值
    override def value: mutable.Map[String, Long] = {
      wCount
    }
  }
}

六、广播变量：分布式共享只读变量

val data: RDD[String] = context.makeRDD(List("hello","world", "hello","scala"))

// 声明一个广播变量
val value: Broadcast[RDD[String]] = context.broadcast(data)
println(value.value)