spark 持久化解析

持久化使用

package org.example

import org.apache.spark.sql.SparkSession

object CacheTest {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .master("local[2]")
      .appName("spark-cache-test")
      .getOrCreate()

    val sc = spark.sparkContext
    val l = Seq("hello spark", "hello scala")
    val rdd = sc.parallelize(l)
    rdd.collect()

    val flatmapRdd = rdd.flatMap(_.split(" "))
    def fun(x: String)   = {
      println("fun is call")
      x -> 1
    }


    val mapRdd = flatmapRdd.map(fun)
    println(mapRdd.toDebugString)
//    mapRdd.cache()
    println(mapRdd.toDebugString)

    val reduceRdd = mapRdd.reduceByKey((x,y)=>x+y)
    reduceRdd.collect()

    println(mapRdd.toDebugString)
    val groupRdd = mapRdd.groupByKey()
    groupRdd.collect()

    Thread.sleep(10000000)
  }
}


/**
(2) MapPartitionsRDD[2] at map at CacheTest.scala:25 []
 |  MapPartitionsRDD[1] at flatMap at CacheTest.scala:18 []
 |  ParallelCollectionRDD[0] at parallelize at CacheTest.scala:15 []
(2) MapPartitionsRDD[2] at map at CacheTest.scala:25 []
 |  MapPartitionsRDD[1] at flatMap at CacheTest.scala:18 []
 |  ParallelCollectionRDD[0] at parallelize at CacheTest.scala:15 []
fun is call
fun is call
fun is call
fun is call
(2) MapPartitionsRDD[2] at map at CacheTest.scala:25 []
 |  MapPartitionsRDD[1] at flatMap at CacheTest.scala:18 []
 |  ParallelCollectionRDD[0] at parallelize at CacheTest.scala:15 []
fun is call
fun is call
fun is call
fun is call
不使用持久化。map执行了8次。mapRdd没有变化
*/

package org.example

import org.apache.spark.sql.SparkSession

object CacheTest {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .master("local[2]")
      .appName("spark-cache-test")
      .getOrCreate()

    val sc = spark.sparkContext
    val l = Seq("hello spark", "hello scala")
    val rdd = sc.parallelize(l)
    rdd.collect()

    val flatmapRdd = rdd.flatMap(_.split(" "))
    def fun(x: String)   = {
      println("fun is call")
      x -> 1
    }


    val mapRdd = flatmapRdd.map(fun)
    println(mapRdd.toDebugString)
    mapRdd.cache()
    println(mapRdd.toDebugString)

    val reduceRdd = mapRdd.reduceByKey((x,y)=>x+y)
    reduceRdd.collect()

    println(mapRdd.toDebugString)
    val groupRdd = mapRdd.groupByKey()
    groupRdd.collect()

    Thread.sleep(10000000)
  }
}
/**
(2) MapPartitionsRDD[2] at map at CacheTest.scala:25 []
 |  MapPartitionsRDD[1] at flatMap at CacheTest.scala:18 []
 |  ParallelCollectionRDD[0] at parallelize at CacheTest.scala:15 []
(2) MapPartitionsRDD[2] at map at CacheTest.scala:25 [Memory Deserialized 1x Replicated]
 |  MapPartitionsRDD[1] at flatMap at CacheTest.scala:18 [Memory Deserialized 1x Replicated]
 |  ParallelCollectionRDD[0] at parallelize at CacheTest.scala:15 [Memory Deserialized 1x Replicated]
fun is call
fun is call
fun is call
fun is call
(2) MapPartitionsRDD[2] at map at CacheTest.scala:25 [Memory Deserialized 1x Replicated]
 |       CachedPartitions: 2; MemorySize: 400.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
 |  MapPartitionsRDD[1] at flatMap at CacheTest.scala:18 [Memory Deserialized 1x Replicated]
 |  ParallelCollectionRDD[0] at parallelize at CacheTest.scala:15 [Memory Deserialized 1x Replicated]
使用持久化。map执行了4次。
cache后rdd后多出[Memory Deserialized 1x Replicated],指的是持久化级别,给rdd打上了标记
rdd中多出 CachedPartitions: 2; MemorySize: 400.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B,代表优先读取cache

*/

package org.example

import org.apache.spark.sql.SparkSession

object CacheTest {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .master("local[2]")
      .appName("spark-cache-test")
      .getOrCreate()

    val sc = spark.sparkContext
    val l = Seq("hello spark", "hello scala")
    val rdd = sc.parallelize(l)
    rdd.collect()

    val flatmapRdd = rdd.flatMap(_.split(" "))
    def fun(x: String)   = {
      println("fun is call")
      x -> 1
    }


    val mapRdd = flatmapRdd.map(fun)
    println(mapRdd.toDebugString)
    sc.setCheckpointDir("./checkpoint")
    mapRdd.checkpoint()
    println(mapRdd.toDebugString)

    val reduceRdd = mapRdd.reduceByKey((x,y)=>x+y)
    reduceRdd.collect()

    println(mapRdd.toDebugString)
    val groupRdd = mapRdd.groupByKey()
    groupRdd.collect()

    Thread.sleep(10000000)
  }
}

/**
(2) MapPartitionsRDD[2] at map at CacheTest.scala:25 []
 |  MapPartitionsRDD[1] at flatMap at CacheTest.scala:18 []
 |  ParallelCollectionRDD[0] at parallelize at CacheTest.scala:15 []
(2) MapPartitionsRDD[2] at map at CacheTest.scala:25 []
 |  MapPartitionsRDD[1] at flatMap at CacheTest.scala:18 []
 |  ParallelCollectionRDD[0] at parallelize at CacheTest.scala:15 []
fun is call
fun is call
fun is call
fun is call
fun is call
fun is call
fun is call
fun is call
(2) MapPartitionsRDD[2] at map at CacheTest.scala:25 []
 |  ReliableCheckpointRDD[4] at collect at CacheTest.scala:32 []
使用checkpoint后,rdd没有变化。
map执行了8次,其中reduceRdd.collect()执行了4次,checkpoint执行了4次。
这说明checkpoint是重新计算的。可结合cache()/persist()使用来避免多次计算。
后面rdd血缘发生改变,替换为ReliableCheckpointRDD[4] at collect at CacheTest.scala:32 []
这说明checkpoint切断了血缘,后面计算是从checkpoint中读取数据
*/

package org.example

import org.apache.spark.sql.SparkSession

object CacheTest {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .master("local[2]")
      .appName("spark-cache-test")
      .getOrCreate()

    val sc = spark.sparkContext
    val l = Seq("hello spark", "hello scala")
    val rdd = sc.parallelize(l)
    rdd.collect()

    val flatmapRdd = rdd.flatMap(_.split(" "))
    def fun(x: String)   = {
      println("fun is call")
      x -> 1
    }


    val mapRdd = flatmapRdd.map(fun)
    println(mapRdd.toDebugString)
    mapRdd.cache()
    sc.setCheckpointDir("./checkpoint")
    mapRdd.checkpoint()
    println(mapRdd.toDebugString)

    val reduceRdd = mapRdd.reduceByKey((x,y)=>x+y)
    reduceRdd.collect()

    println(mapRdd.toDebugString)
    val groupRdd = mapRdd.groupByKey()
    groupRdd.collect()

    Thread.sleep(10000000)
  }
}
/**
(2) MapPartitionsRDD[2] at map at CacheTest.scala:25 []
 |  MapPartitionsRDD[1] at flatMap at CacheTest.scala:18 []
 |  ParallelCollectionRDD[0] at parallelize at CacheTest.scala:15 []
(2) MapPartitionsRDD[2] at map at CacheTest.scala:25 [Memory Deserialized 1x Replicated]
 |  MapPartitionsRDD[1] at flatMap at CacheTest.scala:18 [Memory Deserialized 1x Replicated]
 |  ParallelCollectionRDD[0] at parallelize at CacheTest.scala:15 [Memory Deserialized 1x Replicated]
fun is call
fun is call
fun is call
fun is call
(2) MapPartitionsRDD[2] at map at CacheTest.scala:25 [Memory Deserialized 1x Replicated]
 |       CachedPartitions: 2; MemorySize: 400.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
 |  ReliableCheckpointRDD[4] at collect at CacheTest.scala:33 [Memory Deserialized 1x Replicated]
先cache/persist再checkpoint,map执行了四次。checkpoint是从cache/persist中获取数据。
*/

checkpoint在本地生成了文件,因为并行度是2,所以有part-00000和part-00001两个数据文件。
作业停止后,本地文件依然存在。所以如果不需要,就手动删除。
image.png

持久化对比

cache/persist/checkpoint 三个持久化算子。
cache:本质就是persist,缓存到内存中。
persist:可以选择不同的缓存策略(内存、硬盘结合),不会切断血缘,作业完成后字段删除。
checkpoint:缓存到硬盘上。能切断血缘。作业完成后数据文件不会删除。
这三个算子都是transform算子,不会立即执行。等到action算子触发后,才会开始持久化。

cache算子

/**
 * Persist this RDD with the default storage level (`MEMORY_ONLY`).
 */
def cache(): this.type = persist()

/**
 * Persist this RDD with the default storage level (`MEMORY_ONLY`).
 */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)


cache:本质就是persist,缓存到内存中。

persist算子

image.png
判断如果rdd自身的存储级别为None(默认值),那就说明该rdd没有进行缓存过,就调用sc.persistRDD()进行rdd缓存的注册注册完毕更新自身的缓存级别StorageLevel。
sc.persistRDD()是将rdd添加到map里面,方便后续直接获取使用。persistentRdds在sparkContext中记录了所有持久化的rdd。
image.png
image.png
sc.persist()完成,只是给rdd打了一个缓存标记。

在action算子触发下,rdd开始便利获取数据,此时才开始正式持久化。
storageLevel已经变成对应的缓存级别,所以走getOrCompute(split, context)
image.png
生成blockId,调用SparkEnv.get.blockManager.getOrElseUpdate方法
image.png
getT(classTag)是先从缓存中获取,没有的话走doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true)生成iterator。
image.png
调用doPut缓存数据。{info => …}是doPut的一个参数。
val res = putBody(putBlockInfo)。putBody就是{info => …}
memoryStore.putIteratorAsValues(blockId, iterator(), classTag) 因为持久化到内存,调用memoryStore缓存数据。
返回值是var iteratorFromFailedMemoryStorePut: Option[PartiallyUnrolledIterator[T]] = None。
全部缓存成功 iteratorFromFailedMemoryStorePut还是None,
没有全部缓存iteratorFromFailedMemoryStorePut就是没有缓存的数据的iteratorimage.png
image.png
doPutIterator返回值如果是None,代表全部缓存成功,getLocalValues(blockId).getOrElse获取数据,返回Left(iter)
返回值是Some(iter),代表没有缓存成功,返回Right(iter)
image.png

checkpoint算子

1.sc.setCheckpointDir(“./checkpoint”)

生成目录(checkpoint/5544d86b-5bd7-492c-8986-5bd0b671f7eb),并存到checkpointDir变量
image.png

2.mapRdd.checkpoint()

生成一个ReliableRDDCheckpointData对象,赋值给当前的rdd的checkpointData变量。ReliableRDDCheckpointData中cpDir变量最重要,就是rdd-2这个目录。
image.png
image.png

3.reduceRdd.collect()

collect算子触发checkpoint。核心是runJob()
image.png
runJob中dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)表示调度执行,开始rdd的计算。
rdd.doCheckpoint()是rdd的checkpoint的部分。主要看这一部分。
image.png
从尾部开始向上遍历rdd,找到需要checkpoint的子rdd,子rdd的checkpointData变量在前面已经赋值了,对它进行checkpoint。
image.png
变更cpState状态为 CheckpointingInProgress,标志正在进行checkpoint。完成后,变更状态为 Checkpointed,标志checkpoint完成。用cpState变量避免同一个rdd多次checkpoint。
val newRDD = doCheckpoint() 这个是checkpoint的具体过程
image.png
具体实现是 ReliableRDDCheckpointData。
val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)
image.png
image.png

sc.runJob(originalRDD,writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _)
再次提交任务,表明checkpoint的时候rdd会多计算一次。
writePartitionToCheckpointFile 生产文件
writePartitionerToCheckpointDir 将文件写入对应目录
val newRDD = new ReliableCheckpointRDD[T](sc, checkpointDirPath.toString, originalRDD.partitioner)
生产新的rdd(直接从checkpoint目录读取,不需要再次计算),用来替换原来checkpoint的rdd。
image.png
将newRdd保存到 cpRDD变量,更新cpState状态完成
rdd.markCheckpointed() 切断血缘
image.png
将依赖和分区都置为空,就是切断rdd的血缘
image.png

4.checkpoint复用

computeOrReadCheckpoint方法中,判断rdd是否原来checkpoint过。如果是,则使用checkpoint的rdd。
firstParent可以获取到checkpoint的rdd,因为checkpoint完成后会切断血缘,新的rdd就会在头部。
image.png
image.png

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值