RDD -- 其他操作

最新推荐文章于 2024-01-24 16:30:54 发布

游九河

最新推荐文章于 2024-01-24 16:30:54 发布

阅读量163

点赞数

分类专栏： spark core 文章标签： RDD 缓存检查点共享变量

本文链接：https://blog.csdn.net/qq_40337206/article/details/90404500

版权

spark core 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

RDD 缓存

级别	说明
MEMORY_ONLY	系统默认数据缓存到内存中
MEMORY_AND_DISK	优先存储在内存中，当不适合存储在内存中时，会启用磁盘存储
MEMORY_ONLY_SER	和MEMORY_ONLY都是存储在内存中，不同的MEMORY_ONLY_SER存储的是java 对象，MEMORY_ONLY存储的是反序列的对象
MEMORY_AND_DISK_SER	和MEMORY_AND_DISK相同的是存储模式，不同的是存储的结构
DISK_ONLY	DISK_ONLY将数据存在磁盘中

persist

persist（）源码

  /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   */
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

persist(newLevel: StorageLevel) 源码

  /**
   * Set this RDD's storage level to persist its values across operations after the first time
   * it is computed. This can only be used to assign a new storage level if the RDD does not
   * have a storage level set yet. Local checkpointing is an exception.
   */
  def persist(newLevel: StorageLevel): this.type = {
    if (isLocallyCheckpointed) {
      // This means the user previously called localCheckpoint(), which should have already
      // marked this RDD for persisting. Here we should override the old storage level with
      // one that is explicitly requested by the user (after adapting it to use disk).
      persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
    } else {
      persist(newLevel, allowOverride = false)
    }
  }

StorageLevel 源码


/**
 * Various [[org.apache.spark.storage.StorageLevel]] defined and utility functions for creating
 * new storage levels.
 */
object StorageLevel {
  val NONE = new StorageLevel(false, false, false, false)
  val DISK_ONLY = new StorageLevel(true, false, false, false)
  val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
  val MEMORY_ONLY = new StorageLevel(false, true, false, true)
  val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
  val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
  val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
  val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
  val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
  val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
  val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
  val OFF_HEAP = new StorageLevel(true, true, true, false, 1)

eg:

    val a = sc.parallelize(1 to 100)
    a.cache()
    a.persist()
    a.persist(StorageLevel.MEMORY_ONLY)

cache 本质是 persist

  /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   */
  def cache(): this.type = persist()

检查点 checkpoint 容错机制

checkpoint（检查点）容错是对lineage血统容错的辅助，lineage过长时，造成容错成本过高，当检查点后的任务丢失分区时，可以从检查点处的RDD重新做lineage，可以减少开销。官方建议：在此RDD上执行任何作业之前，必须调用此函数。将此RDD保存在内存中。

源码

  /**
   * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
   * directory set with `SparkContext#setCheckpointDir` and all references to its parent
   * RDDs will be removed. This function must be called before any job has been
   * executed on this RDD. It is strongly recommended that this RDD is persisted in
   * memory, otherwise saving it on a file will require recomputation.
   */
  def checkpoint(): Unit = RDDCheckpointData.synchronized {
    // NOTE: we use a global lock here due to complexities downstream with ensuring
    // children RDD partitions point to the correct parent partitions. In the future
    // we should revisit this consideration.
    if (context.checkpointDir.isEmpty) {
      throw new SparkException("Checkpoint directory has not been set in the SparkContext")
    } else if (checkpointData.isEmpty) {
      checkpointData = Some(new ReliableRDDCheckpointData(this))
    }
  }

eg:

    val a = sc.parallelize(1 to 100)
    sc.setCheckpointDir("hdfs://192.168.72.2:8020/checkpoint/20190521")
    a.persist(StorageLevel.MEMORY_ONLY)
    a.checkpoint()

广播变量 broadcast

在对于一些共享数据集时，broadcast变量为每台机器缓存一份数据而不是在每个task上缓存一份数据从而减少资源开销。使用场景：大表 join 小表时,将小表的数据向每台机器分发一份数据。

源码

  /**
   * Broadcast a read-only variable to the cluster, returning a
   * [[org.apache.spark.broadcast.Broadcast]] object for reading it in distributed functions.
   * The variable will be sent to each cluster only once.
   *
   * @param value value to broadcast to the Spark nodes
   * @return `Broadcast` object, a read-only variable cached on each machine
   */
  def broadcast[T: ClassTag](value: T): Broadcast[T] = {
    assertNotStopped()
    require(!classOf[RDD[_]].isAssignableFrom(classTag[T].runtimeClass),
      "Can not directly broadcast RDDs; instead, call collect() and broadcast the result.")
    val bc = env.broadcastManager.newBroadcast[T](value, isLocal)
    val callSite = getCallSite
    logInfo("Created broadcast " + bc.id + " from " + callSite.shortForm)
    cleaner.foreach(_.registerBroadcastForCleanup(bc))
    bc
  }

eg:

假设一个地区有10万个免费WIFI，100万个用户，现在有10亿条连接信息

10万个免费WIFI组成一个 A 表： WIFI_ID , POS

10亿条连接信息组成一个 B 表 : USER_ID,WIFI_ID,TIME,MESSAGE

100万个用户组成一个 C 表： USER_ID , USER_NAME

现在要跑出一张结果表 D :USER_NAME,POS,TIME,MESSAGE

    val A = sc.textFile("hdfs://192.168.72.2:8020/wifi").map(line => {
      val fields = line.split("\\|")
      val wifi = fields(0)
      val pos = fields(1)
      (wifi, pos)
    })
    val B = sc.textFile("hdfs://192.168.72.2:8020/connectionInfo").map(line => {
      val fields = line.split("\\|")
      val userID = fields(0)
      val wifiID = fields(1)
      val time = fields(2)
      val message = fields(3)
      (userID, wifiID, time, message)
    })
    val C = sc.textFile("hdfs://192.168.72.2:8020/user").map(line => {
      val fields = line.split("\\|")
      val userID = fields(0)
      val userName = fields(1)
      (userID, userName)
    })
    //生成不可变的集合，广播到task中去
    val wifiPosBroakcast = A.collect()
    val wifiPos = sc.broadcast(wifiPosBroakcast)

    val userBroakcast = C.collect()
    val user = sc.broadcast(userBroakcast)
    
    def mapPartitionFunc(iter: Iterator[(String, String, String, String)]): Iterator[(String, String, String, String)] = {
      var result = ListBuffer[(String, String, String, String)]()
      val wifis = wifiPos.value
      val users = user.value
      // result = iter.join(wifis).join(users) 
      result.iterator
    }
    B.mapPartitions(mapPartitionFunc)

Accumulator 累加器

Accumulator:创建并注册一个累加器，该累加器从0开始，通过“add”累积输入。通过“value”获得累加器的值。

源码

  /**
   * Create and register a long accumulator, which starts with 0 and accumulates inputs by `add`.
   */
  def longAccumulator(name: String): LongAccumulator = {
    val acc = new LongAccumulator
    register(acc, name)
    acc
  }
 override def add(v: T): Unit = _value = param.addAccumulator(_value, v)
 override def value: jl.Long = _sum

eg;

al accum = sc.longAccumulator("My Accumulator")
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))
accum.value //10

南有乔木，不可休思。汉有游女，不可求思。汉之广矣，不可泳思。江之永矣，不可方思。

翘翘错薪，言刈其楚。之子于归，言秣其马。汉之广矣，不可泳思。江之永矣，不可方思。

翘翘错薪，言刈其蒌。之子于归。言秣其驹。汉之广矣，不可泳思。江之永矣，不可方思。

游九河

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
RDD -- 其他操作

RDD 缓存级别说明MEMORY_ONLY系统默认数据缓存到内存中MEMORY_AND_DISK优先存储在内存中，当不适合存储在内存中时，会启用磁盘存储MEMORY_ONLY_SER和MEMORY_ONLY都是存储在内存中，不同的MEMORY_ONLY_SER存储的是java 对象，MEMORY_ONLY存储的是反序列的对象MEMORY_AND_DI...
复制链接

扫一扫