RDD -- 其他操作

RDD 缓存

级别说明
MEMORY_ONLY系统默认 数据缓存到内存中
MEMORY_AND_DISK优先存储在内存中,当不适合存储在内存中时,会启用磁盘存储
MEMORY_ONLY_SER和MEMORY_ONLY都是存储在内存中,不同的MEMORY_ONLY_SER存储的是java 对象,MEMORY_ONLY存储的是反序列的对象
MEMORY_AND_DISK_SER和MEMORY_AND_DISK相同的是存储模式,不同的是存储的结构
DISK_ONLYDISK_ONLY将数据存在磁盘中

persist
persist() 源码
  /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   */
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
persist(newLevel: StorageLevel) 源码
  /**
   * Set this RDD's storage level to persist its values across operations after the first time
   * it is computed. This can only be used to assign a new storage level if the RDD does not
   * have a storage level set yet. Local checkpointing is an exception.
   */
  def persist(newLevel: StorageLevel): this.type = {
    if (isLocallyCheckpointed) {
      // This means the user previously called localCheckpoint(), which should have already
      // marked this RDD for persisting. Here we should override the old storage level with
      // one that is explicitly requested by the user (after adapting it to use disk).
      persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
    } else {
      persist(newLevel, allowOverride = false)
    }
  }
StorageLevel 源码

/**
 * Various [[org.apache.spark.storage.StorageLevel]] defined and utility functions for creating
 * new storage levels.
 */
object StorageLevel {
  val NONE = new StorageLevel(false, false, false, false)
  val DISK_ONLY = new StorageLevel(true, false, false, false)
  val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
  val MEMORY_ONLY = new StorageLevel(false, true, false, true)
  val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
  val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
  val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
  val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
  val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
  val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
  val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
  val OFF_HEAP = new StorageLevel(true, true, true, false, 1)
eg:
    val a = sc.parallelize(1 to 100)
    a.cache()
    a.persist()
    a.persist(StorageLevel.MEMORY_ONLY)

cache 本质是 persist

  /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   */
  def cache(): this.type = persist()

检查点 checkpoint 容错机制

checkpoint(检查点)容错是对lineage血统容错的辅助,lineage过长时,造成容错成本过高,当检查点后的任务丢失分区时,可以从检查点处的RDD重新做lineage,可以减少开销。官方建议:在此RDD上执行任何作业之前,必须调用此函数。将此RDD保存在内存中。
源码
  /**
   * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
   * directory set with `SparkContext#setCheckpointDir` and all references to its parent
   * RDDs will be removed. This function must be called before any job has been
   * executed on this RDD. It is strongly recommended that this RDD is persisted in
   * memory, otherwise saving it on a file will require recomputation.
   */
  def checkpoint(): Unit = RDDCheckpointData.synchronized {
    // NOTE: we use a global lock here due to complexities downstream with ensuring
    // children RDD partitions point to the correct parent partitions. In the future
    // we should revisit this consideration.
    if (context.checkpointDir.isEmpty) {
      throw new SparkException("Checkpoint directory has not been set in the SparkContext")
    } else if (checkpointData.isEmpty) {
      checkpointData = Some(new ReliableRDDCheckpointData(this))
    }
  }
eg:
    val a = sc.parallelize(1 to 100)
    sc.setCheckpointDir("hdfs://192.168.72.2:8020/checkpoint/20190521")
    a.persist(StorageLevel.MEMORY_ONLY)
    a.checkpoint()

广播变量 broadcast

在对于一些共享数据集时,broadcast变量为每台机器缓存一份数据而不是在每个task上缓存一份数据从而减少资源开销。使用场景:大表 join 小表时,将小表的数据向每台机器分发一份数据。
源码
  /**
   * Broadcast a read-only variable to the cluster, returning a
   * [[org.apache.spark.broadcast.Broadcast]] object for reading it in distributed functions.
   * The variable will be sent to each cluster only once.
   *
   * @param value value to broadcast to the Spark nodes
   * @return `Broadcast` object, a read-only variable cached on each machine
   */
  def broadcast[T: ClassTag](value: T): Broadcast[T] = {
    assertNotStopped()
    require(!classOf[RDD[_]].isAssignableFrom(classTag[T].runtimeClass),
      "Can not directly broadcast RDDs; instead, call collect() and broadcast the result.")
    val bc = env.broadcastManager.newBroadcast[T](value, isLocal)
    val callSite = getCallSite
    logInfo("Created broadcast " + bc.id + " from " + callSite.shortForm)
    cleaner.foreach(_.registerBroadcastForCleanup(bc))
    bc
  }
eg:
假设一个地区有10万个免费WIFI,100万个用户,现在有10亿条连接信息
10万个免费WIFI组成一个 A 表: WIFI_ID , POS
10亿条连接信息组成一个 B 表 : USER_ID,WIFI_ID,TIME,MESSAGE
100万个用户组成一个 C 表: USER_ID , USER_NAME
现在要跑出一张结果表 D :USER_NAME,POS,TIME,MESSAGE
    val A = sc.textFile("hdfs://192.168.72.2:8020/wifi").map(line => {
      val fields = line.split("\\|")
      val wifi = fields(0)
      val pos = fields(1)
      (wifi, pos)
    })
    val B = sc.textFile("hdfs://192.168.72.2:8020/connectionInfo").map(line => {
      val fields = line.split("\\|")
      val userID = fields(0)
      val wifiID = fields(1)
      val time = fields(2)
      val message = fields(3)
      (userID, wifiID, time, message)
    })
    val C = sc.textFile("hdfs://192.168.72.2:8020/user").map(line => {
      val fields = line.split("\\|")
      val userID = fields(0)
      val userName = fields(1)
      (userID, userName)
    })
    //生成不可变的集合,广播到task中去
    val wifiPosBroakcast = A.collect()
    val wifiPos = sc.broadcast(wifiPosBroakcast)

    val userBroakcast = C.collect()
    val user = sc.broadcast(userBroakcast)
    
    def mapPartitionFunc(iter: Iterator[(String, String, String, String)]): Iterator[(String, String, String, String)] = {
      var result = ListBuffer[(String, String, String, String)]()
      val wifis = wifiPos.value
      val users = user.value
      // result = iter.join(wifis).join(users) 
      result.iterator
    }
    B.mapPartitions(mapPartitionFunc)

Accumulator 累加器

Accumulator:创建并注册一个累加器,该累加器从0开始,通过“add”累积输入。通过“value”获得累加器的值。
源码
  /**
   * Create and register a long accumulator, which starts with 0 and accumulates inputs by `add`.
   */
  def longAccumulator(name: String): LongAccumulator = {
    val acc = new LongAccumulator
    register(acc, name)
    acc
  }
 override def add(v: T): Unit = _value = param.addAccumulator(_value, v)
 override def value: jl.Long = _sum
eg;
al accum = sc.longAccumulator("My Accumulator")
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))
accum.value //10

南有乔木,不可休思。汉有游女,不可求思。汉之广矣,不可泳思。江之永矣,不可方思。
翘翘错薪,言刈其楚。之子于归,言秣其马。汉之广矣,不可泳思。江之永矣,不可方思。
翘翘错薪,言刈其蒌。之子于归。言秣其驹。汉之广矣,不可泳思。江之永矣,不可方思。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值