spark 缓存及内存管理

最新推荐文章于 2023-04-01 14:06:35 发布

云舒向晚

最新推荐文章于 2023-04-01 14:06:35 发布

阅读量1.4k

点赞数 4

分类专栏： spark 大数据文章标签： spark 大数据

本文链接：https://blog.csdn.net/qq_51549686/article/details/111866752

版权

spark 同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

大数据

2 篇文章 0 订阅

订阅专栏

1 persist 和 unpersist

cache() 调用 persist()，且默认存储级别是 MEMORY_ONLY。
persist() 用来设置RDD的存储级别

在这里插入图片描述

    是否进行序列化和磁盘写入，需要充分考虑所分配到的内存资源和可接受的计算时间长短,序列化会减少内存占用，但是反序列化会延长时间,磁盘写入会延长时间，但是会减少内存占用,也许能提高计算速度。此外要认识到:cache 的 RDD 会一直占用内存，当后期不需要再依赖于他的反复计算的时候，可以使用 unpersist 释放掉。

import org.apache.spark.storage.StorageLevel
val rdd1 = sc.textFile(“hdfs://ns1/user/panniu/spark/input”).flatMap(.split("\t")).map((,1))
val rdd2 = rdd1.persist(StorageLevel.MEMORY_ONLY)
rdd2.count()

在这里插入图片描述

unpersist() 用于删除磁盘，内存中的相关序列化对象

删除缓存后：

在这里插入图片描述

2 spark 内存管理

    Spark 作为一个以擅长内存计算为优势的计算引擎，内存管理方案是其非常重要的模块； Spark的内存可以大体归为两类：execution（运行内存）和storage（存储内存），前者包括shuffles、joins、sorts和aggregations所需内存，后者包括cache和节点间数据传输所需内存；
    在Spark 1.5和之前版本里，运行内存和存储内存是静态配置的，不支持借用；Spark 1.6之后引入的统一内存管理机制，与静态内存管理的区别在于存储内存和执行内存共享同一块空间，可以动态占用对方的空闲区域，提供更好的性能。

2.1 静态内存管理——spark1.5
spark1.6 及以后兼容了 spark1.5 的内存管理。当配置 spark.memory.useLegacyMode=true 时，采用spark1.5的内存管理；当spark.memory.useLegacyMode=false时，采用spark1.6 及以后的内存管理。
spark1.5 的内存管理实现类： StaticMemoryManager
spark.storage.memoryFraction：
spark 存储总内存占系统内存的百分比，默认是 0.6。
spark.shuffle.memoryFraction：
spark shuffle 执行用到的内存占系统内存的百分比，默认是0.2。

spark.storage.safetyFraction：
可用的存储内存占总存储内存的百分比，默认是 0.9。
spark.shuffle.safetyFraction：
可用的shuffle操作执行内存占总执行内存的百分比，默认是 0.8。
在这里插入图片描述

private def getMaxExecutionMemory(conf: SparkConf): Long = {
val systemMaxMemory = conf.getLong(“spark.testing.memory”, Runtime.getRuntime.maxMemory)
// 如果拿到的最大内存 < 32M
if (systemMaxMemory < MIN_MEMORY_BYTES) {
throw new IllegalArgumentException(s"System memory $systemMaxMemory must " +
s"be at least $MIN_MEMORY_BYTES. Please increase heap size using the --driver-memory " +
s"option or spark.driver.memory in Spark configuration.")
}
if (conf.contains(“spark.executor.memory”)) {
val executorMemory = conf.getSizeAsBytes(“spark.executor.memory”)
if (executorMemory < MIN_MEMORY_BYTES) {
throw new IllegalArgumentException(s"Executor memory $e x e c u t o r M e m o r y m u s t b e a t l e a s t " + s "$ MIN_MEMORY_BYTES. Please increase executor memory using the " +
s"–executor-memory option or spark.executor.memory in Spark configuration.")
}
}
val memoryFraction = conf.getDouble(“spark.shuffle.memoryFraction”, 0.2)
val safetyFraction = conf.getDouble(“spark.shuffle.safetyFraction”, 0.8)
(systemMaxMemory * memoryFraction * safetyFraction).toLong
}

private def getMaxStorageMemory(conf: SparkConf): Long = {
val systemMaxMemory = conf.getLong(“spark.testing.memory”, Runtime.getRuntime.maxMemory)
val memoryFraction = conf.getDouble(“spark.storage.memoryFraction”, 0.6)
val safetyFraction = conf.getDouble(“spark.storage.safetyFraction”, 0.9)
(systemMaxMemory * memoryFraction * safetyFraction).toLong
}

举例：executor 的最大可用内存1000M
存储总内存 = 1000M * 0.6 = 600M
运行总内存 = 1000M * 0.2 = 200M
other = 1000M - 600M - 200M = 200M

存储总内存 = 安全存储内存 + 预留内存（防止OOM）
安全存储内存 = 存储总内存 * 0.9 = 600 * 0.9 = 540M
预留内存 = 存储总内存 * (1-0.9) = 60M

运行总内存 = 安全运行内存 + 预留内存（防止OOM）
安全运行内存 = 运行总内存 * 0.8 = 200M * 0.8 = 160M
预留内存 = 运行总内存 * (1-0.8) = 40M

缺点：
这种内存管理方式的缺陷，即 execution 和 storage 内存分配，即使在一方内存不够用而另一方内存空闲的情况下也不能共享，造成内存浪费。

2.2 统一内存管理——spark1.6以后
当spark.memory.useLegacyMode=false时，采用spark1.6 及以后的内存管理。
spark1.6及以后的内存管理实现类： UnifiedMemoryManager
当前spark版本是 spark2.1.1 ，参数配置部分与spark1.6 不同，下面讲解按照spark2.1.1 版本进行参数讲解。
spark.memory.fraction：
spark内存占可用内存（系统内存 - 300）的百分比，默认是0.6。
spark.memory.storageFraction：
spark的存储内存占spark内存的百分比，默认是0.5。
spark的统一内存管理，可以通过配置 spark.memory.storageFraction ，来调整存储内存和执行内存的比例，进而实现内存共享。
在这里插入图片描述

private def getMaxMemory(conf: SparkConf): Long = {
val systemMemory = conf.getLong(“spark.testing.memory”, Runtime.getRuntime.maxMemory)
val reservedMemory = conf.getLong(“spark.testing.reservedMemory”,
// 300M
if (conf.contains(“spark.testing”)) 0 else RESERVED_SYSTEM_MEMORY_BYTES)
// 最小内存大小：450M
val minSystemMemory = (reservedMemory * 1.5).ceil.toLong
if (systemMemory < minSystemMemory) {
throw new IllegalArgumentException(s"System memory $systemMemory must " +
s"be at least $minSystemMemory. Please increase heap size using the --driver-memory " +
s"option or spark.driver.memory in Spark configuration.")
}
// SPARK-12759 Check executor memory to fail fast if memory is insufficient
if (conf.contains(“spark.executor.memory”)) {
val executorMemory = conf.getSizeAsBytes(“spark.executor.memory”)
if (executorMemory < minSystemMemory) {
throw new IllegalArgumentException(s"Executor memory $e x e c u t o r M e m o r y m u s t b e a t l e a s t " + s "$ minSystemMemory. Please increase executor memory using the " +
s"–executor-memory option or spark.executor.memory in Spark configuration.")
}
}
val usableMemory = systemMemory - reservedMemory
val memoryFraction = conf.getDouble(“spark.memory.fraction”, 0.6)
(usableMemory * memoryFraction).toLong
}

def apply(conf: SparkConf, numCores: Int): UnifiedMemoryManager = {
// 获取最大可用内存
val maxMemory = getMaxMemory(conf)
new UnifiedMemoryManager(
conf,
maxHeapMemory = maxMemory,
// 存储内存 = 获取最大可用内存 * 0.5
onHeapStorageRegionSize =
(maxMemory * conf.getDouble(“spark.memory.storageFraction”, 0.5)).toLong,
numCores = numCores)
}

举例：系统内存1000M
系统预留内存 = 300M
可用内存 = 系统内存 - 系统预留内存 = 1000 - 300 = 700M
spark内存 = 可用内存 * 0.6 = 700 * 0.6 = 420M
存储内存和执行内存均占一半， 210M

为了提高内存利用率，spark针对Storage Memory 和 Execution Memory有如下策略：
1）一方空闲，一方内存不足情况下，内存不足一方可以向空闲一方借用内存；
2）只有Execution Memory可以强制拿回Storage Memory在Execution Memory空闲时，借用的Execution Memory的部分内存（如果因强制取回，而Storage Memory数据丢失，重新计算即可）；
3）Storage Memory只能等待Execution Memory主动释放占用的Storage Memory空闲时的内存。(这里不强制取回，因为如果task执行，数据丢失就会导致task 失败)；

用spark1.5的方式提交，
spark-shell --master spark://nn1.hadoop:7077 --executor-memory 1G --total-executor-cores 5 --conf spark.memory.useLegacyMode=true

spark-shell --master spark://nn1.hadoop:7077 --executor-memory 1G --total-executor-cores 5 --conf spark.memory.useLegacyMode=true --conf spark.storage.memoryFraction=0.2

1.6以后的
spark-shell --master spark://nn1.hadoop:7077 --executor-memory 1G --total-executor-cores 5
存储内存是可用内存的一半。可用内存分配比：60%

spark-shell --master spark://nn1.hadoop:7077 --executor-memory 1G --total-executor-cores 5 --conf spark.memory.fraction=0.2
存储内存是可用内存的一半。可用内存分配比：20%

3 BlockManager分析

    BlockManager是Spark的分布式存储系统，与我们平常说的分布式存储系统是有区别的，区别就是这个分布式存储系统只会管理Block块数据，它运行在所有节点上。
    BlockManager的结构是Maser-Slave架构，Master就是Driver上的BlockManagerMaster，Slave就是每个Executor上的BlockManager。BlockManagerMaster负责接受Executor上的BlockManager的注册以及管理BlockManager的元数据信息。

运行图：
在这里插入图片描述

流程说明：
1）在 Application 启动的时候会在 spark-env.sh 中注册 BlockMangerMaster。
BlockManagerMaster：对整个集群的 Block 数据进行管理；
2）每个启动一个 Executor 都会实例化 BlockManagerSlave 并通过远程通信的方式注册给 BlockMangerMaster；
3）BlockManagerSlave由 4部分组成：
MemoryStore：负责对内存上的数据进行存储和读写；
DiskStore：负责对磁盘上的数据进行存储和读写；
BlockTransferService：负责与远程其他Executor 的BlockManager建立网络连接；
BlockManagerWorker：负责对远程其他Executor的BlockManager的数据进行读写；
4）当Executor 的BlockManager 执行了增删改操作，那就必须将 block 的 blockStatus 上报给Driver端的BlockManagerMaster，BlockManagerMaster 内部的BlockManagerMasterEndPoint 内维护了元数据信息的映射。通过Map、Set结构，很容易维护增加、更新、删除元数据，进而达到维护元数据的功能。
// 维护 BlockManagerId 与 BlockManagerInfo 的关系
// 而BlockManagerInfo内部维护 JHashMap[BlockId, BlockStatus] 的映射关系
private val blockManagerInfo = new mutable.HashMap[BlockManagerId, BlockManagerInfo]

// 维护 executorID 与 BlockManagerId 的关系
private val blockManagerIdByExecutor = new mutable.HashMap[String, BlockManagerId]

// 维护 BlockId 与 HashSet[BlockManagerId] 的关系，因为数据块可能有副本
private val blockLocations = new JHashMap[BlockId, mutable.HashSet[BlockManagerId]]
blockid HashSet[BlockManagerId]
BlockManagerId BlockManagerInfo
BlockId BlockStatus
一个block块可以存在多个BlockManager内（副本存储），一个BlockManger里有多个block块。

5）block 写操作
本地写：
当Spark作业进行持久化或Shuffle等操作的时候，会触发BlockManager进行写操作；比如执行persist操作的时候，缓存级别设置的是 MEMORY_AND_DISK，就会触发数据持久化的操作，数据会优先进入到内存，当内存不足，会将数据持久化到磁盘。
远程写：
如果指定了replicate（带副本的缓存级别），那么数据会通过BlockTransferService复制一份到其他节点上去。

6）block 读操作
本地读：
当Spark作业的某个算子触发读取数据的操作，首先，会在该算子所在的BlockManager读取数据；
远程读：
如果本地没有数据，需要从 driver上获取到 Block的真正存储位置，通过BlockTransferService 到远程有数据的那个Executor 里，找到那个Executor 的 BlockManager 来拉取数据。

数据块的读写流程
调用rdd 的 iterator 函数
当执行任务时，会调用rdd 的 iterator 函数，调用轨迹为：
在这里插入图片描述

iterator 函数实现大体是这么个流程:
1 若标记了有缓存，则取缓存，取不到则进行”计算或读检查点”。完了再存入缓存，以备后续使用。
2 若未标记有缓存，则直接进行”计算或读检查点”。
3 “计算或读检查点”这个过程也做两个判断：有做过checkpoint，没有做过checkpoint。做过checkpoint则可以读取到检查点数据返回。无则调该rdd的实现类的computer函数计算。computer函数实现方式就是向上递归“获取父rdd分区数据进行计算”，直到遇到检查点rdd获取有缓存的rdd。
在这里插入图片描述

getOrCompute(split, context)方法
/**

从内存或者磁盘获取，如果磁盘获取需要将block缓存到内存
/
private[spark] def getOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {
// 根据rdd id创建RDDBlockId
val blockId = RDDBlockId(id, partition.index)
// 是否从缓存的block读取
var readCachedBlock = true
// This method is called on executors, so we need call SparkEnv.get instead of sc.env.
SparkEnv.get.blockManager.getOrElseUpdate(blockId, storageLevel, elementClassTag, () => {
readCachedBlock = false
// 如果数据不在内存，那么就尝试读取检查点结果迭代计算
computeOrReadCheckpoint(partition, context)
}) match {
// 获取到了结果直接返回
case Left(blockResult) =>
// 如果从cache读取block
if (readCachedBlock) {
val existingMetrics = context.taskMetrics().inputMetrics
existingMetrics.incBytesRead(blockResult.bytes)
new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {
override def next(): T = {
existingMetrics.incRecordsRead(1)
delegate.next()
}
}
} else {
new InterruptibleIterator(context, blockResult.data.asInstanceOf[Iterator[T]])
}
case Right(iter) =>
new InterruptibleIterator(context, iter.asInstanceOf[Iterator[T]])
}
}
SparkEnv.get.blockManager.getOrElseUpdate方法
/*
如果指定的block存在，则直接获取，否则调用makeIterator方法去计算block，然后持久化最后返回值
*/
def getOrElseUpdate[T](
blockId: BlockId,
level: StorageLevel,
classTag: ClassTag[T],
makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {
// Attempt to read the block from local or remote storage. If it’s present, then we don’t need
// to go through the local-get-or-put path.
// 尝试从本地获取数据，如果获取不到则从远端获取
getT(classTag) match {
case Some(block) =>
return Left(block)
case _ =>
// Need to compute the block.
}
// Initially we hold no locks on this block.
// 如果本地化和远端都没有获取到数据，则调用makeIterator计算，最后将结果写入block
doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true) match {
// 表示写入成功
case None =>
// doPut() didn’t hand work back to us, so the block already existed or was successfully
// stored. Therefore, we now hold a read lock on the block.
// 从本地获取数据块
val blockResult = getLocalValues(blockId).getOrElse {
// Since we held a read lock between the doPut() and get() calls, the block should not
// have been evicted, so get() not returning the block indicates some internal error.
releaseLock(blockId)
throw new SparkException(s"get() failed for block $blockId even though we held a lock")
}
// We already hold a read lock on the block from the doPut() call and getLocalValues()
// acquires the lock again, so we need to call releaseLock() here so that the net number
// of lock acquisitions is 1 (since the caller will only call release() once).
releaseLock(blockId)
Left(blockResult)
case Some(iter) => // 如果写入失败
// The put failed, likely because the data was too large to fit in memory and could not be
// dropped to disk. Therefore, we need to pass the input iterator back to the caller so
// that they can decide what to do with the values (e.g. process them without caching).
// 如果put操作失败，表示可能是因为数据太大，无法写入内存，又无法被磁盘drop，因此我们需要返回这个iterator给调用者
Right(iter)
}
}

上面方法内部是先get 数据，后doPutIterator 计算写入数据。
1）get方法是读数据的入口。通过调用get 方法从本地或其他executor 获取数据，如果获取到返回对应的数据，如果获取不到，执行下面的步骤。
def get[T: ClassTag](blockId: BlockId): Option[BlockResult] = {
// 获取本地的块数据并返回
// getLocalValues 底层
// 内存获取调用 memoryStore.getValues(blockId)
// 磁盘获取调用 diskStore.getBytes(blockId)
val local = getLocalValues(blockId)
if (local.isDefined) {
logInfo(s"Found block $blockId locally")
return local
}
// 获取其他executor上的块数据并返回
// getRemoteValues 底层是调用 blockTransferService.fetchBlockSync 实现
val remote = getRemoteValuesT
if (remote.isDefined) {
logInfo(s"Found block $blockId remotely")
return remote
}
// 什么也没获取到，返回None
None
}
2）doPutIterator 方法是写数据的入口。通过调用 doPutIterator 来写入数据。
private def doPutIterator[T](
blockId: BlockId,
iterator: () => Iterator[T],
level: StorageLevel,
classTag: ClassTag[T],
tellMaster: Boolean = true,
keepReadLock: Boolean = false): Option[PartiallyUnrolledIterator[T]] = {
doPut(blockId, level, classTag, tellMaster = tellMaster, keepReadLock = keepReadLock) { info =>
val startTimeMs = System.currentTimeMillis
var iteratorFromFailedMemoryStorePut: Option[PartiallyUnrolledIterator[T]] = None
// Size of the block in bytes
var size = 0L
// 如果设置的级别是把数据写入到内存中
if (level.useMemory) {
// Put it in memory first, even if it also has useDisk set to true;
// We will drop it to disk later if the memory store can’t hold it.
if (level.deserialized) {
// 不序列化写入，则说明获取的数据为值类型，调用putIteratorAsValues 将数据存入内存
memoryStore.putIteratorAsValues(blockId, iterator(), classTag) match {
// 数据写内存成功，返回数据块大小
case Right(s) =>
size = s

        case Left(iter) =>
          // 如果数据写入失败，如果存储级别是写入磁盘，则写到磁盘中；否则返回结果
          if (level.useDisk) {
            logWarning(s"Persisting block $blockId to disk instead.")
            diskStore.put(blockId) { fileOutputStream =>
              serializerManager.dataSerializeStream(blockId, fileOutputStream, iter)(classTag)
            }
            size = diskStore.getSize(blockId)
          } else {
            iteratorFromFailedMemoryStorePut = Some(iter)
          }
      }
    } else {
      // 如果没设置反序列化，则说明获取的数据为字节类型，调用putIteratorAsBytes将数据存入内存
      memoryStore.putIteratorAsBytes(blockId, iterator(), classTag, level.memoryMode) match {
        case Right(s) =>
          // 数据写内存成功，返回数据块大小
          size = s
        case Left(partiallySerializedValues) =>
          
          if (level.useDisk) {
            // 如果数据写入失败，如果存储级别是写入磁盘，则写到磁盘中，返回写入数据大小；否则返回结果
            logWarning(s"Persisting block $blockId to disk instead.")
            diskStore.put(blockId) { fileOutputStream =>
              partiallySerializedValues.finishWritingToStream(fileOutputStream)
            }
            size = diskStore.getSize(blockId)
          } else {
            iteratorFromFailedMemoryStorePut = Some(partiallySerializedValues.valuesIterator)
          }
      }
    }

} else if (level.useDisk) {
// 如果是磁盘，调用 diskStore.put() 写入
diskStore.put(blockId) { fileOutputStream =>
serializerManager.dataSerializeStream(blockId, fileOutputStream, iterator())(classTag)
}
// 返回写入数据大小
size = diskStore.getSize(blockId)
}

val putBlockStatus = getCurrentBlockStatus(blockId, info)
val blockWasSuccessfullyStored = putBlockStatus.storageLevel.isValid
if (blockWasSuccessfullyStored) {
// 如果成功写入，把写入数据块的元数据发送给driver端
info.size = size
if (tellMaster && info.tellMaster) {
reportBlockStatus(blockId, putBlockStatus)
}
addUpdatedBlockStatusToTaskMetrics(blockId, putBlockStatus)
logDebug(“Put block %s locally took %s”.format(blockId, Utils.getUsedTimeMs(startTimeMs)))
if (level.replication > 1) {
// 如果需要创建副本，则根据数据块编号获取数据复制到其他节点
val remoteStartTime = System.currentTimeMillis
val bytesToReplicate = doGetLocalBytes(blockId, info)
// [SPARK-16550] Erase the typed classTag when using default serialization, since
// NettyBlockRpcServer crashes when deserializing repl-defined classes.
// TODO(ekl) remove this once the classloader issue on the remote end is fixed.
val remoteClassTag = if (!serializerManager.canUseKryo(classTag)) {
scala.reflect.classTag[Any]
} else {
classTag
}
try {
// 复制到其他节点
replicate(blockId, bytesToReplicate, level, remoteClassTag)
} finally {
bytesToReplicate.unmap()
}
logDebug(“Put block %s remotely took %s”
.format(blockId, Utils.getUsedTimeMs(remoteStartTime)))
}
}
assert(blockWasSuccessfullyStored == iteratorFromFailedMemoryStorePut.isEmpty)
iteratorFromFailedMemoryStorePut
}
}

BlockManager典型的几个应用场景如下：
1）spark shuffle过程的数据就是通过blockManager来存储的。
2）spark broadcast 将task调度到多个executor的时候，broadCast 底层使用的数据存储就是blockManager。
3）对一个rdd进行cache的时候，cache的数据就是通过blockManager来存放的。