Spark RDD persistence持久化

最新推荐文章于 2023-10-19 19:21:33 发布

大鱼-瓶邪

最新推荐文章于 2023-10-19 19:21:33 发布

阅读量302

点赞数

分类专栏： Spark Hadoop

本文链接：https://blog.csdn.net/qq_25948717/article/details/81915291

版权

58 篇文章 7 订阅

订阅专栏

32 篇文章 1 订阅

订阅专栏

Spark RDD持久化机制可以用于将需要重复运算的RDD存储在内存中，以便大幅提升运算效率

Spark RDD持久化使用方法如下：

RDD.persist ------------存储等级

RDD.unpersist ------------取消持久化

存储等级说明：

MEMORY_ONLY ：spark会将RDD对象以Java对象反串行化（序列化）在JVM的堆空间中，而不经过序列化处理。

如果RDD太大无法完全存储在内存中，多余的RDD partitions不会缓存在内存中，

而是需要重新计算

MEMORY_AND_DISK：尽量将RDD以Java对象反串行化在JVM的在内存中，如果内存缓存不下了，

就将剩余分区缓存在磁盘中。
MEMORY_ONLY_SER ：将RDD进行序列化处理(每个分区序列化为一个字节数组)然后缓存在内存中。

因为需要再进行反序列化，会多使用CPU计算资源，但是比较省内存的存储空间

多余的RDD partitions不会缓存在内存中，而是需要重新计算

MEMORY_AND_DISK_SER ：和MEMORY_ONLY_SER类似，多余的RDD partitions缓存在磁盘中
DISK_ONLY ：仅仅使用磁盘存储RDD的数据(未经序列化)。

Storage Level	Meaning
MEMORY_ONLY	Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
MEMORY_AND_DISK	Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
MEMORY_ONLY_SER	Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
MEMORY_AND_DISK_SER	Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.
DISK_ONLY	Store the RDD partitions only on disk.

关注