Spark RDD持久化机制可以用于将需要重复运算的RDD存储在内存中,以便大幅提升运算效率
Spark RDD持久化使用方法如下:
RDD.persist ------------存储等级
RDD.unpersist ------------取消持久化
存储等级说明:
MEMORY_ONLY :spark会将RDD对象以Java对象反串行化(序列化)在JVM的堆空间中,而不经过序列化处理。
如果RDD太大无法完全存储在内存中,多余的RDD partitions不会缓存在内存中,
而是需要重新计算
MEMORY_AND_DISK: 尽量将RDD以Java对象反串行化在JVM的在内存中,如果内存缓存不下了,
就将剩余分区缓存在磁盘中。
MEMORY_ONLY_SER :将RDD进行序列化处理(每个分区序列化为一个字节数组)然后缓存在内存中。
因为需要再进行反序列化,会多使用CPU计算资源,但是比较省内存的存储空间
多余的RDD partitions不会缓存在内存中,而是需要重新计算
MEMORY_AND_DISK_SER :和MEMORY_ONLY_SER类似,多余的RDD partitions缓存在磁盘中
DISK_ONLY :仅仅使用磁盘存储RDD的数据(未经序列化)。
Storage Level | Meaning |
---|---|
MEMORY_ONLY | Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. |
MEMORY_AND_DISK | Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. |
MEMORY_ONLY_SER | Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. |
MEMORY_AND_DISK_SER | Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. |
DISK_ONLY | Store the RDD partitions only on disk. |