Rdd持久化 --持久化方式1 RDD:缓存机制 cache persist cache=persist(MEMORY) 注意点: cache()或persist()后不能再有其他算子 cache()或persist()遇到Action算子完成后才生效 应用场景: 从文件加载数据之后,因为重新获取文件成本较高 经过较多的算子变换之后,重新计算成本较高 单个非常消耗资源的算子之后 缓存策略:StorageLevel MEMORY_ONLY MEMORY_AND_DISK DISK_ONLY val v1 = sc.textFile("file:opt/datas/abc.txt").cache v1 结果: res4: org.apache.spark.rdd.RDD[String] = file:////opt/datas/abc.txt MapPartitionsRDD[3] at textFile at <console>:24 v1.collect 结果: res5: Array[String] = Array(hello java, "") 删除目录文件 rm -rf /opt/datas/abc.txt v1.collect 结果: res5: Array[String] = Array(hello java, "") 清空v1的缓存 v1.unpersist 结果: res8: v1.type = file:////opt/datas/abc.txt MapPartitionsRDD[3] at textFile at <console>:24 v1.collect 结果: 报错 --持久化方式2 checkpoint 检查点:类似于快照 sc.setCheckpointDir("file:opt/datas") val rdd = sc.textFile("file:opt/datas/abc.txt") rdd.checkpoint rdd.collect //生成快照 rdd.isCheckpointed //Boolean = true rdd.getCheckpointFile //Option[String] = Some(file:/opt/datas/711c3013-049e-4b77-bb3a-fc355109505c/rdd-7)