java spark persist,hadoop – 我的sparkDF.persist(DISK_ONLY)数据存储在哪里？

最新推荐文章于 2022-07-17 21:10:56 发布

Mousse Miao

最新推荐文章于 2022-07-17 21:10:56 发布

阅读量362

点赞数

文章标签： java spark persist

对于简短的回答,我们可以看一下关于spark.local.dir的

the documentation：

Directory to use for “scratch” space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.

为了更深入地理解我们可以查看代码：DataFrame(它只是一个数据集[Row])基于RDDs,它利用相同的持久性机制. RDD将此委托给SparkContext,这标志着它是持久性的.然后,该任务由org.apache.spark.storage程序包中的几个类实际处理：首先,BlockManager仅管理要保留的数据块以及如何执行该策略,将实际持久性委派给DiskStore(当在磁盘上写入时,当然)代表一个高级别的写入界面,而后者又有一个DiskBlockManager用于更低级别的操作.

希望您了解现在的位置,以便我们继续前进并了解数据实际存在的位置以及我们如何配置它：DiskBlockManager调用帮助程序Utils.getConfiguredLocalDirs,为了实用,我将在此处复制(取自链接的2.2.1版本,撰写本文时的最新版本)：

def getConfiguredLocalDirs(conf: SparkConf): Array[String] = {

val shuffleServiceEnabled = conf.getBoolean("spark.shuffle.service.enabled", false)

if (isRunningInYarnContainer(conf)) {

// If we are in yarn mode, systems can have different disk layouts so we must set it

// to what Yarn on this system said was available. Note this assumes that Yarn has

// created the directories already, and that they are secured so that only the

// user has access to them.

getYarnLocalDirs(conf).split(",")

} else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {

conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)

} else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {

conf.getenv("SPARK_LOCAL_DIRS").split(",")

} else if (conf.getenv("MESOS_DIRECTORY") != null && !shuffleServiceEnabled) {

// Mesos already creates a directory per Mesos task. Spark should use that directory

// instead so all temporary files are automatically cleaned up when the Mesos task ends.

// Note that we don't want this if the shuffle service is enabled because we want to

// continue to serve shuffle files after the executors that wrote them have already exited.

Array(conf.getenv("MESOS_DIRECTORY"))

} else {

if (conf.getenv("MESOS_DIRECTORY") != null && shuffleServiceEnabled) {

logInfo("MESOS_DIRECTORY available but not using provided Mesos sandbox because " +

"spark.shuffle.service.enabled is enabled.")

}

// In non-Yarn mode (or for the driver in yarn-client mode), we cannot trust the user

// configuration to point to a secure directory. So create a subdirectory with restricted

// permissions under each listed directory.

conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")).split(",")

}

我相信代码是非常不言自明的,并且评论很好(并且完全匹配文档的内容)：在Yarn上运行时,有一个特定的策略依赖于Yarn容器的存储,在Mesos中它使用Mesos sandbox(除非启用了shuffle服务),在所有其他情况下,它将转到spark.local.dir或java.io.tmpdir(可能是/ tmp /)下设置的位置.

所以,如果你只是在玩数据很可能存储在/ tmp /下,否则它很大程度上取决于你的环境和配置.