java spark persist,hadoop – 我的sparkDF.persist(DISK_ONLY)数据存储在哪里?

对于简短的回答,我们可以看一下关于spark.local.dir的

the documentation:

Directory to use for “scratch” space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.

为了更深入地理解我们可以查看代码:DataFrame(它只是一个数据集[Row])基于RDDs,它利用相同的持久性机制. RDD将此委托给SparkContext,这标志着它是持久性的.然后,该任务由org.apache.spark.storage程序包中的几个类实际处理:首先,BlockManager仅管理要保留的数据块以及如何执行该策略,将实际持久性委派给DiskStore(当在磁盘上写入时,当然)代表一个高级别的写入界面,而后者又有一个DiskBlockManager用于更低级别的操作.

希望您了解现在的位置,以便我们继续前进并了解数据实际存在的位置以及我们如何配置它:DiskBlockManager调用帮助程序Utils.getConfiguredLocalDirs,为了实用,我将在此处复制(取自链接的2.2.1版本,撰写本文时的最新版本):

def getConfiguredLocalDirs(conf: SparkConf): Array[String] = {

val shuffleServiceEnabled = conf.getBoolean("spark.shuffle.service.enabled", false)

if (isRunningInYarnContainer(conf)) {

// If we are in yarn mode, systems can have different disk layouts so we must set it

// to what Yarn on this system said was available. Note this assumes that Yarn has

// created the directories already, and that they are secured so that only the

// user has access to them.

getYarnLocalDirs(conf).split(",")

} else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {

conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)

} else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {

conf.getenv("SPARK_LOCAL_DIRS").split(",")

} else if (conf.getenv("MESOS_DIRECTORY") != null && !shuffleServiceEnabled) {

// Mesos already creates a directory per Mesos task. Spark should use that directory

// instead so all temporary files are automatically cleaned up when the Mesos task ends.

// Note that we don't want this if the shuffle service is enabled because we want to

// continue to serve shuffle files after the executors that wrote them have already exited.

Array(conf.getenv("MESOS_DIRECTORY"))

} else {

if (conf.getenv("MESOS_DIRECTORY") != null && shuffleServiceEnabled) {

logInfo("MESOS_DIRECTORY available but not using provided Mesos sandbox because " +

"spark.shuffle.service.enabled is enabled.")

}

// In non-Yarn mode (or for the driver in yarn-client mode), we cannot trust the user

// configuration to point to a secure directory. So create a subdirectory with restricted

// permissions under each listed directory.

conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")).split(",")

}

}

我相信代码是非常不言自明的,并且评论很好(并且完全匹配文档的内容):在Yarn上运行时,有一个特定的策略依赖于Yarn容器的存储,在Mesos中它使用Mesos sandbox(除非启用了shuffle服务),在所有其他情况下,它将转到spark.local.dir或java.io.tmpdir(可能是/ tmp /)下设置的位置.

所以,如果你只是在玩数据很可能存储在/ tmp /下,否则它很大程度上取决于你的环境和配置.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值