阅读源码的部分主要包括:
平时看到的.hive-staging_xxx这种临时目录如何产生的以及如何移动到最终目录的。
数据持久化主要包括两个阶段:
1、commitTask
executor端的task任务执行commitTask方法,将数据文件从task临时目录转移到Job临时目录
2、commitJob
driver端执行commitJob方法,将各个task任务提交的数据文件,从Job临时目录转移到Job的最终目标目录,同时创建 _SUCCESS 文件标志成功
上面要注意的是:commitTask的时候有两种算法,主要看 FileOutputCommitter 设置的哪种:
FileOutputCommitter V1文件提交机制
FileOutputCommitter V1文件提交机制的基本工作原理,需要经历两次rename过程,每个task先将数据写入到如下临时目录:
finalTargetDir/_temporary/appAttemptDir/_temporary/taskAttemptDir/dataFile
等到task完成数据写入后,执行commitTask方法做第一次rename,将数据文件从task临时目录转移到如下临时目录:
finalTargetDir/_temporary/appAttemptDir/taskDir/dataFile
最后,当所有task都执行完commitTask方法后,由Driver负责执行commitJob方法做第二次rename,依次将数据文件从job临时目录下的各个task目录中,转移到如下最终目标目录中,并生成_SUCCESS标识文件:
finalTargetDir/dataFile
FileOutputCommitter V2文件提交机制
FileOutputCommitter V1文件提交机制较好地解决了数据一致性的问题,因为只有在rename的过程中才可能出现数据一致性问题,而通常情况下,这种问题出现的概率非常低。但是,两次rename带来了性能上的问题,主要表现在:如果有大量task写入数据,即使所有task都执行完成了,仍需等待较长一段时间作业才结束,这些时间主要耗费在driver端做第二次rename,这个问题在对象存储中尤为突出。
FileOutputCommitter V2文件提交机制的出现,解决了两次rename存在的性能问题。相比于FileOutputCommitter V1文件提交机制,主要去掉了在commitJob阶段做第二次rename来提高性能,但是牺牲了一部分的数据一致性。在FileOutputCommitter V2文件提交机制中,如果部分task已执行成功,而此时job执行失败,就会出现一部分数据对外可见,也就是出现了脏数据,需要数据消费者根据是否新生成了_SUCCESS标识文件来判断数据的完整性。
下面是源码部分:
org.apache.spark.sql.hive.execution.InsertIntoHiveTable#run
// 临时目录。类似:.hive-staging_hive_2022-12-07_15-26-44_044_728963245003028863-1/-ext-10000/
val tmpLocation = getExternalTmpPath(sparkSession, hadoopConf, tableLocation)
processInsert(sparkSession, externalCatalog, hadoopConf, tableDesc, tmpLocation, child)
saveAsHiveFile(sparkSession = sparkSession,plan = child,hadoopConf = hadoopConf,fileSinkConf = fileSinkConf,outputLocation = tmpLocation.toString,partitionAttributes = partitionAttributes)
// 初始化 CommitProtocolClass
val committer = FileCommitProtocol.instantiate(
sparkSession.sessionState.conf.fileCommitProtocolClass,
jobId = java.util.UUID.randomUUID().toString,
outputPath = outputLocation)
// CommitProtocolClass 默认为 SQLHadoopMapReduceCommitProtocol
def fileCommitProtocolClass: String = getConf(SQLConf.FILE_COMMIT_PROTOCOL_CLASS)
val FILE_COMMIT_PROTOCOL_CLASS =
buildConf("spark.sql.sources.commitProtocolClass")
.internal()
.stringConf
.createWithDefault(
"org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")
org.apache.spark.sql.execution.datasources.FileFormatWriter#write
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol#setupJob
// 设置 committer。用户没有设置的话默认为 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol#setupCommitter
// 创建临时目录。类似:file:/D:/vortual/java_project/java-learn/spark-warehouse/xxx/.hive-staging_hive_2022-12-07_15-26-44_044_728963245003028863-1/-ext-10000/_temporary/0
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter#setupJob
// task 输出文件
org.apache.spark.sql.execution.datasources.FileFormatWriter#executeTask
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter
newOutputWriter()
// 创建 task 临时目录。类似:file:/D:/vortual/java_project/java-learn/spark-warehouse/xxx/.hive-staging_hive_2022-12-08_14-29-21_347_396488422317007477-1/-ext-10000/_temporary/0/_temporary/attempt_20221208142922_0000_m_000000_0/part-00000-b46f3490-b74b-469c-933e-fc8a6e6adce6-c000
val currentPath = committer.newTaskTempFile(
taskAttemptContext,
None,
f"-c$fileCounter%03d" + ext)
dataWriter.commit()
org.apache.spark.sql.execution.datasources.FileFormatDataWriter#commit
WriteTaskResult(committer.commitTask(taskAttemptContext), summary)
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol#commitTask
org.apache.spark.mapred.SparkHadoopMapRedUtil#commitTask
// 判断是否可以提交 task
if (canCommit) {
performCommit()
}
// 判断是否能提交的代码
org.apache.spark.scheduler.OutputCommitCoordinator#handleAskPermissionToCommit
// 提交 task。算法一的话就是再次移动下目录到临时目录、算法二是移动到最终输出目录
// 最终输出目录类似:file:/D:/vortual/java_project/java-learn/spark-warehouse/xxx/.hive-staging_hive_2022-12-07_15-26-44_044_728963245003028863-1/-ext-10000/
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter#commitTask(org.apache.hadoop.mapreduce.TaskAttemptContext, org.apache.hadoop.fs.Path)
// 提交 job
org.apache.spark.sql.execution.datasources.FileFormatWriter 类:
committer.commitJob(job, commitMsgs)
org.apache.spark.internal.io.FileCommitProtocol#commitJob
logInfo(s"Write Job ${description.uuid} committed.")
override def commitJob(jobContext: JobContext, taskCommits: Seq[TaskCommitMessage]): Unit = {
committer.commitJob(jobContext)
}
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter#commitJob
* The job has completed, so do following commit job, include:
* Move all committed tasks to the final output dir (algorithm 1 only).
* Delete the temporary directory, including all of the work directories.
* Create a _SUCCESS file to make it as successful.
输出到最终目录、删除临时目录、创建 _SUCCESS 文件标志成功
最终目录类似地址:file:/D:/vortual/java_project/java-learn/spark-warehouse/xxx/.hive-staging_hive_2022-12-08_09-20-27_592_2270078055673944735-1/-ext-10000
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter#commitJobInternal
将 hive-stagingxxx目录下的数据移动到表的目录下的代码:
externalCatalog.loadTable(
table.database,
table.identifier.table,
tmpLocation.toString, // TODO: URI
overwrite,
isSrcLocal = false)
org.apache.spark.sql.hive.HiveExternalCatalog#loadTable
client.loadTable(
loadPath,
s"$db.$table",
isOverwrite,
isSrcLocal)
org.apache.spark.sql.hive.client.HiveClientImpl#loadTable
org.apache.spark.sql.hive.client.Shim_v0_14#loadTable
org.apache.hadoop.hive.ql.metadata.Hive#loadTable
// 将临时目录移动到 hive 表目录下面的代码
Path tableDest = tbl.getPath();
replaceFiles(tableDest, loadPath, tableDest, tableDest, sessionConf, isSrcLocal);
参考资料:
http://www.jasongj.com/spark/committer/
https://blog.csdn.net/u013332124/article/details/92001346
https://aws.amazon.com/cn/blogs/china/application-and-practice-of-spark-small-file-merging-function-on-aws-s3/