Note: spark版本2.3.1
HiveSQL优化时, 输入分片需要开启参数进行合并, 否则会产生很多分片.
那么SparkSQL是如何应对大量输入小文件的呢?
本例以Hive表为例(大量parquet小文件, 可切分).
首先我们Debug到这里(“package org.apache.spark.sql.execution.FileSourceScanExec”)
这里有个模式匹配, 我们是非分区表, 走默认匹配.
代码如下
private def createNonBucketedReadRDD(
readFile: (PartitionedFile) => Iterator[InternalRow],
selectedPartitions: Seq[PartitionDirectory],
fsRelation: HadoopFsRelation): RDD[InternalRow] = {
val defaultMaxSplitBytes =
fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
val bytesPerCore = totalBytes / defaultParallelism
val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
val splitFiles = selectedPartitions.flatMap { partition =>
partition.files.flatMap { file =>
val blockLocations = getBlockLocations(file)
if (fsRelation.fileFormat.isSplitable(
fsRelation.sparkSession, fsRelation.options, file.getPath)) {
(0L until file.getLen by maxSplitBytes).map { offset =>
val remaining = file.getLen - offset
val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
val hosts = getBlockHosts(blockLocations, offset, size)
PartitionedFile(
partition.values, file.getPath.toUri.toString, offset, size, hosts)
}
} else {
val hosts = getBlockHosts(blockLocations, 0, file.getLen)
Seq(PartitionedFile(
partition.values, file.getPath.toUri.toString, 0, file.getLen, hosts))
}
}
}.toArray.sortBy(_.length)(implicitly[Ordering[Long]].reverse)
val partitions = new ArrayBuffer[FilePartition]
val currentFiles = new ArrayBuffer[PartitionedFile]
var currentSize = 0L
/** Close the current partition and move to the next. */
def closePartition(): Unit = {
if (currentFiles.nonEmpty) {
val newPartition =
FilePartition(
partitions.size,
currentFiles.toArray.toSeq) // Copy to a new Array.
partitions += newPartition
}
currentFiles.clear()
currentSize = 0
}
// Assign files to partitions using "Next Fit Decreasing"
splitFiles.foreach { file =>
if (currentSize + file.length > maxSplitBytes) {
closePartition()
}
// Add the given file to the current partition.
currentSize += file.length + openCostInBytes
currentFiles += file
}
closePartition()
new FileScanRDD(fsRelation.sparkSession, readFile, partitions)
}
其中有几个关键的参数和函数
defaultMaxSplitBytes: 默认可切分块, 最大Size(默认128M), 对应的参数是"spark.sql.files.maxPartitionBytes"
openCostInBytes: 默认4M, 对应的参数是"spark.sql.files.openCostInBytes"
defaultParallelism: 默认并行度(如果没有设置, 一般是CPU的core数)
totalBytes: 所有parquet文件的总大小(selectedPartitions.flatMap(.files.map(.getLen + openCostInBytes)).sum, 从代码来看, 每个文件都要加上参数openCostInBytes)
bytesPerCore: 每个核处理数据Size(totalBytes / defaultParallelism)
maxSplitBytes: Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)), 一般处理大量数据都是128M.
val splitFiles = selectedPartitions.flatMap:
splitFiles是一个Array, 其中是按逻辑切分后的文件索引.
逻辑如代码所示, 总的来说就是把所有文件切成小于128M的文件索引
(0L until file.getLen by maxSplitBytes).map { offset =>
val remaining = file.getLen - offset
val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
val hosts = getBlockHosts(blockLocations, offset, size)
PartitionedFile(
partition.values, file.getPath.toUri.toString, offset, size, hosts)
}
合并文件索引, 把加总后超过128M的文件索引放入一个分区
// Assign files to partitions using "Next Fit Decreasing"
splitFiles.foreach { file =>
if (currentSize + file.length > maxSplitBytes) {
closePartition()
}
// Add the given file to the current partition.
currentSize += file.length + openCostInBytes
currentFiles += file
}
/** Close the current partition and move to the next. */
def closePartition(): Unit = {
if (currentFiles.nonEmpty) {
val newPartition =
FilePartition(
partitions.size,
currentFiles.toArray.toSeq) // Copy to a new Array.
partitions += newPartition
}
currentFiles.clear()
currentSize = 0
}
最后返回new FileScanRDD(fsRelation.sparkSession, readFile, partitions)
至此我们完成了文件的分割和合并.
所以SparkSQL并不需要关心输入端是否有小文件, 从而可以把精力放在逻辑实现上.