Spark读取parquet文件的切分逻辑

据源读取对应的物理执行节点为FileSourceScanExec ,对于非bucket的扫描调用createNonBucketedReadRDD方法定义如下

private def createNonBucketedReadRDD(
readFile: (PartitionedFile) => Iterator[InternalRow],
selectedPartitions: Seq[PartitionDirectory],
fsRelation: HadoopFsRelation): RDD[InternalRow] = {
//读取文件时打包成最大的partition大小 256M,对应一个block大
val defaultMaxSplitBytes =
fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
// 文件打开开销,每次打开文件最少需要读取的字节 4M
val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
//400
val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
//通过 fs 获取文件的大小bytes,// 总共读取的大小
val totalBytes = selectedPartitions.flatMap(.files.map(.getLen + openCostInBytes)).sum

 //单core读取的大小
val bytesPerCore = totalBytes / defaultParallelism

// 计算大小,不会超过设置的256MB
val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
logInfo(s"Planning scan with bin packing, max size: $maxSplitBytes bytes, " +
s"open cost is considered as scanning $openCostInBytes bytes.")

val splitFiles = selectedPartitions.flatMap { partition =>
  partition.files.flatMap { file =>
    val blockLocations = getBlockLocations(file)
	//判断当前文件是否可拆分(即是否是可切分文件格式),如果可切分,则按照最大任务大小 maxSplitBytes 进行拆分,得到一个由 PartitionedFile 组成的序列。拆分的方式是将文件分成若干个连续的块,块的大小不超过最大任务大小
    if (fsRelation.fileFormat.isSplitable(//
        fsRelation.sparkSession, fsRelation.options, file.getPath)) {
      (0L until file.getLen by maxSplitBytes).map { offset =>
        val remaining = file.getLen - offset
		//如果remaining大于切分的最大block块,则将文件进行切分,否则返回剩下的为一个block文件块,这样的情况下一个文件将被切分为多个256m的文件和剩余一个小文件
        val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
        val hosts = getBlockHosts(blockLocations, offset, size)53.3 G  159.9 G  hdfs:/day=2023-07-04/hour=10

parquet文件总共103个,每个文件大概530,总共53.3G=54579.2

1.实际使用结果记录,申请资源 --num-executors 200 --executor-memory 8G --executor-cores 2
实际启动的task的个数为412

defaultMaxSplitBytes=256M
openCostInBytes=4m
totalBytes = 54579.2 + 103 * 4MB = 54991.2 MB
bytesPerCore = 54991.2MB / 400 = 137.5MB
maxSplitBytes = 137.5MB = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))

//412的由来,将部分切分逻辑抽取出来测试
val openCostInBytes=41024
val totalBytes=(53.3
10241024 +10341024).toInt
val maxSplitBytes=(137.5
1024).toInt

val data= Array.fill(103)(530*1024).flatMap(totalBytes=>{
  (0 until totalBytes by maxSplitBytes).map(offset=>{
    val remaining=totalBytes-offset
    val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
    size

  })
}).sortBy(-_)

data.foreach(println(_))

val partitions = new ArrayBuffer[Int]
val currentFiles = new ArrayBuffer[Int]
var currentSize = 0L

/** Close the current partition and move to the next. */
def closePartition(): Unit = {
  if (currentFiles.nonEmpty) {
     // Copy to a new Array.
    partitions.append(1)
  }
  currentFiles.clear()
  currentSize = 0
}

// Assign files to partitions using "First Fit Decreasing" (FFD)
data.foreach { file =>
  if (currentSize + file > maxSplitBytes) {
    closePartition()
  }
  // Add the given file to the current partition.
  currentSize += file + openCostInBytes
  currentFiles += file
}
closePartition()


println(partitions.size)
        2.

1.实际使用结果记录,申请资源 --num-executors 100 --executor-memory 8G --executor-cores 1
实际启动的task的个数为215=ceil(54991.2/256)

defaultMaxSplitBytes=256M
openCostInBytes=4m
totalBytes = 54579.2 + 103 * 4MB = 54991.2 MB
bytesPerCore = 54991.2MB / 100 = 549.91MB
maxSplitBytes = 256MB = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))

4.2 G 12.7 G hdfs:///day=2023-07-03/hour=22/minute=00

defaultMaxSplitBytes=256M

openCostInBytes=4m
totalBytes = 4300.8 + 120 * 4MB = 4780.8 MB
bytesPerCore = 4780.8MB / 400 = 11.95MB
maxSplitBytes = 11.95MB = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))PartitionedFile(
partition.values, file.getPath.toUri.toString, offset, size, hosts)
}

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值