Spark读取parquet文件的切分逻辑

最新推荐文章于 2024-05-26 00:15:00 发布

荣晓

最新推荐文章于 2024-05-26 00:15:00 发布

阅读量441

点赞数

分类专栏： spark 文章标签： spark 大数据分布式

本文链接：https://blog.csdn.net/weixin_43015677/article/details/131652418

版权

spark 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

据源读取对应的物理执行节点为FileSourceScanExec ,对于非bucket的扫描调用createNonBucketedReadRDD方法定义如下

private def createNonBucketedReadRDD(
readFile: (PartitionedFile) => Iterator[InternalRow],
selectedPartitions: Seq[PartitionDirectory],
fsRelation: HadoopFsRelation): RDD[InternalRow] = {
//读取文件时打包成最大的partition大小 256M，对应一个block大
val defaultMaxSplitBytes =
fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
// 文件打开开销，每次打开文件最少需要读取的字节 4M
val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
//400
val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
//通过 fs 获取文件的大小bytes，// 总共读取的大小
val totalBytes = selectedPartitions.flatMap(.files.map(.getLen + openCostInBytes)).sum

 //单core读取的大小
val bytesPerCore = totalBytes / defaultParallelism

// 计算大小，不会超过设置的256MB
val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
logInfo(s"Planning scan with bin packing, max size: $maxSplitBytes bytes, " +
s"open cost is considered as scanning $openCostInBytes bytes.")

val splitFiles = selectedPartitions.flatMap { partition =>
  partition.files.flatMap { file =>
    val blockLocations = getBlockLocations(file)
	//判断当前文件是否可拆分（即是否是可切分文件格式），如果可切分，则按照最大任务大小 maxSplitBytes 进行拆分，得到一个由 PartitionedFile 组成的序列。拆分的方式是将文件分成若干个连续的块，块的大小不超过最大任务大小
    if (fsRelation.fileFormat.isSplitable(//
        fsRelation.sparkSession, fsRelation.options, file.getPath)) {
      (0L until file.getLen by maxSplitBytes).map { offset =>
        val remaining = file.getLen - offset
		//如果remaining大于切分的最大block块，则将文件进行切分，否则返回剩下的为一个block文件块，这样的情况下一个文件将被切分为多个256m的文件和剩余一个小文件
        val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
        val hosts = getBlockHosts(blockLocations, offset, size)53.3 G  159.9 G  hdfs:/day=2023-07-04/hour=10

parquet文件总共103个，每个文件大概530，总共53.3G=54579.2

1.实际使用结果记录，申请资源 --num-executors 200 --executor-memory 8G --executor-cores 2
实际启动的task的个数为412

defaultMaxSplitBytes=256M
openCostInBytes=4m
totalBytes = 54579.2 + 103 * 4MB = 54991.2 MB
bytesPerCore = 54991.2MB / 400 = 137.5MB
maxSplitBytes = 137.5MB = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))

//412的由来，将部分切分逻辑抽取出来测试
val openCostInBytes=41024
val totalBytes=(53.310241024 +10341024).toInt
val maxSplitBytes=(137.51024).toInt

val data= Array.fill(103)(530*1024).flatMap(totalBytes=>{
  (0 until totalBytes by maxSplitBytes).map(offset=>{
    val remaining=totalBytes-offset
    val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
    size

  })
}).sortBy(-_)

data.foreach(println(_))

val partitions = new ArrayBuffer[Int]
val currentFiles = new ArrayBuffer[Int]
var currentSize = 0L

/** Close the current partition and move to the next. */
def closePartition(): Unit = {
  if (currentFiles.nonEmpty) {
     // Copy to a new Array.
    partitions.append(1)
  }
  currentFiles.clear()
  currentSize = 0
}

// Assign files to partitions using "First Fit Decreasing" (FFD)
data.foreach { file =>
  if (currentSize + file > maxSplitBytes) {
    closePartition()
  }
  // Add the given file to the current partition.
  currentSize += file + openCostInBytes
  currentFiles += file
}
closePartition()


println(partitions.size)
        2.

1.实际使用结果记录，申请资源 --num-executors 100 --executor-memory 8G --executor-cores 1
实际启动的task的个数为215=ceil(54991.2/256)

defaultMaxSplitBytes=256M
openCostInBytes=4m
totalBytes = 54579.2 + 103 * 4MB = 54991.2 MB
bytesPerCore = 54991.2MB / 100 = 549.91MB
maxSplitBytes = 256MB = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))

4.2 G 12.7 G hdfs:///day=2023-07-03/hour=22/minute=00

defaultMaxSplitBytes=256M

openCostInBytes=4m
totalBytes = 4300.8 + 120 * 4MB = 4780.8 MB
bytesPerCore = 4780.8MB / 400 = 11.95MB
maxSplitBytes = 11.95MB = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))PartitionedFile(
partition.values, file.getPath.toUri.toString, offset, size, hosts)
}

荣晓

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Spark读取parquet文件的切分逻辑

1.实际使用结果记录，申请资源 --num-executors 200 --executor-memory 8G --executor-cores 2。1.实际使用结果记录，申请资源 --num-executors 100 --executor-memory 8G --executor-cores 1。//读取文件时打包成最大的partition大小 256M，对应一个block大。//通过 fs 获取文件的大小bytes，// 总共读取的大小。// 文件打开开销，每次打开文件最少需要读取的字节 4M。
复制链接

扫一扫