spark读取文件源码分析-2

最新推荐文章于 2023-03-12 22:22:17 发布

夜月行者

最新推荐文章于 2023-03-12 22:22:17 发布

阅读量548

点赞数

分类专栏： spark 文章标签： scala spark 大数据

本文链接：https://blog.csdn.net/u013200380/article/details/108578014

版权

spark 专栏收录该内容

17 篇文章 1 订阅

订阅专栏

文章目录

本篇接上一篇,这一篇主要是针对job1的源码分析
在这里插入图片描述

1. job1产生时机源码分析

job1是在job0执行完在driver端往下执行的的时候发生的，所以先来复习一下job0的过程。


DataFrameReader.load()
DataFrameReader.loadV1Source()
DataSoure.resolveRelation()
DataSource.getOrInferFileFormatSchema()
new InMemoryFileIndex(sparkSession, globbedPaths, options, None, fileStatusCache)
InMemoryFileIndex.refresh0()
InMemoryFileIndex.listLeafFiles()
InMemoryFileIndex.bulkListLeafFiles()

job1产生的调用是在DataSoure.getOrInferFileFormatSchema()方法中。
再贴一遍这个方法

1. DataSoure.getOrInferFileFormatSchema()


  private def getOrInferFileFormatSchema(
      format: FileFormat,
      fileStatusCache: FileStatusCache = NoopCache): (StructType, StructType) = {
    // the operations below are expensive therefore try not to do them if we don't need to, e.g.,
    // in streaming mode, we have already inferred and registered partition columns, we will
    // never have to materialize the lazy val below
    // job0 这里定义的是lazy变量，最终使用的时候才会初始化
    lazy val tempFileIndex = {
      val allPaths = caseInsensitiveOptions.get("path") ++ paths
      val hadoopConf = sparkSession.sessionState.newHadoopConf()
      val globbedPaths = allPaths.toSeq.flatMap { path =>
        val hdfsPath = new Path(path)
        val fs = hdfsPath.getFileSystem(hadoopConf)
        val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
        SparkHadoopUtil.get.globPathIfNecessary(fs, qualified)
      }.toArray
      //job0 这个地方初始化了InMemoryFileIndex 对象，也就是在这里形成了第一个job
      new InMemoryFileIndex(sparkSession, globbedPaths, options, None, fileStatusCache)
    }
    val partitionSchema = if (partitionColumns.isEmpty) {
      // Try to infer partitioning, because no DataSource in the read path provides the partitioning
      // columns properly unless it is a Hive DataSource
      // job0 在这里第一次真正使用lazy的tempFileIndex变量，也就促使了InMemoryFileIndex 的初始化。
      combineInferredAndUserSpecifiedPartitionSchema(tempFileIndex)
    } else {
      // maintain old behavior before SPARK-18510. If userSpecifiedSchema is empty used inferred
      // partitioning
      if (userSpecifiedSchema.isEmpty) {
        val inferredPartitions = tempFileIndex.partitionSchema
        inferredPartitions
      } else {
        val partitionFields = partitionColumns.map { partitionColumn =>
          userSpecifiedSchema.flatMap(_.find(c => equality(c.name, partitionColumn))).orElse {
            val inferredPartitions = tempFileIndex.partitionSchema
            val inferredOpt = inferredPartitions.find(p => equality(p.name, partitionColumn))
            if (inferredOpt.isDefined) {
              logDebug(
                s"""Type of partition column: $partitionColumn not found in specified schema
                   |for $format.
                   |User Specified Schema
                   |=====================
                   |${userSpecifiedSchema.orNull}
                   |
                   |Falling back to inferred dataType if it exists.
                 """.stripMargin)
            }
            inferredOpt
          }.getOrElse {
            throw new AnalysisException(s"Failed to resolve the schema for $format for " +
              s"the partition column: $partitionColumn. It must be specified manually.")
          }
        }
        StructType(partitionFields)
      }
    }

    // 到这里job0已经执行完并且返回回来了
    val dataSchema = userSpecifiedSchema.map { schema =>
      StructType(schema.filterNot(f => partitionSchema.exists(p => equality(p.name, f.name))))
    }.orElse {
      // job1 始于这里的调用
      format.inferSchema(
        sparkSession,
        caseInsensitiveOptions,
        tempFileIndex.allFiles())
    }.getOrElse {
      throw new AnalysisException(
        s"Unable to infer schema for $format. It must be specified manually.")
    }

    // We just print a waring message if the data schema and partition schema have the duplicate
    // columns. This is because we allow users to do so in the previous Spark releases and
    // we have the existing tests for the cases (e.g., `ParquetHadoopFsRelationSuite`).
    // See SPARK-18108 and SPARK-21144 for related discussions.
    try {
      SchemaUtils.checkColumnNameDuplication(
        (dataSchema ++ partitionSchema).map(_.name),
        "in the data schema and the partition schema",
        equality)
    } catch {
      case e: AnalysisException => logWarning(e.getMessage)
    }

    (dataSchema, partitionSchema)
  }

在这个方法中会先触发job0，在job0返回后会走下面的步骤，job0主要是文件分析，收集总共有多少个文件，每个文件的block信息等。
job1的主要功能是做data-schema的提取,job1的触发在这一段，下面的代码会进入到orElse{}部分

    // 到这里job0已经执行完并且返回回来了
    val dataSchema = userSpecifiedSchema.map { schema =>
      StructType(schema.filterNot(f => partitionSchema.exists(p => equality(p.name, f.name))))
    }.orElse {
      // job1 始于这里的调用
      format.inferSchema(
        sparkSession,
        caseInsensitiveOptions,
        tempFileIndex.allFiles())
    }.getOrElse {
      throw new AnalysisException(
        s"Unable to infer schema for $format. It must be specified manually.")
    }

对应的format的类型是ParquetFileFormat，所以就是调用了ParquetFileFormat.inferSchema()

2. ParquetFileFormat.inferSchema

  /**
   * When possible, this method should return the schema of the given `files`.  When the format
   * does not support inference, or no valid files are given should return None.  In these cases
   * Spark will require that user specify the schema manually.
   */

  override def inferSchema(
      sparkSession: SparkSession,
      parameters: Map[String, String],
      files: Seq[FileStatus]): Option[StructType] = {
    val parquetOptions = new ParquetOptions(parameters, sparkSession.sessionState.conf)

    // Should we merge schemas from all Parquet part-files?
    val shouldMergeSchemas = parquetOptions.mergeSchema

    val mergeRespectSummaries = sparkSession.sessionState.conf.isParquetSchemaRespectSummaries

    val filesByType = splitFiles(files)

    // Sees which file(s) we need to touch in order to figure out the schema.
    //
    // Always tries the summary files first if users don't require a merged schema.  In this case,
    // "_common_metadata" is more preferable than "_metadata" because it doesn't contain row
    // groups information, and could be much smaller for large Parquet files with lots of row
    // groups.  If no summary file is available, falls back to some random part-file.
    //
    // NOTE: Metadata stored in the summary files are merged from all part-files.  However, for
    // user defined key-value metadata (in which we store Spark SQL schema), Parquet doesn't know
    // how to merge them correctly if some key is associated with different values in different
    // part-files.  When this happens, Parquet simply gives up generating the summary file.  This
    // implies that if a summary file presents, then:
    //
    //   1. Either all part-files have exactly the same Spark SQL schema, or
    //   2. Some part-files don't contain Spark SQL schema in the key-value metadata at all (thus
    //      their schemas may differ from each other).
    //
    // Here we tend to be pessimistic and take the second case into account.  Basically this means
    // we can't trust the summary files if users require a merged schema, and must touch all part-
    // files to do the merge.
    val filesToTouch =
      if (shouldMergeSchemas) {
        // Also includes summary files, 'cause there might be empty partition directories.

        // If mergeRespectSummaries config is true, we assume that all part-files are the same for
        // their schema with summary files, so we ignore them when merging schema.
        // If the config is disabled, which is the default setting, we merge all part-files.
        // In this mode, we only need to merge schemas contained in all those summary files.
        // You should enable this configuration only if you are very sure that for the parquet
        // part-files to read there are corresponding summary files containing correct schema.

        // As filed in SPARK-11500, the order of files to touch is a matter, which might affect
        // the ordering of the output columns. There are several things to mention here.
        //
        //  1. If mergeRespectSummaries config is false, then it merges schemas by reducing from
        //     the first part-file so that the columns of the lexicographically first file show
        //     first.
        //
        //  2. If mergeRespectSummaries config is true, then there should be, at least,
        //     "_metadata"s for all given files, so that we can ensure the columns of
        //     the lexicographically first file show first.
        //
        //  3. If shouldMergeSchemas is false, but when multiple files are given, there is
        //     no guarantee of the output order, since there might not be a summary file for the
        //     lexicographically first file, which ends up putting ahead the columns of
        //     the other files. However, this should be okay since not enabling
        //     shouldMergeSchemas means (assumes) all the files have the same schemas.

        val needMerged: Seq[FileStatus] =
          if (mergeRespectSummaries) {
            Seq.empty
          } else {
            filesByType.data
          }
        needMerged ++ filesByType.metadata ++ filesByType.commonMetadata
      } else {
        // Tries any "_common_metadata" first. Parquet files written by old versions or Parquet
        // don't have this.
        filesByType.commonMetadata.headOption
            // Falls back to "_metadata"
            .orElse(filesByType.metadata.headOption)
            // Summary file(s) not found, the Parquet file is either corrupted, or different part-
            // files contain conflicting user defined metadata (two or more values are associated
            // with a same key in different files).  In either case, we fall back to any of the
            // first part-file, and just assume all schemas are consistent.
            .orElse(filesByType.data.headOption)
            .toSeq
      }
    ParquetFileFormat.mergeSchemasInParallel(filesToTouch, sparkSession)
  }

这个方法看着很长，其实注释很多，主要就是判断是否需要读取每个parquet文件的schema来进行schema的判断，以及具体的判断方式
把代码简化一下。

1. 简化后代码


  override def inferSchema(
      sparkSession: SparkSession,
      parameters: Map[String, String],
      files: Seq[FileStatus]): Option[StructType] = {
    val parquetOptions = new ParquetOptions(parameters, sparkSession.sessionState.conf)

    // 是否要从所有的parquet的part-files中merge schema, 这个值是默认是false
    val shouldMergeSchemas = parquetOptions.mergeSchema

    // merge的话，是否所有的parquet文件的summary信息都是一致的,默认是false，就是不一致,这个变量只有在shouldMergeSchemas 为true的时候才会用上
    val mergeRespectSummaries = sparkSession.sessionState.conf.isParquetSchemaRespectSummaries

    val filesByType = splitFiles(files)

    val filesToTouch =
      // 这里是false
      if (shouldMergeSchemas) {

        val needMerged: Seq[FileStatus] =
          if (mergeRespectSummaries) {
            Seq.empty
          } else {
            filesByType.data
          }
        needMerged ++ filesByType.metadata ++ filesByType.commonMetadata
      } else {
        会走到这里来
        filesByType.commonMetadata.headOption
            .orElse(filesByType.metadata.headOption)
            .orElse(filesByType.data.headOption)
            .toSeq
      }
    最终调用这里来
    ParquetFileFormat.mergeSchemasInParallel(filesToTouch, sparkSession)
  }

2. parquetOptions.mergeSchema 为false

parquetOptions.mergeSchema对应的默认值是false

  val PARQUET_SCHEMA_MERGING_ENABLED = buildConf("spark.sql.parquet.mergeSchema")
    .doc("When true, the Parquet data source merges schemas collected from all data files, " +
         "otherwise the schema is picked from the summary file or a random data file " +
         "if no summary file is available.")
    .booleanConf
    .createWithDefault(false)

3. isParquetSchemaRespectSummaries 默认值为false

sparkSession.sessionState.conf.isParquetSchemaRespectSummaries默认的值是false


  val PARQUET_SCHEMA_RESPECT_SUMMARIES = buildConf("spark.sql.parquet.respectSummaryFiles")
    .doc("When true, we make assumption that all part-files of Parquet are consistent with " +
         "summary files and we will ignore them when merging schema. Otherwise, if this is " +
         "false, which is the default, we will merge all part-files. This should be considered " +
         "as expert-only option, and shouldn't be enabled before knowing what it means exactly.")
    .booleanConf
    .createWithDefault(false)

4. filesByType 信息

filesByType 是所有文件的一个封装，最后生成的filesToTouch 的内容是

LocatedFileStatus{path=hdfs://test.com:9000/user/daily/20200828/part-00000-0e0dc5b5-5061-41ca-9fa6-9fb7b3e09e98-c000.snappy.parquet; isDirectory=false; length=193154555; replication=3; blocksize=134217728; modification_time=1599052749676; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

也就是filesByType所有文件中的第一个，然后使用这个文件作为参数来调用ParquetFileFormat.mergeSchemasInParallel(filesToTouch, sparkSession)方法

3. ParquetFileFormat.mergeSchemasInParallel


  /**
   * Figures out a merged Parquet schema with a distributed Spark job.
   *
   * Note that locality is not taken into consideration here because:
   *
   *  1. For a single Parquet part-file, in most cases the footer only resides in the last block of
   *     that file.  Thus we only need to retrieve the location of the last block.  However, Hadoop
   *     `FileSystem` only provides API to retrieve locations of all blocks, which can be
   *     potentially expensive.
   *
   *  2. This optimization is mainly useful for S3, where file metadata operations can be pretty
   *     slow.  And basically locality is not available when using S3 (you can't run computation on
   *     S3 nodes).
   */
  def mergeSchemasInParallel(
      filesToTouch: Seq[FileStatus],
      sparkSession: SparkSession): Option[StructType] = {
    val assumeBinaryIsString = sparkSession.sessionState.conf.isParquetBinaryAsString
    val assumeInt96IsTimestamp = sparkSession.sessionState.conf.isParquetINT96AsTimestamp
    val serializedConf = new SerializableConfiguration(sparkSession.sessionState.newHadoopConf())

    // !! HACK ALERT !!
    //
    // Parquet requires `FileStatus`es to read footers.  Here we try to send cached `FileStatus`es
    // to executor side to avoid fetching them again.  However, `FileStatus` is not `Serializable`
    // but only `Writable`.  What makes it worse, for some reason, `FileStatus` doesn't play well
    // with `SerializableWritable[T]` and always causes a weird `IllegalStateException`.  These
    // facts virtually prevents us to serialize `FileStatus`es.
    //
    // Since Parquet only relies on path and length information of those `FileStatus`es to read
    // footers, here we just extract them (which can be easily serialized), send them to executor
    // side, and resemble fake `FileStatus`es there.
    val partialFileStatusInfo = filesToTouch.map(f => (f.getPath.toString, f.getLen))

    // Set the number of partitions to prevent following schema reads from generating many tasks
    // in case of a small number of parquet files.
    val numParallelism = Math.min(Math.max(partialFileStatusInfo.size, 1),
      sparkSession.sparkContext.defaultParallelism)

    val ignoreCorruptFiles = sparkSession.sessionState.conf.ignoreCorruptFiles

    // Issues a Spark job to read Parquet schema in parallel.
    val partiallyMergedSchemas =
      sparkSession
        .sparkContext
        .parallelize(partialFileStatusInfo, numParallelism)
        .mapPartitions { iterator =>
          // Resembles fake `FileStatus`es with serialized path and length information.
          val fakeFileStatuses = iterator.map { case (path, length) =>
            new FileStatus(length, false, 0, 0, 0, 0, null, null, null, new Path(path))
          }.toSeq

          // Reads footers in multi-threaded manner within each task
          val footers =
            ParquetFileFormat.readParquetFootersInParallel(
              serializedConf.value, fakeFileStatuses, ignoreCorruptFiles)

          // Converter used to convert Parquet `MessageType` to Spark SQL `StructType`
          val converter = new ParquetToSparkSchemaConverter(
            assumeBinaryIsString = assumeBinaryIsString,
            assumeInt96IsTimestamp = assumeInt96IsTimestamp)
          if (footers.isEmpty) {
            Iterator.empty
          } else {
            var mergedSchema = ParquetFileFormat.readSchemaFromFooter(footers.head, converter)
            footers.tail.foreach { footer =>
              val schema = ParquetFileFormat.readSchemaFromFooter(footer, converter)
              try {
                mergedSchema = mergedSchema.merge(schema)
              } catch { case cause: SparkException =>
                throw new SparkException(
                  s"Failed merging schema of file ${footer.getFile}:\n${schema.treeString}", cause)
              }
            }
            Iterator.single(mergedSchema)
          }
        }.collect()

    if (partiallyMergedSchemas.isEmpty) {
      None
    } else {
      var finalSchema = partiallyMergedSchemas.head
      partiallyMergedSchemas.tail.foreach { schema =>
        try {
          finalSchema = finalSchema.merge(schema)
        } catch { case cause: SparkException =>
          throw new SparkException(
            s"Failed merging schema:\n${schema.treeString}", cause)
        }
      }
      Some(finalSchema)
    }
  }

ParquetFileFormat.mergeSchemasInParallel有两个参数，一个是sparksession,另一个是filesToTouch，存储的是一个文件列表信息，上面也提到默认情况下实际上传进来的只有一个文件。
这个方法的作用就是创建一个job1，根据传进来的文件来解析出来schema信息,对应的job1的内容有:

    val partialFileStatusInfo = filesToTouch.map(f => (f.getPath.toString, f.getLen))

    val numParallelism = Math.min(Math.max(partialFileStatusInfo.size, 1),
      sparkSession.sparkContext.defaultParallelism)

    val ignoreCorruptFiles = sparkSession.sessionState.conf.ignoreCorruptFiles

    val partiallyMergedSchemas =
      sparkSession
        .sparkContext
        .parallelize(partialFileStatusInfo, numParallelism)
        .mapPartitions { iterator =>
		...
		...

            Iterator.single(mergedSchema)
        }.collect()

job1的并行度numParallelism

    val numParallelism = Math.min(Math.max(partialFileStatusInfo.size, 1),
      sparkSession.sparkContext.defaultParallelism)

partialFileStatusInfo.size是传进来的参数，这里是1
sparkSession.sparkContext.defaultParallelism返回的是1,实际上这个在不同的环境下值是不同的，但是肯定不会小于1

所以 numParallelism 的值就是1，也就是job1的partition数量是1

2. job1的调用总结

方法的调用链是


DataFrameReader.load()
DataFrameReader.loadV1Source()
DataSoure.resolveRelation()
DataSource.getOrInferFileFormatSchema()
ParquetFileFormat.inferSchema()
ParquetFileFormat.mergeSchemasInParallel()

3. spark read 总结

再回顾一下我们的代码。

public class UserProfileTest {

    static String filePath = "hdfs:///user/daily/20200828/*.parquet";
    public static void main(String[] args) {
        SparkConf sparkConf = new SparkConf()
                .setMaster("local")
                .setAppName("user_profile_test")
                .set(ConfigurationOptions.ES_NODES, "")
                .set(ConfigurationOptions.ES_PORT, "")
                .set(ConfigurationOptions.ES_MAPPING_ID, "uid");

	//主要想要考察一下这个地方为什么会产生更多的job
        SparkSession sparkSession = SparkSession.builder().config(sparkConf).getOrCreate();
        Dataset<Row> userProfileSource = sparkSession.read().parquet(filePath);
        userProfileSource.count();
        userProfileSource.write().parquet("hdfs:///user/daily/result2020082808/");
    }
}

当filePath命中的path数量大于32的话便会产生一个单独的job来进行文件的递归查找，找到所有复合条件的file信息（包括block size等信息，为后面的schema识别和rdd操作的partition做准备）
在找到所有的文件之后对应的读取文件如果是parquet的话会创建一个新的job来解析对应的schema。

基于上面的解析，假如我们对上面代码中的filepath进行修改，假如设置filePath

static String filePath = "hdfs://bj3-dev-search-01.tencn:9000/user/daily/20200828/part-00057-0e0dc5b5-5061-41ca-9fa6-9fb7b3e09e98-c000.snappy.parquet";

那么程序运行的时候就没有job0了。