Spark 之 FileFormat

最新推荐文章于 2022-03-02 16:35:22 发布

zhixingheyi_tian

最新推荐文章于 2022-03-02 16:35:22 发布

阅读量490

点赞数

分类专栏： spark 大数据 spark源码分析

本文链接：https://blog.csdn.net/zhixingheyi_tian/article/details/85047199

版权

spark 同时被 3 个专栏收录

107 篇文章 4 订阅

订阅专栏

大数据

90 篇文章 1 订阅

订阅专栏

spark源码分析

18 篇文章 1 订阅

订阅专栏

每个FileFormat 都实现了，inferSchema，但是只有初始化的时候的调用一次。

ParquetFileFormat

spark 获取 parquet 的 schema 是通过发起了一个job

/**
   * Figures out a merged Parquet schema with a distributed Spark job.
   *
   * Note that locality is not taken into consideration here because:
   *
   *  1. For a single Parquet part-file, in most cases the footer only resides in the last block of
   *     that file.  Thus we only need to retrieve the location of the last block.  However, Hadoop
   *     `FileSystem` only provides API to retrieve locations of all blocks, which can be
   *     potentially expensive.
   *
   *  2. This optimization is mainly useful for S3, where file metadata operations can be pretty
   *     slow.  And basically locality is not available when using S3 (you can't run computation on
   *     S3 nodes).
   */
  def mergeSchemasInParallel(
      filesToTouch: Seq[FileStatus],
      sparkSession: SparkSession): Option[StructType] = {
...

// Issues a Spark job to read Parquet schema in parallel.
    val partiallyMergedSchemas =
      sparkSession
        .sparkContext
        .parallelize(partialFileStatusInfo, numParallelism)
        .mapPartitions {
        ...
        }.collect()      

...
      }

再来看看AVRO

AVRO 是从 driver 端发起一个随机文件读取

override def inferSchema(
      spark: SparkSession,
      options: Map[String, String],
      files: Seq[FileStatus]): Option[StructType] = {
    val conf = spark.sparkContext.hadoopConfiguration

    // Schema evolution is not supported yet. Here we only pick a single random sample file to
    // figure out the schema of the whole dataset.
    val sampleFile = if (conf.getBoolean(IgnoreFilesWithoutExtensionProperty, true)) {
      files.find(_.getPath.getName.endsWith(".avro")).getOrElse {
        throw new FileNotFoundException(
          "No Avro files found. Hadoop option \"avro.mapred.ignore.inputs.without.extension\" is " +
            "set to true. Do all input files have \".avro\" extension?"
        )
      }
    } else {
      files.headOption.getOrElse {
        throw new FileNotFoundException("No Avro files found.")
      }
    }

    // User can specify an optional avro json schema.
    val avroSchema = options.get(AvroSchema).map(new Schema.Parser().parse).getOrElse {
      val in = new FsInput(sampleFile.getPath, conf)
      try {
        val reader = DataFileReader.openReader(in, new GenericDatumReader[GenericRecord]())
        try {
          reader.getSchema
        } finally {
          reader.close()
        }
      } finally {
        in.close()
      }
    }

    SchemaConverters.toSqlType(avroSchema).dataType match {
      case t: StructType => Some(t)
      case _ => throw new RuntimeException(
        s"""Avro schema cannot be converted to a Spark SQL StructType:
           |
           |${avroSchema.toString(true)}
           |""".stripMargin)
    }
  }