每个FileFormat 都实现了,inferSchema,但是只有初始化的时候的调用一次。
ParquetFileFormat
spark 获取 parquet 的 schema 是通过发起了一个job
/**
* Figures out a merged Parquet schema with a distributed Spark job.
*
* Note that locality is not taken into consideration here because:
*
* 1. For a single Parquet part-file, in most cases the footer only resides in the last block of
* that file. Thus we only need to retrieve the location of the last block. However, Hadoop
* `FileSystem` only provides API to retrieve locations of all blocks, which can be
* potentially expensive.
*
* 2. This optimization is mainly useful for S3, where file metadata operations can be pretty
* slow. And basically locality is not available when using S3 (you can't run computation on
* S3 nodes).
*/
def mergeSchemasInParallel(
filesToTouch: Seq[FileStatus],
sparkSession: SparkSession): Option[StructType] = {
...
// Issues a Spark job to read Parquet schema in parallel.
val partiallyMergedSchemas =
sparkSession
.sparkContext
.parallelize(partialFileStatusInfo, numParallelism)
.mapPartitions {
...
}.collect()
...
}
再来看看AVRO
AVRO 是 从 driver 端发起一个随机文件读取
override def inferSchema(
spark: SparkSession,
options: Map[String, String],
files: Seq[FileStatus]): Option[StructType] = {
val conf = spark.sparkContext.hadoopConfiguration
// Schema evolution is not supported yet. Here we only pick a single random sample file to
// figure out the schema of the whole dataset.
val sampleFile = if (conf.getBoolean(IgnoreFilesWithoutExtensionProperty, true)) {
files.find(_.getPath.getName.endsWith(".avro")).getOrElse {
throw new FileNotFoundException(
"No Avro files found. Hadoop option \"avro.mapred.ignore.inputs.without.extension\" is " +
"set to true. Do all input files have \".avro\" extension?"
)
}
} else {
files.headOption.getOrElse {
throw new FileNotFoundException("No Avro files found.")
}
}
// User can specify an optional avro json schema.
val avroSchema = options.get(AvroSchema).map(new Schema.Parser().parse).getOrElse {
val in = new FsInput(sampleFile.getPath, conf)
try {
val reader = DataFileReader.openReader(in, new GenericDatumReader[GenericRecord]())
try {
reader.getSchema
} finally {
reader.close()
}
} finally {
in.close()
}
}
SchemaConverters.toSqlType(avroSchema).dataType match {
case t: StructType => Some(t)
case _ => throw new RuntimeException(
s"""Avro schema cannot be converted to a Spark SQL StructType:
|
|${avroSchema.toString(true)}
|""".stripMargin)
}
}